DEV Community

Maykeye
Maykeye

Posted on

BakaLLM. Part 16. Procrastinating in style

Hello, fairy dairy diary!~

Cirno rests

Focus of last several days is idea of Progressively Stacking 2.0. (Of course I skipped the part about keeping optimizer as math-too-hard).

After 3 epochs of training by parts, each epoch trained only one third of the net, loss is 4.4424.

It's comparable to after 1 epoch, baka fanout-zeropad-upscaler (4.4411 -- fanout where MLP was not upscaling yet).

However I'm not expecting it to be that good for a while.
Play for now is to train all layers up to ~15 epoch and look what will happen.

Depending on the result then I may scale network up to 1B and train using LoRA to tie all layers together.

Also I was too lazy to change number of layers run-time, so for now I just disable them and they run only upscaler, while parms for attn and mlp are still there.

Hit into a bug with rotary_embedding_torch (it didn't like when requires_grad_ was called on different layers with different arg), but it already was fixed in 0.5.3. Sweet.

One day I'll throw more recurrence as intended!

But for now, I'm still chilling.
And you should do.

Top comments (0)