DEV Community

Maykeye
Maykeye

Posted on

BakaLLM, part 15: reaching for the stars!

Hello, fairy dairy diary!

The feeling when you realize XL realization was wrong

Let's start with good news. After implementing bare minimum for Llama* I've reached loss of 3.94 after 3 epochs, which is very good, as I've got rid of ~40M parameters in the process. I didn't think much about upscaler and just padded zero from the right, figuring that normalizer will inject something, this something will play nicely with last layer.

validation result

And training graph was also good.

Training graph

Now, two bad news: a) shared mlp didn't work well after all. I tried "correcting" it with LoRA, tried different patterns of sharing(ie. what groups of layers share what)(3 shared mlp were better than 4 different mlp for some reasons), different dim_ff, but while weight reduction was good, loss reduction was not that good

Second bad news is just embarrassing! For some reason during XL implementation I picked up calculated K, V for last segment for self attention from the previous layer. Why? I don't remember. K is meant to hold self.k_proj(X), where k_proj is a current layer. So next time, on the same layer, we don't calculate self.k_proj(X) again, but reuse calculation. The same applies to V. Surprisingly, it worked well enough, which shows that when you throw poo at model, it still can sieve through it somehow

I didn't notice this error until I was hit by different dim_model of llama* (here called "fan-out",

This is why the WIP model is called BakaLLM, of course.

After fixing the bug, llama*-tweaked became better than before fixing it, so no need to pass as is K/V between layers.

Technically I'd want to test all previous mainlined architectures, but I'll just fix rmt and check the result aginst older rmt and llama* (here known as fanout).

Then the on the road map is to add couple of simple upscalers(replace x=pad(x, (0, extra_dim),value=0.0) with x=x 🐱 (x @ W_uscale_proj)) to fit extra space or x=pad(x, (0, extra_dim),value=0.0) + (x @ W_uscale_proj)) to fully project x. And as the name implies, I'd lke to try "fan-in" that reduces hidden size)

Well, happy new year, chill yourself, for in global warming it's getting harder! Well, cya later, I'm recalculating 005_RMT for the next couple of days.

Top comments (0)