DEV Community

Maykeye
Maykeye

Posted on

MambaBit. The most cursed LLM?

Modern tokenizers come in all form! Some, like qwen, support ~150 000 tokens.

Byte level models support 256 tokens.

Can we go lower?

(There should be "you were so busy asking if you could" meme, but dev.to complains)

But the answer is yes. MambaBit comes with just 2 tokens. One token for bit 0, one token for bit 1. That's it. Yet somehow it still produces something which is not completely random.

Behold the most cursed becomes

Behold the most cursed of men.

LEONTES:
Now means means me not so much as my father,
In the good many lord, and my father come.
Enter fullscreen mode Exit fullscreen mode

It still learned words! And line breaks, names that precede normal text! Yet, even on bit level it produces words rather than words! It's too cursed!

Also we can go even lower and fully embrace the bitness: layers like nn.LayerNorm have built-in bias, which means bit 1 can be encoded as parms_for_1 + bias from normalization and bit 0 = bias from normalization. Say bye to nn.Embedding!

Same with lm_head. As of now model produces output which is essentially [X, -X] where usually is in range -3...3.

Top comments (0)