Chasing the Next Transformer Killer - Part 2

Jul 21, 2024

This is the second part of a post about research toward faster Large Language Models, for the first part click here.

In the previous part, we looked at efforts that improved the asymptotic complexity of traditional transformers. Where the improvement efforts become exciting is when researchers try to mate the advantages of the attention mechanisms and RNNs. Yes, you have heard it right: there is a way to bring back recurrent neural networks in the picture while still allowing the parallel training of the neural network (to overcome the seriality of RNNs). The most prominent pioneer of this idea is RWKV (pronounced as “RwaKuv” and named from the four major parameters: R W K V).

It is defined as “Parallelizable RNN with Transformer-level LM Performance” on their main GitHub page. To understand more of the details behind this marvelous model I advise two sections of the Hugging Face RWKV article:

Transformer Architecture vs RNNs
RWKV attention formulation “The model architecture is very similar to classic transformer-based models (i.e. an embedding layer, multiple identical layers, layer normalization, and a Causal Language Modeling head to predict the next token).”

According to RWKV authors, it is the leading sub-quadratic transformer architecture alternative:

RWKV models can compete with one category larger models, for example, a 3 billion parameter RWKV can be as fast as 7 billion transformer models while performing just as well in benchmarks. Not to mention the small size, which allows much more feasible edge device deployment on mobile devices for example. I was experimenting with MLC-LLM and tried some 7 billion and 3 billion models. 7 billion stretches the capability of today’s phones and can only provide a few tokens per second inference, while the 3 billion parameter model could work 10 times as fast. I was not able to test 7 billion size models with Google MediaPipe edge mobile apps yet, but the 2 billion Gemma (especially the gemma-2b-it-gpu-int4) worked well.

I was so interested in RWKV that I attended a tech meetup about it in San Francisco and I discovered that there’s even more to it than just being very efficient. The community around RWKV is very international, it’s much more popular in Asia than here in the US. These international roots resulted in a naturally better performer token vocabulary and tokenizer. If someone tries to use OpenAI’s tokenizer with a non-Latin alphabet they would incur 3x+ token size blow-up. It is somewhat like encoding Chinese or Japanese alphabets with UTF-8. It is known that generative AI services are not cheap and if an API takes the token counts into account this would result in 3x+ cost increase, on top of also being slower, because inference speed also is proportional to your prompt: the longer your prompt the longer it takes to generate the answer. That doesn’t necessarily need to be like that though.

The vocabulary is an extremely important part of a transformer model, for example, the Bloomberg GPT developer realized that the number representations are also not ideal with the traditional OpenAI tokenizer, so they custom-trained their tokenizer which handles numbers better. This piece is key to why Bloomberg GPT performs better with its specialized tasks. Consequently, RWKV’s tokenizer is foreign language and alphabet friendly and naturally bodes well with them. Non-Latin alphabets only incur 1.5-2.5x increase, while the English alphabet still stays at 1.5x only.

On the other hand, someone cannot just simply increase the size of the vocabulary to any arbitrary size because certain algorithms have quadratic asymptotical consequences (for transformer-like models, not for RWKV), and someone would also need much more training data to cover everything nicely. The availability of good training data is a hold-back factor for today’s large language model training. The sweet spot for vocabulary size seems to be between ~32K and ~64K symbols according to the RWKV researchers.

Another interesting development along the lines of RNN-based models is the introduction of Recurrent Gemma. This underlines that even Google DeepMind (the inventor of transformer architecture) believes that a mashup of RNNs and transformers can be beneficial and lucrative. Recurrent Gemma is based on the Griffin architecture and mixes gated linear recurrences with local attention. Fortunately, it is open source just like the other Gemma variants, and I cannot wait for the follow-up versions of it. These cutting-edge models can play a crucial role on edge devices such as mobile phones. In the future, our computers and phones will run the models locally, which can provide off-line capabilities, save server-side call turnarounds, and increase privacy.