
Alan Dao's personal blog
AI Researcher
Dated back to this post Multi Modal Tokenizing With Chameleon. I have worked with my team at HomeBrew Research to make something. We wanted to give the community something new and not simply a replicate of Chameleon (vision modality).
By that, we decided to work on a model that can do sound, that you can talk to it, that you can give commands to it. A Llama model that can listen!
The Error of Death 🔗Have you been constantly battling with VRAM usage in fine-tuning LLMs, and constantly struggling with the below error?
RuntimeError: CUDA error: out of memory.. The above is the destroyer of joy, the sudden stop of happiness, and the most dreadful error you might have faced as someone trying to train an AI model, or more specifically, an LLM (because I assume it’s the most VRAM-intensive among the bunch).
Today I went biking
And… I realize something I wanted. When I was biking
I weren’t able to use the phone for navigation I were listening to music, but could not change tracks easily … I realize how much I need a hands-free solution to control my phone in general while I was riding a bicycle. My work is in AI, at this point I should have something just to parse my voice to just pick something I need, a better (yes) Siri or something like that?
In LLMs, a very fundamental step is tokenizing. In order to make the LLM understand what you are inputing you need to convert text into numbers.
But one might wonder? how about images, sounds, … everything else but text?
That’s exactly the question I will answer today.
The Logic Behind Tokenizing an Image 🔗To tokenize an image, we must first understand the fundamental principles behind tokenizing text. There are three key aspects of text tokenization that differ significantly from image tokenization:
You may have heard everywhere on Reddit or on Twitter about…
“Model A has RoPE implemented.”
“We can make it run longer by changing the RoPE scaling.”
…and so on.
But for real? What the hell is RoPE, and how does it work? They say something about sin and cos, but what does that even mean? Now, I am about to debunk all of that, for your sake.
The intuition behind RoPE 🔗In order to understand what is RoPE firstly we need to review some high school math.