This is going to be the first part of a series of posts explaining basic concepts about the many different corners of the vast world surrounding AI. The goal of these posts isn't to turn you into an AI expert overnight, but they will help you start understanding key pieces so you can get a lot more out of everything this new reality has to offer.
Many of us come from other fields, and since ChatGPT launched back in 2022 we've had to keep learning a bunch of concepts, acronyms, words, and plenty of things that used to be known only by the people who worked on them, and now it seems like “everyone gets it”
Most of the concepts explained in these posts don't aim to go into huge depth, but they do aim to be specific enough to get the main idea across, so you can make better use of the many tools available to us and dig deeper into these concepts later on if we want to.
Let's start today with something basic…
Those weird numbers: 7B, 70B, 405B
You've probably seen names like Gemma-4-31B, Qwen-3-8B, or Llama-3.1-70B. And you've wondered what that "B" stuck next to a number means.
The short answer is that the B comes from Billion. And it refers to the number of parameters in the model.
- Gemma-4-31B→ 31 billion parameters.
- Qwen-3-8B → 8 billion parameters.
- Llama-3.1-70B → 70 billion parameters.
Wait but… what exactly are parameters?
When you train a model, what happens under the hood is that it's shown many billions of examples, and the model keeps adjusting a huge amount of internal numbers to "get it right" a little better each time. Once training is done, those numbers are the parameters. They're what defines the model.
Put another way: a parameter is, “simply”, a number the model learned during its training.
This matters because it explains why models sometimes "hallucinate". And that's because they don't have a table of facts to look things up in. They have an enormous statistical distribution of patterns, and sometimes the most likely pattern isn't the right one.
A typical question is: more parameters = better model? Generally yes, but with a few nuances. More parameters means:
- More capacity to capture complex patterns.
- Better reasoning on hard tasks.
- More knowledge "memorized" during training.
But it also means:
- More memory needed to run them (the GPU has to load all those numbers).
- Slower at generating text.
- More expensive to serve (both in money and in maintainability)
If I know a model's parameters, can I figure out how much memory it needs?
Yes. Knowing the above, you can apply an old trick:
parameters (B) × bytes_per_parameter ≈ Memory (GB) A parameter normally takes up 2 bytes (in today's standard format, which is called BF16, but don't worry about that now. We'll cover it later). So at a glance we can look at a model and see how much vRAM it needs:
- Mistral-7B → 7 × 2 = ~14 GB of vRAM.
- Llama-3.1-70B → 70 × 2 = ~140 GB of vRAM.
- Llama-3.1-405B → 405 × 2 = ~810 GB of vRAM.
That would be the minimum size you'd need to load any of those models in their standard format.
But wait… so do I need a supercomputer at home?
If we look at the numbers above, loading a 70B model in its standard format (140 GB of vRAM) requires graphics cards that cost thousands of euros. It looks like local AI is only for the rich!
Keep calm, my friend, and don't panic. This is where the magic of quantization comes in.
To understand quantization, imagine that each parameter (that little number the model learned) is a super precise GPS coordinate: 40.4161754, -3.7032902. Storing that many decimals takes up a lot of space. What happens if we round it to 40.416, -3.703? It takes up less space, and for 99% of tasks, it'll get you to the same place.
Quantization does exactly that: it compresses the model's numbers by lowering their precision so they take up less memory and are faster to process.
The "flavors" of precision: FP16, FP8, INT4...
Remember we mentioned earlier that a standard parameter takes up 2 bytes? That's because it uses a 16-bit format (known as FP16 or BF16). From there, the community started "squeezing" those numbers:
- FP16 / BF16 (16-bit): The original format. Each parameter takes up 2 bytes. Maximum quality, maximum weight.
- FP8 / INT8 (8-bit): Precision cut in half. Each parameter takes up 1 byte. The model loses a fraction of its intelligence (in some cases almost imperceptible), but it takes up half the space.
- INT4 / 4-bit: Very reduced precision. Each parameter takes up just 0.5 bytes (half a byte). Here the model can become less accurate, but the memory savings are so massive that it can be worth it in some cases.
Note: even bigger reductions exist, but except in very specific cases they're usually not advisable, since the model loses so much precision that it “gets dumb”.
Calculating the memory a model needs like a pro
Now that you understand what parameters are and how they can be compressed, you can combine both concepts to know whether or not a model fits on your graphics card or on the server where you plan to run it.
The practical formula, the same “trick” we saw earlier:
parameters (B) × bytes_per_parameter ≈ Memory (GB) But this time with its quantization flavors:
- For 16-bit (uncompressed) -> Multiply by 2
- For 8-bit (FP8/INT8) -> Multiply by 1
- For 4-bit (INT4) -> Multiply by 0.5
Let's take Llama-3.1-70B so you can see the difference:
- Llama-3.1-70B in 16-bit: 70 × 2 = ~140 GB of vRAM.
- Llama-3.1-70B in 8-bit: 70 × 1 = ~70 GB of vRAM. (Now it's starting to be more manageable).
- Llama-3.1-70B in 4-bit: 70 × 0.5 = ~35 GB of vRAM. (This is now more accessible)
One last tip for the real world: This formula gives you what the model takes up at rest (the weights). When you start talking to it and the model has to "think" and remember the conversation (what's called context memory), it needs extra space.
As a general rule, always add at least 10% or 20% more to the result of your calculation to give yourself some headroom so the model doesn't crash for lack of memory.
And that's it! Now you know what that "B" in the models means and how to instantly tell whether you'll be able to run it on your machine or not. See you in upcoming posts.
Want to burn tokens with no limits?
- If you're a company and you want to use open models so API costs don't bankrupt you, check out our plans and start saving today: Helmcode
- If you're not a company, join our private community of builders. Besides open model inference, you'll get access to resources, knowledge, workshops, and much more to build together as a community: NaN