A First Look At Gemma 4
Google released Gemma 4 today, and I’m excited. I had some fun with Gemma 3 a few months back. It was slow, but surprisingly smart for something that can run on a laptop. After that initial experience, I placed “fine-tuning Gemma” on my to-do list; what if you could have a fast and cheap model that matches the performance of Gemini Pro for some specific tasks? Anyway, it didn’t happen, other stuff did, such as Slay the Spire 2. 😅
I downloaded two different versions from Hugging Face:
- The 31B, quantized by unsloth as Q6_K.
- The 26B with 4B activation, quantized by bartowski as Q5_K_M.
We can compare the speed of both, and whether they work at all using llama.cpp on my laptop. Its specifications are:
- 32 GB of RAM 💸
- AMD Ryzen 9 7940HS
- NVIDIA GeForce RTX 4060 with 8GB of VRAM
Installation of llama.cpp
I downloaded the CUDA-13-enabled llama.cpp from its GitHub releases page. At first, it did not detect my GPU. It turns out I also had to download the CUDA 13 DLL files and drop them next to the llama-cli executable. I like that all files are flat in there, no need to think about where the DLLs should go. Once the three DLLs were in place (called cublas and cudart), llama-cli printed the name of my GPU when I used its list-devices command.
I also went into the graphics settings of Windows to force-enable my dGPU whenever llama-cli.exe was running. I’m not sure whether that was a necessary step, but the setup worked, so I let it be.
Gemma 4 31B
I loaded the model in memory using the .\llama-cli.exe --model ..\gemma-4-31B-it-Q6_K.gguf command. After waiting for some time, the program was ready for my input. I prompted:
Write a simple C program from start to finish, single file.
For some time, nothing happened. Then, I saw the first thinking tokens starting to stream. It wasn’t gibberish; Gemma 4 was running on my laptop! It decomposed my prompt into a detailed goal, tried to infer what the intent was, etc.
Many minutes later, it produced this:
#include <stdio.h>
int main() {
// Variable declarations
int num1, num2, sum;
printf("--- Simple Addition Program ---\n");
// Ask user for the first number
printf("Enter first number: ");
scanf("%d", &num1);
// Ask user for the second number
printf("Enter second number: ");
scanf("%d", &num2);
// Perform the addition
sum = num1 + num2;
// Display the result
printf("The sum of %d and %d is: %d\n", num1, num2, sum);
return 0;
}
It took a long time. The prompt was read at a speed of 17 tokens per second, but the generation ran at a measly 2 tokens per second. Depending on your use case, that might be acceptable.
I looked at the Performance panel of the Windows Task Manager during and after the generation. It showed that my 32 GB of RAM was completely filled, and most of the VRAM was as well. After the generation ended, 10.6 GB of RAM was still in use (I had WSL running, among other things). Could it be that llama ran out of RAM and swapped to disk?
Gemma 4 26B-A4B
The same model as above, but slightly dumber. I expected it to run faster since, from my understanding, only 4B parameters are ever active at once. That’s almost eight times less than the 31B version. Using primary school math, I inferred that it would run at 14+ tokens per second. Given that only a few layers were on the GPU, though, I thought that was too optimistic… but I forgot two things: 26 is smaller than 31, and the quantization was more severe.
I asked it to do the same thing, to write the C program… Wow, more than an order of magnitude faster! At an incredible speed of 27 tokens per second, it produced this:
#include <stdio.h>
int main() {
// 1. Declare variables to store integers
int number1, number2, sum;
printf("--- Simple Addition Program ---\n");
// 2. Ask the user for the first number
printf("Enter the first integer: ");
// %d tells scanf to look for an integer
// &number1 tells scanf to store the result in the memory address of number1
scanf("%d", &number1);
// 3. Ask the user for the second number
printf("Enter the second integer: ");
scanf("%d", &number2);
// 4. Perform the calculation
sum = number1 + number2;
// 5. Display the result
// \n moves the cursor to a new line
printf("\nResult: %d + %d = %d\n", number1, number2, sum);
return 0; // Tells the OS the program finished successfully
}
That’s the same program as above, which makes me wonder whether the default llama.cpp temperature is zero. The prompt itself was processed at 45 tokens per second. When I asked a follow-up question, it processed the appended context at a speed of 283 tokens per second.
Conclusion
On a laptop such as mine, equipped with lots of RAM and an NVIDIA GPU, Gemma 4 works well. Its mixture of experts version, which only activates 4B parameters at a time, is much more enjoyable to use. It looks like a great model for people and businesses that want to chat or process unstructured information without sharing their data with Big Tech.
The model was released today, so it’s too early to say what impact it will have. I look forward to the benchmark results, which should appear within the next week or so.