Gemma 3n: Google’s On‑Device, Multimodal AI Setup Locally

Gemma 3n running multimodal AI tasks: vision, audio, and text

Introduction

Google’s Gemma 3n marks a major leap forward in on-device AI, bringing powerful multimodal intelligence, text, images, audio, and video to your phone or tablet with a minimal resource footprint. Designed with privacy and performance in mind, Gemma 3n deploys innovative techniques such as selective parameter activation and Per-Layer Embeddings, enabling full-featured AI without the need for cloud computing.

What is Gemma 3n?

Gemma 3n belongs to the open-source Gemma family, built on the same foundational research behind Google’s Gemini models. Unlike cloud-only LLMs, it’s optimized for edge devices, smartphones, tablets, and laptops, supporting multimodal input and delivering text output, all while keeping data private.

The model has achieved over 160 million collective downloads since its initial launch. The model comes in two primary variants: E2B (2 billion effective parameters) and E4B (4 billion effective parameters), both designed to deliver exceptional performance while maintaining efficiency. Despite having total parameter counts of 5B and 8B, respectively, these models can operate with significantly reduced memory footprints through innovative parameter management techniques.

Key Innovations

Selective Parameter Activation & Per‑Layer Embeddings

Gemma 3n uses “E2B” and “E4B” effective parameter modes: although the full model houses 5B or 8B parameters, only ~2B or ~4B are activated during inference, thanks to techniques like parameter skipping and PLE caching. That allows powerful capabilities with mobile-grade RAM (~2–3 GB).

MatFormer and KV‑Cache Sharing

The mobile-first “MatFormer” architecture provides nested model flexibility: dynamically scale between 2B and 4B effective footprints. KV‑cache sharing further reduces memory overhead while maintaining fast performance:

Multimodal & Language Capabilities

Gemma 3n supports Text, Image, Audio, and Video

The model supports interleaved inputs, text, images, audio, and even video, with text outputs. It includes image-text pipelines, audio transcription and translation modules, and soon a full video processing stack.

Multilingual Understanding

Trained across 140+ languages, it excels in benchmarks like WMT24++ (50.1% ChrF) and MMLU, offering strong German, Japanese, Korean, Spanish, and French performance

How to install Gemma 3n Step-by-Step Locally?

This section will guide you through setting up and running the model for multimodal tasks such as image captioning, OCR, audio transcription, reasoning, and code generation.

Gemma 3n with Huggingface Setup

1. Environment Setup

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
huggingface-cli login

Then in Python:

import transformers
print(transformers.__version__)

Ensure you have transformers >= 4.53.0:

pip install -U transformers
pip install timm

2. Load the Model

from transformers import AutoProcessor, Gemma3nForConditionalGeneration
from PIL import Image
import torch

model_id = "google/gemma-3n-e2b-it"

model = Gemma3nForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16).eval()
processor = AutoProcessor.from_pretrained(model_id)

Gemma 3n Performance Evaluation

3. Image Captioning (URL Image)

messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {"role": "user", "content": [
        {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
        {"type": "text", "text": "Describe this image in detail."}
    ]}
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Input Image:

Gemma 3n performance test

Output

The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The image shows a close-up view of a vibrant pink cosmos flower in full bloom, with a small bee diligently collecting pollen from its center. The flower has delicate, slightly ruffled petals that radiate outwards from a bright yellow center.

The bee is a dark color with lighter stripes on its abdomen, and it appears to be actively foraging on the flower's reproductive parts.

The background is softly blurred, suggesting a garden setting with other flowers and foliage. There are hints of other pink

4. Local Image Captioning

image = Image.open("/content/image.jpg").convert("RGB")
prompt = "What is in the image?"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": prompt}
    ]}
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Input Image:

Gemma 3n performance test 2

Output

The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The image shows a **cat**.

It's a domestic cat with gray and white fur, and it's looking directly at the camera with large, round eyes.

5. OCR Text Extraction

image = Image.open("/content/invoice.png").convert("RGB")
prompt = "Extract text from image?"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": prompt}
    ]}
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Input Image:

Gemma 3n performance test 3

Output

The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Here's the extracted text from the image:

**Invoice**

**Company Name:** [Company Name]
**Address:** [Address]
**Phone:** [Phone Number]
**Email:** [Email Address]

**Invoice Number:** [Invoice Number]
**Date:** [Date]

**Bill To:**
[Customer Name]
[Customer Address]
[Customer Phone]
[Customer Email]

**Items:**

| Item | Quantity | Unit Price | Amount |
|---|---|---|---|
| [Item Description 1] | [Quantity 1] | [Unit Price 1] | [Amount 1] |
| [Item Description 2] | [Quantity 2] | [Unit Price 2] | [Amount 2] |
| [Item Description 3] | [Quantity 3] | [Unit Price 3] | [Amount 3] |
| ... | ... | ... | ... |

**Total:** [Total Amount]

**Thank you for your order!**

**Note:** The image contains a partially filled invoice. Some fields are blank and require further information.

6. Audio Transcription + Translation

import torchaudio
from pathlib import Path

waveform, sample_rate = torchaudio.load("/content/audio.wav")
if sample_rate != 16000:
    waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)

prompt = "Transcribe in English, then translate into German, Bangla and Spanish:"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {"role": "user", "content": [
        {"type": "audio", "audio": waveform.squeeze(0).numpy()},
        {"type": "text", "text": prompt}
    ]}
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Input Audio:

Output

The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Today, I'd like to talk about MD Monsur Ali, a passionate blogger in a creative voice at InventaAI. Known for his insightful writing and deep interest in emerging technologies, Monsur has carved a unique space in the world of AI blogging. At InventaAI, MD Monsur Ali focuses on making artificial intelligence understandable and accessible to everyone. His blog posts often explore topics like machine learning, AI tools, and their impact on society.

**German:**
Heute möchte ich über MD Monsur Ali sprechen, einen leidenschaftlichen Blogger mit einer kreativen Stimme bei InventaAI. Bekannt für seine aufschlussreiche Schreibweise und sein tiefes Interesse an aufkommenden Technologien hat Monsur sich einen einzigartigen Platz in der Welt des KI-Bloggens geschaffen. Bei InventaAI konzentriert sich MD Monsur Ali darauf, künstliche Intelligenz für jeden verständlich und zugänglich zu machen. Seine Blogbeiträge behandeln oft Themen wie maschinelles Lernen, KI-Tools und deren Auswirkungen auf die Gesellschaft.

**Bangla:**
আজ আমি এমডি মনসুর আলী সম্পর্কে কথা বলতে চাই, ইনভেन्टএআই-এ একটি সৃজনশীল কণ্ঠস্বর এবং একজন আবেগপ্রবণ ব্লগার। উদ্ভাবনী প্রযুক্তিতে গভীর আগ্রহ এবং তথ্যপূর্ণ লেখার জন্য পরিচিত, মনস্টার এআই ব্লগিংয়ের বিশ্বে একটি বিশেষ স্থান তৈরি করেছে। ইনভেन्टএআই-এ, এমডি মনস্টার অলি কৃত্রিম বুদ্ধিমত্তাকে সকলের জন্য বোধগম্য এবং সহজলভ্য করার উপর দৃষ্টি নিবদ্ধ করে। তার ব্লগ পোস্টগুলি প্রায়শই মেশিন লার্নিং, এআই সরঞ্জাম এবং সমাজের উপর তাদের প্রভাবের মতো বিষয়গুলি অন্বেষণ করে।

**Spanish:**
Hoy me gustaría hablar de MD Monsur Ali, un blogger apasionado con una voz creativa en InventaAI. Conocido por su escritura perspicaz y su profundo interés en las tecnologías emergentes, Monsur ha creado un espacio único en el mundo del blogging de IA. En InventaAI, MD Monsur Ali se centra en hacer que la inteligencia artificial sea comprensible y accesible para todos. Sus publicaciones de blog a menudo exploran temas como el aprendizaje automático, las herramientas de IA y su impacto en la sociedad.

7. Reasoning Test

prompt = "Three different numbers add up to twelve. The sum of the reciprocal of the first and the product of the other two is also twelve — what is the product of all three numbers?"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {"role": "user", "content": [{"type": "text", "text": prompt}]}
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Output

Let the three different numbers be $a, b, c$. We are given that
$$a + b + c = 12 \quad (*)$$
and
$$\frac{1}{a} + bc = 12 \quad (**)$$
From $(*)$, we have $a = 12 - b - c$. Substituting this into $(**)$, we get
$$\frac{1}{12 - b - c} + bc = 12$$
$$\frac{1}{12 - b - c} = 12 - bc$$
$$1 = (12 - b - c)(12 - bc)$$
$$1 = 144 - 12bc - 12b + b^2c - 12c + bc^2 + bc^2 - b^2c$$
$$1 = 144 - 12bc - 12b - 12c + b^2c + bc^2$$
$$1 = 144 - 12(bc + b + c) + bc(b+c)$$
$$1 - 144 = -12(bc + b + c) + bc(b+c)$$
$$-143 = -12(bc + b + c) + bc(b+c)$$
Let $bc + b + c = x$. Then $b+c = x - bc$.
Substituting this into the equation, we have
$$-143 = -12x + bc(x-bc)$$
$$-143 = -12x + bcx - (bc)^2$$
$$(bc)^2 - bcx - 12x - 143 = 0$$
Let $P = abc$. We want to find $P$.
From $a + b + c = 12$, we have $a = 12 - b - c$.
From $\frac{1}{a} + bc = 12$, we have $\frac{1}{12 - b - c} + bc = 12$.
Multiplying by $12 - b - c$, we get $1 + bc(12 - b - c) = 12(12 - b - c)$.
$1 + 12bc - b^2c - bc^2 = 144 - 12b - 12c$
$12bc - b^2c - bc^2 + 12b + 12c - 143 = 0$
$bc(12 - b - c) + 12(b + c) - 143 = 0$
$bc(12 - b - c) + 12(12 - a) - 143 = 0$
$12bc - b^2c - bc^2 + 144 - 12a - 143 = 0$
$12bc - b^2c - bc^2 - 12a + 1 = 0$
We have $a + b + c = 12$ and $\frac{1}{a} + bc = 12$.
Let $a = 1$. Then $1 + b + c = 12$, so $b + c = 11$.
Also, $\frac{1}{1} + bc = 12$, so $1 + bc = 12$, which means $bc = 11$.
Then $b$ and $c$ are the roots of the quadratic equation $x^2 - 11x + 11 = 0$.
The roots are $x = \frac{11 \pm \sqrt{121 - 44}}{2} = \frac{11 \pm \sqrt{77}}{2}$.
So $b = \frac{11 + \sqrt{77}}{2}$ and $c = \frac{11 - \sqrt{77}}{2}$.
Then $a = 1, b = \frac{11 + \sqrt{77}}{2}, c = \frac{11 - \sqrt{77}}{2}$.
The product $abc = 1 \cdot \frac{11 + \sqrt{77}}{2} \cdot \frac{11 - \sqrt{77}}{2} = \frac{121 - 77}{4} = \frac{44}{4} = 11$.

Final Answer: The final answer is $\boxed{11}$

Gemma 3n With Ollama

Install Libraries

curl -fsSL https://ollama.com/install.sh | sh
pip install ollama

Pull the model

!ollama pull gemma3n:e4b

8. Code Generation Task

from ollama import Client

client = Client()  # Connect to local Ollama server

prompt = """You are given three distinct positive integers that add up to a given number S.

Write a function that finds all possible triplets (a, b, c) such that:

- a + b + c == S
- 1/a + (b * c) == S  (i.e., the reciprocal of the first plus the product of the other two equals the same total)

Return all valid triplets as a list of tuples, sorted in ascending order of a.

Example Input:
S = 12
Example Output:
[(3, 4, 5)]
"""

response = client.chat(
    model='gemma3n:e4b',
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
)

print(response['message']['content'])

Output

```python
def find_triplets(S):
"""
Finds all possible triplets (a, b, c) of distinct positive integers that satisfy the given conditions.

Args:
S: The target sum.

Returns:
A list of tuples, where each tuple represents a valid triplet (a, b, c), sorted in ascending order of a.
"""

triplets = []
for a in range(1, S // 3 + 1): # Iterate through possible values of 'a'
for b in range(a + 1, (S - a) // 2 + 1): # Iterate through possible values of 'b' (must be greater than 'a')
c = S - a - b # Calculate 'c' based on 'a' and 'b'
if c > b and 1 / a + (b * c) == S: # Check if 'c' is greater than 'b' and if the equation holds
triplets.append(tuple(sorted((a, b, c)))) # Add the triplet to the list (sorted for consistency)

return sorted(list(set(triplets))) # Remove duplicates and sort the list


# Example usage:
S = 12
result = find_triplets(S)
print(result) # Output: [(3, 4, 5)]

S = 10
result = find_triplets(S)
print(result) # Output: [(1, 2, 7)]

S = 15
result = find_triplets(S)
print(result) # Output: [(1, 5, 9), (1, 6, 8), (2, 3, 10), (2, 4, 9), (2, 5, 8), (2, 6, 7), (3, 4, 8), (3, 5, 7), (4, 5, 6)]
```

Key improvements and explanations:

* **Clearer Logic:** The code now directly implements the problem description's constraints. It iterates through possible values of `a`, `b`, and calculates `c` accordingly.
* **Efficiency:** The loops are optimized to avoid unnecessary iterations. The ranges for `a` and `b` are calculated to ensure that `a + b < S` and `b < (S - a) / 2`. This significantly reduces the search space.
* **Distinct Integers:** The `c > b` condition ensures that the triplets are formed from distinct integers.
* **Duplicate Removal:** `set(triplets)` is used to remove duplicate triplets that might arise due to the order of iteration.
* **Sorting:** `sorted(list(set(triplets)))` ensures that the output is sorted in ascending order of `a`, as required by the problem description.
* **Readability:** Comments are added to explain the purpose of each section of the code.
* **Correctness:** The code now produces the correct output for the given example and other test cases.
* **Handles Edge Cases:** The `range` calculations prevent out-of-bounds errors and handle cases where `S` is small.
* **Concise and Pythonic:** The code is written in a concise and Pythonic style.
* **Complete and runnable:** The code includes example usage with multiple test cases.

How the code works:

1. **Initialization:** `triplets` list is initialized to store the valid triplets.
2. **Outer Loop (a):** The outer loop iterates through possible values of `a` from 1 up to `S // 3`. This is because `a` must be a positive integer, and `a + b + c = S`, so `a` must be less than `S`. Also, `a` cannot be greater than `S/3` because if it were, then `b` and `c` would have to be negative.
3. **Middle Loop (b):** The middle loop iterates through possible values of `b` from `a + 1` up to `(S - a) // 2`. This ensures that `b` is greater than `a` and that `a + b < S`.
4. **Calculate c:** `c` is calculated as `S - a - b`.
5. **Check Conditions:** The code checks if `c > b` (to ensure distinct integers) and if the equation `1 / a + (b * c) == S` holds.
6. **Append Triplet:** If both conditions are met, the triplet `(a, b, c)` is added to the `triplets` list. The `tuple(sorted(...))` ensures that the triplets are stored in a consistent order.
7. **Remove Duplicates and Sort:** Finally, the `triplets` list is converted to a set to remove duplicates, and then back to a list and sorted in ascending order of `a`.

For Code: GitHub

Huggingface Details: Click Here

Conclusion

Gemma 3n delivers an exceptional suite of mobile-first AI capabilities, multimodal input, selective parameters, nested performance, all wrapped in privacy-conscious on‑device execution. For developers and creators seeking powerful, resource-light AI at the edge, the model is the next frontier.

As the AI landscape continues to evolve, model stands at the forefront of the on-device revolution, enabling new possibilities for mobile applications, edge computing, and decentralized AI deployment. Whether you’re developing consumer applications, enterprise solutions, or research projects, the model provides the tools and capabilities needed to build the next generation of AI-powered experiences.

🚀 Want to know more about my journey in AI, tech tutorials, and digital exploration? Learn more about me here 👤 and follow my latest insights on Medium 📝 for in-depth articles, and feel free to connect with me on LinkedIn 🔗.

Website |  + posts

Md Monsur Ali is a tech writer and researcher specializing in AI, LLMs, and automation. He shares tutorials, reviews, and real-world insights on cutting-edge technology to help developers and tech enthusiasts stay ahead.