Gemma 4: Google's Open Multimodal AI You Can Run on Your Own Hardware

Google DeepMind releases Gemma 4 under Apache 2.0 with text, image, and audio support across four model sizes built for local deployment.

Data & IT Infrastructure
Gemma 4: Google's Open Multimodal AI You Can Run on Your Own Hardware

A New Standard for Open Multimodal AI

Google Gemma Logo

Google DeepMind has released Gemma 4, a family of open-weight multimodal models that represents a significant shift in how developers and businesses can deploy AI locally. Released under the Apache 2.0 license, Gemma 4 is designed from the ground up to run on personal hardware, from Android phones to consumer-grade GPUs, without sending data to the cloud.

The release is authored by Clement Farabet, VP of Research at Google DeepMind, and Olivier Lacombe, Group Product Manager, and it arrives at a moment when the demand for on-device AI has never been higher. With over 400 million downloads across previous Gemma generations and more than 100,000 community variants in the "Gemmaverse," Google is doubling down on the open model strategy.

What makes Gemma 4 noteworthy is not just performance. It is the combination of multimodal capabilities, permissive licensing, and genuine edge deployment feasibility packed into models small enough to run on a laptop.

The Model Lineup: Four Sizes, Two Architectures

Gemma 4 Official Page

Gemma 4 ships as four distinct models spanning two architecture families, each targeting different deployment scenarios.

Dense Models

The smallest models, E2B and E4B, are purpose-built for edge deployment. E2B carries 2.3 billion effective parameters (5.1 billion with embeddings) across 35 layers, while E4B scales to 4.5 billion effective parameters (8 billion with embeddings) across 42 layers. Both support text, image, and native audio input, making them the most versatile models in the family for on-device applications.

The 31B dense model is the heavyweight, with 30.7 billion parameters across 60 layers. It targets workstation-class inference and supports text and image input with a 256K token context window. This is the model that competes directly with much larger open-weight alternatives from other providers.

Mixture-of-Experts Model

The 26B A4B model introduces a Mixture-of-Experts (MoE) architecture with 25.2 billion total parameters but only 3.8 billion active parameters during inference. It uses a configuration of 128 total experts with 8 active per token plus 1 shared expert, and supports text and image input with a 256K context window.

This design offers a compelling throughput advantage: inference speed approaches that of a 4B model while quality approaches that of a 25B model. However, developers need to understand a critical MoE tradeoff: all parameters must be loaded in memory for routing efficiency, even though only a fraction activates per token.

Memory Requirements and Hardware Planning

Google provides transparent memory requirements for base weights at different precision levels, which is essential for deployment planning.

For the edge models, E2B requires as little as 3.2 GB at Q4_0 quantization, making it viable for mobile devices. E4B needs 5 GB at Q4_0, still comfortable on modern smartphones. The 31B dense model demands 17.4 GB at Q4_0 or 58.3 GB at BF16, putting it in workstation or high-end consumer GPU territory. The 26B MoE model needs 15.6 GB at Q4_0 or 48 GB at BF16.

These figures cover weights only. Runtime overhead including KV cache for long context windows and other software needs will add to the total. Developers building long-context applications should budget significantly more memory than the base weight numbers suggest.

Multimodal Capabilities Across the Family

Every Gemma 4 model processes text and images natively. The smaller E2B and E4B models add native audio input, which is uncommon among open-weight models at this scale.

The vision encoder parameters vary by model size: approximately 150 million parameters for E2B and E4B, and 550 million for the larger 26B and 31B models. The audio encoder on the edge models adds approximately 300 million parameters each.

Google's launch materials claim all models natively process video and images. In practice, video support works through frame extraction rather than native temporal understanding, treating video as a sequence of image inputs. This is still useful for applications like video summarization or visual search, but developers should not expect true video-native comprehension.

Gemma 4 - Model Family

Benchmark Performance: Punching Above Weight Class

Gemma 4 posts strong benchmark results that justify Google's "intelligence-per-parameter" positioning.

On MMLU Pro, the 31B model scores 85.2%, the 26B MoE hits 82.6%, and even the E4B manages 69.4%. For comparison, the previous-generation Gemma 3 27B scored 67.6%. On AIME 2026 (without tools), the 31B reaches 89.2% and the 26B MoE scores 88.3%, dramatic improvements over Gemma 3's 20.8%.

Coding performance is similarly strong. On LiveCodeBench v6, the 31B scores 80.0% and achieves a Codeforces ELO of 2150. The 26B MoE manages 77.1% and 1718 ELO respectively. Gemma 3 27B had an ELO of just 110, making this a generational leap.

For agentic and tool-use tasks, the Tau2 benchmark average shows the 31B at 76.9% and 26B MoE at 68.2%, compared to Gemma 3's 16.2%. This matters because native function calling and structured output support are baked into the architecture, not bolted on through prompting.

On third-party leaderboards, Google claims the 31B model ranks third and the 26B ranks sixth among open models on Arena AI's text leaderboard, with ELO ratings of 1452 and 1441 respectively. Google states the larger models outperform competitors "20 times their size" in Arena-style comparisons.

What Apache 2.0 Licensing Actually Means

Previous Gemma generations used more restrictive "community" licenses. The move to Apache 2.0 is a deliberate strategic shift that carries real practical consequences.

Apache 2.0 allows commercial use, modification, and redistribution with minimal restrictions. Businesses can embed Gemma 4 in proprietary products, fine-tune it for internal use, and deploy it in regulated environments without worrying about license compliance beyond standard attribution. This eliminates a significant barrier for enterprise adoption and on-premises deployment in industries like healthcare, finance, and government where licensing scrutiny is intense.

Google explicitly frames this as a response to developer demand for "digital sovereignty" over data, infrastructure, and deployment choices. In an era where data residency and AI governance are board-level concerns, permissive licensing is a competitive advantage rather than just a philosophical choice.

Architecture Details for Developers

Under the hood, Gemma 4 introduces several architectural features that matter for application development.

The hybrid attention mechanism interleaves local sliding-window attention with global full-context attention, with the final layer always using global attention. Global layers use unified keys and values with proportional RoPE (p-RoPE) for handling long contexts efficiently. Sliding window sizes are 512 tokens for E2B/E4B and 1024 tokens for the larger models.

Gemma 4 introduces native support for the system role in conversations, native function calling, and a configurable "thinking mode" using a dedicated think token. These features reduce the prompt engineering fragility that has plagued tool-using agents built on earlier models.

Distribution and Ecosystem

Google distributes Gemma 4 through multiple channels: Hugging Face, Kaggle, Ollama, Google AI Studio, and the AI Edge Gallery for mobile deployment. The Android AICore Developer Preview enables on-device inference for Android applications.

Framework support includes vLLM, llama.cpp, MLX, and other popular inference engines. This broad distribution reduces friction for developers who want to experiment quickly, and the existing Gemmaverse community of fine-tuners and integrators ensures rapid ecosystem growth.

What This Means for App Developers

For teams building AI-powered applications, Gemma 4 changes the calculus of build versus buy. Running a capable multimodal model locally means you can process sensitive documents, images, and audio without sending data to external APIs. The privacy implications are significant: user data never leaves the device or the company's infrastructure.

The edge models (E2B and E4B) open up entirely new categories of mobile and IoT applications where connectivity is unreliable or latency requirements are strict. A 3.2 GB model that understands text, images, and audio can power real-time translation, document scanning, accessibility features, and conversational interfaces without a network connection.

For developers building productivity tools, these advances translate into tangible user benefits. Email clients and communication platforms can leverage local multimodal models to understand attachments, classify messages, and draft contextual responses without cloud round-trips. Tools like Maylee already demonstrate this approach, using AI to auto-classify incoming emails and draft replies that match the sender's writing style, and open models like Gemma 4 make it possible to bring similar intelligence to entirely new categories of applications.

The MoE architecture of the 26B model offers a particularly interesting sweet spot for developers who need strong reasoning capabilities but want to serve multiple concurrent users. With only 3.8 billion active parameters per inference pass, throughput per GPU can be significantly higher than with a comparably accurate dense model.

Gemma 4 - Cloud vs On-Device

The Competitive Landscape

Gemma 4 enters a crowded field of open-weight models. Alibaba's Qwen family, Meta's Llama, DeepSeek, and Mistral all compete for developer attention. Chinese open models from Zhipu/GLM, Moonshot/Kimi, and MiniMax often offer larger parameter counts.

Where Gemma 4 differentiates is in the combination of factors: Apache 2.0 licensing, genuine edge deployment capability with audio support, strong benchmarks relative to parameter count, and the backing of Google's distribution infrastructure. No single competitor currently matches all of these attributes simultaneously.

For teams evaluating local deployment options at the 30B parameter scale with permissive licensing, Gemma 4's 31B model is a top contender. For latency-sensitive applications where active parameter count matters more than total model size, the 26B MoE offers a unique proposition. And for on-device applications requiring audio understanding, the E2B and E4B models occupy a niche that few open-weight competitors address.

Looking Ahead

Gemma 4 is not just a model release. It is a signal that Google views the open-weight ecosystem as a strategic priority, not a side project. The Apache 2.0 licensing, the edge-native design, and the multimodal breadth all point toward a future where capable AI runs everywhere, from data centers to phones to embedded devices.

For app developers, the practical takeaway is clear: the barrier to building AI-native applications has dropped again. The models are free, the license is permissive, and the hardware requirements are increasingly reasonable. The question is no longer whether you can run multimodal AI locally, but what you will build with it.

Frequently Asked Questions

What license does Gemma 4 use and can I use it commercially?+

Gemma 4 is released under the Apache 2.0 license, which allows commercial use, modification, and redistribution with minimal restrictions. This is a significant change from previous Gemma generations that used more restrictive community licenses.

How much memory do I need to run Gemma 4 locally?+

Memory requirements vary by model and precision. The E2B model needs as little as 3.2 GB at Q4_0 quantization, E4B needs 5 GB, the 26B MoE needs 15.6 GB, and the 31B dense model needs 17.4 GB. These figures cover weights only and exclude runtime overhead like KV cache.

What modalities does Gemma 4 support?+

All Gemma 4 models support text and image input with text output. The smaller E2B and E4B edge models also support native audio input. Video is processed as a sequence of image frames rather than through native temporal understanding.

What is the difference between the MoE and dense models?+

The 26B A4B MoE model has 25.2 billion total parameters but only activates 3.8 billion per inference pass, offering faster throughput. The 31B dense model activates all 30.7 billion parameters every time, delivering higher quality but requiring more compute per token.

Can Gemma 4 run on a smartphone?+

Yes. The E2B model at Q4_0 quantization requires only 3.2 GB of memory, making it feasible on modern smartphones. Google distributes it through the AI Edge Gallery and Android AICore Developer Preview for mobile deployment.

How does Gemma 4 compare to previous Gemma models?+

Gemma 4 shows dramatic improvements over Gemma 3 27B across all benchmarks. For example, Codeforces ELO jumped from 110 to 2150 for the 31B model, AIME 2026 scores went from 20.8% to 89.2%, and agentic task performance on Tau2 increased from 16.2% to 76.9%.

What context length does Gemma 4 support?+

The E2B and E4B edge models support 128K token contexts. The larger 26B MoE and 31B dense models support 256K token contexts.

Where can I download Gemma 4?+

Gemma 4 weights are available through Hugging Face, Kaggle, Ollama, Google AI Studio, and the Google AI Edge Gallery. Framework support includes vLLM, llama.cpp, MLX, and other popular inference engines.

Ready to get started?

Maylee

It thinks inside the box.

Resources

Contact

© 2026 Maylee. All rights reserved.