Google Previews “Gemini 3.1 Flash-Lite,” Its Fastest and Most Cost-Effective AI Model
On March 3, 2026, Google DeepMind released a preview of “Gemini 3.1 Flash-Lite,” the most cost-effective and fastest lightweight model in the Gemini 3 series.
Designed for applications requiring high-volume API requests and real-time processing, it achieves significant speed improvements and cost reductions compared to previous generation models.
Overwhelming Cost Performance and Basic Specifications
This model is available through Google AI Studio for developers and Vertex AI for enterprise users.
- Pricing: $0.25 per 1 million input tokens and $1.50 per 1 million output tokens. It can be operated at an overwhelmingly lower cost compared to higher-tier models.
- Context Window: Supports up to 1,048,576 (approximately 1 million) input tokens, allowing for the processing of long texts, images, audio, video, and PDF files.
- Maximum Output: Capable of outputting up to 65,536 text tokens in a single request.
Improved Processing Speed and Benchmark Performance
Despite being a lightweight model, Gemini 3.1 Flash-Lite maintains high reasoning capabilities and multimodal performance.
- Faster Response Times: Compared to the previous Gemini 2.5 Flash, the Time To First Token (TTFT) is 2.5 times faster, and overall output speed has improved by 45%.
- Benchmark Results: It scored 86.9% on GPQA Diamond (which measures expert-level reasoning) and 76.8% on MMMU Pro (which includes image analysis), surpassing the scores of previous generation large models (such as Gemini 2.5 Flash).
“Thinking Levels” to Control Reasoning Depth Based on Tasks
This model comes standard with a feature that allows developers to arbitrarily control the depth of the AI’s reasoning.
- Four-Tier Reasoning Adjustment: Depending on the task, users can select from four thinking levels: “minimal,” “low,” “medium,” and “high.”
- Resource Optimization: It is possible to minimize latency by lowering the thinking level for simple tasks requiring real-time responses, or to increase accuracy by raising the thinking level for tasks involving complex conditional branching or UI generation.
Primary Anticipated Use Cases
Due to its low latency and low cost, it is optimized for high-frequency and large-scale processing, such as:
- Real-Time Translation and Text Classification: Instantly translating and classifying massive chat logs, customer support tickets, and user reviews.
- Structured Data Extraction: Building pipelines to extract specific entities from documents like receipts and specifications, and stably outputting them in JSON format.
- Model Routing: Acting as an “orchestrator” at the frontend of an application by receiving user input first, immediately answering simple questions, and routing only tasks that require advanced reasoning to higher-tier Pro models.
