Gemini 3.1 Flash-Lite

3.1 Flash-Lite · Family: Gemini

Gemini 3.1 Flash-Lite is the most cost-efficient thinking model in Google DeepMind's Gemini 3 series, designed for high throughput and low latency while retaining reasoning quality.

⏳ Preview⏳ Limited accessLLMMultimodalReasoning modelTool-using model📁 Gemini

Context window

tokens

Max output

65,536

tokens

Release date

29 April 2026

🔬Google DeepMindResearch lab 🏢GoogleOwner

Access:APIHostedDeployment:☁ Cloud

Overview

Gemini 3.1 Flash-Lite is an AI model developed by Google DeepMind, announced on April 29, 2026 as part of the Gemini 3.1 family. It is a scalable thinking model designed for high-volume tasks at low cost and latency.

The model supports flexible reasoning levels, allowing users to select the level of thinking to apply per task. Context window: 1M tokens, max output: 64,000 tokens. Supports function calling, structured output, search as a tool, and code execution.

Available via Google AI Studio, Gemini API, and Vertex AI. Knowledge cutoff: January 2025. Output speed: 363 tokens/s. Lowest price in the series 3: input $0.25/1M, output $1.50/1M tokens.

Classification

LLMMultimodalReasoning modelTool-using model

Family: Gemini

Applications

Coding Content generation Document analysis Workflow automation Writing assistance Q&A / Question answering Data analysis Translation

Access & deployment

APIHosted

Cloud

Weights: Closed

Key parameters

📏 Context: 1M

✓ Tools

📥 Input: text, image, audio, video…

Platforms

Vertex AI

Technical specification

Context window

tokens

Max output tokens

65,536

tokens per response

Knowledge cutoff

1 Jan 2025

Knowledge boundary

License

proprietary

Hardware requirements

Available only through Google cloud infrastructure (Gemini API, Vertex AI, Google AI Studio).

Features:✓ Tool use

Modalities

⬇ Input

textimageaudiovideodocuments

⬆ Output

textcode

Capabilities and applications

Native model capabilities

Reasoning

The model's ability to reason logically and solve complex problems.

Category: reasoning

Multi-step reasoning

Carrying out multi-step chains of reasoning across long, complex tasks.

Category: reasoning

Long context

Support for large context windows — tens to hundreds of thousands (or millions) of input tokens. Enables analysis of entire codebases, long documents, and many parallel conversations without losing earlier information. GPT-5.1 supports 400,000 tokens.

Category: language

Multimodal understanding

Category: multimodal

Coding

Generating, analysing and modifying code in many programming languages. Covers writing functions, debugging, refactoring, code review, and creating tests. Measured by benchmarks such as HumanEval and SWE-bench.

Category: coding

Function Calling

Category: planning

Structured output

Producing data in structured formats such as JSON.

Category: structured_generation

Audio understanding

Category: audio

Image understanding

Analysing and interpreting the content of images.

Category: vision

Video Understanding

Category: video

Chart understanding

Reading and interpreting charts, tables and diagrams.

Category: vision

Multilingual

Competence in many natural languages (from a few to over a hundred): understanding, generation, translation, and code-switching within a single conversation. Frontier models support a wide range of languages with comparable quality.

Category: language

Streaming output

Category: reasoning

Application domains

Coding Content generation Document analysis Workflow automation Writing assistance Q&A / Question answering Data analysis Translation

Benchmark results

11 benchmarks

Humanity's Last Exam

accuracy · No tools, Gemini 3.1 Flash-Lite High

16.0%%