Google Gemini 2.5 Flash-Lite (Beta)

Important

Pre-General Availability: 2025-08-29

The Gemini 2.5 Flash-Lite model (google.gemini-2.5-flash-lite) is the fastest and most budget-friendly multimodal reasoning model in the 2.5 family, optimized for low latency. Gemini 2.5 Flash and Gemini 2.5 Flash-Lite models are both efficient models. Flash-Lite is optimized for lower cost and faster performance on high-volume, less complex tasks. Gemini 2.5 Flash offers a balance of speed and intelligence for more complex applications.

Available in This Region

  • US East (Ashburn) (on-demand only)
Important

External Calls

The Google Gemini 2.5 models that can be accessed through the OCI Generative AI service, are hosted externally by Google. Therefore, a call to a Google Gemini model (through the OCI Generative AI service) results in a call to a Google location.

Key Features

  • Model Name in OCI Generative AI: google.gemini-2.5-flash-lite
  • Available On-Demand: Access this model on-demand, through the Console playground or the API.
  • Multimodal Support: Input text, code, and images and get a text output. File inputs such as audio, video, and document files aren't supported. See Limits for the types and sizes of image inputs.
  • Knowledge: Has a deep domain knowledge in science, mathematics, and code.
  • Context Length: One million tokens
  • Maximum Input Tokens: 1,048,576 (Console and API)
  • Maximum Output Tokens: 65,536 (default) (Console and API)
  • Excels at These Use Cases: For general-purpose, high throughput, cost-sensitive tasks that don't require complex reasoning, such as classification, translation, and intelligent routing. For example, customer support inquiries and summarizing large-scale documents.
  • Has Reasoning: Yes. Includes text and visual reasoning and image understanding. For reasoning problems increase the maximum output tokens. See Model Parameters.
  • Knowledge Cutoff: January 2025

See the following table for the features supported in the Google Vertex AI Platform (Beta) for OCI Generative, with links to each feature.

Supported Gemini 2.5 Pro Features
Feature Supported?
Code execution Yes
Tuning No
System instructions Yes
Structured output Yes
Batch prediction No
Function calling Yes
Count Tokens No
Thinking No
Context caching Yes, the model can cache the input tokens, but this feature isn't controlled through the API.
Vertex AI RAG Engine No
Chat completions Yes
Grounding No

For key feature details, see the Google Gemini 2.5 Flash-Lite documentation.

Limits

Complex Prompts
The Gemini 2.5 Flash-Lite (Beta) model has its thinking process turned off to prioritize speed and cost, so it's not suited for complex tasks. For complex tasks, we recommend using the Google Gemini 2.5 Pro (Beta) model.
Image Inputs
  • Console: Upload one or more .png or .jpg images, each 5 MB or smaller.
  • API: Submit a base64 encoded version of an image. For example, a 512 x 512 image typically converts to around 1,610 tokens. Supported MIME types are image/png, image/jpeg, and image/webp.
    • Maximum images per prompt: 3,000
    • Maximum image size before encoding: 7 MB

On-Demand Mode

You can reach the pretrained foundational models in Generative AI through two modes: on-demand and dedicated. Here are key features for the on-demand mode:
  • You pay as you go for each inference call when you use the models in the playground or when you call the models through the API.

  • Low barrier to start using Generative AI.
  • Great for experimenting, proof of concepts, and evaluating the models.
  • Available for the pretrained models in regions not listed as (dedicated AI cluster only).
Tip

To ensure reliable access to Generative AI models in the on-demand mode, we recommend implementing a back-off strategy, which involves delaying requests after a rejection. Without one, repeated rapid requests can lead to further rejections over time, increased latency, and potential temporary blocking of client by the Generative AI service. By using a back-off strategy, such as an exponential back-off strategy, you can distribute requests more evenly, reduce load, and improve retry success, following industry best practices and enhancing the overall stability and performance of your integration to the service.

Note

The Gemini models are available only in the on-demand mode.
Model Name OCI Model Name Getting Access
Gemini 2.5 Flash-Lite (Beta) google.gemini-2.5-flash-lite Contact Oracle Beta Programs

Release Date

Model Beta Release Date On-Demand Retirement Date Dedicated Mode Retirement Date
google.gemini-2.5-flash-lite 2025-08-29 Tentative This model isn't available for the dedicated mode.
Important

To learn about OCI Generative AI model deprecation and retirement, see Retiring the Models.

Model Parameters

To change the model responses, you can change the values of some parameters in the playground or the API.

Maximum output tokens

The maximum number of tokens that you want the model to generate for each response. Estimate four characters per token. Because you're prompting a chat model, the response depends on the prompt and each response doesn't necessarily use up the maximum allocated tokens. The maximum output token for the Gemini 2.5 model series is 65,536 (default) tokens for each run.

Tip

For large inputs with difficult problems, set a high value for the maximum output tokens parameter.
Temperature

The level of randomness used to generate the output text. Min: 0, Max: 2, Default: 1

Tip

Start with the temperature set to 0 or less than one, and increase the temperature as you regenerate the prompts for a more creative output. High temperatures can introduce hallucinations and factually incorrect information.
Top p

A sampling method that controls the cumulative probability of the top tokens to consider for the next token. Assign p a decimal number between 0 and 1 for the probability. For example, enter 0.75 for the top 75 percent to be considered. Set p to 1 to consider all tokens.

Top k

A sampling method in which the model chooses the next token randomly from the top k most likely tokens. In the Gemini 2.5 models, the top k has a fixed value of 64, which means that the model considers only the 64 most likely tokens (words or word parts) for each step of generation. The final token is then chosen from this list.

Number of Generations (API only)

The numGenerations parameter in the API controls how many different response options the model generates for each prompt.

  • When you send a prompt, the Gemini model generates a set of possible answers. By default, it returns only the response with the highest probability (numGenerations = 1).
  • If you increase the numGenerations parameter to a number between or equal to 2 and 8 you can have the model generate 2 to 8 distinct responses.