Dynamic shared quota

Dynamic shared quota distributes on-demand capacity among all queries being processed by Google Cloud services. This capability eliminates the need for you to submit quota increase requests (QIRs).

Supported Google model versions

The Google models and their versions that support dynamic shared quota are the following:

  • Gemini 1.5 Flash (gemini-1.5-flash-002)
  • Gemini 1.5 Pro (gemini-1.5-pro-002)

Other supported models

For information about Claude models that support dynamic shared quota, see Use the Claude models from Anthropic.

Example of how dynamic shared quota works

Google Cloud looks at the available capacity in a specific region, such as North America, and then looks at how many projects are sending requests. Consider project A, which sends 25 queries per minute (QPM), and project B, which sends 25 QPM. The service can support 100 QPM. If project A increases the rate of its queries to 75 QPM, then dynamic shared quota supports the increase. If project A increases the rate of its queries to 100 QPM, then dynamic shared quota throttles project A down to 75 QPM in order to continue to serve project B at 25 QPM.

To troubleshoot errors that might occur with the use of dynamic shared quota, see Troubleshoot quota errors.

Considerations

Consideration Solution
Control cost and prevent budget overruns. Configure a self-imposed quota called a consumer quota override. For more information, see Creating a consumer quota override.
Prioritize traffic. Use Provisioned Throughput.
Monitor your usage. View the following metrics:
  • publisher/online_serving/token_count
  • publisher/online_serving/tokens
For more information, see the aiplatform section in the Cloud Monitoring documentation.

What's next