Dynamic shared quota distributes on-demand capacity among all queries being processed by Google Cloud services. This capability eliminates the need for you to submit quota increase requests (QIRs).
Supported Google model versions
The Google models and their versions that support dynamic shared quota are the following:
- Gemini 1.5 Flash (
gemini-1.5-flash-002
) - Gemini 1.5 Pro (
gemini-1.5-pro-002
)
Other supported models
For information about Claude models that support dynamic shared quota, see Use the Claude models from Anthropic.
Example of how dynamic shared quota works
Google Cloud looks at the available capacity in a specific region, such as North America, and then looks at how many projects are sending requests. Consider project A, which sends 25 queries per minute (QPM), and project B, which sends 25 QPM. The service can support 100 QPM. If project A increases the rate of its queries to 75 QPM, then dynamic shared quota supports the increase. If project A increases the rate of its queries to 100 QPM, then dynamic shared quota throttles project A down to 75 QPM in order to continue to serve project B at 25 QPM.
To troubleshoot errors that might occur with the use of dynamic shared quota, see Troubleshoot quota errors.
Considerations
Consideration | Solution |
---|---|
Control cost and prevent budget overruns. | Configure a self-imposed quota called a consumer quota override. For more information, see Creating a consumer quota override. |
Prioritize traffic. | Use Provisioned Throughput. |
Monitor your usage. | View the following metrics:
aiplatform section
in the Cloud Monitoring documentation. |
What's next
- To learn more about Gemini models that support dynamic shared quota, see Gemini models.
- To learn more about Generative AI quotas and limits, see Generative AI on Vertex AI rate limits.
- To learn more about quotas and limits for Vertex AI, see Vertex AI quotas and limits.
- To learn more about Google Cloud quotas and limits, see Understand quota values and system limits.