8000 [Feature] Add support for token-based rate-limiting in RemoteInferenceEngine · Issue #1457 · oumi-ai/oumi · GitHub
[go: up one dir, main page]

Skip to content

[Feature] Add support for token-based rate-limiting in RemoteInferenceEngine #1457

@jgreer013

Description

@jgreer013

Feature request

Add support for token-based rate-limiting for both input and output tokens when using RemoteInferenceEngine.

Motivation / references

Many API providers in addition to request-based limits also have token-based limits for both input and output tokens:

https://console.anthropic.com/settings/limits
https://platform.openai.com/settings/organization/limits

If we don't allow users to configure these limits, their requests end up failing after a certain point.

Your contribution

  1. Update remote_params.py to add input and output token limits. Also modify it to use RPM rather than a custom politeness policy:
    politeness_policy: float = 0.0
  2. Update RemoteInferenceEngine to use "requests per minute" rather than a custom politeness policy
  3. Update RemoteInferenceEngine to sleep either when the RPM or the TPM limit is reached (whichever comes first):
    await asyncio.sleep(remote_params.politeness_policy)

Most APIs specify in a successful API response how many input and output tokens were processed for that request.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0