-
Notifications
You must be signed in to change notification settings - Fork 668
Open
Open
Copy link
Labels
FeatureenhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed
Description
Feature request
Add support for token-based rate-limiting for both input and output tokens when using RemoteInferenceEngine.
Motivation / references
Many API providers in addition to request-based limits also have token-based limits for both input and output tokens:
https://console.anthropic.com/settings/limits
https://platform.openai.com/settings/organization/limits
If we don't allow users to configure these limits, their requests end up failing after a certain point.
Your contribution
- Update remote_params.py to add input and output token limits. Also modify it to use RPM rather than a custom politeness policy:
politeness_policy: float = 0.0 - Update RemoteInferenceEngine to use "requests per minute" rather than a custom politeness policy
- Update RemoteInferenceEngine to sleep either when the RPM or the TPM limit is reached (whichever comes first):
await asyncio.sleep(remote_params.politeness_policy)
Most APIs specify in a successful API response how many input and output tokens were processed for that request.
Metadata
Metadata
Assignees
Labels
FeatureenhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed