[Feature] Add support for token-based rate-limiting in RemoteInferenceEngine

Add support for token-based rate-limiting for both input and output tokens when using RemoteInferenceEngine.

Many API providers in addition to request-based limits also have token-based limits for both input and output tokens:

If we don't allow users to configure these limits, their requests end up failing after a certain point.

Update remote_params.py to add input and output token limits. Also modify it to use RPM rather than a custom politeness policy:

oumi/src/oumi/core/configs/params/remote_params.py

Line 45 in eee596f

politeness_policy: float = 0.0
Update RemoteInferenceEngine to use "requests per minute" rather than a custom politeness policy
Update RemoteInferenceEngine to sleep either when the RPM or the TPM limit is reached (whichever comes first):

oumi/src/oumi/inference/remote_inference_engine.py

Line 427 in eee596f

await asyncio.sleep(remote_params.politeness_policy)

Most APIs specify in a successful API response how many input and output tokens were processed for that request.

Provide feedback