2023-06-09 03:08:34+00:00

When scraping data at enterprise scales, you will inevitably hit the limit of what a remote server will accept. When you exceed their allowed request count, target APIs return an HTTP 429 Too Many Requests status code. Failing to handle 429 errors gracefully can lead to IP bans or incomplete data syncs.

A robust crawler must monitor response headers, respect retry windows, and throttle its own request rates dynamically.


1. Inspecting Rate-Limit Headers

Most modern API gateways return headers indicating your current quota and window expiration, such as:

2. Implementing a Dynamic Wait Loop

Our crawler intercepts HTTP responses. If a 429 is encountered, it parses the Retry-After header and sleeps for that duration before retrying. If no header is present, we fall back to an exponential backoff loop:

# Retrying HTTP 429 requests in Python
import time
import requests

def fetch_with_rate_limit_handling(url, params):
    while True:
        res = requests.get(url, params=params)
        if res.status_code == 429:
            retry_after = int(res.headers.get("Retry-After", 5))
            print(f"Rate limited. Waiting for {retry_after}s...")
            time.sleep(retry_after)
            continue
        res.raise_for_status()
        return res.json()

Self-throttling ensures our crawling pipeline runs continuously without triggering security triggers on supplier servers.