Building High-Speed Asynchronous Crawlers in Python Using asyncio and requests

2023-04-26 20:17:08+00:00

Traditional web scraping scripts process requests sequentially: they send an HTTP request, block execution until the response arrives, parse the data, and then move to the next URL. When querying 10+ external APIs, this synchronous flow is extremely slow. Asynchronous I/O allows you to execute hundreds of requests concurrently while waiting for network responses.

Using Python's asyncio and aiohttp libraries, we can rewrite synchronous requests to run in parallel, scaling throughput exponentially.

1. Replacing Blocking HTTP Calls

The standard requests library is synchronous and blocks the Python GIL during network calls. To write an async crawler, we replace requests.get with asynchronous libraries like aiohttp or running blocking calls in an executor:

import asyncio
import requests

async def fetch_provider_sync_wrapper(url, params):
    loop = asyncio.get_running_loop()
    # Run blocking requests.get call in a separate thread pool executor
    response = await loop.run_in_executor(None, lambda: requests.get(url, params=params))
    return response.json()

2. Implementing Concurrency Limits

Launching 1,000 concurrent network tasks can crash your local network interface or get your IP banned by target servers. We control concurrency using asyncio.Semaphore to limit the maximum number of simultaneous requests:

    sem = asyncio.Semaphore(10) # Max 10 concurrent requests
    async with sem:
        data = await fetch_provider_sync_wrapper(url, params)

Adopting async patterns reduces crawling cycle times from hours to minutes, utilizing system CPU resources efficiently.