2023-01-30 06:34:17+00:00

When building a web crawler that queries third-party APIs (like Mouser, DigiKey, and Avnet) for pricing and stock info, you face a major challenge: external network requests are slow and prone to timeouts. If your system makes these API calls synchronously during a user request, the client connection will frequently timeout.

A resilient design delegates these crawling tasks to an asynchronous pipeline backed by an AWS SQS queue, allowing workers to fetch data in parallel.


1. SQS-Triggered Crawling

When a customer searches for a part number, the system returns cached data and sends a crawl request message to SQS. SQS triggers an AWS Lambda instance, which reads the message and runs concurrent API requests using Python's asyncio or Go's goroutines. This ensures that even if one provider's API takes 5 seconds to respond, the overall user experience remains fast.

2. Mitigating Rate Limits with Batching

Since SQS triggers Lambdas in batches, we must carefully control the batch size (BatchSize parameter) to avoid hitting the rate limits of external APIs. If we trigger 100 crawler instances simultaneously, the provider might block our IP. We configure the SQS queue with a reserved concurrency limit on the Lambda function to throttle the maximum number of concurrent crawlers.

# Example of concurrent crawler tasks in Python asyncio
async def fetch_all_providers(part_number):
    tasks = [
        crawl_digikey(part_number),
        crawl_mouser(part_number),
        crawl_arrow(part_number)
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

Using this asynchronous pipeline, we achieve robust data ingestion while maintaining a stable, rate-limit-compliant relationship with external providers.