DynamoDB Parallel Queries: How to Scan Millions of Records Concurrently

2023-03-14 13:25:42+00:00

DynamoDB is designed for single-digit millisecond key-value lookups. However, there are times when you need to perform analytical sweeps—such as audit checking, migrations, or database-wide integrity checks. A standard single-threaded Scan request will take hours to process millions of records, as it fetches data page-by-page sequentially.

To speed up this process, DynamoDB allows you to run Parallel Scans, dividing the database table into independent segments processed concurrently by multiple threads.

1. How Parallel Scans Work

When executing a parallel scan, you specify a TotalSegments value (defining how many workers are scanning) and a Segment index (0-indexed, identifying which worker this is). DynamoDB partitions the keys internally and assigns a subset of the partitions to each segment. Each worker can then scan its assigned segment independently:

# Executing a parallel scan segment in Python using Boto3
import boto3

dynamodb = boto3.client('dynamodb')

def scan_segment(table_name, total_segments, segment_index):
    paginator = dynamodb.get_paginator('scan')
    page_iterator = paginator.paginate(
        TableName=table_name,
        TotalSegments=total_segments,
        Segment=segment_index,
        PaginationConfig={'PageSize': 100}
    )
    
    for page in page_iterator:
        for item in page['Items']:
            process_item(item)

2. Managing Provisioned Capacity

Parallel scans consume a massive amount of Read Capacity Units (RCUs) very quickly. If you run 20 concurrent scan segments, you can easily exhaust your table's provisioned capacity and throttle user traffic. It is best to run parallel scans on tables set to On-Demand Capacity, or temporarily boost the RCUs before running the query.