Reindexing Millions of Legacy Entities: Batch Migration Routines on Cloud Datastore

2020-06-23 05:25:42+00:00

When you update NoSQL database models to add search index rules (such as indexing a previously excluded field), Cloud Datastore does not index historical entities automatically. The new index is only applied to documents created or modified *after* the change was pushed. To apply the index to legacy data, you must rewrite every existing document in the table. If you run updates on a large dataset synchronously, your script will hit execution timeouts.

By writing cursor-based batch migrators in Python, we can safely reindex millions of documents in the background.

1. Implementing the Cursor-Based Batch Reindexer

We write a background migration script that queries entities in batches of 1,000 using database cursors to track progress:

# batch_reindexer.py
from google.cloud import ndb
import time

class LegacyData(ndb.Model):
    old_field = ndb.StringProperty()
    # Newly indexed field
    category = ndb.StringProperty(indexed=True)

def reindex_batch(cursor=None, batch_size=1000):
    client = ndb.Client()
    
    with client.context():
        # Query for entities that need reindexing
        query = LegacyData.query()
        entities, next_cursor, more = query.fetch_page(batch_size, start_cursor=cursor)
        
        if not entities:
            print("Reindexing complete!")
            return
            
        # Trigger an update on each entity to force index calculations
        for entity in entities:
            # We rewrite the entity to trigger index updates
            entity.put()
            
        print(f"Reindexed {len(entities)} records.")
        
        if more and next_cursor:
            # Sleep briefly to avoid database write rate limits
            time.sleep(0.5)
            # Recurse with the next cursor
            reindex_batch(next_cursor, batch_size)

2. Safe Migration Loops

By leveraging cursor objects, the migration can be safely paused and resumed between runs, preventing server timeouts and maintaining ingestion speed.