The Cost of Naive Document Pipelines
Google Cloud Document AI charges on a per-page basis (typically $0.03 per page for invoices). While this seems small, processing thousands of documents monthly can rapidly accumulate significant costs. A common mistake developers make is processing every single document that lands in their input bucket and then filtering out irrelevant documents using post-processing code.
For example, if your pipeline needs to exclude certain invoices (e.g., sample documents or specific vendor types prefixed with "EX"), doing this check after the Document AI API call means you still pay for every page processed. If 20% of your documents are irrelevant, you are throwing money away.
The Solution: Move Filtering to the Edge
To optimize cloud costs, filtering must occur before the files hit GCS and Document AI. Depending on your architecture, there are two easy ways to implement pre-upload filtering.
Option 1: Excluding Files During Sync (using Rclone)
If you sync your files from a local machine or shared cloud drive to GCS using rclone, you can apply exclusion patterns directly during the sync command:
# Exclude invoices with "EX" in their name from being uploaded
rclone copy gdrive:Invoices gcs:my-bucket/input/ --exclude "*EX*.pdf"
Option 2: Deleting Files Post-Upload, Pre-Processing
If files are uploaded directly to GCS via an automated script or third-party, run a cleanup script in GCS before calling the Document AI API:
# Remove excluded PDFs using gsutil before starting the batch run
gsutil rm gs://my-bucket/input/*EX*.pdf
Comparing the Approaches
When deciding whether to filter before or after, keep this guideline in mind:
- Post-processing filter: Best when you need to audit every file, or you don't know which files to exclude until you parse their internal text contents.
- Pre-upload filter: Best when you can identify files by filename, file size, or metadata. It is the most cost-effective method.
Moving your filters to the edge of your cloud pipeline is a simple shift that can save hundreds of dollars on your monthly cloud bill.