The Reality of Out-of-the-Box Document AI Models
Google Cloud Document AI is a powerful tool for converting unstructured documents into structured JSON. However, when using a pre-trained processor (like the default Invoice Parser) on varying layouts, extraction rates can be disappointingly low. During a recent project, we observed that out-of-the-box accuracy for extracting custom product codes and invoice dates was around 16%.
While the long-term solution is to train a custom processor using labeled documents, that takes hours of manual work. In the short term, you need a solution that bridges the gap. The answer lies in combining structured extraction with rule-based OCR fallbacks.
Implementing a Text Fallback Strategy
When Document AI processes a document, it returns both structured entities and the raw, flat text of the document. If the structured processor fails to locate a field, we can fall back to searching the raw OCR text using regular expressions.
Extracting Alphanumeric Codes via Regex
For example, if your business requires extracting product references that start with "7" and are at least 5 digits long, you can parse the raw text as a safety net:
import re
def extract_from_text_fallback(text: str, already_extracted: set, prefix: str = "7") -> list:
"""Extract product codes from raw text when Document AI misses them."""
# Find words starting with the prefix followed by 5 or more digits
pattern = rf'\b{re.escape(prefix)}\d{{5,}}\b'
matches = re.findall(pattern, text)
unique_codes = set(matches) - already_extracted
return list(unique_codes)
Multi-Level Date Extraction Fallbacks
Extracting the invoice date is another common pain point. You can design a multi-level fallback loop:
- Accept Low-Confidence Entities: Lower the default confidence threshold for the
invoice_dateentity. - Keyword Context Matching: Scan the text around keywords like "FECHA", "DATE", or "INVOICE DATE" using positional regex.
- Positional Heuristics: Use the first date pattern found in the first 500 characters of the document (often where headers reside).
Results & Performance Impact
By implementing these python post-processing fallbacks, the extraction rate of product codes in our test set went from 60% to 100%, and invoice dates increased by 50%. Although these fallback records are sparse (containing only the code or date, without line-item table context), they provided the client with the essential data points they needed immediately.