Automated Meta Scraping: Harvesting OpenGraph, Twitter, and Sailthru Headers

When cataloging online assets, speed and precision are paramount. Curation workflows that require users to manually copy-paste page titles, publication dates, and author names are highly inefficient and introduce human errors. However, because websites are structured differently, building a single scraping model is close to impossible.

The solution is to leverage Standardized Header Harvesting. By target-scraping SEO headers like OpenGraph, Twitter Cards, and Sailthru metadata programmatically, we can accurately harvest clean metadata across 95% of major sites with a single script.

The Ingestion Strategy

Rather than writing separate custom parsers for every individual website, we look for standardized metadata schemas embedded inside the HTML headers. We try to read headers in priority order:

OpenGraph Tags (og:*): Supported by almost all modern websites for social media preview cards.
Twitter Card Tags (twitter:*): Supported by news sites and media platforms.
Sailthru Tags (sailthru:*): Extremely popular among major news portals and corporate publishing outlets.
HTML Fallbacks: Standard tags like document.title and meta author tags as a clean backup.

Step 1: Harvesting the Best Subject Title

We write a robust parsing utility inside our content script to extract the most descriptive title from the metadata headers, falling back to the standard page title if metadata tags are absent:

function getBestSubjectLine() {
    let subject = document.title || "";

    // 1. Check Sailthru title
    const sailthru = document.querySelector('meta[property="sailthru:title"]');
    if (sailthru && sailthru.getAttribute("content")) {
        return sailthru.getAttribute("content");
    }

    // 2. Check OpenGraph title
    const og = document.querySelector('meta[property="og:title"]');
    if (og && og.getAttribute("content")) {
        return og.getAttribute("content");
    }

    // 3. Check Twitter title
    const twitter = document.querySelector('meta[property="twitter:title"]');
    if (twitter && twitter.getAttribute("content")) {
        return twitter.getAttribute("content");
    }

    return subject;
}

Step 2: Resolving the Author Name

Author naming schemas are notoriously inconsistent. Here is a production-grade helper that resolves the author's name across various news and publishing conventions:

function getBestAuthor() {
    let author = "";

    // Check OpenGraph author
    const ogAuthor = document.querySelector('meta[property="article:author_name"]');
    if (ogAuthor && ogAuthor.getAttribute("content")) {
        return ogAuthor.getAttribute("content");
    }

    // Check Sailthru author
    const sailthruAuthor = document.querySelector('meta[property="sailthru.author"]');
    if (sailthruAuthor && sailthruAuthor.getAttribute("content")) {
        return sailthruAuthor.getAttribute("content");
    }

    // Check standard meta author variations
    const metaAuthor = document.querySelector('meta[name="author"]') || 
                       document.querySelector('meta[name="Author"]');
    if (metaAuthor && metaAuthor.getAttribute("content")) {
        return metaAuthor.getAttribute("content");
    }

    return author;
}

Step 3: Compiling Site Names

To record the source publisher (e.g. "TechCrunch", "Medium", or "Wired"), we check application properties and brand headers:

function getBestSiteName() {
    let siteName = "";

    // 1. Check og:site_name
    const ogSite = document.querySelector('meta[property="og:site_name"]');
    if (ogSite && ogSite.getAttribute("content")) {
        return ogSite.getAttribute("content");
    }

    // 2. Check application-name
    const appName = document.querySelector('meta[property="application-name"]');
    if (appName && appName.getAttribute("content")) {
        return appName.getAttribute("content");
    }

    // 3. Check mobile app title
    const appleTitle = document.querySelector('meta[property="apple-mobile-web-app-title"]');
    if (appleTitle && appleTitle.getAttribute("content")) {
        return appleTitle.getAttribute("content");
    }

    return siteName;
}

Harvester Design Best Practices

Trim and Clean: Always call .trim() on extracted attributes to strip white spaces, newlines, and trailing separator tabs.
Run at document_start: Set "run_at": "document_start" in your manifest configurations. This allows the extension to extract metadata properties immediately before heavy images or page scripts slow down the tab execution.
Graceful Fallbacks: Never let a missing tag break execution. Return empty strings and fallback to raw window properties.

Implementing a unified, prioritized metadata harvesting pipeline inside your content scripts lets users catalog web resources with a single click, completely removing manual inputs and human error.