When cataloging online assets, speed and precision are paramount. Curation workflows that require users to manually copy-paste page titles, publication dates, and author names are highly inefficient and introduce human errors. However, because websites are structured differently, building a single scraping model is close to impossible.
The solution is to leverage Standardized Header Harvesting. By target-scraping SEO headers like OpenGraph, Twitter Cards, and Sailthru metadata programmatically, we can accurately harvest clean metadata across 95% of major sites with a single script.
The Ingestion Strategy
Rather than writing separate custom parsers for every individual website, we look for standardized metadata schemas embedded inside the HTML headers. We try to read headers in priority order:
- OpenGraph Tags (og:*): Supported by almost all modern websites for social media preview cards.
- Twitter Card Tags (twitter:*): Supported by news sites and media platforms.
- Sailthru Tags (sailthru:*): Extremely popular among major news portals and corporate publishing outlets.
- HTML Fallbacks: Standard tags like
document.titleand meta author tags as a clean backup.
Step 1: Harvesting the Best Subject Title
We write a robust parsing utility inside our content script to extract the most descriptive title from the metadata headers, falling back to the standard page title if metadata tags are absent:
function getBestSubjectLine() {
let subject = document.title || "";
// 1. Check Sailthru title
const sailthru = document.querySelector('meta[property="sailthru:title"]');
if (sailthru && sailthru.getAttribute("content")) {
return sailthru.getAttribute("content");
}
// 2. Check OpenGraph title
const og = document.querySelector('meta[property="og:title"]');
if (og && og.getAttribute("content")) {
return og.getAttribute("content");
}
// 3. Check Twitter title
const twitter = document.querySelector('meta[property="twitter:title"]');
if (twitter && twitter.getAttribute("content")) {
return twitter.getAttribute("content");
}
return subject;
}
Step 2: Resolving the Author Name
Author naming schemas are notoriously inconsistent. Here is a production-grade helper that resolves the author's name across various news and publishing conventions:
function getBestAuthor() {
let author = "";
// Check OpenGraph author
const ogAuthor = document.querySelector('meta[property="article:author_name"]');
if (ogAuthor && ogAuthor.getAttribute("content")) {
return ogAuthor.getAttribute("content");
}
// Check Sailthru author
const sailthruAuthor = document.querySelector('meta[property="sailthru.author"]');
if (sailthruAuthor && sailthruAuthor.getAttribute("content")) {
return sailthruAuthor.getAttribute("content");
}
// Check standard meta author variations
const metaAuthor = document.querySelector('meta[name="author"]') ||
document.querySelector('meta[name="Author"]');
if (metaAuthor && metaAuthor.getAttribute("content")) {
return metaAuthor.getAttribute("content");
}
return author;
}
Step 3: Compiling Site Names
To record the source publisher (e.g. "TechCrunch", "Medium", or "Wired"), we check application properties and brand headers:
function getBestSiteName() {
let siteName = "";
// 1. Check og:site_name
const ogSite = document.querySelector('meta[property="og:site_name"]');
if (ogSite && ogSite.getAttribute("content")) {
return ogSite.getAttribute("content");
}
// 2. Check application-name
const appName = document.querySelector('meta[property="application-name"]');
if (appName && appName.getAttribute("content")) {
return appName.getAttribute("content");
}
// 3. Check mobile app title
const appleTitle = document.querySelector('meta[property="apple-mobile-web-app-title"]');
if (appleTitle && appleTitle.getAttribute("content")) {
return appleTitle.getAttribute("content");
}
return siteName;
}
Harvester Design Best Practices
- Trim and Clean: Always call
.trim()on extracted attributes to strip white spaces, newlines, and trailing separator tabs. - Run at document_start: Set
"run_at": "document_start"in your manifest configurations. This allows the extension to extract metadata properties immediately before heavy images or page scripts slow down the tab execution. - Graceful Fallbacks: Never let a missing tag break execution. Return empty strings and fallback to raw window properties.
Implementing a unified, prioritized metadata harvesting pipeline inside your content scripts lets users catalog web resources with a single click, completely removing manual inputs and human error.