Manufacturer Name Normalization: Resolving String Variations ('HRS' vs. 'Hirose') in Python

2023-06-23 13:25:42+00:00

When aggregating catalog data from multiple suppliers, data inconsistency is a major challenge. Different distributors write manufacturer names differently. For instance, Hirose Electric is written as HRS, Hirose, HIROSE ELECTRIC, or Hirose (HRS). If not normalized, your storefront facets will show 5 different filters for the same manufacturer.

To clean up this data, we build a Manufacturer Normalization Engine using mapping tables and string similarity algorithms.

1. Creating the Canonical Mapping Table

We define a dictionary mapping common spelling variations to a single, canonical manufacturer name:

MANUFACTURER_MAP = {
    "hrs": "Hirose",
    "hirose": "Hirose",
    "hirose electric": "Hirose",
    "hirose(hrs)": "Hirose",
    "ti": "Texas Instruments",
    "texas inst": "Texas Instruments",
    "stmicro": "STMicroelectronics",
    "st microelectronics": "STMicroelectronics"
}

def normalize_manufacturer(name: str) -> str:
    if not name:
        return "Unknown"
    clean_name = name.strip().lower()
    return MANUFACTURER_MAP.get(clean_name, name.title())

2. Fuzzy String Fallback

If a new variation is encountered that is not in our dictionary, we can fall back to fuzzy string matching (using libraries like rapidfuzz or Python's built-in difflib). If a spelling is a 90%+ match to an existing canonical name, we map it automatically, flag it for manual review, and log the change.

Applying this normalization during the ingestion phase ensures search index cleanliness and consistent navigation filters.