Fuzzy Matching in Electronic Component Search: Tuning Elasticsearch for Complex Part Numbers

2023-02-28 03:08:34+00:00

Electronic component part numbers (such as DM3AT-SF-PEJM5(11) or 5015-120-001) are highly structured. They contain manufacturer prefixes, mounting codes, packaging identifiers, and revisions separated by dashes, slashes, or parentheses. Standard text search engines fail here: they either break the tokens into too many pieces, or fail to match when the user types a partial part number.

To provide instant, accurate search results for engineers, we must write custom Elasticsearch analyzers that understand part number syntax.

1. Custom N-Gram Tokenization

Standard tokenizers split text by whitespace or punctuation. For part numbers, we want to match partial queries like DM3 or PEJM5. We configure a custom Edge N-Gram Tokenizer that generates partial tokens from the start of each word, combined with a custom character filter that strips non-alphanumeric characters:

{
  "settings": {
    "analysis": {
      "tokenizer": {
        "part_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 20,
          "token_chars": ["letter", "digit"]
        }
      },
      "analyzer": {
        "part_number_analyzer": {
          "type": "custom",
          "tokenizer": "part_ngram_tokenizer",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

2. Tuning Match Queries

When querying, we use a bool query that combines a high-weight exact match clause with a lower-weight fuzzy clause. This ensures that exact matches appear at the top, while partial or slightly misspelled numbers still appear in the search results.