Electronic component part numbers (such as DM3AT-SF-PEJM5(11) or 5015-120-001) are highly structured. They contain manufacturer prefixes, mounting codes, packaging identifiers, and revisions separated by dashes, slashes, or parentheses. Standard text search engines fail here: they either break the tokens into too many pieces, or fail to match when the user types a partial part number.
To provide instant, accurate search results for engineers, we must write custom Elasticsearch analyzers that understand part number syntax.
1. Custom N-Gram Tokenization
Standard tokenizers split text by whitespace or punctuation. For part numbers, we want to match partial queries like DM3 or PEJM5. We configure a custom Edge N-Gram Tokenizer that generates partial tokens from the start of each word, combined with a custom character filter that strips non-alphanumeric characters:
{
"settings": {
"analysis": {
"tokenizer": {
"part_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": ["letter", "digit"]
}
},
"analyzer": {
"part_number_analyzer": {
"type": "custom",
"tokenizer": "part_ngram_tokenizer",
"filter": ["lowercase"]
}
}
}
}
}
2. Tuning Match Queries
When querying, we use a bool query that combines a high-weight exact match clause with a lower-weight fuzzy clause. This ensures that exact matches appear at the top, while partial or slightly misspelled numbers still appear in the search results.