Tuning Elasticsearch for Complex String Lookups: Edge N-Grams and Case-Insensitive Matching

2022-08-14 18:00:00+00:00

Modern applications require search-as-you-type interfaces. Standard text search engines split words by whitespace, meaning a search for "Apple" won't match a query for "Ap" until the user types the full word. In enterprise directories, where users search for customer names or SKU prefixes, this standard tokenization fails.

To enable partial string searches, we configure custom Edge N-Gram Tokenizers inside our Elasticsearch index settings.

1. Constructing Custom Analyzers

We configure the index to generate sub-word tokens starting from the edge of the string. A name like Enterprise Platform will generate tokens vr, vry, vryn, the enterprise platform:

// Elasticsearch index setting for partial matches
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 15,
          "token_chars": ["letter", "digit"]
        }
      },
      "analyzer": {
        "partial_match_analyzer": {
          "type": "custom",
          "tokenizer": "edge_ngram_tokenizer",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

2. Restricting Index Size

While Edge N-Grams enable fast partial matches, they generate multiple sub-tokens that increase index size. To manage storage, we apply the custom analyzer solely to index time (using a simple lowercase filter for search time), keeping disk footprint low.