Search Systems

Why Not Just Use SQL?

-- SQL LIKE is O(n) — full table scan, no ranking
SELECT * FROM products WHERE description LIKE '%bluetooth speaker%';
-- 1M products → scans all 1M rows, no relevance score

Elasticsearch advantages:

Inverted index → O(1) term lookup
Relevance scoring (TF-IDF, BM25)
Fuzzy matching, stemming, synonyms
Near real-time search (1s delay after index)
Horizontal scaling built-in

Inverted Index

Documents:
  Doc1: "bluetooth speaker portable"
  Doc2: "wireless speaker home"
  Doc3: "bluetooth headphones portable"

Inverted Index:
  "bluetooth"  → [Doc1, Doc3]
  "speaker"    → [Doc1, Doc2]
  "portable"   → [Doc1, Doc3]
  "wireless"   → [Doc2]
  "headphones" → [Doc3]

Query "bluetooth speaker":
  "bluetooth" → [Doc1, Doc3]
  "speaker"   → [Doc1, Doc2]
  Intersection → Doc1 (highest score: matches both terms)

Elasticsearch Concepts

Concept	SQL Equivalent
Index	Table
Document	Row
Field	Column
Mapping	Schema
Shard	Partition
Replica	Read replica

Index Settings

PUT /products
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": { "type": "text", "analyzer": "my_analyzer" },
      "name_keyword": { "type": "keyword" },
      "price": { "type": "float" },
      "category": { "type": "keyword" },
      "description": { "type": "text" },
      "tags": { "type": "keyword" },
      "created_at": { "type": "date" }
    }
  }
}

Common Query Patterns

Full-Text Search with Boost

GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "bluetooth speaker",
      "fields": ["name^3", "description", "tags^2"],
      "type": "best_fields",
      "fuzziness": "AUTO"
    }
  }
}

Boolean Query

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "speaker" } }
      ],
      "filter": [
        { "range": { "price": { "gte": 50, "lte": 200 } } },
        { "term": { "category": "electronics" } },
        { "term": { "in_stock": true } }
      ],
      "should": [
        { "match": { "tags": "featured" } }
      ]
    }
  },
  "sort": [
    { "_score": "desc" },
    { "created_at": "desc" }
  ]
}

Faceted Search (Aggregations)

GET /products/_search
{
  "query": { "match": { "name": "speaker" } },
  "aggs": {
    "by_category": {
      "terms": { "field": "category", "size": 10 }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 50 },
          { "from": 50, "to": 100 },
          { "from": 100, "to": 200 },
          { "from": 200 }
        ]
      }
    },
    "avg_price": { "avg": { "field": "price" } }
  }
}

Spring Data Elasticsearch

// Document mapping
@Document(indexName = "products")
@Setting(settingPath = "es-settings.json")
public class ProductDocument {
    @Id
    private String id;

    @MultiField(
        mainField = @Field(type = FieldType.Text, analyzer = "english"),
        otherFields = @InnerField(suffix = "keyword", type = FieldType.Keyword)
    )
    private String name;

    @Field(type = FieldType.Float)
    private BigDecimal price;

    @Field(type = FieldType.Keyword)
    private String category;

    @Field(type = FieldType.Date)
    private Instant createdAt;
}

// Repository
public interface ProductSearchRepository
        extends ElasticsearchRepository<ProductDocument, String> {

    @Query("{\"multi_match\": {\"query\": \"?0\", \"fields\": [\"name^3\", \"description\"]}}")
    Page<ProductDocument> search(String query, Pageable pageable);
}

// Service with custom query
@Service
public class ProductSearchService {
    @Autowired private ElasticsearchOperations esOps;

    public SearchResult<ProductDocument> search(ProductSearchRequest req) {
        Query query = NativeQuery.builder()
            .withQuery(q -> q.bool(b -> b
                .must(m -> m.multiMatch(mm -> mm
                    .query(req.getKeyword())
                    .fields("name^3", "description", "tags^2")
                    .fuzziness("AUTO")
                ))
                .filter(f -> f.range(r -> r
                    .field("price")
                    .gte(JsonData.of(req.getMinPrice()))
                    .lte(JsonData.of(req.getMaxPrice()))
                ))
            ))
            .withPageable(PageRequest.of(req.getPage(), req.getSize()))
            .build();

        return esOps.search(query, ProductDocument.class);
    }
}

Autocomplete / Search-as-You-Type

Completion Suggester

PUT /products
{
  "mappings": {
    "properties": {
      "name_suggest": {
        "type": "completion",
        "analyzer": "simple",
        "search_analyzer": "simple"
      }
    }
  }
}

// Index document with suggestions
{
  "name": "Bluetooth Speaker",
  "name_suggest": {
    "input": ["Bluetooth Speaker", "bluetooth speaker", "speaker"],
    "weight": 100
  }
}

// Query
GET /products/_search
{
  "suggest": {
    "product_suggest": {
      "prefix": "blue",
      "completion": {
        "field": "name_suggest",
        "size": 5,
        "fuzzy": { "fuzziness": 1 }
      }
    }
  }
}

Edge N-Gram (Type-ahead)

Better for partial word matching:

"analysis": {
  "tokenizer": {
    "edge_ngram_tokenizer": {
      "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 20
    }
  }
}
// "blue" → "b", "bl", "blu", "blue"
// Searching "blu" matches "blue", "bluetooth", etc.

Search Architecture

User Input → Search API
                ↓
           Query Parser (handle special chars, operators, quotes)
                ↓
           Query Builder (build ES query with filters, boosts)
                ↓
           Elasticsearch Cluster
           (5 shards, 1 replica each = 10 nodes total for 100M docs)
                ↓
           Result Ranker (personalization, A/B test scoring)
                ↓
           Response Formatter
                ↓
           Client

Data Sync: DB → Elasticsearch

Option 1: Dual-write (write to DB + ES in same transaction)
  Risk: partial failure → inconsistency

Option 2: CDC (Change Data Capture) via Debezium
  DB binlog → Debezium → Kafka → ES Consumer
  Reliable, near real-time (~1s lag)

Option 3: Batch sync (nightly full reindex)
  Simple, but stale data during day

// Debezium + Kafka → ES indexer
@KafkaListener(topics = "dbserver1.public.products")
public void onProductChange(DebeziumMessage msg) {
    if ("DELETE".equals(msg.getOp())) {
        productSearchRepository.deleteById(msg.getBefore().getId());
    } else {
        ProductDocument doc = mapper.toDocument(msg.getAfter());
        productSearchRepository.save(doc);
    }
}

Relevance Tuning

Technique	Effect
Field boosting	Name matches > description matches
Freshness boost	Recent products scored higher
Popularity boost	High-view products ranked up
Personalization	User's history influences ranking
Synonym expansion	"TV" matches "television"
Stemming	"running" matches "run", "runs"
Fuzzy matching	"speker" matches "speaker"

Interview Questions

Q: Why is Elasticsearch (or search engine) better than SQL LIKE for full-text search?

A: Search engines use analyzers, inverted indexes, and ranking algorithms optimized for linguistic matching. SQL LIKE is limited, often scans heavily, and lacks relevance scoring.

Q: What is an inverted index? How does it enable fast search?

A: An inverted index maps terms to posting lists of matching document IDs. Queries become fast set operations on postings instead of scanning full documents.

Q: How do you keep Elasticsearch in sync with your primary database?

A: Publish DB changes through CDC or outbox events and update Elasticsearch asynchronously with retries. Keep reindex/replay tooling for drift recovery.

Q: What is BM25 and how does it improve over TF-IDF?

A: BM25 is a probabilistic ranking function that better normalizes term frequency and document length. It usually produces more relevant rankings than plain TF-IDF defaults.

Q: How would you design an autocomplete/search-as-you-type feature?

A: Index edge n-grams or use completion suggester with popularity signals and typo tolerance. Cache top prefixes and debounce client queries.

Q: What is faceted search and how do you implement it?

A: Faceted search shows aggregated counts by attributes (brand, price range, category) alongside results. Implement with filtered aggregations over indexed keyword/numeric fields.

Q: How do you handle Elasticsearch going down while still accepting writes to your primary DB?

A: Keep DB as source of truth, queue indexing events durably, and replay when cluster recovers. Degrade gracefully by serving fallback search modes if needed.

Q: How would you scale Elasticsearch to handle 1 billion product documents?

A: Plan shard count by index size/throughput, use index lifecycle tiers, and separate ingest from query workloads. Optimize mappings, refresh intervals, and replica strategy for SLA.

Q: What are the trade-offs of Elasticsearch's near real-time (~1s) indexing?

A: NRT improves throughput by batching refreshes but introduces short visibility delay after writes. Lower refresh intervals reduce delay but increase indexing overhead.

Q: How do you tune relevance in search results?

A: Tune analyzers, field boosts, synonym handling, and business signals (freshness/popularity). Evaluate with offline judgments and online A/B metrics like CTR and conversion.

Why Not Just Use SQL?​

Inverted Index​

Elasticsearch Concepts​

Index Settings​

Common Query Patterns​

Full-Text Search with Boost​

Boolean Query​

Faceted Search (Aggregations)​

Spring Data Elasticsearch​

Autocomplete / Search-as-You-Type​

Completion Suggester​

Edge N-Gram (Type-ahead)​

Search Architecture​

Data Sync: DB → Elasticsearch​

Relevance Tuning​

Interview Questions​

Q: Why is Elasticsearch (or search engine) better than SQL LIKE for full-text search?​

Q: What is an inverted index? How does it enable fast search?​

Q: How do you keep Elasticsearch in sync with your primary database?​

Q: What is BM25 and how does it improve over TF-IDF?​

Q: How would you design an autocomplete/search-as-you-type feature?​

Q: What is faceted search and how do you implement it?​

Q: How do you handle Elasticsearch going down while still accepting writes to your primary DB?​

Q: How would you scale Elasticsearch to handle 1 billion product documents?​

Q: What are the trade-offs of Elasticsearch's near real-time (~1s) indexing?​

Q: How do you tune relevance in search results?​