Search Systems
Why Not Just Use SQL?โ
-- SQL LIKE is O(n) โ full table scan, no ranking
SELECT * FROM products WHERE description LIKE '%bluetooth speaker%';
-- 1M products โ scans all 1M rows, no relevance score
Elasticsearch advantages:
- Inverted index โ O(1) term lookup
- Relevance scoring (TF-IDF, BM25)
- Fuzzy matching, stemming, synonyms
- Near real-time search (1s delay after index)
- Horizontal scaling built-in
Inverted Indexโ
Documents:
Doc1: "bluetooth speaker portable"
Doc2: "wireless speaker home"
Doc3: "bluetooth headphones portable"
Inverted Index:
"bluetooth" โ [Doc1, Doc3]
"speaker" โ [Doc1, Doc2]
"portable" โ [Doc1, Doc3]
"wireless" โ [Doc2]
"headphones" โ [Doc3]
Query "bluetooth speaker":
"bluetooth" โ [Doc1, Doc3]
"speaker" โ [Doc1, Doc2]
Intersection โ Doc1 (highest score: matches both terms)
Elasticsearch Conceptsโ
| Concept | SQL Equivalent |
|---|---|
| Index | Table |
| Document | Row |
| Field | Column |
| Mapping | Schema |
| Shard | Partition |
| Replica | Read replica |
Index Settingsโ
PUT /products
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stop", "snowball"]
}
}
}
},
"mappings": {
"properties": {
"name": { "type": "text", "analyzer": "my_analyzer" },
"name_keyword": { "type": "keyword" },
"price": { "type": "float" },
"category": { "type": "keyword" },
"description": { "type": "text" },
"tags": { "type": "keyword" },
"created_at": { "type": "date" }
}
}
}
Common Query Patternsโ
Full-Text Search with Boostโ
GET /products/_search
{
"query": {
"multi_match": {
"query": "bluetooth speaker",
"fields": ["name^3", "description", "tags^2"],
"type": "best_fields",
"fuzziness": "AUTO"
}
}
}
Boolean Queryโ
GET /products/_search
{
"query": {
"bool": {
"must": [
{ "match": { "name": "speaker" } }
],
"filter": [
{ "range": { "price": { "gte": 50, "lte": 200 } } },
{ "term": { "category": "electronics" } },
{ "term": { "in_stock": true } }
],
"should": [
{ "match": { "tags": "featured" } }
]
}
},
"sort": [
{ "_score": "desc" },
{ "created_at": "desc" }
]
}
Faceted Search (Aggregations)โ
GET /products/_search
{
"query": { "match": { "name": "speaker" } },
"aggs": {
"by_category": {
"terms": { "field": "category", "size": 10 }
},
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{ "to": 50 },
{ "from": 50, "to": 100 },
{ "from": 100, "to": 200 },
{ "from": 200 }
]
}
},
"avg_price": { "avg": { "field": "price" } }
}
}
Spring Data Elasticsearchโ
// Document mapping
@Document(indexName = "products")
@Setting(settingPath = "es-settings.json")
public class ProductDocument {
@Id
private String id;
@MultiField(
mainField = @Field(type = FieldType.Text, analyzer = "english"),
otherFields = @InnerField(suffix = "keyword", type = FieldType.Keyword)
)
private String name;
@Field(type = FieldType.Float)
private BigDecimal price;
@Field(type = FieldType.Keyword)
private String category;
@Field(type = FieldType.Date)
private Instant createdAt;
}
// Repository
public interface ProductSearchRepository
extends ElasticsearchRepository<ProductDocument, String> {
@Query("{\"multi_match\": {\"query\": \"?0\", \"fields\": [\"name^3\", \"description\"]}}")
Page<ProductDocument> search(String query, Pageable pageable);
}
// Service with custom query
@Service
public class ProductSearchService {
@Autowired private ElasticsearchOperations esOps;
public SearchResult<ProductDocument> search(ProductSearchRequest req) {
Query query = NativeQuery.builder()
.withQuery(q -> q.bool(b -> b
.must(m -> m.multiMatch(mm -> mm
.query(req.getKeyword())
.fields("name^3", "description", "tags^2")
.fuzziness("AUTO")
))
.filter(f -> f.range(r -> r
.field("price")
.gte(JsonData.of(req.getMinPrice()))
.lte(JsonData.of(req.getMaxPrice()))
))
))
.withPageable(PageRequest.of(req.getPage(), req.getSize()))
.build();
return esOps.search(query, ProductDocument.class);
}
}
Autocomplete / Search-as-You-Typeโ
Completion Suggesterโ
PUT /products
{
"mappings": {
"properties": {
"name_suggest": {
"type": "completion",
"analyzer": "simple",
"search_analyzer": "simple"
}
}
}
}
// Index document with suggestions
{
"name": "Bluetooth Speaker",
"name_suggest": {
"input": ["Bluetooth Speaker", "bluetooth speaker", "speaker"],
"weight": 100
}
}
// Query
GET /products/_search
{
"suggest": {
"product_suggest": {
"prefix": "blue",
"completion": {
"field": "name_suggest",
"size": 5,
"fuzzy": { "fuzziness": 1 }
}
}
}
}
Edge N-Gram (Type-ahead)โ
Better for partial word matching:
"analysis": {
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
// "blue" โ "b", "bl", "blu", "blue"
// Searching "blu" matches "blue", "bluetooth", etc.
Search Architectureโ
User Input โ Search API
โ
Query Parser (handle special chars, operators, quotes)
โ
Query Builder (build ES query with filters, boosts)
โ
Elasticsearch Cluster
(5 shards, 1 replica each = 10 nodes total for 100M docs)
โ
Result Ranker (personalization, A/B test scoring)
โ
Response Formatter
โ
Client
Data Sync: DB โ Elasticsearchโ
Option 1: Dual-write (write to DB + ES in same transaction)
Risk: partial failure โ inconsistency
Option 2: CDC (Change Data Capture) via Debezium
DB binlog โ Debezium โ Kafka โ ES Consumer
Reliable, near real-time (~1s lag)
Option 3: Batch sync (nightly full reindex)
Simple, but stale data during day
// Debezium + Kafka โ ES indexer
@KafkaListener(topics = "dbserver1.public.products")
public void onProductChange(DebeziumMessage msg) {
if ("DELETE".equals(msg.getOp())) {
productSearchRepository.deleteById(msg.getBefore().getId());
} else {
ProductDocument doc = mapper.toDocument(msg.getAfter());
productSearchRepository.save(doc);
}
}
Relevance Tuningโ
| Technique | Effect |
|---|---|
| Field boosting | Name matches > description matches |
| Freshness boost | Recent products scored higher |
| Popularity boost | High-view products ranked up |
| Personalization | User's history influences ranking |
| Synonym expansion | "TV" matches "television" |
| Stemming | "running" matches "run", "runs" |
| Fuzzy matching | "speker" matches "speaker" |
Interview Questionsโ
Q: Why is Elasticsearch (or search engine) better than SQL LIKE for full-text search?โ
A: Search engines use analyzers, inverted indexes, and ranking algorithms optimized for linguistic matching. SQL LIKE is limited, often scans heavily, and lacks relevance scoring.
Q: What is an inverted index? How does it enable fast search?โ
A: An inverted index maps terms to posting lists of matching document IDs. Queries become fast set operations on postings instead of scanning full documents.
Q: How do you keep Elasticsearch in sync with your primary database?โ
A: Publish DB changes through CDC or outbox events and update Elasticsearch asynchronously with retries. Keep reindex/replay tooling for drift recovery.
Q: What is BM25 and how does it improve over TF-IDF?โ
A: BM25 is a probabilistic ranking function that better normalizes term frequency and document length. It usually produces more relevant rankings than plain TF-IDF defaults.
Q: How would you design an autocomplete/search-as-you-type feature?โ
A: Index edge n-grams or use completion suggester with popularity signals and typo tolerance. Cache top prefixes and debounce client queries.
Q: What is faceted search and how do you implement it?โ
A: Faceted search shows aggregated counts by attributes (brand, price range, category) alongside results. Implement with filtered aggregations over indexed keyword/numeric fields.
Q: How do you handle Elasticsearch going down while still accepting writes to your primary DB?โ
A: Keep DB as source of truth, queue indexing events durably, and replay when cluster recovers. Degrade gracefully by serving fallback search modes if needed.
Q: How would you scale Elasticsearch to handle 1 billion product documents?โ
A: Plan shard count by index size/throughput, use index lifecycle tiers, and separate ingest from query workloads. Optimize mappings, refresh intervals, and replica strategy for SLA.
Q: What are the trade-offs of Elasticsearch's near real-time (~1s) indexing?โ
A: NRT improves throughput by batching refreshes but introduces short visibility delay after writes. Lower refresh intervals reduce delay but increase indexing overhead.
Q: How do you tune relevance in search results?โ
A: Tune analyzers, field boosts, synonym handling, and business signals (freshness/popularity). Evaluate with offline judgments and online A/B metrics like CTR and conversion.