Skip to main content

Search Systems


Why Not Just Use SQL?โ€‹

-- SQL LIKE is O(n) โ€” full table scan, no ranking
SELECT * FROM products WHERE description LIKE '%bluetooth speaker%';
-- 1M products โ†’ scans all 1M rows, no relevance score

Elasticsearch advantages:

  • Inverted index โ†’ O(1) term lookup
  • Relevance scoring (TF-IDF, BM25)
  • Fuzzy matching, stemming, synonyms
  • Near real-time search (1s delay after index)
  • Horizontal scaling built-in

Inverted Indexโ€‹

Documents:
Doc1: "bluetooth speaker portable"
Doc2: "wireless speaker home"
Doc3: "bluetooth headphones portable"

Inverted Index:
"bluetooth" โ†’ [Doc1, Doc3]
"speaker" โ†’ [Doc1, Doc2]
"portable" โ†’ [Doc1, Doc3]
"wireless" โ†’ [Doc2]
"headphones" โ†’ [Doc3]

Query "bluetooth speaker":
"bluetooth" โ†’ [Doc1, Doc3]
"speaker" โ†’ [Doc1, Doc2]
Intersection โ†’ Doc1 (highest score: matches both terms)

Elasticsearch Conceptsโ€‹

ConceptSQL Equivalent
IndexTable
DocumentRow
FieldColumn
MappingSchema
ShardPartition
ReplicaRead replica

Index Settingsโ€‹

PUT /products
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stop", "snowball"]
}
}
}
},
"mappings": {
"properties": {
"name": { "type": "text", "analyzer": "my_analyzer" },
"name_keyword": { "type": "keyword" },
"price": { "type": "float" },
"category": { "type": "keyword" },
"description": { "type": "text" },
"tags": { "type": "keyword" },
"created_at": { "type": "date" }
}
}
}

Common Query Patternsโ€‹

Full-Text Search with Boostโ€‹

GET /products/_search
{
"query": {
"multi_match": {
"query": "bluetooth speaker",
"fields": ["name^3", "description", "tags^2"],
"type": "best_fields",
"fuzziness": "AUTO"
}
}
}

Boolean Queryโ€‹

GET /products/_search
{
"query": {
"bool": {
"must": [
{ "match": { "name": "speaker" } }
],
"filter": [
{ "range": { "price": { "gte": 50, "lte": 200 } } },
{ "term": { "category": "electronics" } },
{ "term": { "in_stock": true } }
],
"should": [
{ "match": { "tags": "featured" } }
]
}
},
"sort": [
{ "_score": "desc" },
{ "created_at": "desc" }
]
}

Faceted Search (Aggregations)โ€‹

GET /products/_search
{
"query": { "match": { "name": "speaker" } },
"aggs": {
"by_category": {
"terms": { "field": "category", "size": 10 }
},
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{ "to": 50 },
{ "from": 50, "to": 100 },
{ "from": 100, "to": 200 },
{ "from": 200 }
]
}
},
"avg_price": { "avg": { "field": "price" } }
}
}

Spring Data Elasticsearchโ€‹

// Document mapping
@Document(indexName = "products")
@Setting(settingPath = "es-settings.json")
public class ProductDocument {
@Id
private String id;

@MultiField(
mainField = @Field(type = FieldType.Text, analyzer = "english"),
otherFields = @InnerField(suffix = "keyword", type = FieldType.Keyword)
)
private String name;

@Field(type = FieldType.Float)
private BigDecimal price;

@Field(type = FieldType.Keyword)
private String category;

@Field(type = FieldType.Date)
private Instant createdAt;
}

// Repository
public interface ProductSearchRepository
extends ElasticsearchRepository<ProductDocument, String> {

@Query("{\"multi_match\": {\"query\": \"?0\", \"fields\": [\"name^3\", \"description\"]}}")
Page<ProductDocument> search(String query, Pageable pageable);
}

// Service with custom query
@Service
public class ProductSearchService {
@Autowired private ElasticsearchOperations esOps;

public SearchResult<ProductDocument> search(ProductSearchRequest req) {
Query query = NativeQuery.builder()
.withQuery(q -> q.bool(b -> b
.must(m -> m.multiMatch(mm -> mm
.query(req.getKeyword())
.fields("name^3", "description", "tags^2")
.fuzziness("AUTO")
))
.filter(f -> f.range(r -> r
.field("price")
.gte(JsonData.of(req.getMinPrice()))
.lte(JsonData.of(req.getMaxPrice()))
))
))
.withPageable(PageRequest.of(req.getPage(), req.getSize()))
.build();

return esOps.search(query, ProductDocument.class);
}
}

Autocomplete / Search-as-You-Typeโ€‹

Completion Suggesterโ€‹

PUT /products
{
"mappings": {
"properties": {
"name_suggest": {
"type": "completion",
"analyzer": "simple",
"search_analyzer": "simple"
}
}
}
}

// Index document with suggestions
{
"name": "Bluetooth Speaker",
"name_suggest": {
"input": ["Bluetooth Speaker", "bluetooth speaker", "speaker"],
"weight": 100
}
}

// Query
GET /products/_search
{
"suggest": {
"product_suggest": {
"prefix": "blue",
"completion": {
"field": "name_suggest",
"size": 5,
"fuzzy": { "fuzziness": 1 }
}
}
}
}

Edge N-Gram (Type-ahead)โ€‹

Better for partial word matching:

"analysis": {
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
// "blue" โ†’ "b", "bl", "blu", "blue"
// Searching "blu" matches "blue", "bluetooth", etc.

Search Architectureโ€‹

User Input โ†’ Search API
โ†“
Query Parser (handle special chars, operators, quotes)
โ†“
Query Builder (build ES query with filters, boosts)
โ†“
Elasticsearch Cluster
(5 shards, 1 replica each = 10 nodes total for 100M docs)
โ†“
Result Ranker (personalization, A/B test scoring)
โ†“
Response Formatter
โ†“
Client

Data Sync: DB โ†’ Elasticsearchโ€‹

Option 1: Dual-write (write to DB + ES in same transaction)
Risk: partial failure โ†’ inconsistency

Option 2: CDC (Change Data Capture) via Debezium
DB binlog โ†’ Debezium โ†’ Kafka โ†’ ES Consumer
Reliable, near real-time (~1s lag)

Option 3: Batch sync (nightly full reindex)
Simple, but stale data during day
// Debezium + Kafka โ†’ ES indexer
@KafkaListener(topics = "dbserver1.public.products")
public void onProductChange(DebeziumMessage msg) {
if ("DELETE".equals(msg.getOp())) {
productSearchRepository.deleteById(msg.getBefore().getId());
} else {
ProductDocument doc = mapper.toDocument(msg.getAfter());
productSearchRepository.save(doc);
}
}

Relevance Tuningโ€‹

TechniqueEffect
Field boostingName matches > description matches
Freshness boostRecent products scored higher
Popularity boostHigh-view products ranked up
PersonalizationUser's history influences ranking
Synonym expansion"TV" matches "television"
Stemming"running" matches "run", "runs"
Fuzzy matching"speker" matches "speaker"

Interview Questionsโ€‹

A: Search engines use analyzers, inverted indexes, and ranking algorithms optimized for linguistic matching. SQL LIKE is limited, often scans heavily, and lacks relevance scoring.

A: An inverted index maps terms to posting lists of matching document IDs. Queries become fast set operations on postings instead of scanning full documents.

Q: How do you keep Elasticsearch in sync with your primary database?โ€‹

A: Publish DB changes through CDC or outbox events and update Elasticsearch asynchronously with retries. Keep reindex/replay tooling for drift recovery.

Q: What is BM25 and how does it improve over TF-IDF?โ€‹

A: BM25 is a probabilistic ranking function that better normalizes term frequency and document length. It usually produces more relevant rankings than plain TF-IDF defaults.

Q: How would you design an autocomplete/search-as-you-type feature?โ€‹

A: Index edge n-grams or use completion suggester with popularity signals and typo tolerance. Cache top prefixes and debounce client queries.

Q: What is faceted search and how do you implement it?โ€‹

A: Faceted search shows aggregated counts by attributes (brand, price range, category) alongside results. Implement with filtered aggregations over indexed keyword/numeric fields.

Q: How do you handle Elasticsearch going down while still accepting writes to your primary DB?โ€‹

A: Keep DB as source of truth, queue indexing events durably, and replay when cluster recovers. Degrade gracefully by serving fallback search modes if needed.

Q: How would you scale Elasticsearch to handle 1 billion product documents?โ€‹

A: Plan shard count by index size/throughput, use index lifecycle tiers, and separate ingest from query workloads. Optimize mappings, refresh intervals, and replica strategy for SLA.

Q: What are the trade-offs of Elasticsearch's near real-time (~1s) indexing?โ€‹

A: NRT improves throughput by batching refreshes but introduces short visibility delay after writes. Lower refresh intervals reduce delay but increase indexing overhead.

Q: How do you tune relevance in search results?โ€‹

A: Tune analyzers, field boosts, synonym handling, and business signals (freshness/popularity). Evaluate with offline judgments and online A/B metrics like CTR and conversion.