Skip to main content

NoSQL & MongoDB โ€” Complete Guide

What is MongoDB?โ€‹

MongoDB is a document-oriented NoSQL database โ€” the most popular one in the world. Instead of storing data in rows and columns like MySQL or PostgreSQL, it stores data as flexible JSON-like documents grouped into collections.

ConceptDescription
Documents, not rowsEach record is a rich JSON/BSON document. Fields can be nested, arrays can hold objects โ€” no rigid schema required.
Collections, not tablesDocuments live in collections. Unlike SQL tables, a collection doesn't enforce that every document has the same fields.
Horizontal scaleDesigned to shard across many servers rather than scale up to one bigger one.
Flexible schemaAdd a new field to one document without a migration script.

A document looks like thisโ€‹

Forget rows with foreign-key joins. One user document holds everything about that user:

// SQL: 3 tables (users, addresses, order_refs) + JOINs
// MongoDB: one self-contained document
{
"_id": "ObjectId(\"64abc123...\")",
"name": "Alice Tran",
"email": "[email protected]",
"createdAt": "ISODate(\"2024-01-15T09:00:00Z\")",
"address": {
"city": "Ho Chi Minh City",
"country": "VN"
},
"tags": ["premium", "verified"],
"preferences": {
"theme": "dark",
"notifications": true
}
}
BSON vs JSON

MongoDB stores documents as BSON (Binary JSON) โ€” a superset of JSON with extra types like ObjectId, Date, Decimal128, and BinData. The driver converts between BSON and your language's native types automatically.


Why MongoDB?โ€‹

Problems it solvesโ€‹

ProblemHow MongoDB helps
Schema rigidityIn SQL, adding a column to 500M rows requires a lock and hours of migration. In MongoDB you just start writing the new field โ€” old documents simply won't have it.
Impedance mismatchORMs exist because objects don't map cleanly to tables. MongoDB documents are objects โ€” nested structures map 1:1, eliminating the mapping layer.
Write throughput at scaleHorizontal sharding distributes writes across many nodes. A single SQL primary can't match the write throughput of a 10-shard cluster.
Variable data shapesProduct catalogs where a laptop has 40 fields and a T-shirt has 8 different ones. EAV in SQL is painful; MongoDB handles this naturally.

MongoDB vs. relational databasesโ€‹

TopicRelational (PostgreSQL)MongoDB
Data modelTables, rows, columnsCollections, documents (JSON/BSON)
SchemaFixed โ€” DDL migrations requiredFlexible โ€” optional schema validation
JoinsNative SQL JOINs$lookup (aggregation) or embedding
TransactionsFull ACID since the 1980sACID since v4.0 (single & multi-doc)
Horizontal scaleHard; read replicas only (writes to one)Native sharding across many nodes
Query languageSQL (declarative, universal)MQL โ€” MongoDB Query Language (JSON-based)
Best forComplex queries, strict integrity, BIDocuments, scale-out, evolving schemas
When NOT to use MongoDB

Financial ledgers, stock inventory, reservation systems โ€” anywhere strict ACID consistency across many related records is the primary concern, a mature relational database is usually the better choice. MongoDB added multi-document transactions, but they carry a real performance cost.

Decision guideโ€‹

Is data hierarchical / document-shaped?
โ†’ MongoDB โœ…

Do you need complex multi-table JOINs + referential integrity?
โ†’ PostgreSQL

Is it high-volume time-series (IoT, monitoring)?
โ†’ MongoDB time-series collections (v5+) or InfluxDB/TimescaleDB

Need full-text search as the primary feature?
โ†’ Elasticsearch (or Atlas Search on top of MongoDB)

Need sub-millisecond caching?
โ†’ Redis (MongoDB is a disk database)

Core Conceptsโ€‹

The hierarchyโ€‹

MongoDBSQL equivalentDescription
DatabaseDatabaseA namespace containing collections. One server can host many databases.
CollectionTableA group of documents. No enforced schema by default.
DocumentRowA BSON record โ€” a set of key-value pairs, can be nested.
FieldColumnA key-value pair inside a document. Value can be any BSON type.
_idPrimary keyEvery document must have a unique _id. Auto-generated as ObjectId if omitted.
IndexIndexB-tree or special index on one or more fields to speed up queries.

ObjectId โ€” the default primary keyโ€‹

MongoDB auto-generates a 12-byte ObjectId for every document's _id. It encodes timestamp + machine + random counter โ€” globally unique, roughly time-sortable, no central counter needed:

ObjectId("64abc1230000000000000001")
// ^^^^^^^^ ^^^^^^ ^^^^ ^^^^^^^^
// 4 bytes 3 bytes 2 b 3 bytes
// Unix TS machine PID counter
ObjectId includes a timestamp

You can extract the creation time of any document for free: document._id.getTimestamp() โ€” no separate createdAt field needed.

BSON data typesโ€‹

TypeBSON type stringExample use
StringstringNames, descriptions
Integer (32/64-bit)int / longCounts, IDs
DoubledoubleCoordinates, ratings
Decimal128decimalMoney โ€” use this, not double!
BooleanboolFlags
DatedateTimestamps โ€” always UTC
ObjectIdobjectIdDocument IDs, references
ArrayarrayTags, order items, addresses
Object (embedded doc)objectNested documents
NullnullOptional absent field
Binary databinDataFile bytes, UUIDs

MongoDB can enforce a JSON Schema on a collection. This gives you flexible schema where needed, strict rules where important:

db.createCollection("users", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["name", "email"],
properties: {
name: { bsonType: "string", minLength: 1 },
email: { bsonType: "string", pattern: "^.+@.+$" },
age: { bsonType: "int", minimum: 0, maximum: 150 }
}
}
},
validationAction: "error" // or "warn" to log without rejecting
})

Data Modellingโ€‹

The most important MongoDB skill. The central question: should related data be embedded in the same document, or referenced across documents?

Embedding (denormalization)โ€‹

Store related data inside the parent document. One read fetches everything โ€” no JOINs:

{
"_id": "ObjectId(\"...\")",
"userId": "alice",
"items": [
{ "sku": "LAPTOP-X1", "qty": 1, "price": 1299.00 },
{ "sku": "HDMI-CBL", "qty": 2, "price": 14.99 }
],
"shippingAddress": { "city": "HCMC", "zip": "70000" }
}

Referencing (normalization)โ€‹

Store a foreign key (usually an ObjectId) and look up the related document separately โ€” more like SQL:

// Order document
{ "_id": ObjectId("order1"), "userId": ObjectId("user1"), "total": 129.99 }

// User document (separate collection)
{ "_id": ObjectId("user1"), "name": "Alice", "email": "alice@..." }

// Query with $lookup (like a LEFT JOIN)
db.orders.aggregate([
{ $lookup: {
from: "users",
localField: "userId",
foreignField: "_id",
as: "user"
}}
])

The embed vs. reference decisionโ€‹

Use embed whenโ€ฆUse reference whenโ€ฆ
Data is always read togetherData is accessed independently
Relationship is 1-to-few (โ‰ค ~100 sub-items)Relationship is 1-to-many (thousands+)
Sub-item has no lifecycle without the parentSub-item exists independently (e.g. a Product)
Sub-items rarely change individuallySub-items are updated frequently
Document size stays under 16 MBEmbedding would exceed the 16 MB document limit
The 16 MB document limit

MongoDB documents cannot exceed 16 MB. An order with 2 line items โ€” fine to embed. A social post with unbounded comments โ€” never embed comments. Use a separate comments collection referenced by postId.

Real-world examplesโ€‹

Blog post + metadata โ€” tags and SEO are always read with the post, no independent lifecycle:

{ "title": "...", "author": "...", "tags": [], "seo": { "slug": "...", "metaDesc": "..." } }

Order + line items โ€” items are bounded (โ‰ค 100) and always shown with the order:

{ "userId": "...", "items": [{ "sku": "X1", "qty": 1, "price": 1299 }], "shippingAddress": {} }

๐Ÿ”ฌ Senior deep-dive: advanced schema patterns

Bucket patternโ€‹

For time-series data, instead of one document per event (millions of tiny documents), bucket events into time windows. This dramatically reduces index size and improves scan performance:

// Anti-pattern: one doc per sensor reading (millions of docs)
{ sensorId: "s1", ts: ISODate("..."), value: 22.5 }

// Bucket pattern: one doc per sensor per hour (thousands of docs)
{
sensorId: "s1",
hour: ISODate("2024-01-15T09:00:00Z"),
count: 60,
sum: 1350.0,
readings: [
{ ts: ISODate("09:00:01"), v: 22.5 },
{ ts: ISODate("09:01:02"), v: 22.6 }
// ... 58 more
]
}

Extended reference patternโ€‹

Reference the foreign document's _id, but also duplicate the most frequently read fields into the referencing document. This trades some write complexity for zero-join reads on the hot path:

// Order line item โ€” productId is the reference,
// but name + imageUrl are duplicated to avoid $lookup on every order display
{
productId: ObjectId("..."),
name: "USB-C Charger 65W", // duplicated field
imageUrl: "/images/charger.jpg", // duplicated field
qty: 2,
unitPrice: 24.99
}
Trade-off

If name changes in the products collection, order history shows the old name โ€” which is often correct for an order (you want the name at time of purchase). Choose deliberately.

Computed patternโ€‹

Pre-compute expensive aggregations and store the result directly on the document. Read the cached value instead of recomputing on every request:

{ "productId": "P1", "totalReviews": 1847, "avgRating": 4.3, "lastComputed": "..." }

Recomputed via a background job or $merge pipeline periodically.

Subset patternโ€‹

Store only the N most recently-accessed array items in the parent document; keep the full list in a separate collection. Keeps the working set small and hot in memory:

// Product โ€” only 5 most recent reviews embedded
{ "_id": "P1", "name": "...", "recentReviews": [/* last 5 */] }

// Full reviews in separate collection
{ "productId": "P1", "text": "...", "rating": 5, "createdAt": "..." }

Outlier patternโ€‹

Most documents are "normal" (a book with 5 reviews). A few are outliers (a bestseller with 50,000). Use an overflow flag + overflow collection instead of bloating the main document:

// Normal book: all reviews embedded
{ "_id": "B1", "reviews": [/* โ‰ค 200 */], "hasOverflow": false }

// Bestseller: truncated + overflow flag
{ "_id": "B2", "reviews": [/* first 200 */], "hasOverflow": true }

// Overflow docs
{ "bookId": "B2", "reviews": [/* next 200 */], "page": 2 }

CRUD Operationsโ€‹

MongoDB Query Language (MQL) uses JSON objects for both filters and update specifications.

Createโ€‹

// Insert one
db.users.insertOne({
name: "Alice", email: "[email protected]", age: 28
})
// โ†’ { acknowledged: true, insertedId: ObjectId("...") }

// Insert many (bulk)
db.users.insertMany([
{ name: "Bob", age: 32 },
{ name: "Carol", age: 25 }
], { ordered: false }) // ordered:false = continue on partial failure

Readโ€‹

// Find all documents
db.users.find({})

// Filter: age > 25 AND name starts with "A"
db.users.find({
age: { $gt: 25 },
name: { $regex: /^A/i }
})

// Projection: return only name + email, exclude _id
db.users.find({ age: { $gte: 18 } }, { name: 1, email: 1, _id: 0 })

// Sort, skip, limit โ€” pagination
db.users.find({})
.sort({ createdAt: -1 }) // -1 = descending
.skip(20)
.limit(10)

// Query on embedded field (dot notation)
db.users.find({ "address.city": "HCMC" })

// Query on array: documents where tags contains "premium"
db.users.find({ tags: "premium" })

Query operators cheat-sheetโ€‹

OperatorMeaningExample
$eqEquals{ age: { $eq: 25 } }
$neNot equals{ status: { $ne: "banned" } }
$gt / $gteGreater than / or equal{ price: { $gte: 100 } }
$lt / $lteLess than / or equal{ stock: { $lt: 5 } }
$inValue in array{ status: { $in: ["active", "trial"] } }
$ninValue not in array{ role: { $nin: ["admin"] } }
$andLogical AND{ $and: [{ age: { $gt: 18 } }, { verified: true }] }
$orLogical OR{ $or: [{ city: "HCM" }, { city: "HN" }] }
$existsField exists{ phone: { $exists: true } }
$regexRegular expression{ email: { $regex: /@gmail/ } }
$elemMatchArray element matches all conditions{ items: { $elemMatch: { qty: { $gt: 1 }, sku: "X" } } }

Updateโ€‹

// Update one document
db.users.updateOne(
{ email: "[email protected]" }, // filter
{ $set: { name: "Alice Tran" }, // update operator
$currentDate: { updatedAt: true } }
)

// Increment a counter atomically
db.products.updateOne(
{ _id: ObjectId("...") },
{ $inc: { stock: -1 } } // atomic โ€” safe under concurrency
)

// Push to an array
db.users.updateOne(
{ _id: ObjectId("...") },
{ $push: { tags: "verified" } }
)

// Upsert: insert if not found
db.users.updateOne(
{ email: "[email protected]" },
{ $set: { name: "New User" } },
{ upsert: true }
)

Update operators cheat-sheetโ€‹

OperatorEffect
$setSet field value (creates if missing)
$unsetRemove a field from the document
$incIncrement/decrement a number atomically
$mulMultiply a field's value
$renameRename a field
$pushAppend to an array
$addToSetAppend to array only if not already present
$pullRemove matching elements from array
$popRemove first or last element of array
$currentDateSet field to current date/timestamp

Deleteโ€‹

// Delete one matching document
db.users.deleteOne({ email: "[email protected]" })

// Delete all matching documents
db.sessions.deleteMany({ expiresAt: { $lt: new Date() } })

// findOneAndDelete: atomically retrieve + delete (useful for job queues)
const job = await db.jobs.findOneAndDelete(
{ status: "pending" },
{ sort: { createdAt: 1 } }
)

Aggregation Pipelineโ€‹

The aggregation pipeline transforms a collection through a series of stages โ€” like Unix pipes for data. Each stage takes input documents and emits output to the next.

Collection โ†’ [$match] โ†’ [$group] โ†’ [$sort] โ†’ [$limit] โ†’ Result

Example: revenue by categoryโ€‹

db.orders.aggregate([
// Stage 1: only completed orders from 2024
{ $match: {
status: "completed",
createdAt: { $gte: ISODate("2024-01-01") }
}},

// Stage 2: unwind items array โ€” one doc per line item
{ $unwind: "$items" },

// Stage 3: join with products collection to get category
{ $lookup: {
from: "products",
localField: "items.productId",
foreignField: "_id",
as: "product"
}},

// Stage 4: group by category, sum revenue
{ $group: {
_id: "$product.category",
totalRevenue: { $sum: { $multiply: ["$items.qty", "$items.price"] } },
orderCount: { $sum: 1 }
}},

// Stage 5: sort highest revenue first
{ $sort: { totalRevenue: -1 }},

// Stage 6: rename _id to category
{ $project: { category: "$_id", totalRevenue: 1, orderCount: 1, _id: 0 }}
])

Key stages referenceโ€‹

StageSQL equivalentUse
$matchWHEREFilter early โ€” must come before expensive stages
$projectSELECT colsInclude/exclude/rename/compute fields
$groupGROUP BYAggregate with $sum, $avg, $min, $max, $push, $addToSet
$sortORDER BYSort (use index when possible โ€” put right after $match)
$limitLIMITTake first N documents
$skipOFFSETSkip N documents (avoid large skips)
$unwindExplode JOINDeconstruct array โ€” one doc per element
$lookupLEFT JOINJoin with another collection
$addFieldsComputed columnAdd computed fields without removing others
$facetMultiple GROUP BYRun multiple sub-pipelines in parallel on same input
$bucketHistogramGroup by numeric range (price brackets, age ranges)
$out / $mergeINSERT INTO / UPSERTWrite pipeline results to a collection
Performance rule: $match and $sort early

Place $match as the first stage so MongoDB can use an index and skip irrelevant documents. A $match after $group forces full-collection processing. A $sort immediately after $match can leverage a compound index covering the same fields.

๐Ÿ”ฌ Senior deep-dive: $facet for multi-dimensional analytics

$facet runs multiple sub-pipelines in parallel on the same input โ€” perfect for search pages that return results, category counts, and a total simultaneously:

db.products.aggregate([
{ $match: { inStock: true } },
{ $facet: {
// sub-pipeline 1: paginated results
"results": [
{ $sort: { price: 1 } },
{ $skip: 0 }, { $limit: 20 }
],
// sub-pipeline 2: category counts for faceted filters UI
"categoryFacets": [
{ $group: { _id: "$category", count: { $sum: 1 } } }
],
// sub-pipeline 3: total count for pagination
"total": [{ $count: "n" }]
}}
])

Indexesโ€‹

Indexes are the single biggest lever for MongoDB performance. Without the right index, queries do a full collection scan โ€” reading every document. With one, reads become O(log n).

Index typesโ€‹

db.users.createIndex({ age: 1 })
// 1 = ascending, -1 = descending

Basic. Works for equality, range, and sort queries on that field.

The ESR rule for compound indexesโ€‹

When designing a compound index, field order matters. Follow Equality โ†’ Sort โ†’ Range:

PositionTypeExample fieldWhy
1stEqualityuserIdEliminates most documents immediately
2ndSortcreatedAt (used in .sort())Pre-sorts โ€” avoids in-memory sort stage
3rdRangestatus (used in $in)Applied last after equality/sort narrow the set
// Query: orders for user X, sorted newest-first, only "shipped" or "delivered"
db.orders.find({ userId: "alice", status: { $in: ["shipped", "delivered"] } })
.sort({ createdAt: -1 })

// Correct ESR compound index:
db.orders.createIndex({ userId: 1, createdAt: -1, status: 1 })

Diagnosing with explain()โ€‹

db.orders.find({ userId: "alice" }).explain("executionStats")

// Key fields:
// winningPlan.stage: "IXSCAN" โœ… vs "COLLSCAN" ๐Ÿ”ด
// executionStats.nReturned โ†’ documents returned
// executionStats.totalDocsExamined โ†’ documents scanned
// โ†’ good ratio: nReturned โ‰ˆ totalDocsExamined
// executionStats.executionTimeMillis โ†’ wall time
Index overhead on writes

Every index costs write overhead โ€” MongoDB must update every index on every insert, update, and delete. Don't create indexes speculatively. Measure first with explain(), then add the minimum indexes needed.

๐Ÿ”ฌ Senior deep-dive: partial indexes and covered queries

Partial indexesโ€‹

Index only documents matching a filter. Dramatically reduces index size when most queries target a subset:

// Only index orders that are NOT yet delivered.
// "Active order" queries are 95% of traffic; delivered orders rarely queried.
db.orders.createIndex(
{ userId: 1, createdAt: -1 },
{ partialFilterExpression: { status: { $ne: "delivered" } } }
)

// Sparse unique index: unique email, but null/missing is allowed
db.users.createIndex({ email: 1 }, { unique: true, sparse: true })

Covered queriesโ€‹

A query is "covered" when all requested fields are in the index โ€” MongoDB never touches the actual documents. Fastest possible read: O(log n) with zero document I/O.

// Index: { userId: 1, status: 1, total: 1 }
// Query + projection only uses indexed fields โ†’ fully covered
db.orders.find(
{ userId: "alice", status: "shipped" },
{ total: 1, _id: 0 } // must exclude _id to be covered!
)
// explain() โ†’ winningPlan.stage: "PROJECTION_COVERED"

Use Casesโ€‹

Where MongoDB excelsโ€‹

Use caseWhy MongoDB fits
E-commerce product catalogProducts in different categories have completely different attributes. A laptop has RAM and CPU; a T-shirt has size and colour. Flexible schema handles this naturally โ€” no EAV tables needed.
User profiles and preferencesUser data is a natural document โ€” profile info, settings, social links. Most reads fetch the whole profile at once, making embedding ideal.
Real-time event loggingClickstream, activity feeds, analytics events โ€” high write throughput with flexible schemas. Time-series collections (v5.0+) improve storage and query efficiency further.
Geospatial applicationsNative 2dsphere indexes enable "find restaurants within 2 km" with one query. GeoJSON support is first-class.
CMS / content managementArticles, recipes, landing pages all have variable structure. No schema migration when editors add a new content type.
Gaming: player state & leaderboardsPlayer inventory, progress, and achievements are natural documents. Flexible schema means adding a new game feature doesn't require migrating 10M player records.

Geospatial exampleโ€‹

db.restaurants.find({
location: {
$near: {
$geometry: { type: "Point", coordinates: [106.66, 10.78] },
$maxDistance: 2000 // metres
}
}
})

Where to think twiceโ€‹

ScenarioBetter choiceWhy
Financial ledger, double-entry accountingPostgreSQLComplex transactions, strict integrity, audit trails
Complex reporting with ad-hoc JOINsPostgreSQL / data warehouseSQL is vastly better for multi-table analytics
Very high-volume IoT / monitoringInfluxDB / TimescaleDBDedicated time-series engines outperform general-purpose on compression and query speed
Full-text search as primary featureElasticsearchBetter relevance scoring, richer text analysis
Sub-millisecond cachingRedisIn-memory; MongoDB is a disk database

Advanced Topics (Senior)โ€‹

Replica setsโ€‹

A replica set is a group of MongoDB instances that maintain the same dataset. Typical production setup: 1 primary + 2 secondaries. All writes go to the primary; secondaries replicate asynchronously via the oplog.

// Java driver: read from secondary for analytics queries
MongoCollection<Document> col = db.getCollection("orders")
.withReadPreference(ReadPreference.secondary())
.withReadConcern(ReadConcern.MAJORITY);

Read preference options:

OptionBehaviour
primary (default)All reads to primary โ€” consistent, no replication lag
primaryPreferredRead from primary when available, fallback to secondary
secondaryAlways read from a secondary โ€” may be slightly stale
secondaryPreferredRead from secondary when available, fallback to primary
nearestRead from the node with lowest network latency

Shardingโ€‹

Sharding partitions data across multiple replica sets ("shards"). A mongos router directs queries to the correct shard(s) based on the shard key.

Shard key strategyProsCons / gotchas
Hashed (e.g. _id hashed)Even distribution, no hotspotsRange queries scatter across all shards
Ranged (e.g. createdAt)Range queries hit one shardMonotonic keys (time, ObjectId) cause write hotspots
Compound (e.g. userId + date)Locality per user, efficient range queriesComplex to choose; requires high cardinality
The shard key is permanent

The shard key is a permanent architectural decision. MongoDB 5.0+ allows limited updates, but practically speaking a poor shard key โ€” causing hotspots or scatter-gather on every read โ€” requires a full reshard. Design carefully before sharding.

Write concern and read concernโ€‹

SettingValueMeaning
writeConcern: w"majority"Write acknowledged by a majority of nodes โ€” durable, survives primary failure
writeConcern: w1Acknowledged by primary only โ€” faster, but risks data loss on primary crash
writeConcern: w0Fire-and-forget โ€” no acknowledgement
readConcern"local"Read what's on this node (default) โ€” may be rolled back
readConcern"majority"Read data that a majority has acknowledged โ€” safe
readConcern"linearizable"Strongest โ€” reflects all writes before this read. Slowest.

Multi-document ACID transactionsโ€‹

Since v4.0, MongoDB supports multi-document ACID transactions. Use them when you must atomically update multiple documents across collections:

const session = client.startSession();
try {
session.startTransaction({
readConcern: { level: "snapshot" },
writeConcern: { w: "majority" }
});

// Both ops succeed or both roll back
await db.collection("accounts").updateOne(
{ _id: "alice" }, { $inc: { balance: -100 } }, { session }
);
await db.collection("accounts").updateOne(
{ _id: "bob" }, { $inc: { balance: 100 } }, { session }
);

await session.commitTransaction();
} catch (err) {
await session.abortTransaction();
} finally {
session.endSession();
}
Transactions are expensive in MongoDB

MongoDB transactions add significant latency and acquire collection-level locks with a 60-second timeout by default. Use them sparingly โ€” only when atomicity across multiple documents is truly required. If you're reaching for transactions everywhere, your data model is likely wrong. Consider embedding the related data instead.

Change streamsโ€‹

Change streams let applications subscribe to real-time data change notifications โ€” like CDC (Change Data Capture) โ€” built on top of the oplog and fully resumable:

const changeStream = db.collection("orders").watch([
{ $match: { "fullDocument.status": "payment_received" } }
], { fullDocument: "updateLookup" });

changeStream.on("change", (event) => {
// Trigger fulfillment workflow
fulfillmentService.process(event.fullDocument);
});

Common use cases: event sourcing, real-time dashboards, cache invalidation, CDC pipelines to a data warehouse.


Advanced Schema Patterns (Senior)โ€‹

Named patterns referenceโ€‹

PatternCore ideaBest for
PolymorphicStore different entity shapes in one collection with a type discriminatorProduct catalogs, assets with common query paths
ComputedPre-compute and cache expensive aggregations on the documentRead-heavy aggregates (avg rating, total reviews)
SubsetEmbed only the N most recent/relevant sub-items; full list in a separate collectionPosts with many comments, products with many reviews
OutlierUse an overflow flag + overflow collection for documents that would exceed normal boundsSocial posts by celebrities, viral content
Extended referenceReference by _id but duplicate frequently-read fields to avoid $lookup on hot readsOrder line items (duplicate product name, image URL)
BucketGroup time-series events into time-window documentsIoT sensor data, clickstream, metrics

Tree structuresโ€‹

PatternBest for
Parent referenceSimple parent lookup; listing children
Children referencesDirect children fast; deep traversal slow
Array of ancestorsFast "all ancestors" query โ€” good for breadcrumbs
Materialized pathRegex prefix match on path string โ€” simple subtree queries
Nested setsFast subtree reads; very expensive writes

Spring Data MongoDBโ€‹

@Document(collection = "users")
public class User {
@Id private String id;
private String name;
private String email;
private List<Address> addresses; // embedded documents
private LocalDateTime createdAt;
}

@Repository
public interface UserRepository extends MongoRepository<User, String> {
// Spring derives query from method name
List<User> findByAddressesCity(String city);

// Custom MQL query
@Query("{ 'orders.status': ?0 }")
List<User> findByOrderStatus(String status);

// Paginated result
Page<User> findByAgeGreaterThan(int age, Pageable pageable);
}

๐ŸŽฏ Interview Questionsโ€‹

Q1. What is MongoDB and how does it differ from a relational database?

MongoDB is a document-oriented NoSQL database that stores data as BSON documents in collections rather than rows in tables. Key differences: schema is flexible (documents in a collection can have different fields), relationships are handled by embedding or referencing rather than JOINs, it scales horizontally via sharding while SQL scales vertically, and hierarchical data maps 1:1 to documents. The trade-off is giving up some relational guarantees for flexibility and scalability.

Q2. When would you embed data vs. reference it?

Embed when: data is always read together, the relationship is 1-to-few (bounded count), and the sub-document has no lifecycle outside the parent (e.g. order line items, address inside a user). Reference when: the relationship is 1-to-many with unbounded growth (e.g. comments on a post), the sub-document is accessed or updated independently, it's shared by multiple parents, or embedding would exceed 16 MB. The guiding principle: design for your application's access patterns, not for normalisation.

Q3. What is the aggregation pipeline and how does it work?

The aggregation pipeline transforms a collection through a series of stages, each taking the output of the previous stage as input โ€” similar to Unix pipes. Common stages: $match (filter), $group (aggregate), $sort, $lookup (join), $unwind (explode arrays), $project (reshape). For performance, always put $match first so MongoDB can use an index early.

Q4. How do indexes work in MongoDB? What is the ESR rule?

MongoDB uses B-tree indexes (plus specialised types like 2dsphere and text). An index stores a sorted structure of field values with pointers to documents, enabling O(log n) lookup instead of O(n) collection scan. The ESR rule for compound indexes: Equality fields first (narrows the scan set most), then Sort fields (allows index-ordered traversal, avoiding in-memory sort), then Range fields last ($gt, $in, $lt). Verify with explain("executionStats") โ€” look for IXSCAN, not COLLSCAN.

Q5. How does MongoDB support ACID transactions? When should you use them?

Since v4.0, MongoDB supports multi-document ACID transactions across collections and (since v4.2) across shards. They use snapshot isolation and writeConcern: "majority" for durability. However, they add significant overhead โ€” lock contention, longer latency, coordination cost. Best practice: use transactions only when you must atomically update multiple documents across collections. For most use cases, a well-designed document model (embedding related data) eliminates the need for transactions entirely โ€” which is both faster and simpler.

Q6. What is a replica set and how does failover work?

A replica set is a group of MongoDB instances (typically 3) that keep identical data. One node is the primary (handles all writes); secondaries replicate via the oplog asynchronously. If the primary becomes unreachable, the remaining nodes hold an election โ€” the node with the most up-to-date oplog and majority votes becomes the new primary. Failover typically takes 10โ€“30 seconds. writeConcern: "majority" ensures writes are on a majority of nodes before acknowledging, so they survive primary failure without data loss.

Q7 (Senior). What is sharding and how do you choose a shard key?

Sharding horizontally partitions data across multiple replica sets. A mongos router maps each document to a shard based on the shard key. A good shard key must: (1) have high cardinality โ€” many distinct values to spread data; (2) avoid monotonic growth โ€” time-based keys concentrate all writes on the "last" shard; (3) match your query patterns โ€” if most queries filter by userId, shard on userId to avoid scatter-gather. The shard key is permanent โ€” get it wrong and you face a costly reshard operation.

Q8 (Senior). What is the oplog and how do change streams use it?

The oplog (operations log) is a capped collection in the local database that records all write operations as idempotent entries. Secondaries tail the oplog to replicate. Change streams are a user-facing API built on the oplog โ€” they translate raw oplog entries into user-friendly change events with a resume token. Change streams are resumable: if your consumer crashes, it restarts from the last processed token. They support pipeline filtering and deliver full document state with fullDocument: "updateLookup". Use cases: event sourcing, real-time dashboards, cache invalidation, CDC pipelines.


See Alsoโ€‹