NoSQL & MongoDB โ Complete Guide
What is MongoDB?โ
MongoDB is a document-oriented NoSQL database โ the most popular one in the world. Instead of storing data in rows and columns like MySQL or PostgreSQL, it stores data as flexible JSON-like documents grouped into collections.
| Concept | Description |
|---|---|
| Documents, not rows | Each record is a rich JSON/BSON document. Fields can be nested, arrays can hold objects โ no rigid schema required. |
| Collections, not tables | Documents live in collections. Unlike SQL tables, a collection doesn't enforce that every document has the same fields. |
| Horizontal scale | Designed to shard across many servers rather than scale up to one bigger one. |
| Flexible schema | Add a new field to one document without a migration script. |
A document looks like thisโ
Forget rows with foreign-key joins. One user document holds everything about that user:
// SQL: 3 tables (users, addresses, order_refs) + JOINs
// MongoDB: one self-contained document
{
"_id": "ObjectId(\"64abc123...\")",
"name": "Alice Tran",
"createdAt": "ISODate(\"2024-01-15T09:00:00Z\")",
"address": {
"city": "Ho Chi Minh City",
"country": "VN"
},
"tags": ["premium", "verified"],
"preferences": {
"theme": "dark",
"notifications": true
}
}
MongoDB stores documents as BSON (Binary JSON) โ a superset of JSON with extra types like ObjectId, Date, Decimal128, and BinData. The driver converts between BSON and your language's native types automatically.
Why MongoDB?โ
Problems it solvesโ
| Problem | How MongoDB helps |
|---|---|
| Schema rigidity | In SQL, adding a column to 500M rows requires a lock and hours of migration. In MongoDB you just start writing the new field โ old documents simply won't have it. |
| Impedance mismatch | ORMs exist because objects don't map cleanly to tables. MongoDB documents are objects โ nested structures map 1:1, eliminating the mapping layer. |
| Write throughput at scale | Horizontal sharding distributes writes across many nodes. A single SQL primary can't match the write throughput of a 10-shard cluster. |
| Variable data shapes | Product catalogs where a laptop has 40 fields and a T-shirt has 8 different ones. EAV in SQL is painful; MongoDB handles this naturally. |
MongoDB vs. relational databasesโ
| Topic | Relational (PostgreSQL) | MongoDB |
|---|---|---|
| Data model | Tables, rows, columns | Collections, documents (JSON/BSON) |
| Schema | Fixed โ DDL migrations required | Flexible โ optional schema validation |
| Joins | Native SQL JOINs | $lookup (aggregation) or embedding |
| Transactions | Full ACID since the 1980s | ACID since v4.0 (single & multi-doc) |
| Horizontal scale | Hard; read replicas only (writes to one) | Native sharding across many nodes |
| Query language | SQL (declarative, universal) | MQL โ MongoDB Query Language (JSON-based) |
| Best for | Complex queries, strict integrity, BI | Documents, scale-out, evolving schemas |
Financial ledgers, stock inventory, reservation systems โ anywhere strict ACID consistency across many related records is the primary concern, a mature relational database is usually the better choice. MongoDB added multi-document transactions, but they carry a real performance cost.
Decision guideโ
Is data hierarchical / document-shaped?
โ MongoDB โ
Do you need complex multi-table JOINs + referential integrity?
โ PostgreSQL
Is it high-volume time-series (IoT, monitoring)?
โ MongoDB time-series collections (v5+) or InfluxDB/TimescaleDB
Need full-text search as the primary feature?
โ Elasticsearch (or Atlas Search on top of MongoDB)
Need sub-millisecond caching?
โ Redis (MongoDB is a disk database)
Core Conceptsโ
The hierarchyโ
| MongoDB | SQL equivalent | Description |
|---|---|---|
| Database | Database | A namespace containing collections. One server can host many databases. |
| Collection | Table | A group of documents. No enforced schema by default. |
| Document | Row | A BSON record โ a set of key-value pairs, can be nested. |
| Field | Column | A key-value pair inside a document. Value can be any BSON type. |
| _id | Primary key | Every document must have a unique _id. Auto-generated as ObjectId if omitted. |
| Index | Index | B-tree or special index on one or more fields to speed up queries. |
ObjectId โ the default primary keyโ
MongoDB auto-generates a 12-byte ObjectId for every document's _id. It encodes timestamp + machine + random counter โ globally unique, roughly time-sortable, no central counter needed:
ObjectId("64abc1230000000000000001")
// ^^^^^^^^ ^^^^^^ ^^^^ ^^^^^^^^
// 4 bytes 3 bytes 2 b 3 bytes
// Unix TS machine PID counter
You can extract the creation time of any document for free: document._id.getTimestamp() โ no separate createdAt field needed.
BSON data typesโ
| Type | BSON type string | Example use |
|---|---|---|
| String | string | Names, descriptions |
| Integer (32/64-bit) | int / long | Counts, IDs |
| Double | double | Coordinates, ratings |
| Decimal128 | decimal | Money โ use this, not double! |
| Boolean | bool | Flags |
| Date | date | Timestamps โ always UTC |
| ObjectId | objectId | Document IDs, references |
| Array | array | Tags, order items, addresses |
| Object (embedded doc) | object | Nested documents |
| Null | null | Optional absent field |
| Binary data | binData | File bytes, UUIDs |
Schema validation (optional but recommended)โ
MongoDB can enforce a JSON Schema on a collection. This gives you flexible schema where needed, strict rules where important:
db.createCollection("users", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["name", "email"],
properties: {
name: { bsonType: "string", minLength: 1 },
email: { bsonType: "string", pattern: "^.+@.+$" },
age: { bsonType: "int", minimum: 0, maximum: 150 }
}
}
},
validationAction: "error" // or "warn" to log without rejecting
})
Data Modellingโ
The most important MongoDB skill. The central question: should related data be embedded in the same document, or referenced across documents?
Embedding (denormalization)โ
Store related data inside the parent document. One read fetches everything โ no JOINs:
{
"_id": "ObjectId(\"...\")",
"userId": "alice",
"items": [
{ "sku": "LAPTOP-X1", "qty": 1, "price": 1299.00 },
{ "sku": "HDMI-CBL", "qty": 2, "price": 14.99 }
],
"shippingAddress": { "city": "HCMC", "zip": "70000" }
}
Referencing (normalization)โ
Store a foreign key (usually an ObjectId) and look up the related document separately โ more like SQL:
// Order document
{ "_id": ObjectId("order1"), "userId": ObjectId("user1"), "total": 129.99 }
// User document (separate collection)
{ "_id": ObjectId("user1"), "name": "Alice", "email": "alice@..." }
// Query with $lookup (like a LEFT JOIN)
db.orders.aggregate([
{ $lookup: {
from: "users",
localField: "userId",
foreignField: "_id",
as: "user"
}}
])
The embed vs. reference decisionโ
| Use embed whenโฆ | Use reference whenโฆ |
|---|---|
| Data is always read together | Data is accessed independently |
| Relationship is 1-to-few (โค ~100 sub-items) | Relationship is 1-to-many (thousands+) |
| Sub-item has no lifecycle without the parent | Sub-item exists independently (e.g. a Product) |
| Sub-items rarely change individually | Sub-items are updated frequently |
| Document size stays under 16 MB | Embedding would exceed the 16 MB document limit |
MongoDB documents cannot exceed 16 MB. An order with 2 line items โ fine to embed. A social post with unbounded comments โ never embed comments. Use a separate comments collection referenced by postId.
Real-world examplesโ
- โ Embed
- ๐ Reference
Blog post + metadata โ tags and SEO are always read with the post, no independent lifecycle:
{ "title": "...", "author": "...", "tags": [], "seo": { "slug": "...", "metaDesc": "..." } }
Order + line items โ items are bounded (โค 100) and always shown with the order:
{ "userId": "...", "items": [{ "sku": "X1", "qty": 1, "price": 1299 }], "shippingAddress": {} }
Blog post + comments โ comments are unbounded, referenced by postId:
// comments collection
{ "postId": ObjectId("..."), "text": "Great post!", "author": "Bob" }
Order line item โ product โ products exist independently and change on their own:
// item inside order references product
{ "productId": ObjectId("..."), "qty": 2, "unitPrice": 24.99 }
๐ฌ Senior deep-dive: advanced schema patterns
Bucket patternโ
For time-series data, instead of one document per event (millions of tiny documents), bucket events into time windows. This dramatically reduces index size and improves scan performance:
// Anti-pattern: one doc per sensor reading (millions of docs)
{ sensorId: "s1", ts: ISODate("..."), value: 22.5 }
// Bucket pattern: one doc per sensor per hour (thousands of docs)
{
sensorId: "s1",
hour: ISODate("2024-01-15T09:00:00Z"),
count: 60,
sum: 1350.0,
readings: [
{ ts: ISODate("09:00:01"), v: 22.5 },
{ ts: ISODate("09:01:02"), v: 22.6 }
// ... 58 more
]
}
Extended reference patternโ
Reference the foreign document's _id, but also duplicate the most frequently read fields into the referencing document. This trades some write complexity for zero-join reads on the hot path:
// Order line item โ productId is the reference,
// but name + imageUrl are duplicated to avoid $lookup on every order display
{
productId: ObjectId("..."),
name: "USB-C Charger 65W", // duplicated field
imageUrl: "/images/charger.jpg", // duplicated field
qty: 2,
unitPrice: 24.99
}
If name changes in the products collection, order history shows the old name โ which is often correct for an order (you want the name at time of purchase). Choose deliberately.
Computed patternโ
Pre-compute expensive aggregations and store the result directly on the document. Read the cached value instead of recomputing on every request:
{ "productId": "P1", "totalReviews": 1847, "avgRating": 4.3, "lastComputed": "..." }
Recomputed via a background job or $merge pipeline periodically.
Subset patternโ
Store only the N most recently-accessed array items in the parent document; keep the full list in a separate collection. Keeps the working set small and hot in memory:
// Product โ only 5 most recent reviews embedded
{ "_id": "P1", "name": "...", "recentReviews": [/* last 5 */] }
// Full reviews in separate collection
{ "productId": "P1", "text": "...", "rating": 5, "createdAt": "..." }
Outlier patternโ
Most documents are "normal" (a book with 5 reviews). A few are outliers (a bestseller with 50,000). Use an overflow flag + overflow collection instead of bloating the main document:
// Normal book: all reviews embedded
{ "_id": "B1", "reviews": [/* โค 200 */], "hasOverflow": false }
// Bestseller: truncated + overflow flag
{ "_id": "B2", "reviews": [/* first 200 */], "hasOverflow": true }
// Overflow docs
{ "bookId": "B2", "reviews": [/* next 200 */], "page": 2 }
CRUD Operationsโ
MongoDB Query Language (MQL) uses JSON objects for both filters and update specifications.
Createโ
// Insert one
db.users.insertOne({
})
// โ { acknowledged: true, insertedId: ObjectId("...") }
// Insert many (bulk)
db.users.insertMany([
{ name: "Bob", age: 32 },
{ name: "Carol", age: 25 }
], { ordered: false }) // ordered:false = continue on partial failure
Readโ
// Find all documents
db.users.find({})
// Filter: age > 25 AND name starts with "A"
db.users.find({
age: { $gt: 25 },
name: { $regex: /^A/i }
})
// Projection: return only name + email, exclude _id
db.users.find({ age: { $gte: 18 } }, { name: 1, email: 1, _id: 0 })
// Sort, skip, limit โ pagination
db.users.find({})
.sort({ createdAt: -1 }) // -1 = descending
.skip(20)
.limit(10)
// Query on embedded field (dot notation)
db.users.find({ "address.city": "HCMC" })
// Query on array: documents where tags contains "premium"
db.users.find({ tags: "premium" })
Query operators cheat-sheetโ
| Operator | Meaning | Example |
|---|---|---|
$eq | Equals | { age: { $eq: 25 } } |
$ne | Not equals | { status: { $ne: "banned" } } |
$gt / $gte | Greater than / or equal | { price: { $gte: 100 } } |
$lt / $lte | Less than / or equal | { stock: { $lt: 5 } } |
$in | Value in array | { status: { $in: ["active", "trial"] } } |
$nin | Value not in array | { role: { $nin: ["admin"] } } |
$and | Logical AND | { $and: [{ age: { $gt: 18 } }, { verified: true }] } |
$or | Logical OR | { $or: [{ city: "HCM" }, { city: "HN" }] } |
$exists | Field exists | { phone: { $exists: true } } |
$regex | Regular expression | { email: { $regex: /@gmail/ } } |
$elemMatch | Array element matches all conditions | { items: { $elemMatch: { qty: { $gt: 1 }, sku: "X" } } } |
Updateโ
// Update one document
db.users.updateOne(
{ $set: { name: "Alice Tran" }, // update operator
$currentDate: { updatedAt: true } }
)
// Increment a counter atomically
db.products.updateOne(
{ _id: ObjectId("...") },
{ $inc: { stock: -1 } } // atomic โ safe under concurrency
)
// Push to an array
db.users.updateOne(
{ _id: ObjectId("...") },
{ $push: { tags: "verified" } }
)
// Upsert: insert if not found
db.users.updateOne(
{ $set: { name: "New User" } },
{ upsert: true }
)
Update operators cheat-sheetโ
| Operator | Effect |
|---|---|
$set | Set field value (creates if missing) |
$unset | Remove a field from the document |
$inc | Increment/decrement a number atomically |
$mul | Multiply a field's value |
$rename | Rename a field |
$push | Append to an array |
$addToSet | Append to array only if not already present |
$pull | Remove matching elements from array |
$pop | Remove first or last element of array |
$currentDate | Set field to current date/timestamp |
Deleteโ
// Delete one matching document
// Delete all matching documents
db.sessions.deleteMany({ expiresAt: { $lt: new Date() } })
// findOneAndDelete: atomically retrieve + delete (useful for job queues)
const job = await db.jobs.findOneAndDelete(
{ status: "pending" },
{ sort: { createdAt: 1 } }
)
Aggregation Pipelineโ
The aggregation pipeline transforms a collection through a series of stages โ like Unix pipes for data. Each stage takes input documents and emits output to the next.
Collection โ [$match] โ [$group] โ [$sort] โ [$limit] โ Result
Example: revenue by categoryโ
db.orders.aggregate([
// Stage 1: only completed orders from 2024
{ $match: {
status: "completed",
createdAt: { $gte: ISODate("2024-01-01") }
}},
// Stage 2: unwind items array โ one doc per line item
{ $unwind: "$items" },
// Stage 3: join with products collection to get category
{ $lookup: {
from: "products",
localField: "items.productId",
foreignField: "_id",
as: "product"
}},
// Stage 4: group by category, sum revenue
{ $group: {
_id: "$product.category",
totalRevenue: { $sum: { $multiply: ["$items.qty", "$items.price"] } },
orderCount: { $sum: 1 }
}},
// Stage 5: sort highest revenue first
{ $sort: { totalRevenue: -1 }},
// Stage 6: rename _id to category
{ $project: { category: "$_id", totalRevenue: 1, orderCount: 1, _id: 0 }}
])
Key stages referenceโ
| Stage | SQL equivalent | Use |
|---|---|---|
$match | WHERE | Filter early โ must come before expensive stages |
$project | SELECT cols | Include/exclude/rename/compute fields |
$group | GROUP BY | Aggregate with $sum, $avg, $min, $max, $push, $addToSet |
$sort | ORDER BY | Sort (use index when possible โ put right after $match) |
$limit | LIMIT | Take first N documents |
$skip | OFFSET | Skip N documents (avoid large skips) |
$unwind | Explode JOIN | Deconstruct array โ one doc per element |
$lookup | LEFT JOIN | Join with another collection |
$addFields | Computed column | Add computed fields without removing others |
$facet | Multiple GROUP BY | Run multiple sub-pipelines in parallel on same input |
$bucket | Histogram | Group by numeric range (price brackets, age ranges) |
$out / $merge | INSERT INTO / UPSERT | Write pipeline results to a collection |
$match and $sort earlyPlace $match as the first stage so MongoDB can use an index and skip irrelevant documents. A $match after $group forces full-collection processing. A $sort immediately after $match can leverage a compound index covering the same fields.
๐ฌ Senior deep-dive: $facet for multi-dimensional analytics
$facet runs multiple sub-pipelines in parallel on the same input โ perfect for search pages that return results, category counts, and a total simultaneously:
db.products.aggregate([
{ $match: { inStock: true } },
{ $facet: {
// sub-pipeline 1: paginated results
"results": [
{ $sort: { price: 1 } },
{ $skip: 0 }, { $limit: 20 }
],
// sub-pipeline 2: category counts for faceted filters UI
"categoryFacets": [
{ $group: { _id: "$category", count: { $sum: 1 } } }
],
// sub-pipeline 3: total count for pagination
"total": [{ $count: "n" }]
}}
])
Indexesโ
Indexes are the single biggest lever for MongoDB performance. Without the right index, queries do a full collection scan โ reading every document. With one, reads become O(log n).
Index typesโ
- Single field
- Compound
- Multikey (array)
- Text
- Geospatial
- Unique
db.users.createIndex({ age: 1 })
// 1 = ascending, -1 = descending
Basic. Works for equality, range, and sort queries on that field.
db.orders.createIndex({ userId: 1, createdAt: -1 })
Serves queries on userId alone or userId + createdAt together. Field order matters โ see ESR rule below.
db.users.createIndex({ tags: 1 })
// auto-multikey when field is an array
Indexes each element of the array. Enables fast { tags: "premium" } queries.
db.articles.createIndex({ title: "text", body: "text" })
Tokenizes and stems words. Enables { $text: { $search: "mongodb sharding" } } queries.
db.shops.createIndex({ location: "2dsphere" })
Enables $near, $geoWithin queries for location-based searches.
db.users.createIndex({ email: 1 }, { unique: true })
Enforces uniqueness. Inserting a duplicate throws E11000 duplicate key error.
The ESR rule for compound indexesโ
When designing a compound index, field order matters. Follow Equality โ Sort โ Range:
| Position | Type | Example field | Why |
|---|---|---|---|
| 1st | Equality | userId | Eliminates most documents immediately |
| 2nd | Sort | createdAt (used in .sort()) | Pre-sorts โ avoids in-memory sort stage |
| 3rd | Range | status (used in $in) | Applied last after equality/sort narrow the set |
// Query: orders for user X, sorted newest-first, only "shipped" or "delivered"
db.orders.find({ userId: "alice", status: { $in: ["shipped", "delivered"] } })
.sort({ createdAt: -1 })
// Correct ESR compound index:
db.orders.createIndex({ userId: 1, createdAt: -1, status: 1 })
Diagnosing with explain()โ
db.orders.find({ userId: "alice" }).explain("executionStats")
// Key fields:
// winningPlan.stage: "IXSCAN" โ
vs "COLLSCAN" ๐ด
// executionStats.nReturned โ documents returned
// executionStats.totalDocsExamined โ documents scanned
// โ good ratio: nReturned โ totalDocsExamined
// executionStats.executionTimeMillis โ wall time
Every index costs write overhead โ MongoDB must update every index on every insert, update, and delete. Don't create indexes speculatively. Measure first with explain(), then add the minimum indexes needed.
๐ฌ Senior deep-dive: partial indexes and covered queries
Partial indexesโ
Index only documents matching a filter. Dramatically reduces index size when most queries target a subset:
// Only index orders that are NOT yet delivered.
// "Active order" queries are 95% of traffic; delivered orders rarely queried.
db.orders.createIndex(
{ userId: 1, createdAt: -1 },
{ partialFilterExpression: { status: { $ne: "delivered" } } }
)
// Sparse unique index: unique email, but null/missing is allowed
db.users.createIndex({ email: 1 }, { unique: true, sparse: true })
Covered queriesโ
A query is "covered" when all requested fields are in the index โ MongoDB never touches the actual documents. Fastest possible read: O(log n) with zero document I/O.
// Index: { userId: 1, status: 1, total: 1 }
// Query + projection only uses indexed fields โ fully covered
db.orders.find(
{ userId: "alice", status: "shipped" },
{ total: 1, _id: 0 } // must exclude _id to be covered!
)
// explain() โ winningPlan.stage: "PROJECTION_COVERED"
Use Casesโ
Where MongoDB excelsโ
| Use case | Why MongoDB fits |
|---|---|
| E-commerce product catalog | Products in different categories have completely different attributes. A laptop has RAM and CPU; a T-shirt has size and colour. Flexible schema handles this naturally โ no EAV tables needed. |
| User profiles and preferences | User data is a natural document โ profile info, settings, social links. Most reads fetch the whole profile at once, making embedding ideal. |
| Real-time event logging | Clickstream, activity feeds, analytics events โ high write throughput with flexible schemas. Time-series collections (v5.0+) improve storage and query efficiency further. |
| Geospatial applications | Native 2dsphere indexes enable "find restaurants within 2 km" with one query. GeoJSON support is first-class. |
| CMS / content management | Articles, recipes, landing pages all have variable structure. No schema migration when editors add a new content type. |
| Gaming: player state & leaderboards | Player inventory, progress, and achievements are natural documents. Flexible schema means adding a new game feature doesn't require migrating 10M player records. |
Geospatial exampleโ
db.restaurants.find({
location: {
$near: {
$geometry: { type: "Point", coordinates: [106.66, 10.78] },
$maxDistance: 2000 // metres
}
}
})
Where to think twiceโ
| Scenario | Better choice | Why |
|---|---|---|
| Financial ledger, double-entry accounting | PostgreSQL | Complex transactions, strict integrity, audit trails |
| Complex reporting with ad-hoc JOINs | PostgreSQL / data warehouse | SQL is vastly better for multi-table analytics |
| Very high-volume IoT / monitoring | InfluxDB / TimescaleDB | Dedicated time-series engines outperform general-purpose on compression and query speed |
| Full-text search as primary feature | Elasticsearch | Better relevance scoring, richer text analysis |
| Sub-millisecond caching | Redis | In-memory; MongoDB is a disk database |
Advanced Topics (Senior)โ
Replica setsโ
A replica set is a group of MongoDB instances that maintain the same dataset. Typical production setup: 1 primary + 2 secondaries. All writes go to the primary; secondaries replicate asynchronously via the oplog.
// Java driver: read from secondary for analytics queries
MongoCollection<Document> col = db.getCollection("orders")
.withReadPreference(ReadPreference.secondary())
.withReadConcern(ReadConcern.MAJORITY);
Read preference options:
| Option | Behaviour |
|---|---|
primary (default) | All reads to primary โ consistent, no replication lag |
primaryPreferred | Read from primary when available, fallback to secondary |
secondary | Always read from a secondary โ may be slightly stale |
secondaryPreferred | Read from secondary when available, fallback to primary |
nearest | Read from the node with lowest network latency |
Shardingโ
Sharding partitions data across multiple replica sets ("shards"). A mongos router directs queries to the correct shard(s) based on the shard key.
| Shard key strategy | Pros | Cons / gotchas |
|---|---|---|
Hashed (e.g. _id hashed) | Even distribution, no hotspots | Range queries scatter across all shards |
Ranged (e.g. createdAt) | Range queries hit one shard | Monotonic keys (time, ObjectId) cause write hotspots |
Compound (e.g. userId + date) | Locality per user, efficient range queries | Complex to choose; requires high cardinality |
The shard key is a permanent architectural decision. MongoDB 5.0+ allows limited updates, but practically speaking a poor shard key โ causing hotspots or scatter-gather on every read โ requires a full reshard. Design carefully before sharding.
Write concern and read concernโ
| Setting | Value | Meaning |
|---|---|---|
writeConcern: w | "majority" | Write acknowledged by a majority of nodes โ durable, survives primary failure |
writeConcern: w | 1 | Acknowledged by primary only โ faster, but risks data loss on primary crash |
writeConcern: w | 0 | Fire-and-forget โ no acknowledgement |
readConcern | "local" | Read what's on this node (default) โ may be rolled back |
readConcern | "majority" | Read data that a majority has acknowledged โ safe |
readConcern | "linearizable" | Strongest โ reflects all writes before this read. Slowest. |
Multi-document ACID transactionsโ
Since v4.0, MongoDB supports multi-document ACID transactions. Use them when you must atomically update multiple documents across collections:
const session = client.startSession();
try {
session.startTransaction({
readConcern: { level: "snapshot" },
writeConcern: { w: "majority" }
});
// Both ops succeed or both roll back
await db.collection("accounts").updateOne(
{ _id: "alice" }, { $inc: { balance: -100 } }, { session }
);
await db.collection("accounts").updateOne(
{ _id: "bob" }, { $inc: { balance: 100 } }, { session }
);
await session.commitTransaction();
} catch (err) {
await session.abortTransaction();
} finally {
session.endSession();
}
MongoDB transactions add significant latency and acquire collection-level locks with a 60-second timeout by default. Use them sparingly โ only when atomicity across multiple documents is truly required. If you're reaching for transactions everywhere, your data model is likely wrong. Consider embedding the related data instead.
Change streamsโ
Change streams let applications subscribe to real-time data change notifications โ like CDC (Change Data Capture) โ built on top of the oplog and fully resumable:
const changeStream = db.collection("orders").watch([
{ $match: { "fullDocument.status": "payment_received" } }
], { fullDocument: "updateLookup" });
changeStream.on("change", (event) => {
// Trigger fulfillment workflow
fulfillmentService.process(event.fullDocument);
});
Common use cases: event sourcing, real-time dashboards, cache invalidation, CDC pipelines to a data warehouse.
Advanced Schema Patterns (Senior)โ
Named patterns referenceโ
| Pattern | Core idea | Best for |
|---|---|---|
| Polymorphic | Store different entity shapes in one collection with a type discriminator | Product catalogs, assets with common query paths |
| Computed | Pre-compute and cache expensive aggregations on the document | Read-heavy aggregates (avg rating, total reviews) |
| Subset | Embed only the N most recent/relevant sub-items; full list in a separate collection | Posts with many comments, products with many reviews |
| Outlier | Use an overflow flag + overflow collection for documents that would exceed normal bounds | Social posts by celebrities, viral content |
| Extended reference | Reference by _id but duplicate frequently-read fields to avoid $lookup on hot reads | Order line items (duplicate product name, image URL) |
| Bucket | Group time-series events into time-window documents | IoT sensor data, clickstream, metrics |
Tree structuresโ
| Pattern | Best for |
|---|---|
| Parent reference | Simple parent lookup; listing children |
| Children references | Direct children fast; deep traversal slow |
| Array of ancestors | Fast "all ancestors" query โ good for breadcrumbs |
| Materialized path | Regex prefix match on path string โ simple subtree queries |
| Nested sets | Fast subtree reads; very expensive writes |
Spring Data MongoDBโ
@Document(collection = "users")
public class User {
@Id private String id;
private String name;
private String email;
private List<Address> addresses; // embedded documents
private LocalDateTime createdAt;
}
@Repository
public interface UserRepository extends MongoRepository<User, String> {
// Spring derives query from method name
List<User> findByAddressesCity(String city);
// Custom MQL query
@Query("{ 'orders.status': ?0 }")
List<User> findByOrderStatus(String status);
// Paginated result
Page<User> findByAgeGreaterThan(int age, Pageable pageable);
}
๐ฏ Interview Questionsโ
Q1. What is MongoDB and how does it differ from a relational database?
MongoDB is a document-oriented NoSQL database that stores data as BSON documents in collections rather than rows in tables. Key differences: schema is flexible (documents in a collection can have different fields), relationships are handled by embedding or referencing rather than JOINs, it scales horizontally via sharding while SQL scales vertically, and hierarchical data maps 1:1 to documents. The trade-off is giving up some relational guarantees for flexibility and scalability.
Q2. When would you embed data vs. reference it?
Embed when: data is always read together, the relationship is 1-to-few (bounded count), and the sub-document has no lifecycle outside the parent (e.g. order line items, address inside a user). Reference when: the relationship is 1-to-many with unbounded growth (e.g. comments on a post), the sub-document is accessed or updated independently, it's shared by multiple parents, or embedding would exceed 16 MB. The guiding principle: design for your application's access patterns, not for normalisation.
Q3. What is the aggregation pipeline and how does it work?
The aggregation pipeline transforms a collection through a series of stages, each taking the output of the previous stage as input โ similar to Unix pipes. Common stages:
$match(filter),$group(aggregate),$sort,$lookup(join),$unwind(explode arrays),$project(reshape). For performance, always put$matchfirst so MongoDB can use an index early.
Q4. How do indexes work in MongoDB? What is the ESR rule?
MongoDB uses B-tree indexes (plus specialised types like 2dsphere and text). An index stores a sorted structure of field values with pointers to documents, enabling O(log n) lookup instead of O(n) collection scan. The ESR rule for compound indexes: Equality fields first (narrows the scan set most), then Sort fields (allows index-ordered traversal, avoiding in-memory sort), then Range fields last (
$gt,$in,$lt). Verify withexplain("executionStats")โ look forIXSCAN, notCOLLSCAN.
Q5. How does MongoDB support ACID transactions? When should you use them?
Since v4.0, MongoDB supports multi-document ACID transactions across collections and (since v4.2) across shards. They use snapshot isolation and
writeConcern: "majority"for durability. However, they add significant overhead โ lock contention, longer latency, coordination cost. Best practice: use transactions only when you must atomically update multiple documents across collections. For most use cases, a well-designed document model (embedding related data) eliminates the need for transactions entirely โ which is both faster and simpler.
Q6. What is a replica set and how does failover work?
A replica set is a group of MongoDB instances (typically 3) that keep identical data. One node is the primary (handles all writes); secondaries replicate via the oplog asynchronously. If the primary becomes unreachable, the remaining nodes hold an election โ the node with the most up-to-date oplog and majority votes becomes the new primary. Failover typically takes 10โ30 seconds.
writeConcern: "majority"ensures writes are on a majority of nodes before acknowledging, so they survive primary failure without data loss.
Q7 (Senior). What is sharding and how do you choose a shard key?
Sharding horizontally partitions data across multiple replica sets. A
mongosrouter maps each document to a shard based on the shard key. A good shard key must: (1) have high cardinality โ many distinct values to spread data; (2) avoid monotonic growth โ time-based keys concentrate all writes on the "last" shard; (3) match your query patterns โ if most queries filter byuserId, shard onuserIdto avoid scatter-gather. The shard key is permanent โ get it wrong and you face a costly reshard operation.
Q8 (Senior). What is the oplog and how do change streams use it?
The oplog (operations log) is a capped collection in the
localdatabase that records all write operations as idempotent entries. Secondaries tail the oplog to replicate. Change streams are a user-facing API built on the oplog โ they translate raw oplog entries into user-friendly change events with a resume token. Change streams are resumable: if your consumer crashes, it restarts from the last processed token. They support pipeline filtering and deliver full document state withfullDocument: "updateLookup". Use cases: event sourcing, real-time dashboards, cache invalidation, CDC pipelines.