Database Design & Normalization
SQL vs NoSQL Decision Guideโ
Before designing a database schema, you must decide between a Relational (SQL) and Non-Relational (NoSQL) database.
| Criteria | SQL | NoSQL |
|---|---|---|
| Schema | Fixed, strictly enforced | Flexible, dynamic document |
| Relationships | Strong (JOINs, Foreign Keys) | Weak (denormalized / embedded) |
| Transactions | Strict ACID compliance | Eventual consistency (usually) |
| Query Flexibility | High (arbitrary SQL queries) | Low (query-by-design) |
| Scaling Writes | Hard (requires complex sharding) | Built-in horizontal scaling |
| Best For | Financial, transactional, relational data | Catalogs, user content, fast sessions |
Entity-Relationship Modelingโ
An ER diagram models the domain before creating tables.
Core concepts:
- Entity: a thing with independent existence (User, Order, Product)
- Attribute: property of an entity (name, email, price)
- Relationship: association between entities (User places Order)
- Cardinality: 1:1, 1:N, M:N
Relationship Typesโ
One-to-One (1:1):
User โโโโโโโโ UserProfile (one user has one profile)
One-to-Many (1:N):
User โโโโโโ<< Order (one user has many orders)
Many-to-Many (M:N):
Order >>โโโโ<< Product (via junction table: order_items)
Junction Table (Bridge Table) for M:Nโ
CREATE TABLE order_items (
order_id BIGINT REFERENCES orders(id) ON DELETE CASCADE,
product_id BIGINT REFERENCES products(id),
quantity INT NOT NULL DEFAULT 1,
unit_price DECIMAL(10,2) NOT NULL,
PRIMARY KEY (order_id, product_id)
);
Normal Formsโ
Normalization reduces data redundancy and prevents update anomalies.
Anomaly Types (Why Normalize?)โ
Unnormalized table: Students
| student_id | student_name | course_id | course_name | instructor |
|------------|-------------|-----------|---------------|------------|
| 1 | Alice | CS101 | Intro to CS | Prof. Lee |
| 1 | Alice | CS201 | Data Struct | Prof. Kim |
| 2 | Bob | CS101 | Intro to CS | Prof. Lee |
- Insert anomaly: can't add a course without a student
- Delete anomaly: deleting last student in a course loses course info
- Update anomaly: changing "Prof. Lee" requires updating multiple rows
First Normal Form (1NF)โ
Rule: Every column must contain atomic (indivisible) values; no repeating groups.
-- โ Violates 1NF (multi-value column)
| order_id | products |
| 1 | "apple, banana" |
-- โ
1NF compliant
| order_id | product |
| 1 | apple |
| 1 | banana |
Second Normal Form (2NF)โ
Rule: Must be in 1NF + every non-key attribute is fully functionally dependent on the entire primary key (no partial dependencies).
Only applies to tables with composite primary keys.
-- โ Violates 2NF
-- PK = (order_id, product_id), but product_name depends only on product_id
| order_id | product_id | product_name | quantity |
| 1 | P1 | Apple | 5 |
| 2 | P1 | Apple | 3 | โ product_name duplicated
-- โ
2NF: split out product info
Table: order_items (order_id, product_id, quantity) -- PK: both columns
Table: products (product_id, product_name, price) -- PK: product_id
Third Normal Form (3NF)โ
Rule: Must be in 2NF + no transitive dependencies (non-key attributes must not depend on other non-key attributes).
-- โ Violates 3NF: zip_code โ city (transitive: id โ zip_code โ city)
| customer_id | name | zip_code | city |
| 1 | Alice | 10001 | New York |
| 2 | Bob | 90210 | LA |
-- โ
3NF: separate zip/city
Table: customers (customer_id, name, zip_code)
Table: zip_codes (zip_code, city, state)
Boyce-Codd Normal Form (BCNF)โ
Rule: Stricter than 3NF. For every non-trivial functional dependency X โ Y, X must be a superkey.
Handles edge cases with overlapping candidate keys.
-- Example: professor teaches one course per room
-- (student, course) โ professor
-- (student, professor) โ course
-- professor โ course (professor only teaches one course โ violates BCNF)
Normal Forms Summaryโ
| Normal Form | Requirement |
|---|---|
| 1NF | Atomic values, no repeating groups |
| 2NF | 1NF + no partial dependencies on PK |
| 3NF | 2NF + no transitive dependencies |
| BCNF | 3NF + every determinant is a superkey |
| 4NF | BCNF + no multi-valued dependencies |
| 5NF | 4NF + no join dependencies |
In practice, 3NF is the target for OLTP. BCNF and beyond are mainly academic.
Denormalizationโ
Intentionally violating normal forms for performance.
| Technique | Description | When |
|---|---|---|
| Redundant columns | Store derived/related data in same table | Avoid expensive JOINs |
| Pre-aggregated columns | order_count, total_spent on user table | Avoid COUNT/SUM aggregations |
| Duplicate tables | Materialized views, summary tables | OLAP reporting |
| JSON columns | Embed variable attributes | Sparse/dynamic attributes |
-- Denormalized: store order count on user to avoid JOIN + COUNT
ALTER TABLE users ADD COLUMN order_count INT DEFAULT 0;
-- Keep in sync with triggers or application logic
UPDATE users SET order_count = order_count + 1 WHERE id = :userId;
Schema Design Patternsโ
Polymorphic Associationโ
One table references multiple entity types:
-- โ Complex: single table for all entity types
CREATE TABLE comments (
id BIGINT PRIMARY KEY,
body TEXT,
entity_type VARCHAR(50), -- 'post', 'video', 'product'
entity_id BIGINT -- no FK enforcement!
);
-- โ
Better: separate tables per entity
CREATE TABLE post_comments (id, post_id REFERENCES posts, body);
CREATE TABLE product_comments (id, product_id REFERENCES products, body);
Soft Deleteโ
ALTER TABLE users ADD COLUMN deleted_at TIMESTAMP NULL;
-- "Delete"
UPDATE users SET deleted_at = NOW() WHERE id = ?;
-- Query active users
SELECT * FROM users WHERE deleted_at IS NULL;
-- Add partial index to optimize
CREATE INDEX idx_active_users ON users (id) WHERE deleted_at IS NULL;
Audit / History Tableโ
CREATE TABLE orders_history (
history_id BIGSERIAL PRIMARY KEY,
order_id BIGINT,
changed_at TIMESTAMP DEFAULT NOW(),
changed_by VARCHAR(100),
operation CHAR(1), -- 'I', 'U', 'D'
old_data JSONB,
new_data JSONB
);
Vertical Partitioningโ
Vertical partitioning splits a table by columns, separating frequently accessed "hot" columns from rarely accessed "cold" columns to optimize memory/cache usage and reduce disk I/O.
-- Hot columns (frequently accessed)
CREATE TABLE user_core (
id BIGSERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL,
name VARCHAR(100),
last_login TIMESTAMPTZ
);
-- Cold columns (rarely accessed)
CREATE TABLE user_profile (
user_id BIGINT REFERENCES user_core(id),
bio TEXT,
avatar_url TEXT,
preferences JSONB
);
Table Partitioning (PostgreSQL)โ
Table partitioning splits a single logical table into multiple physical child tables based on ranges, lists, or hash keys. This is extremely useful for high-volume logs or time-series data.
-- Time-based range partitioning
CREATE TABLE events (
id BIGSERIAL,
user_id BIGINT,
event_type VARCHAR(50),
created_at TIMESTAMPTZ NOT NULL,
data JSONB
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2024_q1 PARTITION OF events
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE events_2024_q2 PARTITION OF events
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
-- Drop old partitions instantly (bypasses slow, locking DELETE queries)
DROP TABLE events_2023_q1;
Hierarchical Data (Adjacency List)โ
CREATE TABLE categories (
id INT PRIMARY KEY,
name VARCHAR(100),
parent_id INT REFERENCES categories(id) -- self-referential
);
-- Recursive CTE to query full tree (PostgreSQL)
WITH RECURSIVE tree AS (
SELECT id, name, parent_id, 0 AS depth
FROM categories WHERE parent_id IS NULL
UNION ALL
SELECT c.id, c.name, c.parent_id, t.depth + 1
FROM categories c JOIN tree t ON c.parent_id = t.id
)
SELECT * FROM tree ORDER BY depth, name;
Alternative patterns: Nested Sets, Closure Table, **Materialized Path`.
EAV โ Entity-Attribute-Valueโ
-- โ Flexible but querying is painful
CREATE TABLE product_attributes (
product_id BIGINT,
attr_name VARCHAR(50),
attr_value VARCHAR(200)
);
-- Finding products where color=red AND size=L requires multiple joins
Better alternative: JSONB in PostgreSQL for dynamic attributes.
JPA Entity Relationship Annotationsโ
// One-to-Many
@Entity
public class User {
@OneToMany(mappedBy = "user", cascade = CascadeType.ALL, orphanRemoval = true)
private List<Order> orders = new ArrayList<>();
}
@Entity
public class Order {
@ManyToOne(fetch = FetchType.LAZY)
@JoinColumn(name = "user_id")
private User user;
}
// Many-to-Many
@Entity
public class Order {
@ManyToMany
@JoinTable(
name = "order_items",
joinColumns = @JoinColumn(name = "order_id"),
inverseJoinColumns = @JoinColumn(name = "product_id")
)
private Set<Product> products = new HashSet<>();
}
๐ฏ Interview Questionsโ
Q1. What are the three types of anomalies that normalization prevents?
Insert anomaly: can't add data about an entity without adding another entity. Delete anomaly: deleting one entity unintentionally removes data about another. Update anomaly: the same fact is stored in multiple places and must be updated everywhere consistently.
Q2. Explain 2NF and give an example of a violation.
2NF requires no partial dependencies โ every non-key column must depend on the entire composite PK. Example violation: a table with PK (order_id, product_id) that also stores product_name. product_name depends only on product_id (partial dependency). Fix: move product_name to a separate products table.
Q3. What is the difference between 3NF and BCNF?
3NF allows a non-key attribute to determine another non-key attribute if the determinant is a candidate key. BCNF is stricter: every determinant must be a superkey. Most violations of BCNF (beyond 3NF) involve tables with multiple overlapping candidate keys โ relatively rare in practice.
Q4. When should you denormalize a database?
Denormalize when JOINs become a measurable performance bottleneck; for pre-computed aggregates on frequently queried summary data; for OLAP/reporting workloads where read performance matters more than write consistency; or when the data model needs to match a specific access pattern (e.g., Cassandra requires denormalization by design).
Q5. What is the EAV (Entity-Attribute-Value) pattern and what are its problems?
EAV stores arbitrary key-value pairs as rows rather than columns โ allows dynamic attributes without schema changes. Problems: no data type enforcement, complex querying (pivoting EAV into columns requires many JOINs), poor query performance, no FK constraints. Better alternatives: JSONB columns in PostgreSQL, or a proper document store like MongoDB.
Q6. How do you model a many-to-many relationship in a relational database?
Create a junction (bridge) table that holds FKs from both entities as a composite PK. The junction table can carry additional attributes of the relationship (e.g., quantity, unit_price in order_items). In JPA, use
@ManyToManywith@JoinTable, or better โ model the junction as its own@Entityif it has attributes.
Q7. What is a soft delete and what are its trade-offs?
Soft delete marks records as deleted (deleted_at timestamp) rather than removing them. Pros: preserves history, supports undo. Cons: all queries must include
WHERE deleted_at IS NULL; indexes become less effective unless partial indexes are used; related data integrity must be manually managed. Consider using an audit/history table instead.
Q8. What are the options for storing hierarchical data in a relational database?
Adjacency list (parent_id FK): simple, but requires recursive CTEs for deep queries. Closure table: stores all ancestor-descendant pairs; fast queries, expensive inserts. Materialized path: stores path as string (
/1/5/12/); easy queries with LIKE, but updates are expensive. Nested sets: left/right boundary values; very fast reads, complex writes. Choice depends on read/write ratio and tree depth.
Advanced Editorial Pass: Schema Design as Evolution Strategyโ
Senior Engineering Focusโ
- Design schemas for change frequency and ownership boundaries.
- Balance normalization against read-path simplicity and cost.
- Encode intent through constraints, naming, and relationship clarity.
Failure Modes to Anticipateโ
- Schema drift from undocumented business rule changes.
- Accidental coupling between unrelated bounded contexts.
- Migration pain from ambiguous key and relationship choices.
Practical Heuristicsโ
- Version schema decisions with rationale.
- Use migration rehearsals on realistic snapshots.
- Keep bounded context boundaries explicit in data ownership.