Skip to main content

Database Design & Normalization

SQL vs NoSQL Decision Guideโ€‹

Before designing a database schema, you must decide between a Relational (SQL) and Non-Relational (NoSQL) database.

CriteriaSQLNoSQL
SchemaFixed, strictly enforcedFlexible, dynamic document
RelationshipsStrong (JOINs, Foreign Keys)Weak (denormalized / embedded)
TransactionsStrict ACID complianceEventual consistency (usually)
Query FlexibilityHigh (arbitrary SQL queries)Low (query-by-design)
Scaling WritesHard (requires complex sharding)Built-in horizontal scaling
Best ForFinancial, transactional, relational dataCatalogs, user content, fast sessions

Entity-Relationship Modelingโ€‹

An ER diagram models the domain before creating tables.

Core concepts:

  • Entity: a thing with independent existence (User, Order, Product)
  • Attribute: property of an entity (name, email, price)
  • Relationship: association between entities (User places Order)
  • Cardinality: 1:1, 1:N, M:N

Relationship Typesโ€‹

One-to-One (1:1):
User โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ UserProfile (one user has one profile)

One-to-Many (1:N):
User โ”€โ”€โ”€โ”€โ”€โ”€<< Order (one user has many orders)

Many-to-Many (M:N):
Order >>โ”€โ”€โ”€โ”€<< Product (via junction table: order_items)

Junction Table (Bridge Table) for M:Nโ€‹

CREATE TABLE order_items (
order_id BIGINT REFERENCES orders(id) ON DELETE CASCADE,
product_id BIGINT REFERENCES products(id),
quantity INT NOT NULL DEFAULT 1,
unit_price DECIMAL(10,2) NOT NULL,
PRIMARY KEY (order_id, product_id)
);

Normal Formsโ€‹

Normalization reduces data redundancy and prevents update anomalies.

Anomaly Types (Why Normalize?)โ€‹

Unnormalized table: Students
| student_id | student_name | course_id | course_name | instructor |
|------------|-------------|-----------|---------------|------------|
| 1 | Alice | CS101 | Intro to CS | Prof. Lee |
| 1 | Alice | CS201 | Data Struct | Prof. Kim |
| 2 | Bob | CS101 | Intro to CS | Prof. Lee |
  • Insert anomaly: can't add a course without a student
  • Delete anomaly: deleting last student in a course loses course info
  • Update anomaly: changing "Prof. Lee" requires updating multiple rows

First Normal Form (1NF)โ€‹

Rule: Every column must contain atomic (indivisible) values; no repeating groups.

-- โŒ Violates 1NF (multi-value column)
| order_id | products |
| 1 | "apple, banana" |

-- โœ… 1NF compliant
| order_id | product |
| 1 | apple |
| 1 | banana |

Second Normal Form (2NF)โ€‹

Rule: Must be in 1NF + every non-key attribute is fully functionally dependent on the entire primary key (no partial dependencies).

Only applies to tables with composite primary keys.

-- โŒ Violates 2NF
-- PK = (order_id, product_id), but product_name depends only on product_id
| order_id | product_id | product_name | quantity |
| 1 | P1 | Apple | 5 |
| 2 | P1 | Apple | 3 | โ† product_name duplicated

-- โœ… 2NF: split out product info
Table: order_items (order_id, product_id, quantity) -- PK: both columns
Table: products (product_id, product_name, price) -- PK: product_id

Third Normal Form (3NF)โ€‹

Rule: Must be in 2NF + no transitive dependencies (non-key attributes must not depend on other non-key attributes).

-- โŒ Violates 3NF: zip_code โ†’ city (transitive: id โ†’ zip_code โ†’ city)
| customer_id | name | zip_code | city |
| 1 | Alice | 10001 | New York |
| 2 | Bob | 90210 | LA |

-- โœ… 3NF: separate zip/city
Table: customers (customer_id, name, zip_code)
Table: zip_codes (zip_code, city, state)

Boyce-Codd Normal Form (BCNF)โ€‹

Rule: Stricter than 3NF. For every non-trivial functional dependency X โ†’ Y, X must be a superkey.

Handles edge cases with overlapping candidate keys.

-- Example: professor teaches one course per room
-- (student, course) โ†’ professor
-- (student, professor) โ†’ course
-- professor โ†’ course (professor only teaches one course โ€” violates BCNF)

Normal Forms Summaryโ€‹

Normal FormRequirement
1NFAtomic values, no repeating groups
2NF1NF + no partial dependencies on PK
3NF2NF + no transitive dependencies
BCNF3NF + every determinant is a superkey
4NFBCNF + no multi-valued dependencies
5NF4NF + no join dependencies

In practice, 3NF is the target for OLTP. BCNF and beyond are mainly academic.


Denormalizationโ€‹

Intentionally violating normal forms for performance.

TechniqueDescriptionWhen
Redundant columnsStore derived/related data in same tableAvoid expensive JOINs
Pre-aggregated columnsorder_count, total_spent on user tableAvoid COUNT/SUM aggregations
Duplicate tablesMaterialized views, summary tablesOLAP reporting
JSON columnsEmbed variable attributesSparse/dynamic attributes
-- Denormalized: store order count on user to avoid JOIN + COUNT
ALTER TABLE users ADD COLUMN order_count INT DEFAULT 0;

-- Keep in sync with triggers or application logic
UPDATE users SET order_count = order_count + 1 WHERE id = :userId;
Denormalization moves consistency responsibility to the application. Always have a clear update strategy.

Schema Design Patternsโ€‹

Polymorphic Associationโ€‹

One table references multiple entity types:

-- โŒ Complex: single table for all entity types
CREATE TABLE comments (
id BIGINT PRIMARY KEY,
body TEXT,
entity_type VARCHAR(50), -- 'post', 'video', 'product'
entity_id BIGINT -- no FK enforcement!
);

-- โœ… Better: separate tables per entity
CREATE TABLE post_comments (id, post_id REFERENCES posts, body);
CREATE TABLE product_comments (id, product_id REFERENCES products, body);

Soft Deleteโ€‹

ALTER TABLE users ADD COLUMN deleted_at TIMESTAMP NULL;

-- "Delete"
UPDATE users SET deleted_at = NOW() WHERE id = ?;

-- Query active users
SELECT * FROM users WHERE deleted_at IS NULL;

-- Add partial index to optimize
CREATE INDEX idx_active_users ON users (id) WHERE deleted_at IS NULL;

Audit / History Tableโ€‹

CREATE TABLE orders_history (
history_id BIGSERIAL PRIMARY KEY,
order_id BIGINT,
changed_at TIMESTAMP DEFAULT NOW(),
changed_by VARCHAR(100),
operation CHAR(1), -- 'I', 'U', 'D'
old_data JSONB,
new_data JSONB
);

Vertical Partitioningโ€‹

Vertical partitioning splits a table by columns, separating frequently accessed "hot" columns from rarely accessed "cold" columns to optimize memory/cache usage and reduce disk I/O.

-- Hot columns (frequently accessed)
CREATE TABLE user_core (
id BIGSERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL,
name VARCHAR(100),
last_login TIMESTAMPTZ
);

-- Cold columns (rarely accessed)
CREATE TABLE user_profile (
user_id BIGINT REFERENCES user_core(id),
bio TEXT,
avatar_url TEXT,
preferences JSONB
);

Table Partitioning (PostgreSQL)โ€‹

Table partitioning splits a single logical table into multiple physical child tables based on ranges, lists, or hash keys. This is extremely useful for high-volume logs or time-series data.

-- Time-based range partitioning
CREATE TABLE events (
id BIGSERIAL,
user_id BIGINT,
event_type VARCHAR(50),
created_at TIMESTAMPTZ NOT NULL,
data JSONB
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2024_q1 PARTITION OF events
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');

CREATE TABLE events_2024_q2 PARTITION OF events
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');

-- Drop old partitions instantly (bypasses slow, locking DELETE queries)
DROP TABLE events_2023_q1;

Hierarchical Data (Adjacency List)โ€‹

CREATE TABLE categories (
id INT PRIMARY KEY,
name VARCHAR(100),
parent_id INT REFERENCES categories(id) -- self-referential
);

-- Recursive CTE to query full tree (PostgreSQL)
WITH RECURSIVE tree AS (
SELECT id, name, parent_id, 0 AS depth
FROM categories WHERE parent_id IS NULL
UNION ALL
SELECT c.id, c.name, c.parent_id, t.depth + 1
FROM categories c JOIN tree t ON c.parent_id = t.id
)
SELECT * FROM tree ORDER BY depth, name;

Alternative patterns: Nested Sets, Closure Table, **Materialized Path`.

EAV โ€” Entity-Attribute-Valueโ€‹

-- โŒ Flexible but querying is painful
CREATE TABLE product_attributes (
product_id BIGINT,
attr_name VARCHAR(50),
attr_value VARCHAR(200)
);
-- Finding products where color=red AND size=L requires multiple joins

Better alternative: JSONB in PostgreSQL for dynamic attributes.


JPA Entity Relationship Annotationsโ€‹

// One-to-Many
@Entity
public class User {
@OneToMany(mappedBy = "user", cascade = CascadeType.ALL, orphanRemoval = true)
private List<Order> orders = new ArrayList<>();
}

@Entity
public class Order {
@ManyToOne(fetch = FetchType.LAZY)
@JoinColumn(name = "user_id")
private User user;
}

// Many-to-Many
@Entity
public class Order {
@ManyToMany
@JoinTable(
name = "order_items",
joinColumns = @JoinColumn(name = "order_id"),
inverseJoinColumns = @JoinColumn(name = "product_id")
)
private Set<Product> products = new HashSet<>();
}

๐ŸŽฏ Interview Questionsโ€‹

Q1. What are the three types of anomalies that normalization prevents?

Insert anomaly: can't add data about an entity without adding another entity. Delete anomaly: deleting one entity unintentionally removes data about another. Update anomaly: the same fact is stored in multiple places and must be updated everywhere consistently.

Q2. Explain 2NF and give an example of a violation.

2NF requires no partial dependencies โ€” every non-key column must depend on the entire composite PK. Example violation: a table with PK (order_id, product_id) that also stores product_name. product_name depends only on product_id (partial dependency). Fix: move product_name to a separate products table.

Q3. What is the difference between 3NF and BCNF?

3NF allows a non-key attribute to determine another non-key attribute if the determinant is a candidate key. BCNF is stricter: every determinant must be a superkey. Most violations of BCNF (beyond 3NF) involve tables with multiple overlapping candidate keys โ€” relatively rare in practice.

Q4. When should you denormalize a database?

Denormalize when JOINs become a measurable performance bottleneck; for pre-computed aggregates on frequently queried summary data; for OLAP/reporting workloads where read performance matters more than write consistency; or when the data model needs to match a specific access pattern (e.g., Cassandra requires denormalization by design).

Q5. What is the EAV (Entity-Attribute-Value) pattern and what are its problems?

EAV stores arbitrary key-value pairs as rows rather than columns โ€” allows dynamic attributes without schema changes. Problems: no data type enforcement, complex querying (pivoting EAV into columns requires many JOINs), poor query performance, no FK constraints. Better alternatives: JSONB columns in PostgreSQL, or a proper document store like MongoDB.

Q6. How do you model a many-to-many relationship in a relational database?

Create a junction (bridge) table that holds FKs from both entities as a composite PK. The junction table can carry additional attributes of the relationship (e.g., quantity, unit_price in order_items). In JPA, use @ManyToMany with @JoinTable, or better โ€” model the junction as its own @Entity if it has attributes.

Q7. What is a soft delete and what are its trade-offs?

Soft delete marks records as deleted (deleted_at timestamp) rather than removing them. Pros: preserves history, supports undo. Cons: all queries must include WHERE deleted_at IS NULL; indexes become less effective unless partial indexes are used; related data integrity must be manually managed. Consider using an audit/history table instead.

Q8. What are the options for storing hierarchical data in a relational database?

Adjacency list (parent_id FK): simple, but requires recursive CTEs for deep queries. Closure table: stores all ancestor-descendant pairs; fast queries, expensive inserts. Materialized path: stores path as string (/1/5/12/); easy queries with LIKE, but updates are expensive. Nested sets: left/right boundary values; very fast reads, complex writes. Choice depends on read/write ratio and tree depth.


Advanced Editorial Pass: Schema Design as Evolution Strategyโ€‹

Senior Engineering Focusโ€‹

  • Design schemas for change frequency and ownership boundaries.
  • Balance normalization against read-path simplicity and cost.
  • Encode intent through constraints, naming, and relationship clarity.

Failure Modes to Anticipateโ€‹

  • Schema drift from undocumented business rule changes.
  • Accidental coupling between unrelated bounded contexts.
  • Migration pain from ambiguous key and relationship choices.

Practical Heuristicsโ€‹

  1. Version schema decisions with rationale.
  2. Use migration rehearsals on realistic snapshots.
  3. Keep bounded context boundaries explicit in data ownership.

Compare Nextโ€‹