File Systems & I/O

File System Concepts

What is a File System?

A file system organizes data on storage devices, providing:

Naming: Human-readable names mapped to data locations.
Hierarchy: Directories and subdirectories.
Metadata: Permissions, timestamps, size, ownership.
Access control: Read/write/execute permissions.

File Types

Type	Description	Linux Indicator
Regular file	Data (text, binary)	`-`
Directory	Contains entries (name → inode)	`d`
Symbolic link	Pointer to another path	`l`
Hard link	Another name for same inode	(same `-`)
Block device	Disk, SSD (block I/O)	`b`
Character device	Terminal, keyboard (stream I/O)	`c`
Named pipe (FIFO)	IPC mechanism	`p`
Socket	Network/IPC endpoint	`s`

File Access Methods

Method	Description	Use Case
Sequential	Read/write in order	Logs, streaming
Direct (Random)	Access any record by position	Databases
Memory-mapped	Map file into address space	Large files, shared memory
Index-based	Index points to data blocks	Database indexes

Disk Structure & Geometry

Disk:
┌──────────────────────┐
│  Track 0  ──────────── ◄── Cylinders (stack of tracks)
│  Track 1
│  ...
│  Track N
└──────────────────────┘

Sector = smallest unit (512B or 4096B)
Block  = OS unit (multiple sectors, typically 4KB)

Disk Access Time

Total time = Seek Time + Rotational Latency + Transfer Time

HDD:
  Seek:     1–15ms (avg ~5ms)
  Rotation: 0–8ms  (avg ~4ms for 7200 RPM)
  Transfer: <1ms
  Total avg: ~10ms

SSD (NVMe):
  Seek: ~0.1ms
  Read latency: 20–100µs
  Write: 100–500µs

Disk Scheduling Algorithms

1. FCFS (First Come First Served)

Requests served in arrival order. Fair but high seek time.

2. SSTF (Shortest Seek Time First)

Service request closest to current head position. Low seek time but starvation of far requests.

3. SCAN (Elevator Algorithm)

Head moves in one direction, services requests, reaches end, reverses. Like an elevator.

Head at 53, requests: 98,183,37,122,14,124,65,67
Direction: ↑
53→65→67→98→122→124→183 (end)→183→124→...→14

Better than FCFS; no starvation

4. C-SCAN (Circular SCAN)

Like SCAN but only services in one direction; jumps to start after reaching end. More uniform wait times than SCAN.

5. C-LOOK

Like C-SCAN but only goes as far as the last request (doesn't go to disk end).

Used in practice: Linux uses a deadline scheduler (for HDDs) or mq-deadline, bfq, kyber (for NVMe).

File System Implementation

Inode (Index Node)

An inode stores all file metadata except the filename:

Inode:
  - File type, permissions
  - Owner UID, GID
  - Size in bytes
  - Timestamps: atime, mtime, ctime
  - Link count
  - Block pointers:
      Direct blocks: 12 × block address
      Single indirect: → block of addresses
      Double indirect: → block → block of addresses
      Triple indirect: → block → block → block of addresses

stat /etc/passwd          # Show inode info
ls -i /etc/passwd         # Show inode number

Directory Structure

A directory is a file containing (name → inode number) mappings.

Directory entry:
[ inode_number (4B) | name_length (1B) | file_type (1B) | name (variable) ]

Hard Links vs Symbolic Links

	Hard Link	Symbolic Link
Points to	Inode	Path string
Across filesystems	No	Yes
If original deleted	Still works	Dangling link
Works on directories	No	Yes
`ls -l` shows	Same size as original	Link path

ln source hardlink         # Hard link
ln -s source symlink       # Symbolic link

Free Space Management

Method	Description
Bit vector (bitmap)	1 bit per block; fast to find contiguous free blocks
Linked list	Free blocks chained together; no random access
Grouping	Linked list with groups of n free block addresses
Counting	Run-length encoding of free block runs

Common File Systems

ext4 (Linux default)

Uses extents (contiguous block ranges) for efficiency.
Journaling: Logs metadata changes before applying → crash recovery.
Max file size: 16TB; max FS size: 1EB.
Features: delayed allocation, multiblock allocation.

XFS

High-performance, designed for large files and parallel I/O.
Uses B+ tree for extent tracking.
Default on RHEL/CentOS.

Btrfs

Copy-on-write (CoW) filesystem.
Built-in snapshots, RAID, checksums, compression.
Subvolumes.

FAT32 / exFAT

Simple File Allocation Table.
exFAT: used for flash drives, cross-platform compatible.
No permissions, no journaling.

NTFS (Windows)

Master File Table (MFT) instead of inodes.
Journaling, ACLs, compression, encryption.

Journaling

Prevents file system corruption on crash by recording changes to a journal (log) before applying them.

Modes (ext4)

Mode	What's journaled	Speed	Safety
`writeback`	Metadata only (no order guarantee)	Fastest	Least safe
`ordered`	Metadata + data written before metadata commit	Default	Good
`journal`	Both metadata and data	Slowest	Safest

Log-Structured File System (LFS)

All writes go to a sequential log. No seeks on write. Garbage collection reclaims old segments. Used in SSDs, JFFS2, F2FS.

RAID (Redundant Array of Independent Disks)

Level	Description	Redundancy	Speed	Min Disks
RAID 0	Striping	None	↑↑ Read & Write	2
RAID 1	Mirroring	Full copy	↑ Read	2
RAID 5	Striping + distributed parity	1 disk	↑ Read	3
RAID 6	Striping + 2 parity	2 disks	↑ Read	4
RAID 10	RAID 1 + RAID 0	Full copy	↑↑ Both	4

Virtual File System (VFS)

VFS provides a uniform interface regardless of underlying filesystem type.

Application (open/read/write/close)
       ↓
   VFS Layer (vfs_open, vfs_read, ...)
       ↓
┌───────────┬──────────┬─────────────┐
│  ext4     │   xfs    │  tmpfs/proc │
└───────────┴──────────┴─────────────┘

VFS abstractions: superblock, inode, dentry (directory entry), file.

`/proc` and `/sys` Filesystems

/proc: Virtual filesystem exposing kernel and process information as files.
/sys (sysfs): Exposes kernel objects (devices, drivers).

cat /proc/cpuinfo          # CPU info
cat /proc/meminfo          # Memory stats
cat /proc/net/tcp          # TCP connections
ls /proc/1234/             # Process 1234 info

I/O Subsystem

I/O Methods

Method	Description	CPU Use
Programmed I/O (polling)	CPU busy-waits for I/O to complete	100%
Interrupt-driven I/O	CPU does other work; device raises interrupt	Low
DMA (Direct Memory Access)	Device transfers data directly to RAM	Minimal

I/O Scheduling

Linux block I/O schedulers:

mq-deadline: Deadline-based scheduling, prevents starvation.
BFQ (Budget Fair Queuing): Fair bandwidth, good for interactive use.
Kyber: Low-latency for NVMe.
None: No reordering (fastest for NVMe where seek time is negligible).

cat /sys/block/sda/queue/scheduler    # Current scheduler
echo mq-deadline > /sys/block/sda/queue/scheduler  # Change it

Page Cache

OS caches disk reads in RAM (page cache). Writes are buffered (write-back). Dramatically reduces I/O.

sync                        # Flush dirty pages to disk
echo 3 > /proc/sys/vm/drop_caches  # Drop page cache (for benchmarking)

Java I/O & NIO

Traditional I/O (java.io)

// Blocking, stream-based
try (BufferedReader br = new BufferedReader(new FileReader("file.txt"))) {
    String line;
    while ((line = br.readLine()) != null) { process(line); }
}

Java NIO (Non-blocking I/O)

// Channel + Buffer model
Path path = Paths.get("file.txt");
try (FileChannel channel = FileChannel.open(path, StandardOpenOption.READ)) {
    ByteBuffer buffer = ByteBuffer.allocateDirect(4096);
    while (channel.read(buffer) > 0) {
        buffer.flip();
        // process buffer
        buffer.clear();
    }
}

Java NIO Selector (Multiplexed I/O)

Selector selector = Selector.open();
ServerSocketChannel server = ServerSocketChannel.open();
server.configureBlocking(false);
server.register(selector, SelectionKey.OP_ACCEPT);

while (true) {
    selector.select();
    Set<SelectionKey> keys = selector.selectedKeys();
    for (SelectionKey key : keys) {
        if (key.isAcceptable()) { /* accept */ }
        if (key.isReadable())   { /* read */   }
    }
    keys.clear();
}

Memory-Mapped Files (Java)

try (FileChannel fc = FileChannel.open(path)) {
    MappedByteBuffer buf = fc.map(
        FileChannel.MapMode.READ_ONLY, 0, fc.size());
    // Direct memory access — no copy to JVM heap
    byte b = buf.get(1000);
}

Common Interview Questions

Q1: What is the difference between a hard link and a symbolic link?

Hard link: another name pointing to the same inode — deleting the original doesn't affect it; can't cross filesystems; can't link directories. Symbolic link: a file containing a path string — if the target is deleted, the link is broken; can cross filesystems.

Q2: How does journaling prevent file system corruption?

Journaling writes a record of intended changes to a journal log before modifying the actual file system structures. On crash recovery, the OS replays or discards incomplete journal entries, ensuring the file system remains consistent.

Q3: What is the difference between `read()` and `mmap()`?

read() copies data from kernel page cache into a user-space buffer (two copies: disk→page cache, page cache→user buffer). mmap() maps the page cache directly into the process address space — access requires only one copy. For large files or shared access, mmap() is significantly faster.

Q4: What happens when you `delete` a file on Linux?

unlink() decrements the inode's link count. When it reaches 0 AND no process has the file open, the inode and data blocks are freed. If a process has it open, the data remains accessible until the last close().

Q5: What is a buffer cache and a page cache?

The page cache (modern Linux) caches file data in memory pages. The old buffer cache cached raw disk blocks. Since Linux 2.4, they are unified: file I/O goes through the page cache, which also serves as the buffer cache.

Q6: What is `fsync()` and when should you call it?

fsync(fd) flushes dirty pages for a file to disk, blocking until complete. Call it when you need durable writes (e.g., database transaction commit). Without it, a crash can lose recently written data still in the page cache.

Q7: What is RAID 5 and what are its limitations?

RAID 5 stripes data with distributed parity across N disks — can tolerate one disk failure. Limitation: during rebuild (after a failure), every block is read from all remaining disks. If a second disk fails during rebuild, all data is lost. For critical data, prefer RAID 6 (tolerates 2 failures) or RAID 10.

Q8: What is the difference between blocking and non-blocking I/O?

Blocking I/O: the calling thread blocks until the I/O operation completes (e.g., read() waits for data). Non-blocking I/O: the call returns immediately with EAGAIN/EWOULDBLOCK if data isn't ready. Non-blocking I/O with select()/poll()/epoll() is the basis for high-performance event-driven servers (Netty, Node.js).

File System Concepts​

What is a File System?​

File Types​

File Access Methods​

Disk Structure & Geometry​

Disk Access Time​

Disk Scheduling Algorithms​

1. FCFS (First Come First Served)​

2. SSTF (Shortest Seek Time First)​

3. SCAN (Elevator Algorithm)​

4. C-SCAN (Circular SCAN)​

5. C-LOOK​

File System Implementation​

Inode (Index Node)​

Directory Structure​

Hard Links vs Symbolic Links​

Free Space Management​

Common File Systems​

ext4 (Linux default)​

XFS​

Btrfs​

FAT32 / exFAT​

NTFS (Windows)​

Journaling​

Modes (ext4)​

Log-Structured File System (LFS)​

RAID (Redundant Array of Independent Disks)​

Virtual File System (VFS)​

/proc and /sys Filesystems​

I/O Subsystem​

I/O Methods​

I/O Scheduling​

Page Cache​

Java I/O & NIO​

Traditional I/O (java.io)​

Java NIO (Non-blocking I/O)​

Java NIO Selector (Multiplexed I/O)​

Memory-Mapped Files (Java)​

Common Interview Questions​

Q1: What is the difference between a hard link and a symbolic link?​

Q2: How does journaling prevent file system corruption?​

Q3: What is the difference between read() and mmap()?​

Q4: What happens when you delete a file on Linux?​

Q5: What is a buffer cache and a page cache?​

Q6: What is fsync() and when should you call it?​

Q7: What is RAID 5 and what are its limitations?​

Q8: What is the difference between blocking and non-blocking I/O?​

Advanced Editorial Pass: File System and I/O Path Engineering​

Senior Engineering Focus​

Failure Modes to Anticipate​

Practical Heuristics​

Compare Next​