Skip to main content

File Systems & I/O

File System Concepts​

What is a File System?​

A file system organizes data on storage devices, providing:

  • Naming: Human-readable names mapped to data locations.
  • Hierarchy: Directories and subdirectories.
  • Metadata: Permissions, timestamps, size, ownership.
  • Access control: Read/write/execute permissions.

File Types​

TypeDescriptionLinux Indicator
Regular fileData (text, binary)-
DirectoryContains entries (name β†’ inode)d
Symbolic linkPointer to another pathl
Hard linkAnother name for same inode(same -)
Block deviceDisk, SSD (block I/O)b
Character deviceTerminal, keyboard (stream I/O)c
Named pipe (FIFO)IPC mechanismp
SocketNetwork/IPC endpoints

File Access Methods​

MethodDescriptionUse Case
SequentialRead/write in orderLogs, streaming
Direct (Random)Access any record by positionDatabases
Memory-mappedMap file into address spaceLarge files, shared memory
Index-basedIndex points to data blocksDatabase indexes

Disk Structure & Geometry​

Disk:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Track 0 ──────────── ◄── Cylinders (stack of tracks)
β”‚ Track 1
β”‚ ...
β”‚ Track N
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Sector = smallest unit (512B or 4096B)
Block = OS unit (multiple sectors, typically 4KB)

Disk Access Time​

Total time = Seek Time + Rotational Latency + Transfer Time

HDD:
Seek: 1–15ms (avg ~5ms)
Rotation: 0–8ms (avg ~4ms for 7200 RPM)
Transfer: <1ms
Total avg: ~10ms

SSD (NVMe):
Seek: ~0.1ms
Read latency: 20–100Β΅s
Write: 100–500Β΅s

Disk Scheduling Algorithms​

1. FCFS (First Come First Served)​

Requests served in arrival order. Fair but high seek time.

2. SSTF (Shortest Seek Time First)​

Service request closest to current head position. Low seek time but starvation of far requests.

3. SCAN (Elevator Algorithm)​

Head moves in one direction, services requests, reaches end, reverses. Like an elevator.

Head at 53, requests: 98,183,37,122,14,124,65,67
Direction: ↑
53β†’65β†’67β†’98β†’122β†’124β†’183 (end)β†’183β†’124β†’...β†’14

Better than FCFS; no starvation

4. C-SCAN (Circular SCAN)​

Like SCAN but only services in one direction; jumps to start after reaching end. More uniform wait times than SCAN.

5. C-LOOK​

Like C-SCAN but only goes as far as the last request (doesn't go to disk end).

Used in practice: Linux uses a deadline scheduler (for HDDs) or mq-deadline, bfq, kyber (for NVMe).


File System Implementation​

Inode (Index Node)​

An inode stores all file metadata except the filename:

Inode:
- File type, permissions
- Owner UID, GID
- Size in bytes
- Timestamps: atime, mtime, ctime
- Link count
- Block pointers:
Direct blocks: 12 Γ— block address
Single indirect: β†’ block of addresses
Double indirect: β†’ block β†’ block of addresses
Triple indirect: β†’ block β†’ block β†’ block of addresses
stat /etc/passwd # Show inode info
ls -i /etc/passwd # Show inode number

Directory Structure​

A directory is a file containing (name β†’ inode number) mappings.

Directory entry:
[ inode_number (4B) | name_length (1B) | file_type (1B) | name (variable) ]
Hard LinkSymbolic Link
Points toInodePath string
Across filesystemsNoYes
If original deletedStill worksDangling link
Works on directoriesNoYes
ls -l showsSame size as originalLink path
ln source hardlink # Hard link
ln -s source symlink # Symbolic link

Free Space Management​

MethodDescription
Bit vector (bitmap)1 bit per block; fast to find contiguous free blocks
Linked listFree blocks chained together; no random access
GroupingLinked list with groups of n free block addresses
CountingRun-length encoding of free block runs

Common File Systems​

ext4 (Linux default)​

  • Uses extents (contiguous block ranges) for efficiency.
  • Journaling: Logs metadata changes before applying β†’ crash recovery.
  • Max file size: 16TB; max FS size: 1EB.
  • Features: delayed allocation, multiblock allocation.

XFS​

  • High-performance, designed for large files and parallel I/O.
  • Uses B+ tree for extent tracking.
  • Default on RHEL/CentOS.

Btrfs​

  • Copy-on-write (CoW) filesystem.
  • Built-in snapshots, RAID, checksums, compression.
  • Subvolumes.

FAT32 / exFAT​

  • Simple File Allocation Table.
  • exFAT: used for flash drives, cross-platform compatible.
  • No permissions, no journaling.

NTFS (Windows)​

  • Master File Table (MFT) instead of inodes.
  • Journaling, ACLs, compression, encryption.

Journaling​

Prevents file system corruption on crash by recording changes to a journal (log) before applying them.

Modes (ext4)​

ModeWhat's journaledSpeedSafety
writebackMetadata only (no order guarantee)FastestLeast safe
orderedMetadata + data written before metadata commitDefaultGood
journalBoth metadata and dataSlowestSafest

Log-Structured File System (LFS)​

All writes go to a sequential log. No seeks on write. Garbage collection reclaims old segments. Used in SSDs, JFFS2, F2FS.


RAID (Redundant Array of Independent Disks)​

LevelDescriptionRedundancySpeedMin Disks
RAID 0StripingNone↑↑ Read & Write2
RAID 1MirroringFull copy↑ Read2
RAID 5Striping + distributed parity1 disk↑ Read3
RAID 6Striping + 2 parity2 disks↑ Read4
RAID 10RAID 1 + RAID 0Full copy↑↑ Both4

Virtual File System (VFS)​

VFS provides a uniform interface regardless of underlying filesystem type.

Application (open/read/write/close)
↓
VFS Layer (vfs_open, vfs_read, ...)
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ext4 β”‚ xfs β”‚ tmpfs/proc β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

VFS abstractions: superblock, inode, dentry (directory entry), file.

/proc and /sys Filesystems​

  • /proc: Virtual filesystem exposing kernel and process information as files.
  • /sys (sysfs): Exposes kernel objects (devices, drivers).
cat /proc/cpuinfo # CPU info
cat /proc/meminfo # Memory stats
cat /proc/net/tcp # TCP connections
ls /proc/1234/ # Process 1234 info

I/O Subsystem​

I/O Methods​

MethodDescriptionCPU Use
Programmed I/O (polling)CPU busy-waits for I/O to complete100%
Interrupt-driven I/OCPU does other work; device raises interruptLow
DMA (Direct Memory Access)Device transfers data directly to RAMMinimal

I/O Scheduling​

Linux block I/O schedulers:

  • mq-deadline: Deadline-based scheduling, prevents starvation.
  • BFQ (Budget Fair Queuing): Fair bandwidth, good for interactive use.
  • Kyber: Low-latency for NVMe.
  • None: No reordering (fastest for NVMe where seek time is negligible).
cat /sys/block/sda/queue/scheduler # Current scheduler
echo mq-deadline > /sys/block/sda/queue/scheduler # Change it

Page Cache​

OS caches disk reads in RAM (page cache). Writes are buffered (write-back). Dramatically reduces I/O.

sync # Flush dirty pages to disk
echo 3 > /proc/sys/vm/drop_caches # Drop page cache (for benchmarking)

Java I/O & NIO​

Traditional I/O (java.io)​

// Blocking, stream-based
try (BufferedReader br = new BufferedReader(new FileReader("file.txt"))) {
String line;
while ((line = br.readLine()) != null) { process(line); }
}

Java NIO (Non-blocking I/O)​

// Channel + Buffer model
Path path = Paths.get("file.txt");
try (FileChannel channel = FileChannel.open(path, StandardOpenOption.READ)) {
ByteBuffer buffer = ByteBuffer.allocateDirect(4096);
while (channel.read(buffer) > 0) {
buffer.flip();
// process buffer
buffer.clear();
}
}

Java NIO Selector (Multiplexed I/O)​

Selector selector = Selector.open();
ServerSocketChannel server = ServerSocketChannel.open();
server.configureBlocking(false);
server.register(selector, SelectionKey.OP_ACCEPT);

while (true) {
selector.select();
Set<SelectionKey> keys = selector.selectedKeys();
for (SelectionKey key : keys) {
if (key.isAcceptable()) { /* accept */ }
if (key.isReadable()) { /* read */ }
}
keys.clear();
}

Memory-Mapped Files (Java)​

try (FileChannel fc = FileChannel.open(path)) {
MappedByteBuffer buf = fc.map(
FileChannel.MapMode.READ_ONLY, 0, fc.size());
// Direct memory access β€” no copy to JVM heap
byte b = buf.get(1000);
}

Common Interview Questions​

Hard link: another name pointing to the same inode β€” deleting the original doesn't affect it; can't cross filesystems; can't link directories. Symbolic link: a file containing a path string β€” if the target is deleted, the link is broken; can cross filesystems.

Q2: How does journaling prevent file system corruption?​

Journaling writes a record of intended changes to a journal log before modifying the actual file system structures. On crash recovery, the OS replays or discards incomplete journal entries, ensuring the file system remains consistent.

Q3: What is the difference between read() and mmap()?​

read() copies data from kernel page cache into a user-space buffer (two copies: disk→page cache, page cache→user buffer). mmap() maps the page cache directly into the process address space — access requires only one copy. For large files or shared access, mmap() is significantly faster.

Q4: What happens when you delete a file on Linux?​

unlink() decrements the inode's link count. When it reaches 0 AND no process has the file open, the inode and data blocks are freed. If a process has it open, the data remains accessible until the last close().

Q5: What is a buffer cache and a page cache?​

The page cache (modern Linux) caches file data in memory pages. The old buffer cache cached raw disk blocks. Since Linux 2.4, they are unified: file I/O goes through the page cache, which also serves as the buffer cache.

Q6: What is fsync() and when should you call it?​

fsync(fd) flushes dirty pages for a file to disk, blocking until complete. Call it when you need durable writes (e.g., database transaction commit). Without it, a crash can lose recently written data still in the page cache.

Q7: What is RAID 5 and what are its limitations?​

RAID 5 stripes data with distributed parity across N disks β€” can tolerate one disk failure. Limitation: during rebuild (after a failure), every block is read from all remaining disks. If a second disk fails during rebuild, all data is lost. For critical data, prefer RAID 6 (tolerates 2 failures) or RAID 10.

Q8: What is the difference between blocking and non-blocking I/O?​

Blocking I/O: the calling thread blocks until the I/O operation completes (e.g., read() waits for data). Non-blocking I/O: the call returns immediately with EAGAIN/EWOULDBLOCK if data isn't ready. Non-blocking I/O with select()/poll()/epoll() is the basis for high-performance event-driven servers (Netty, Node.js).


Advanced Editorial Pass: File System and I/O Path Engineering​

Senior Engineering Focus​

  • Design I/O behavior for durability goals and performance budgets together.
  • Understand journaling, fsync semantics, and write amplification trade-offs.
  • Account for disk scheduler and storage class variability.

Failure Modes to Anticipate​

  • Assuming write completion implies durable persistence.
  • Unbounded synchronous I/O in request paths.
  • I/O saturation hidden behind application retries.

Practical Heuristics​

  1. Set explicit durability levels per operation type.
  2. Measure queue depth and disk latency alongside app latency.
  3. Use batching and async pipelines with strict failure semantics.

Compare Next​