Linux Internals & System Calls
Linux Kernel Architectureโ
User Space
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Applications: bash, java, nginx, postgres, ... โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ System Calls (glibc / syscall)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Kernel Boundary
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ System Call Interface โ
โ โโโโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Process โ โ Memory โ โ VFS โ โ Network โ โ
โ โ Scheduler โ โ Manager โ โ Layer โ โ Stack โ โ
โ โโโโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Device Drivers โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Hardware Abstraction Layer โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Hardware: CPU, RAM, NIC, Disk, ...
Kernel Modesโ
- Kernel Mode (Ring 0): Full hardware access. Kernel code runs here.
- User Mode (Ring 3): Restricted. Applications run here.
Switching to kernel mode is expensive (~1ยตs) โ that's why system calls should be minimized in hot paths.
System Callsโ
A system call is the mechanism by which a user-space process requests a service from the kernel.
System Call Flow (x86-64 Linux)โ
1. App calls glibc wrapper (e.g., read())
2. glibc sets syscall number in RAX, args in RDI, RSI, RDX, R10, R8, R9
3. Executes syscall instruction
4. CPU switches to Ring 0, jumps to kernel entry point
5. Kernel validates args, performs operation
6. Returns result in RAX (negative = error โ errno)
7. CPU switches back to Ring 3
Essential System Callsโ
Process Managementโ
| Syscall | Description |
|---|---|
fork() | Create child process |
exec*() | Replace process image |
wait() / waitpid() | Wait for child |
exit() | Terminate process |
getpid() / getppid() | Get PID |
kill(pid, sig) | Send signal to process |
clone() | Create thread/process with fine-grained sharing |
File I/Oโ
| Syscall | Description |
|---|---|
open(path, flags) | Open/create a file โ fd |
read(fd, buf, n) | Read n bytes |
write(fd, buf, n) | Write n bytes |
close(fd) | Release file descriptor |
lseek(fd, offset, whence) | Move file position |
mmap(addr, len, prot, flags, fd, off) | Map file/device into memory |
munmap(addr, len) | Unmap |
fsync(fd) | Flush to disk |
stat(path, statbuf) | Get file metadata |
unlink(path) | Delete file |
rename(old, new) | Rename/move file (atomic within filesystem) |
Memoryโ
| Syscall | Description |
|---|---|
brk(addr) | Adjust heap end |
mmap(MAP_ANON) | Allocate anonymous memory |
mprotect(addr, len, prot) | Change memory permissions |
madvise(addr, len, advice) | Advise kernel on memory usage |
Networkโ
| Syscall | Description |
|---|---|
socket(domain, type, proto) | Create socket |
bind(fd, addr, len) | Bind to address |
listen(fd, backlog) | Mark as passive |
accept(fd, addr, len) | Accept connection |
connect(fd, addr, len) | Connect to server |
send(fd, buf, n, flags) | Send data |
recv(fd, buf, n, flags) | Receive data |
epoll_create/ctl/wait | Scalable I/O multiplexing |
I/O Multiplexingโ
select | poll | epoll | |
|---|---|---|---|
| Max FDs | 1024 | Unlimited | Unlimited |
| Complexity | O(n) | O(n) | O(1) amortized |
| API | Bitmask | Array of pollfd | Event list |
| Re-registration | Every call | Every call | Once |
| Best for | โคfew hundred FDs | Medium | Thousands of FDs |
// epoll example
int epfd = epoll_create1(0);
struct epoll_event ev = { .events = EPOLLIN, .data.fd = sock_fd };
epoll_ctl(epfd, EPOLL_CTL_ADD, sock_fd, &ev);
struct epoll_event events[64];
int n = epoll_wait(epfd, events, 64, -1);
for (int i = 0; i < n; i++) {
handle(events[i].data.fd);
}
Signalsโ
Signals are asynchronous notifications sent to processes.
Common Signalsโ
| Signal | Number | Default | Description |
|---|---|---|---|
| SIGHUP | 1 | Terminate | Hangup / config reload |
| SIGINT | 2 | Terminate | Ctrl+C |
| SIGQUIT | 3 | Core dump | Ctrl+\ |
| SIGKILL | 9 | Terminate | Cannot be caught or ignored |
| SIGSEGV | 11 | Core dump | Invalid memory access |
| SIGPIPE | 13 | Terminate | Broken pipe (write to closed socket) |
| SIGTERM | 15 | Terminate | Graceful shutdown (catchable) |
| SIGCHLD | 17 | Ignore | Child process changed state |
| SIGSTOP | 19 | Stop | Cannot be caught or ignored |
| SIGCONT | 18 | Continue | Resume stopped process |
| SIGUSR1/2 | 10/12 | Terminate | User-defined signals |
kill -SIGTERM 1234 # Graceful shutdown
kill -9 1234 # Force kill (SIGKILL)
kill -SIGUSR1 1234 # Trigger log rotation in many daemons
Signal Handling in Javaโ
Signal.handle(new Signal("TERM"), signal -> {
log.info("SIGTERM received, shutting down gracefully");
shutdown();
});
// Or via Runtime hook (more portable):
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
log.info("Shutdown hook triggered");
cleanup();
}));
Namespaces (Isolation)โ
Linux namespaces provide isolation for system resources โ the foundation of containers.
| Namespace | Isolates | Flag |
|---|---|---|
| PID | Process IDs | CLONE_NEWPID |
| Network | Interfaces, routes, firewall | CLONE_NEWNET |
| Mount | Filesystem mount points | CLONE_NEWNS |
| UTS | Hostname, domain name | CLONE_NEWUTS |
| IPC | Semaphores, message queues | CLONE_NEWIPC |
| User | UIDs, GIDs | CLONE_NEWUSER |
| cgroup | cgroup root | CLONE_NEWCGROUP |
# What namespace a process is in:
ls -la /proc/1234/ns/
# Run a process in new namespaces:
unshare --pid --fork --mount-proc bash
Control Groups (cgroups)โ
cgroups limit and account for resource usage of process groups.
Resources Controlledโ
- cpu: CPU time allocation (shares, quota, period).
- memory: RAM limit, OOM behavior.
- blkio: Block device I/O throttling.
- net_cls/net_prio: Network priority.
- pids: Limit number of processes.
# cgroups v2 (modern Linux)
# Create a cgroup:
mkdir /sys/fs/cgroup/myapp
# Set memory limit (512MB):
echo $((512 * 1024 * 1024)) > /sys/fs/cgroup/myapp/memory.max
# Set CPU limit (50% of one core):
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max
# Add process to cgroup:
echo 1234 > /sys/fs/cgroup/myapp/cgroup.procs
Docker uses Namespaces + cgroupsโ
Container = Namespaces (isolation) + cgroups (resource limits) + Union FS
The /proc Filesystemโ
/proc/
โโโ [PID]/ # Per-process info
โ โโโ cmdline # Command line args
โ โโโ environ # Environment variables
โ โโโ fd/ # File descriptors
โ โโโ maps # Virtual memory mappings
โ โโโ status # Process status
โ โโโ stat # Scheduler stats
โ โโโ net/ # Network stats (process's netns)
โโโ cpuinfo # CPU details
โโโ meminfo # Memory statistics
โโโ interrupts # Interrupt counters
โโโ net/tcp # TCP socket table
โโโ net/dev # Network interface stats
โโโ sys/ # Kernel tunables (sysctl)
โโโ loadavg # Load averages (1, 5, 15 min)
# Process memory map (virtual memory areas):
cat /proc/1234/maps
# or better:
cat /proc/1234/smaps
# Open file descriptors of a process:
ls -la /proc/1234/fd
# See what files a process has open:
lsof -p 1234
Essential Linux Toolsโ
Process Inspectionโ
ps aux # All processes
ps -ejH # Process tree
pstree -p # Visual process tree
top / htop # Interactive process monitor
pidstat -u 1 # Per-process CPU stats
strace -p 1234 # Trace syscalls of running process
strace -c ./program # Syscall summary
ltrace ./program # Trace library calls
Memoryโ
free -h # Memory summary
vmstat 1 # Virtual memory stats
cat /proc/meminfo # Detailed memory info
pmap -x 1234 # Process memory map
smem # Memory by process (PSS)
I/O & Diskโ
iostat -xz 1 # Disk I/O stats
iotop # I/O by process
lsof +D /path # Processes using directory
fuser /path # Processes using file
df -h # Disk space
du -sh /path # Directory size
Networkโ
ss -tlnp # TCP listening sockets + PID
netstat -tlnp # (older equivalent)
tcpdump -i eth0 port 8080 # Capture packets
curl -v http://example.com # HTTP debug
nmap -sV host # Port scan
Performance Analysisโ
perf stat ./program # CPU performance counters
perf top # System-wide CPU profiling
perf record ./program # Record samples
perf report # Analyze samples
dmesg | tail -50 # Kernel messages
journalctl -u myservice # Systemd service logs
System Call Tracing with straceโ
# Trace all syscalls:
strace ./myprogram
# Trace specific syscalls:
strace -e trace=open,read,write ./myprogram
# Trace with timestamps:
strace -tt ./myprogram
# Count syscalls:
strace -c ./myprogram
# Attach to running process:
strace -p 1234
# Trace children too:
strace -f ./myprogram
Typical strace outputโ
execve("./myprogram", [...], [...]) = 0
brk(NULL) = 0x55a3b2000000
mmap(NULL, 4096, PROT_READ|PROT_WRITE,...) = 0x7f9a12000000
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0...", 832) = 832
close(3) = 0
...
write(1, "Hello, World!\n", 14) = 14
exit_group(0) = ?
Common Interview Questionsโ
Q1: What is the difference between a process and a container?โ
A container is not a separate OS โ it's a Linux process (or group of processes) running with: namespaces (isolation for PID, network, filesystem, etc.) and cgroups (resource limits). The kernel is shared with the host OS. A VM has its own kernel and hardware emulation.
Q2: What happens during a fork() system call?โ
The kernel creates a new PCB for the child, copies the parent's page tables (marking pages copy-on-write), assigns a new PID, and returns 0 to the child and the child's PID to the parent. Actual page copying is deferred until a write occurs.
Q3: Why can't SIGKILL be caught?โ
SIGKILL bypasses the process's signal handlers entirely โ it's delivered directly by the kernel to terminate the process. This guarantees that a process can always be killed, preventing processes from ignoring termination. SIGTERM should be tried first to allow graceful shutdown.
Q4: What is epoll and how does it scale better than select?โ
select/poll require passing all watched FDs on every call (O(n) scan). epoll maintains a kernel-side data structure โ you register FDs once, and epoll_wait returns only the FDs that are ready. Complexity is O(1) per event, not O(n) per FD. Essential for high-connection servers.
Q5: What is a file descriptor?โ
A file descriptor is a small integer (0=stdin, 1=stdout, 2=stderr, then 3+) that's an index into the process's file descriptor table โ a per-process array of open file descriptions. Each entry points to a kernel file struct containing the current offset, flags, and the underlying inode. Inherited across fork().
Q6: What does sysctl do?โ
sysctl reads and sets kernel parameters at runtime via /proc/sys/. Example: sysctl -w net.core.somaxconn=65535 increases the max TCP connection backlog. Changes are ephemeral unless written to /etc/sysctl.conf.
Q7: What is the difference between /proc and /sys?โ
/proc: Exposes process and kernel state (memory maps, CPU info, network tables). Historical, somewhat unorganized. /sys (sysfs): Structured representation of kernel objects (devices, drivers, buses). Read/write files to configure devices. More organized than /proc.
Q8: How does mmap work and what are its advantages?โ
mmap() creates a mapping between a file and the process's virtual address space. The kernel maps file pages on demand (lazy loading). Advantages: avoids double-copying (file โ page cache โ user buffer), allows multiple processes to share the same physical pages (shared libraries, IPC), and access patterns like random seeks are as efficient as array indexing.
Advanced Editorial Pass: Kernel Boundary Awareness for Backend Engineersโ
Senior Engineering Focusโ
- Model syscall overhead and kernel transitions in hot paths.
- Use namespaces and cgroups as governance tools for isolation and limits.
- Treat procfs and tracing as first-class diagnostics interfaces.
Failure Modes to Anticipateโ
- Excessive syscall chatter degrading throughput under load.
- Resource limits misconfigured across cgroups causing false app failures.
- Signal handling bugs leading to unsafe shutdown behavior.
Practical Heuristicsโ
- Profile syscall mix before optimization work.
- Validate container limits with controlled stress tests.
- Standardize graceful signal handling and shutdown sequencing.