CPU Scheduling
CPU scheduling decides which process/thread runs on the CPU at any given time. The scheduler is the OS component responsible for this.
Scheduling Conceptsโ
CPUโI/O Burst Cycleโ
Processes alternate between:
- CPU burst: Actively executing instructions.
- I/O burst: Waiting for I/O to complete.
CPU-bound processes: Long CPU bursts, infrequent I/O (e.g., video encoding).
I/O-bound processes: Short CPU bursts, frequent I/O (e.g., web server, database).
Preemptive vs Non-Preemptiveโ
| Non-Preemptive | Preemptive | |
|---|---|---|
| Description | Process runs until it voluntarily yields or blocks | OS can forcibly remove a running process |
| Latency | Unpredictable | Bounded |
| Complexity | Simple | Complex (need synchronization) |
| Examples | FCFS, SJF (non-preemptive) | Round Robin, SRTF, Linux CFS |
Scheduling Metricsโ
| Metric | Definition |
|---|---|
| CPU Utilization | % of time CPU is busy (target: 40โ90%) |
| Throughput | Processes completed per unit time |
| Turnaround Time | Completion time โ Arrival time |
| Waiting Time | Time spent in ready queue |
| Response Time | First response โ Arrival time |
Scheduling Algorithmsโ
1. First-Come, First-Served (FCFS)โ
- Type: Non-preemptive
- Processes scheduled in arrival order.
- Convoy Effect: Short processes stuck behind a long one.
Process: P1(24ms) P2(3ms) P3(3ms), all arrive at t=0
Order: P1 โ P2 โ P3
Avg Wait: (0 + 24 + 27) / 3 = 17ms โ poor
2. Shortest Job First (SJF)โ
- Type: Non-preemptive (or preemptive โ SRTF)
- Selects process with shortest next CPU burst.
- Optimal average waiting time (non-preemptive: among non-preemptive algorithms).
- Problem: Cannot know burst length in advance โ estimate using exponential averaging:
ฯ(n+1) = ฮฑ * t(n) + (1 โ ฮฑ) * ฯ(n)
Where t(n) = actual nth burst, ฯ(n) = predicted nth burst, ฮฑ โ [0, 1].
3. Shortest Remaining Time First (SRTF)โ
- Type: Preemptive version of SJF.
- When a new process arrives, if its burst < remaining time of current process โ preempt.
- Optimal average waiting time overall.
- Starvation risk for long processes.
4. Round Robin (RR)โ
- Type: Preemptive
- Each process gets a fixed time quantum (typically 10โ100 ms).
- After quantum expires, process goes to end of ready queue.
Quantum = 4ms
P1(24ms) P2(3ms) P3(3ms)
P1(4)โP2(3)โP3(3)โP1(4)โ...โP1(4)โP1(4)โP1(4)โP1(4)
Avg wait = (6 + 4 + 7) / 3 โ 5.67ms
Quantum size matters:
- Too large โ degenerates to FCFS.
- Too small โ too many context switches (overhead).
- Rule of thumb: 80% of CPU bursts should be shorter than the quantum.
5. Priority Schedulingโ
- Each process has a priority; highest priority runs first.
- Preemptive or non-preemptive.
- Starvation: Low-priority processes may never run.
- Solution: Aging โ gradually increase priority of waiting processes.
6. Multilevel Queue Schedulingโ
Ready queue is divided into separate queues (e.g., foreground/background). Each queue has its own scheduling algorithm.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ Interactive (RR, high priority)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ System processes
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ Batch (FCFS, low priority)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
7. Multilevel Feedback Queue (MLFQ)โ
Processes can move between queues based on behavior:
- Use a lot of CPU โ demoted to lower queue (longer quantum).
- Wait too long โ promoted (aging).
This is used in many real OS schedulers.
Linux Completely Fair Scheduler (CFS)โ
Linux uses CFS (since kernel 2.6.23) for normal processes (SCHED_NORMAL).
Core Ideaโ
Each process maintains a virtual runtime (vruntime) โ how much CPU time it has used, weighted by priority (nice value). The scheduler always picks the process with the lowest vruntime.
vruntime += actual_runtime * (NICE_0_LOAD / weight)
Higher-priority (lower nice) processes have higher weight โ vruntime grows slower โ they get more CPU.
Data Structureโ
CFS uses a red-black tree (sorted by vruntime) for O(log n) process selection. The leftmost node is always the next process to run.
Nice Valuesโ
- Range: โ20 (highest priority) to +19 (lowest priority).
- Each step โ 10% more/less CPU time.
nice -n 10 ./my_program # start with nice=10
renice -n -5 -p 1234 # change running process
Scheduling Classes (Priority Order)โ
SCHED_DEADLINEโ EDF (Earliest Deadline First), real-time.SCHED_FIFO/SCHED_RRโ Real-time, fixed priority 1โ99.SCHED_NORMAL/SCHED_BATCH/SCHED_IDLEโ CFS.
Real-Time Schedulingโ
Hard vs Soft Real-Timeโ
| Hard Real-Time | Soft Real-Time | |
|---|---|---|
| Deadline miss | Catastrophic (safety-critical) | Degraded quality |
| Examples | Airbags, pacemakers | Video streaming, audio |
Algorithmsโ
- Rate Monotonic (RM): Static priority; shorter period โ higher priority. Optimal for static-priority preemptive scheduling.
- Earliest Deadline First (EDF): Dynamic priority; closest deadline โ highest priority. Optimal for preemptive scheduling (can achieve 100% utilization).
Multiprocessor Schedulingโ
Asymmetric vs Symmetricโ
- AMP (Asymmetric): One master CPU handles scheduling, others run user code.
- SMP (Symmetric): Each CPU self-schedules from a shared ready queue.
Processor Affinityโ
The tendency to keep a process on the same CPU to exploit warm cache:
- Soft affinity: OS tries but does not guarantee same CPU.
- Hard affinity: Process is pinned to specific CPU(s).
taskset -c 0,1 ./program # Pin to CPU 0 and 1
Load Balancingโ
- Push migration: Overloaded CPU pushes tasks to idle CPUs.
- Pull migration: Idle CPU pulls tasks from busy CPUs.
NUMA-Aware Schedulingโ
In Non-Uniform Memory Access systems, accessing local memory is faster. The scheduler tries to keep processes on CPUs near their memory.
Common Interview Questionsโ
Q1: What is the difference between preemptive and non-preemptive scheduling?โ
Preemptive: OS can interrupt a running process (e.g., timer interrupt). Non-preemptive: Process runs until it voluntarily yields, completes, or blocks. Preemptive provides better responsiveness but requires synchronization mechanisms.
Q2: Why is SJF optimal but rarely used in practice?โ
SJF minimizes average waiting time but requires knowing the next CPU burst length in advance, which is impossible. It can be approximated using exponential averaging of past bursts.
Q3: What is the convoy effect?โ
In FCFS, a CPU-bound process with a long burst holds the CPU while many short I/O-bound processes wait. This leads to poor CPU and I/O device utilization.
Q4: How does Linux CFS prevent starvation?โ
CFS tracks vruntime for all processes. A sleeping process's vruntime is set to min_vruntime of the tree when it wakes, so it gets scheduled quickly but doesn't completely skip ahead. No process is ever fully starved.
Q5: What is the difference between SCHED_FIFO and SCHED_RR in Linux?โ
Both are real-time. SCHED_FIFO: process runs until it blocks or yields โ no time quantum. SCHED_RR: like FIFO but with a time quantum; when it expires, the process goes to the back of its priority queue.
Q6: What is processor affinity and why does it matter?โ
Affinity keeps a process on the same CPU to benefit from hot cache data. Cache misses on migration are expensive (100s of ns). Linux provides sched_setaffinity() syscall.
Q7: How does the time quantum in Round Robin affect performance?โ
- Large quantum โ behaves like FCFS, high turnaround for short jobs.
- Small quantum โ more responsive but high context-switch overhead.
- Ideal: quantum just larger than a typical interaction burst.
Q8: What is aging in scheduling?โ
Aging incrementally increases the priority of a process that has been waiting a long time, preventing starvation. For example, increase priority by 1 every 15 minutes of waiting.
Advanced Editorial Pass: CPU Scheduling, Fairness, and Tail Latencyโ
Senior Engineering Focusโ
- Reason about p99 latency through scheduler behavior, not average CPU utilization.
- Account for cgroup quotas and NUMA effects in containerized deployments.
- Align workload priority policy with business-critical paths.
Failure Modes to Anticipateโ
- CPU starvation due to noisy-neighbor effects.
- Misleading utilization metrics masking run-queue congestion.
- Latency regressions after container CPU limit changes.
Practical Heuristicsโ
- Track run-queue length and throttling metrics with request latency.
- Pin high-priority workloads carefully; validate with load tests.
- Re-check scheduler assumptions after kernel/runtime upgrades.