WTF, Linux?

Sometimes, weird things happen on your system.

The first thing to look at on any given system is the output of top:

``` top - 16:34:38 up 2 days, 1:46, 19 users, load average: 0.74, 0.45, 0.39 Tasks: 290 total, 1 running, 288 sleeping, 0 stopped, 1 zombie %Cpu(s): 2.5 us, 5.2 sy, 0.0 ni, 91.8 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st MiB Mem : 15785.9 total, 1554.9 free, 3441.4 used, 10789.5 buff/cache MiB Swap: 976.0 total, 976.0 free, 0.0 used. 11682.1 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10856 tycho 20 0 1575288 66744 15424 S 5.9 0.4 25:42.61 hangups 4046 root 20 0 311044 48760 27076 S 4.0 0.3 30:22.79 Xorg 12812 tycho 20 0 1127988 316872 152760 S 3.0 2.0 13:29.05 chrome 12852 tycho 20 0 370588 103756 61092 S 2.0 0.6 6:22.84 chrome 30949 tycho 20 0 51804 18328 11976 S 2.0 0.1 0:00.81 urxvt 31013 tycho 20 0 12268 4316 3428 R 2.0 0.0 0:01.20 top 642 root -51 0 0 0 0 S 1.0 0.0 8:55.92 irq/134-iwlwifi 26517 tycho 20 0 603672 147324 78360 S 1.0 0.9 0:09.02 chrome 1 root 20 0 166772 11124 7692 S 0.0 0.1 0:07.28 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.09 kthreadd 3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcugp 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcupargp 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-eventshighpri 8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mmpercpuwq 9 root 20 0 0 0 0 S 0.0 0.0 0:03.80 ksoftirqd/0 10 root 20 0 0 0 0 I 0.0 0.0 3:36.11 rcusched 11 root rt 0 0 0 0 S 0.0 0.0 0:00.70 migration/0 12 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idleinject/0 ```

top allows you to sort by what's using lots of CPU, memory, etc. This machine is not loaded at all, according to the load average:

load average: 0.74, 0.45, 0.39

The first number is the load average in the last minute, the second number is the load average over the last five minutes, and the third number is the load average over the last 15 minutes. Note that load averages are a multiple of cores, so a machine with four cores and a load average of 4 is totally CPU bound. Additionally, the load average is really the number of processes that "wanted" to run, not the number of processes that were actually running. So a machine with four cores and a load average of 16 is 4x oversubscribed on CPU.

But the most magical line in top is the per-cpu state line:

%Cpu(s): 2.5 us, 5.2 sy, 0.0 ni, 91.8 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st

Often, a careful reading of this line can tell you what is going on with a system. The numbers here are percentages and similar to load averages in that percentages > 100 are still sensible. But the most interesting thing are the suffixes here. From the top man page:

us, user : time running un-niced user processes sy, system : time running kernel processes ni, nice : time running niced user processes id, idle : time spent in the kernel idle handler wa, IO-wait : time waiting for I/O completion hi : time spent servicing hardware interrupts si : time spent servicing software interrupts st : time stolen from this vm by the hypervisor

Mostly, you'll see high numbers in the us, sy, id, and wa columns. Large values in us, ni, means that the workload is CPU bound. These will generally correlate with non-zero values in sy, since sy indicates time stuff was running in the kernel. Large values in sy relative to us almost always indicate a problem