Linux Host Troubleshooting
The text DevOps for the Desperate covers a range of scenarios for troubleshooting issues on a remote linux host.
Below are notes on each scenario and the tools used to investigate. Basic cli knowledge, sudo rights, and ssh access are assumed.
How to troubleshoot
Before troubleshooting can begin it is important to have a framework to guide the investigation.
The following approach can be used for methodical troubleshooting:
- Start simple
- Build mental model
- Develop a theory
- Use consistent tools
- Keep notes
- Ask for help
High load average
The linux metric load average indicates how busy a host is. CPU usage and disk IO is used to compute the metric. If a host is in an impaired state load average may be a factor.
To troubleshoot first examine load average and then identify processes contributing to a high load. As a general rule if load average is greater than CPU core count there may be stalled processes impacting performance.
uptime
The uptime command displays how long the host has been running, number of users, and 1/5/15 minute load averages. Note the difference in load times to infer if the host is expierncing a high average load over time. If load is greater than CPU core count, continue investigating.
mlr@pop-os:~$ uptime
19:28:23 up 16 min, 1 user, load average: 0.25, 0.43, 0.36
top
The top command provides information about processes running on the host. It provides CPU, memory, and process information. This tool refreshes itself every 3.0 seconds, so let it run for several cycles and note the differences between values.
If a process PID has high %CPU or %MEM it may be contributing to high load averages. The COMMAND field indicates the name of the process and can be used as a starting point to investigate the further.
mlr@pop-os:~$ top
top - 19:33:09 up 21 min, 1 user, load average: 0.33, 0.34, 0.34
Tasks: 295 total, 1 running, 294 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.1 us, 0.4 sy, 0.0 ni, 99.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 7663.1 total, 3067.7 free, 2129.3 used, 2466.2 buff/cache
MiB Swap: 4095.5 total, 4095.5 free, 0.0 used. 4625.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
249 root -51 0 0 0 0 S 2.3 0.0 0:16.12 irq/156-DLL0945:00
1413 root 15 -5 1386604 114068 69684 S 2.0 1.5 0:32.99 Xorg
2261 mlr 15 -5 635864 54248 40084 S 1.3 0.7 0:08.50 gnome-terminal-
1614 mlr 15 -5 5357140 265588 118724 S 1.0 3.4 0:32.49 gnome-shell
High memory usage
Spikes in traffic, memory leaks, or failing applications can cause memory to be consumed at high rates. By design, linux allocates all memory to cache and buffers also making free memory appear low.
The first step is to confirm that the host is really running low on memory or if the kernel is simply swapping cached and buffered memory between processes. Then move to identify the memory consuming processes and handle them.
free
The free -hm command displays free and used system memory at the time it is run. The -hm flag outputs memory usage in a human readable format. The mem: row indicates actual RAM usages while the swap: row is related to memory written to disk.
If the free column in the swap row is low the host is writing memory to the disk and running slow. The used and free columns can be misleading. Reference the available column to get a feel for how much memory is actually available for new processes.
free -hm
total used free shared buff/cache available
Mem: 7.5Gi 2.1Gi 3.0Gi 643Mi 2.4Gi 4.5Gi
Swap: 4.0Gi 0B 4.0Gi
vmstat
The vmstat 1 5 command provides information about processes, memory, IO, disks, and CPU state. The 1 5 arguements will set vmstat to poll the host for information 5 times every minute. This makes memory trends easier to spot.
The first row of data in the report is system average since boot. The memory sections provides information on memory moving between free, buff, and cache. The swap section shows memory being paged in and out of the disk. Low relative free memory and lots of swap activity indicate that the host consuming high rates of free memory and relying swapped disk memory.
The r column indicates the number of processes waiting to run while the b column indicates the number of processes sleeping. High count in r indicates a possible CPU bottleneck. High count in b indicates that the host is waiting on disk or network I/O.
mlr@pop-os:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 3092416 107712 2437708 0 0 137 72 342 468 2 1 97 0 0
0 0 0 3085652 107712 2443952 0 0 0 96 1309 1748 1 0 99 0 0
0 0 0 3085400 107712 2441900 0 0 0 0 2976 3101 1 1 99 0 0
0 0 0 3096512 107712 2439596 0 0 0 0 3068 2655 1 1 99 0 0
0 0 0 3096260 107712 2439660 0 0 0 112 303 540 0 0 100 0 0
ps
The ps -efly --sort=-rss | head command provides a snapshot of all running processes and memory usage. The efly --sort=-rss | head flag sorts the processes by highest memory usage and shows the top ten results. The CMD column in the output shows the name of each process. The RSS column gives the amount of memory being used by the process in kilobytes.
ps -efly --sort=-rss | head
S UID PID PPID C PRI NI RSS SZ WCHAN STIME TTY TIME CMD
S mlr 1807 1578 0 85 5 630440 345151 do_pol 19:12 ? 00:00:08 io.elementary.appcenter -s
S mlr 1614 1316 2 75 -5 265828 1347579 do_pol 19:12 ? 00:00:41 /usr/bin/gnome-shell
High iowait
A host has high iowait when it is spending too much time waiting for disk IO. This metric is measured by tracking the percentage of time a CPU is idle while waiting for IO disk request. High iowait creates higher average load and CPU usage. Intense application read and writing or slow network storage can be the root cause.
A small amount of iowait is normal on a modern system. The challenge is differentiating normal iowait with sustained high iowait over a period. After identifying high iowait move to finding the process responsible.
iostat
The iostat -xz 1 20 command reports IO and CPU stats for storage devices mounted to the host. The flag -xz 1 20 polls the system 20 times every second and returns an extended statistic format. The %iowait column will show what percent of time the CPU is waiting on disk requests. The w/s column indicates the number of writes per second hitting a disk and the util column indicates disk utilization.
Reviewing polling results for a period of time should help identify if sustained high iowait is affecting the host.
mlr@pop-os:~$ iostat -xz 1 20
Linux 5.17.5-76051705-generic (pop-os) 08/07/2022 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.86 0.23 0.85 0.14 0.00 96.93
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
dm-0 0.10 2.63 0.00 0.00 0.16 25.65 0.01 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme0n1 25.17 938.55 15.53 38.15 0.22 37.28 21.51 520.71 15.15 41.32 1.54 24.20 0.00 0.00 0.00 0.00 0.00 0.00 0.89 0.52 0.04 1.84
iotop
The sudo iotop -oPab command displays IO usage relative to processes on the host. It is similiar to top. The flag -oPab will constantly poll the host and return cummulative IO stats. Elevated permissions are required to run iotop. The IO column will show IO usage and the PID and COMMAND columns can be used to identify process.
Reviewing the polling results will help identify what process or proccesses are creating high iowait.
mlr@pop-os:~$ sudo iotop -oPab
Total DISK READ: 0.00 B/s | Total DISK WRITE: 443.33 K/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 391.87 K/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
352 be/3 root 0.00 B 364.00 K ?unavailable? [jbd2/nvme0n1p3-8]
405 be/4 root 0.00 B 64.00 K ?unavailable? systemd-journald
1077 be/4 root 0.00 B 20.00 K ?unavailable? packagekitd
Out of disk space
At some point a host will run out of disk space. This can be caused by an application, accumulated logs, or build up of files. The drive and file system with low disk space needs to be identified first. Then the isolated drive can be searched to locate the files consuming large amounts of disk space.
df
The df -h command displays disk usage on all mounted filesystems. The flag -h returns a human readable output. Review the size, used, and use% columns to evaluate what disks under filesystem are close to capacity.
mlr@pop-os:~$ df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 767M 1.9M 765M 1% /run
/dev/nvme0n1p3 226G 33G 182G 16% /
tmpfs 3.8G 0 3.8G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/nvme0n1p1 497M 362M 136M 73% /boot/efi
/dev/nvme0n1p2 4.0G 2.6G 1.5G 64% /recovery
tmpfs 767M 180K 767M 1% /run/user/1000
find
The sudo find / -type f -size +100M -exec du -ah {} + | sort -hr | head command searches a specified portion of the filesystem for directories and files that match a criteria. In this case the entire command searches the root filesystem for all files greater than 100mb, sorts by size, and then displays the top ten largest files. Elevated permissions are required. Evaluating the output will provide large files to review and a link to the processes filling the disk.
mlr@pop-os:~$ sudo find / -type f -size +100M -exec du -ah {} + | sort -hr | head
2.6G /var/cache/pop-upgrade/recovery.iso
2.4G /recovery/casper-1FE5-33A5/filesystem.squashfs
666M /home/mlr/Desktop/export/apple_health_export/export.xml
Connection refused
When a host is impaired its internal APIs may refuse connections over a network. When inspecting application logs a connection refused error over a port may be observed. Troubleshooting will involve checking network status to and from the host.
curl
The curl command is used to check if another webserver is responding to requests. This will help confirm if the impaired host is completely down for all users. If the internal API is impaired a connection refused or connection timeout may be returned. This implies the message packet is getting dropped at the host port or firewall.
mlr@pop-os:~$ curl google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
ss
The sudo ss -l -n -p | grep 4000 command will dump socket information on a host. It can be used to check if the API is actually listening on a port. The flag -l -n -p pulls all listening sockets, does not resolve HTTP/SSH, and reports the process using the process. The output is piped to grep to search for the desired port. Elevated permissions are required to see all processes.
mlr@pop-os:~$ sudo ss -l -n -p | grep 4000
tcp LISTEN 0 4096 127.0.0.1:4000 0.0.0.0:* users:(("bundle",pid=5253,fd=8))
tcpdump
The sudo tcpdump -ni any tcp port 4000 command can be used capture network traffic on a host. This will verify if traffic is reaching the impaired host. Executing the command will begin capturing and inspecting tcp packets recieved on all interfaces. The flag -ni any tcp stops dns resolution and tells tcpdump to listen for traffic from 8080. Elevated permissions required. Reviewing the flags in the output will show if connections are being refused. If repeated [S] and [R] flags are observed this implies that a remote IP is attempting to sync with the host but the connection is being reset.
mlr@pop-os:~$ sudo tcpdump -ni any tcp port 4000
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
19:50:24.454916 lo In IP 127.0.0.1.56080 > 127.0.0.1.4000: Flags [S], seq 2500379491, win 65495, options [mss 65495,sackOK,TS val 1673811859 ecr 0,nop,wscale 7], length 0
19:50:24.454927 lo In IP 127.0.0.1.4000 > 127.0.0.1.56080: Flags [S.], seq 1261009642, ack 2500379492, win 65483, options [mss 65495,sackOK,TS val 1673811859 ecr 1673811859,nop,wscale 7], length 0
19:50:24.454937 lo In IP 127.0.0.1.56080 > 127.0.0.1.4000: Flags [.], ack 1, win 512, options [nop,nop,TS val 1673811859 ecr 1673811859], length 0