2023-09-27, 9 minutes reading for Graduates and Juniors
To avoid AWS IPV4 charges, I was migrating my IPV4 ec2 instance to IPV6, however after spending hours fiddling with aws VPC, network and security configuration, I realised that my ISP doesn't support IPV6 :/ . Here I discuss some of the techniques that I have learned about debugging
Debugging is not usually taught at uni; instead, it is typically learned on the job. Although the solution to most bugs require small changes, reproducing the bug can be quite frustrating and time-consuming. I've always found debugging to be one of the best way to gain a better understanding of the system that I'm working on.
In this post I describe some of the tools that I use to help me find and analyze bugs so that I can reproduce and fix them. I will be using Linux in my examples.
Im my experience, there are two ways to approach debugging. But before you start doing anything, your first step is to try to make educated guesses as to where the problem is occurring.
For instance, consider the scenario where we aim to explore all potential configurations of a given program. The cost of changing configuration depends on the number of parameters, so if you have n boolean parameters, you have 2^n possibilities. Assume that each combination takes 10 minutes and you have 10 boolean parameters, then it will take 170 hours(10 minutes * 2^10 ) to try them all. Again it's very important that you make initial guesses and eliminate them as you go on.
Now back to approaches, in both cases you need to deploy the software on a well known reproducible environment.
Ad hoc, trail and error, usually your first instinct, employs your experience of dealing with similar systems and associated common failures.
This method involves going back to basics and requires deeply investigating the components of the system, how deep, well that depends on what your debugging. Here I typically come up with a plan on how to cover which component is the likely to be the culprit.
In my junior experience, it really depends, here are some variables that should help you decide:
The important thing is that you must keep track of what you have tried, including failures(it could take weeks), these will help you narrow down the list of possible culprits.
Logs hold state information of the application at any given time and how the system transitions from one state to another, some of the benefits include:
Unfortunately logs are not usually source controlled or kept around of long because they can be massive, although popular cloud services provides a support for viewing previous builds, these logs cannot be easily diffed with the current build
If I'm debugging a binary or a big executable (big code base), I typically add > logs\log_$(date +%F_%T)
when executing my executables, this will maintain a directory with logs of all my previous executions.
YOUR_COMMAND > logs\log_$(date +%F_%T)
# Prefixed with git hash
YOUR_COMMAND > logs\log_$(git rev-parse --short HEAD)_$(date +%F_%T)
Using tee
we can pipe the output into two streams such as stdout and a file
YOUR_COMMAND | tee logs\log_$(git rev-parse --short HEAD)_$(date +%F_%T)
That way you can monitor the terminal while also storing a copy of the output.
Now once I have enough logs, it can be daunting to pin point the exact error, for example I might be have an application that is producing 100s of lines per request, so I need to tool to filter them quickly.
Regular expressions is there for the rescue, These are some of my regular expressions that I have used:
.*error|fatal.*
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d+\b
^\s+at .*
Regular expressions can feel a bit overwhelming, but using a single concept every week is enough to cover most features in a few months time.
Notes:
Execute this tcp command to get the IP of this website, -P
is for perl regular expression and -o
is for printing only the matching pattern:
sudo tcpdump host matada.org -i any -l | grep -P -o "\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b"
Then refresh this page, it should show you the IP of this website.(nslookup
will do the same)
Sometimes, I'm dealing with a network issue, if I'm not sure, I try to test layer by layer, starting from the bottom of the stack until I reach the application later.
Some of the tools, that I have used include:
TCP/IP layer | Tool |
---|---|
Application | dig, nmap, tcpdump |
Transport | ss, telnet, tcpdump, nc |
Network | ping, traceroute, tcpdump |
Data Link | arping |
Physical | Human eyes |
Try reading the man page of each tool, or if you just want a quick glance on how to use them, I recommend checking out tldr.
SSHing into unfamiliar systems or containers without much knowledge about the environment can be tricky, thankfully each process running has a list of open file descriptors that we can use to peek into. A file descriptor is a unique identifier or reference to an open file, a file descriptor can also reference a socket, or I/O stream
lsof -f -p $$
$$
is the process id of your current terminal
If you need to check the std in/out/err of a running process, you can tail it's file descriptor.
# for stdin
tail -f /proc/<PID>/fd/0
# for stdout
tail -f /proc/<PID>/fd/1
# for stderr
tail -f /proc/<PID>/fd/2
Sometimes the logging mechanism is running on a separate thread, so tailing the main thread(process) might not get you the expected output. In that case check the under /task/fd/
directory
ls -l /proc/<PID>/task/<TID>/fd/1
Sometime you have a binary and no source code, we could try to reverse engineer the program by placing ourselves between the binary and the system itself, think it of as middleman who can see the communication between two parties, this is where strace and latrace(or ltrace)
strace
allows you to run a command and trace system calls.
System calls --> kernel space
latrace
allows you to run a command and trace dynamic library calls using a LD_AUDIT libc feature.
Library calls --> user space (although a library might perform a syscall itself)
Note:
lets traces network-related system calls while calling curl
sudo strace -e trace=network curl -I https://www.google.com
# on an already running process, warning very long output
sudo strace -e trace=network -p $(pgrep firefox)
lets traces dynamically linked library calls while calling openssl
latrace -c openssl rand -base64 32
The output should contains calls made to libcrypto
.
I also highly recommend revisiting your development tools, they might provide flags that come in handy in your journey. Typically verbose, debug and trace flags are your first choices. Some tools requires exporting a environmental variables for more specific debug information.
Try these and look for the debug flags
node --help
man python
As you continue debugging, the frustration can really set in, as you're emotionally invested in this. I've found that working on something different for a while might trigger some neurons that weren't being activated before. These new neural connections at your disposal could provide you with a fresh perspective on how to approach your problem.
Once you know how to reproduce the bug and analysed the problem, you breath a sigh of relief, you work on a fix, and try to test your fix through automated testing, then you document your findings, including things such as:
.. And you cross your fingers and hope your fix does not introduce any Unforeseen Consequences. Your team gives you a pat on the back and you move on to smash your next task.