Debugging tools and tricks

2023-09-27, 9 minutes reading for Graduates and Juniors

To avoid AWS IPV4 charges, I was migrating my IPV4 ec2 instance to IPV6, however after spending hours fiddling with aws VPC, network and security configuration, I realised that my ISP doesn't support IPV6 :/ . Here I discuss some of the techniques that I have learned about debugging


Debugging is not usually taught at uni; instead, it is typically learned on the job. Although the solution to most bugs require small changes, reproducing the bug can be quite frustrating and time-consuming. I've always found debugging to be one of the best way to gain a better understanding of the system that I'm working on.

In this post I describe some of the tools that I use to help me find and analyze bugs so that I can reproduce and fix them. I will be using Linux in my examples.

I will find you and I will fix you meme

Approach

Im my experience, there are two ways to approach debugging. But before you start doing anything, your first step is to try to make educated guesses as to where the problem is occurring.

For instance, consider the scenario where we aim to explore all potential configurations of a given program. The cost of changing configuration depends on the number of parameters, so if you have n boolean parameters, you have 2^n possibilities. Assume that each combination takes 10 minutes and you have 10 boolean parameters, then it will take 170 hours(10 minutes * 2^10 ) to try them all. Again it's very important that you make initial guesses and eliminate them as you go on.

Now back to approaches, in both cases you need to deploy the software on a well known reproducible environment.

Top down:

Ad hoc, trail and error, usually your first instinct, employs your experience of dealing with similar systems and associated common failures.

  • Advantages
    • Could be significantly quicker
  • Disadvantages
    • Limited to superficial or immediately observable issues
    • Without a plan, context switching can become overwhelming, as you switch from trying one thing to another
    • Progress can be ambiguous, sometimes you are not sure whether you progressed closer to the finding the bug or regressed because you lack the knowledge of how the application works.

Bottom up:

This method involves going back to basics and requires deeply investigating the components of the system, how deep, well that depends on what your debugging. Here I typically come up with a plan on how to cover which component is the likely to be the culprit.

  • Advantages
    • Gains a deeper understating of the system.
    • Plenty of aha moments, as you now learn why you application needs to talk to all these services.
  • Disadvantages
    • It can feel like an eternity, and might lead to sunk cost fallacy.

In my junior experience, it really depends, here are some variables that should help you decide:

  • Visibility, do you have access to other services that the software interacts with?
  • How big is your application, is the architecture based on Monolith, or Microservices?
  • How Familiar are you with the environment, including network?
  • Resources, including time, documentation, and access to team member wisdom?
  • Tools at disposal, can you use and install any tool or are you limited to a subset?

The important thing is that you must keep track of what you have tried, including failures(it could take weeks), these will help you narrow down the list of possible culprits.

Tools

Logs

Logs hold state information of the application at any given time and how the system transitions from one state to another, some of the benefits include:

  • I can always diff the current and previous build
    • this gives a glace of the change including any errors
  • Looking at previous build logs can be easier than checking out a previous commit and running the application, especially when the application talks to external APIs/services which by this point might have changed/deprecated

Unfortunately logs are not usually source controlled or kept around of long because they can be massive, although popular cloud services provides a support for viewing previous builds, these logs cannot be easily diffed with the current build

Dumping logs

If I'm debugging a binary or a big executable (big code base), I typically add > logs\log_$(date +%F_%T) when executing my executables, this will maintain a directory with logs of all my previous executions.

Example

YOUR_COMMAND > logs\log_$(date +%F_%T)

# Prefixed with git hash
YOUR_COMMAND > logs\log_$(git rev-parse --short HEAD)_$(date +%F_%T)

Using tee we can pipe the output into two streams such as stdout and a file

YOUR_COMMAND | tee logs\log_$(git rev-parse --short HEAD)_$(date +%F_%T)

That way you can monitor the terminal while also storing a copy of the output.

Regular expressions

Now once I have enough logs, it can be daunting to pin point the exact error, for example I might be have an application that is producing 100s of lines per request, so I need to tool to filter them quickly.

Regular expressions is there for the rescue, These are some of my regular expressions that I have used:

  • General errors .*error|fatal.*
  • Network IP and port \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d+\b
  • Stack trace ^\s+at .*

Regular expressions can feel a bit overwhelming, but using a single concept every week is enough to cover most features in a few months time.

Notes:

  1. Not all search tools(within IDEs) provide support for the complete range of regular expression features, and variations in behaviour within the same feature can exist, just be aware of that.
  2. Be careful of using recursive regular expressions, this can slow down or even kill your IDE.

Example

Execute this tcp command to get the IP of this website, -P is for perl regular expression and -o is for printing only the matching pattern:

sudo tcpdump host matada.org -i any -l | grep -P -o "\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b"

Then refresh this page, it should show you the IP of this website.(nslookup will do the same)

Network

Sometimes, I'm dealing with a network issue, if I'm not sure, I try to test layer by layer, starting from the bottom of the stack until I reach the application later.

Some of the tools, that I have used include:

TCP/IP layerTool
Applicationdig, nmap, tcpdump
Transportss, telnet, tcpdump, nc
Networkping, traceroute, tcpdump
Data Linkarping
PhysicalHuman eyes

Try reading the man page of each tool, or if you just want a quick glance on how to use them, I recommend checking out tldr.

Open file descriptors

SSHing into unfamiliar systems or containers without much knowledge about the environment can be tricky, thankfully each process running has a list of open file descriptors that we can use to peek into. A file descriptor is a unique identifier or reference to an open file, a file descriptor can also reference a socket, or I/O stream

Example

lsof -f -p $$

$$ is the process id of your current terminal

If you need to check the std in/out/err of a running process, you can tail it's file descriptor.

# for stdin 
tail -f /proc/<PID>/fd/0

# for stdout 
tail -f /proc/<PID>/fd/1

# for stderr
tail -f /proc/<PID>/fd/2

Sometimes the logging mechanism is running on a separate thread, so tailing the main thread(process) might not get you the expected output. In that case check the under /task/fd/ directory

ls -l /proc/<PID>/task/<TID>/fd/1

System calls and Library calls

Sometime you have a binary and no source code, we could try to reverse engineer the program by placing ourselves between the binary and the system itself, think it of as middleman who can see the communication between two parties, this is where strace and latrace(or ltrace)

strace allows you to run a command and trace system calls. System calls --> kernel space

latrace allows you to run a command and trace dynamic library calls using a LD_AUDIT libc feature. Library calls --> user space (although a library might perform a syscall itself)

Note:

  1. ltrace doesn't work on recent versions of ubuntu

Example

lets traces network-related system calls while calling curl

sudo strace -e trace=network curl -I https://www.google.com

# on an already running process, warning very long output
sudo strace -e trace=network -p $(pgrep firefox)

lets traces dynamically linked library calls while calling openssl

latrace -c openssl rand -base64 32

The output should contains calls made to libcrypto.

Flags

I also highly recommend revisiting your development tools, they might provide flags that come in handy in your journey. Typically verbose, debug and trace flags are your first choices. Some tools requires exporting a environmental variables for more specific debug information.

Example

Try these and look for the debug flags

node --help
man python

Taking a break

As you continue debugging, the frustration can really set in, as you're emotionally invested in this. I've found that working on something different for a while might trigger some neurons that weren't being activated before. These new neural connections at your disposal could provide you with a fresh perspective on how to approach your problem.

FIN

Once you know how to reproduce the bug and analysed the problem, you breath a sigh of relief, you work on a fix, and try to test your fix through automated testing, then you document your findings, including things such as:

  • What was the root cause of the issue?
  • What was the impact?
  • What steps you took to remedy it?
  • And what steps would you take to prevent it in the future?

.. And you cross your fingers and hope your fix does not introduce any Unforeseen Consequences. Your team gives you a pat on the back and you move on to smash your next task.