This week, we had a few weird crashes with an HTTP server which we could not easily reproduce and we had a hard time pin-pointing the source of the issue. We knew the problems were triggered by bad input, but since the process was continuously receiving around 3000 requests per second at the time, it was pretty hard to isolate the exact request(s) that made it crash.
The idea we had was to capture HTTP requests data up to the point where the process crashed. Then, we would open the trace and look for the last successful requests, the faulty one would be in there somewhere.
The naive approach would have been to run tcpdump until the process crashed, then to stop tcpdump and to look into the packet capture file (pcap). This solution seemed impractical because the crashes were very infrequent and random. The process could run for hours/days without crashing... we would have ended up with a massive pcap file that would have been too big to transfer or work with.
After a good look at the tcpdump man page, we figured that we could leverage a few of its parameters:
tcpdump -i interface -w file -G rotate_seconds -W filecount -C file_size -z command pcap-filter
This command line does the following:
- Capture packets on
- Write the matching packets into files named file with a file number appended (from
- Switching to a new file if
rotate_secondshave elapsed or if
file_sizehas been reached for the current file. When reaching filecount-1 number of files, tcpdump will go back to file 0, overwriting its content. Thereby creating a circular buffer of pcap files.
- After switching to a new file,
commandand pass the old pcap filename as its first argument.
With this, we have a circular buffer of pcap files of reasonable size and we can run some code when switching to a new file.
Usually, the 'code' part is used to compress the pcap file (ie.
-z gzip), but we use it to check for the error condition we are interested in and have it kill tcpdump to stop the capture, for example:
#!/bin/bash ROUTER_PID=8761 CURRENT_PID=$(pgrep router_runner) if [ $ROUTER_PID != "$CURRENT_PID" ]; then kill $PPID fi
In this case, we were chasing a crash, so the quick and dirty way to know if the process had crashed was to check whether its
PID was still the same. If it wasn't or if the process didn't exits, the script would kill its parent (tcpdump) to stop the packet capture.
The command run by tcpdump can check for whatever condition we're interested in. For example, we had another interresting case were the symptoms were a lot of sockets in
CLOSE_WAIT state. The script could check for this and signal the parent.
For the crash we were chasing, the full command we used was:
tcpdump -i any -G 60 -C 10 -W 15 -z $PWD/router_pid.sh -vv -w 9973.pcap dst port 9973
It created 15 files named
9973.pcap14, each file containing at most 10mb or 60 seconds worth of requests and the
router_pid.sh script was executed each time tcpdump switched to a new output file.
We let the capture run for a few hours and eventually the process crashed. Sure enough the culprit was in the pcap files.
By default, on Ubuntu 12.04 at least, AppArmor is configured to deny syscalls that tcpdump requires to call out to the external command passed via the
The quick fix for this was to set apparmor to 'complain' mode for tcpdump:
aa-complain command is part of the
apparmor-utils package, in case you need to install it.
-G rotate_seconds argument works only if tcpdump has captured at least one packet. If the filter doesn't match anything, files won't be rotated.
In this case it doesn't matter much, but if tcpdump is started only with
-G rotate_seconds, one would assume that it would exit after rotate_seconds, which will not be the case if tcpdump has not captured any packet, it will just hang there until it gets at least one packet.
Of course we did NOT discover this by running
tcpdump -G 60 ...
in a cronjob and ending up with 200 tcpdump running. that never happened ;-)