DEV Community

Leszek Walszewski
Leszek Walszewski

Posted on

Debug php-fpm with strace

Problem

I've been dealing with a problem in one of my projects for quite some time. I think this is the worst kind of problem: it only happens in production, it occurs randomly, and it doesn't leave any traces in either PHP logs or server logs. When the problem occurs, there is a terribly long page load that ends in a timeout.

The project is quite old with a large technology debt. No framework was used, so it's very difficult to find common problems or get community advice.

The client won't invest a lot of money in this project, but it's still profitable, so he is trying to get every task done at a low cost.
Therefore, there is no opportunity to purchase a New Relic license or a similar paid profiling service. So, how can we trace low-level problems with the tools that we already have?

Strace to the rescue

Strace is the Linux tool for tracing system processes. It can be used to debug low-level interactions between them.

The basic usage command is:

 

$ strace -p 26380
strace: Process 26380 attached
...
Enter fullscreen mode Exit fullscreen mode

When the failure occurs again, the only thing left to do is execute a command that will connect the Strace to every php-fpm sub-process:

sudo strace -f $(pgrep php-fpm | sed 's/\([0-9]*\)/\-p \1/g') -o ~/stracelogs/log5 -t
Enter fullscreen mode Exit fullscreen mode
-f trace child processes 
-o save result to the file
-t add timestamps to each line
Enter fullscreen mode Exit fullscreen mode

As you can see, we use a combination of phrep and sed to find all child php-fpm processes to attach to. 

Logs may appear to be complex, but we can sometimes find interesting information in them. I have found this:

...
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
connect(6, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("xxx.xxx.xx.xx")}, 16) = 0
poll([{fd=6, events=POLLOUT}], 1, 0)    = 1 ([{fd=6, revents=POLLOUT}])
sendmmsg(6, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="&\5\1\0\0\1\0\0\0\0\0\0\6portal\7interia\2pl\0\0"..., iov_len=35}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=35}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\36\r\1\0\0\1\0\0\0\0\0\0\6portal\7interia\2pl\0\0"..., iov_len=35}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, msg_len=35}], 2, MSG_NOSIGNAL) = 2
poll([{fd=6, events=POLLIN}], 1, 5000)  = 1 ([{fd=6, revents=POLLIN}])
ioctl(6, FIONREAD, [156])               = 0
recvfrom(6, "&\5\201\200\0\1\0\1\0\3\0\3\6portal\7interia\2pl\0\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("xxx.xxx.xx.xx")}, [28->16]) = 156
poll([{fd=6, events=POLLIN}], 1, 4999)  = 1 ([{fd=6, revents=POLLIN}])
ioctl(6, FIONREAD, [93])                = 0
recvfrom(6, "\36\r\201\200\0\1\0\0\0\1\0\0\6portal\7interia\2pl\0\0"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("xxx.xxx.xx.xx")}, [28->16]) = 93
close(6)                                = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
fcntl(6, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("xxx.xx.xx.xxx")}, 16) = -1 EINPROGRESS (Die Operation ist jetzt in Bearbeitung)
poll([{fd=6, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 60000
...
Enter fullscreen mode Exit fullscreen mode

And there, the process got stuck.
Additional it was unable to connect to the ip xxx.xx.xx.xxx by CURL.

The page went online again when this IP responded.

After a little workaround, I found an issue in the code. The reason for the crush was the file_get_contents() function that tried to load the URL directly and put it in HTML's header.

Conclusion

Strace is a powerful tool. Sometimes, even basic knowledge of it is enough to find the cause of a problem. This is especially true when we don't have the budget for specialized, dedicated debugging software.

Top comments (0)