22 March 2008 - 20:13Troubleshooting Defunct (Zombie) Processes on Linux
At work (LNI) we have a tool which implements a remotely controlled RDMA agent using the OpenFabrics interface. The agent is used for compliance, interoperability, and performance testing. For quite some time we’ve had a problem where sometimes the agent hangs and then after sending SIGINT (ctrl-C or kill) the agent shows up as “defunct” with a Zombie state in the output of ps.
Normally a zombie process means that the process has died but remains in the process table because the parent hasn’t called wait() to “reap” the process and retrieve the return code. If you kill the parent, the zombie process becomes parented by init (process 1) and init reaps it. But, the problem we were having was clearly not this. Killing the parent did nothing. And the defunct agent appeared to be holding onto resources. It was not possible to start a new agent since the defunct one continued to hold a socket open, which would never be true with the usual meaning of a defunct process. The only workaround was to reboot the system.
The key to the real answer is that the agent uses multiple threads with the POSIX threads library. Individual threads and processes under Linux are both viewed as tasks to the process management code. Threads are implemented using one task designated as “thread group leader” and a “thread group id” present in each task_struct. By default, the ps utility displays just the thread group leader, hiding the other tasks.
Working at the OpenFabrics Alliance interoperability event last week at UNH-IOL we once again experienced the defunct problem. This time I attacked it in earnest and discovered the “ps aumx” option (”m” being the critical one which displays threads individually). This showed the thread group leader in “Z” state, and then the key: another threads was stuck in the “D” uninterruptible sleep state. In this state the process is running in kernel mode yet cannot be interrupted by any signal, including SIGKILL (signal 9, which normally cannot be ignored). Thus, this thread is unkillable, until whatever condition it is waiting for in kernel mode is cleared. The only solution is to reboot the system.
When the SIGINT (or SIGKILL or whatever) was delivered to the agent, all the tasks (threads) received the signal. Yet one couldn’t exit because it was in “D” and the thread group leader remains around in the deceptive “Z” state, in this case indicating that the process is still around. In fact, this was confirmed by running ps aumx on the agent after a suspected hang but before attempting to kill the agent. This time there were the usual numerous threads, one of which was listed in state “D”.
So how to debug from here? It was possible to use the Magic SysRq Key to obtain a listing of the current tasks on the system. This displayed a stack trace showing the execution context of each task running in kernel mode, including the agent task stuck in uninterruptible sleep. Using this it was possible to determine what the task was doing which caused it to slip into a coma.
I wanted to make sure this got written up because I spent way to much time looking on Google and finding pages that described the usual meaning of “defunct” processes but didn’t touch on this deceptive alternate meaning. So hopefully now it’ll be found! Extra thanks to Professor Robert Russell who helped me troubleshoot all of this.
11 Comments | Tags: computers
22 Mar 2008 - 23:15
Hey, not that it makes a difference to the final analysis, but I’ve always heard that if you send a signal to a multi-threaded (pthreads) process, the signal gets sent to *one* of the threads, not all of them.
The signal will be sent to some randomly chosen thread, unless you set-up per thread signal masks (I forget which pthread_*() function does this).
22 Mar 2008 - 23:25
Technically true although as you say it doesn’t matter to the analysis. The effect of SIGINT is to cause all threads to exit (much like calling exit() from any of the threads).
23 Mar 2008 - 1:43
sounds like a linux kernel bug to me. did you send lkml a note, or pointer to this one?
23 Mar 2008 - 2:05
I don’t believe this is a kernel bug. The behavior of thread group leader remaining in EXIT_ZOMBIE is mentioned in comments in kernel/exit.c, and makes sense. If anything the way ps displays threads can be deceptive and perhaps it would be better to display the task in “D”.
25 Mar 2008 - 11:18
Often I’ve encountered a situation where Firefox is hanging on a web page that displays flash video. Even though I close Firefox and look for processes using “ps -ef|grep fire”, I see none. Firefox still will hang if I start again and go to the flash web page.
Only re-booting the machine seems to fix the problem. Is this a similar problem? I’ve also encountered similar problems when messing with the SystemSettings app in Kubuntu when trying to configure less-than-100%-compatible wi-fi devices. Hope this helps…
25 Mar 2008 - 12:15
Just a suggestion from the pthread trenches.
1) Add a signal handler to your main program before creating any threads. Trap all signals except for SIGTERM,SIGINT. This exception handler will be called for the child threads that generates any exception. Before exiting from the handler, call pthread_kill_other_threads_np(). This will give you a clean restart.
2) Audit your code to make sure your using thread reentrant functions (localtime_r, hostname_r etc). This could be the cause of your deadlock.
25 Mar 2008 - 12:57
Good advice. I’ve only recently started working with this codebase so who knows whether or not it is using the proper reentrant functions. I’ll need to look into that.
Regarding the signal handler idea, I don’t think this will solve this particular problem. I THINK the reason for the hang is improper cleanup of IB verbs resources (specifically outstanding unacked events) and this causes the function to block in uninterruptible sleep. It may be possible to ack the events in a signal handler and get a clean stop. But this is still a theory.
25 Mar 2008 - 19:48
Regarding the signal handler idea, I don’t think this will solve this particular problem.
It will if your child/master thread is getting a SIGSEGV or SIGILL and there is no way to trap it (Without getting into details I can’t explain very well without a whiteboard), the Linux kernel will mark the PID a Zombie when a posix thread child crashes). Under pthreads the child has the same process ID as the parent.
In your exception routine make sure to call backtrace_signals() (see the gcc manual) to locate where the thread crashed. Use the posix functions to log the errors because access to the glibc functions may be Tango Uniform (use open/write vs fopen/fwrite to log errors). Having an thread exception routine for my pthread programs has saved me countless of hours of troubleshooting.
You didn’t state what kernel your application is running under so I reserve the right to give you the wrong information (there are subtle differences between 2.2, 2.4 and 2.6 and pthread versions as well).
IB verbs resources (specifically outstanding unacked events) a
I have no clue what your talking about.
who knows whether or not it is using the proper reentrant functions. I’ll need to look into that.
On the Posix website you can get a list of the pthread reentrant functions.
BTW: The only reason I came to your website is that Linux Today had this entry listed as a way to troubleshoot defunct Zombie processes. I thought I was going to learn something new :)
Good Luck.
25 Mar 2008 - 21:29
No. The child thread *cannot* receive a signal because it is stuck in uninterruptible sleep (that is the “D” state in ps’s output). In this case the signal is pending and will be delivered if and when whatever event the kernel is waiting for (using
wait_for_completion()ever happens. In this case, most likely due to some sort of deadlock, the completion never comes. The thread hasn’t crashed at all - instead it is hanging in kernel mode waiting for something that will probably never happen.You can see the uninterruptible sleep in action if you pull the plug on an NFS server. Processes that attempt to read from the NFS filesystem will block in state “D” until the NFS server returns, or the system is rebooted. They *cannot* be interrupted, and cannot receive signals until they wake.
IB Verbs is the interface to InfiniBand and iWARP RDMA devices (see http://openfabrics.org/). Using this from userspace (and kernel space) involves an asynchronous programming model where a request is queued and then executed by the hardware (this is also a gross simplification :). Once the operation is completed a “completion queue” event arrives, which needs to be acknowledged. If the connection (called a “queue pair” since it is a send and receive queue) is torn down with unacknowledged completions it may block waiting for them to be acknowledged. This *might* be the problem seen here, since the kernel space hang happened inside cleanup routines related to the destruction of a queue pair.
26 Mar 2008 - 8:58
A process reading on a nfs drive can be revived, unmount the nfs filesystem with -lf or so, the key here is -f. Eventually you’ll get a io timeout and all programs will continue working.
What are you doing? I mean why are you blocking on IO in kernelspace? Can it be converted to nonblocking IO?
26 Mar 2008 - 9:24
Yeah, the NFS mount case can now be fixed with -f. But in the meantime reading processes are in “D” and can’t be interrupted. There are other hardware issues which can cause things to be stuck in uninterruptible sleep perpetually.
The kernel space code used wait_for_completion() which causes uninterruptible sleep. I didn’t write that code, and I haven’t had time to look into why it chose this, except to say wait_for_completion is fairly common in driver code (there is now a timeoutable variant, but it is new, and may not always be appropriate in driver code). In this case I was working with beta drivers, beta hardware, so who knows.