Two steps:
Find the web server parent process id, then use ps to query the list of child procsses.
# PARENT=`cat /path/to/serverroot/logs/httpd.pid` # ps -ef | grep $PARENT
Use the second column as the list of all process ids.
This will include piped log programs and perhaps other support programs that start up with the server, but since they can cause hangs too it is best to gather information about them.
For the 1.3 server there is usually only one real thread, but two issues can complicate this and it is best to go ahead and try to get all thread backtraces:
With the 2.0 server, note that if mod_cgid is active then one of the child processes will have a backtrace that looks like this:
cgid_server() at 0xd84521a0 cgid_start() at 0xd84538f4 cgid_init() at 0xd84537ac ap_run_post_config() at 0x100034f4 main() at 0x10000d30
This is the cgid daemon process, and it usually doesn't contribute to a web server hang, but please include that output for completeness.
In the examples below, "12345" is the process id of the process we're getting a backtrace of. Perform the procedure on each child process.
Grab this Perl script and run it like this:
If ThreadsPerChild is > 150:
# ./aixthrdbt.pl -q 12345
The "-q" option tells it to get less information. That isn't good, but with large numbers of threads per child, dbx can be very slow at retrieving the backtraces.
If ThreadsPerChild <= 150:
# ./aixthrdbt.pl 12345 Getting backtrace for pid 12345... The backtrace for pid 12345 is in file backtraces.12345.
If ThreadsPerChild is more than 150, do this and send the output, because otherwise it will take a very long time to get all the backtraces.
# dbx -a 12345 Waiting to attach to process 12345 ... Successfully attached to httpd. warning: Directory containing httpd could not be determined. Apply 'use' command to initialize source path. Type 'help' for help. reading symbolic information ...warning: no source compiled with -g stopped in pth_usched._event_sleep [/usr/lib/libpthreads.a] at 0xd00579dc ($t3) 0xd00579dc (_event_sleep+0x90) 80410014 lwz r2,0x14(r1) (dbx) thread thread state-k wchan state-u k-tid mode held scope function $t1 wait 0xf100009e1969dfb8 running 5603495 k no sys read $t2 zomb terminated 5636279 k no sys $PTRGL >$t3 run blocked 1179771 k no sys _event_sleep $t4 wait 0xf100008790010640 blocked 4292717 k no sys _event_sleep $t5 wait 0xf10000879000b940 blocked 3031229 k no sys _event_sleep $t6 wait 0xf10000879000f340 blocked 3981345 k no sys _event_sleep $t7 wait 0xf100008790014140 blocked 5259459 k no sys _event_sleep $t8 wait 0xf10000879000b440 blocked 2949243 k no sys _event_sleep $t9 wait 0xf100008790012440 blocked 4784355 k no sys _event_sleep $t10 wait 0xf1000087900108c0 blocked 4333687 k no sys _event_sleep $t11 wait 0xf10000879000c540 blocked 3227757 k no sys _event_sleep $t12 wait 0xf10000879000d7c0 blocked 3530975 k no sys _event_sleep $t13 wait 0xf10000879000a240 blocked 2654383 k no sys _event_sleep $t14 wait 0xf100008790002640 blocked 622809 k no sys _event_sleep $t15 wait 0xf10000879000c3c0 blocked 3203173 k no sys _event_sleep $t16 wait 0xf100008790016640 blocked 5865705 k no sys _event_sleep $t17 wait 0xf1000087900126c0 blocked 4825323 k no sys _event_sleep $t18 wait 0xf100008790015040 blocked 5505171 k no sys _event_sleep $t19 wait 0xf10000879000c640 blocked 3244143 k no sys _event_sleep $t20 wait 0xf10000879000dcc0 blocked 3612869 k no sys _event_sleep $t21 wait 0xf100008790013840 blocked 5111857 k no sys _event_sleep $t22 wait 0xf10000879000f5c0 blocked 4022313 k no sys _event_sleep $t23 wait 0xf100008790004240 blocked 1081589 k no sys _event_sleep $t24 wait 0xf10000879000f840 blocked 4063281 k no sys _event_sleep $t25 wait 0xf10000879000cdc0 blocked 3367051 k no sys _event_sleep $t26 wait 0xf10000879000df40 blocked 3653837 k no sys _event_sleep $t27 wait 0xf10000879000ec40 blocked 3866625 k no sys _event_sleep $t28 wait 0xf1000089c3f6aa00 running 3899401 k no sys socket
At this point, we sort of know some information about the different threads but it will take too long to get backtraces of all of them. As a compromise between quick and completeness, get a backtrace for a one thread per unique function listed. For example, there are several different functions listed in the previous display: read (for thread 1), $PTRGL (for thread 2), _event_sleep (for most of the threads), and socket (for thread 28). Run these commands to get a backtrace for each different function listed:
(dbx) th current 1 (dbx) where (dbx) th current 2 (dbx) where (dbx) th current 3 (dbx) where (dbx) th current 28 (dbx) whereOkay, now detach to exit dbx and leave the process unchanged:
(dbx) detach
As long as there are not hundreds of threads per child process, do this for each child process and save the output:
# dbx -a 12345 Waiting to attach to process 12345 ... Successfully attached to httpd. warning: Directory containing httpd could not be determined. Apply 'use' command to initialize source path. Type 'help' for help. reading symbolic information ...warning: no source compiled with -g stopped in pth_usched._event_sleep [/usr/lib/libpthreads.a] at 0xd00579dc ($t3) 0xd00579dc (_event_sleep+0x90) 80410014 lwz r2,0x14(r1) (dbx) thread thread state-k wchan state-u k-tid mode held scope function $t1 wait 0xf100009e1969dfb8 running 5603495 k no sys read $t2 zomb terminated 5636279 k no sys $PTRGL >$t3 run blocked 1179771 k no sys _event_sleep $t4 wait 0xf100008790010640 blocked 4292717 k no sys _event_sleep $t5 wait 0xf10000879000b940 blocked 3031229 k no sys _event_sleep $t6 wait 0xf10000879000f340 blocked 3981345 k no sys _event_sleep $t7 wait 0xf100008790014140 blocked 5259459 k no sys _event_sleep $t8 wait 0xf10000879000b440 blocked 2949243 k no sys _event_sleep $t9 wait 0xf100008790012440 blocked 4784355 k no sys _event_sleep $t10 wait 0xf1000087900108c0 blocked 4333687 k no sys _event_sleep $t11 wait 0xf10000879000c540 blocked 3227757 k no sys _event_sleep $t12 wait 0xf10000879000d7c0 blocked 3530975 k no sys _event_sleep $t13 wait 0xf10000879000a240 blocked 2654383 k no sys _event_sleep $t14 wait 0xf100008790002640 blocked 622809 k no sys _event_sleep $t15 wait 0xf10000879000c3c0 blocked 3203173 k no sys _event_sleep $t16 wait 0xf100008790016640 blocked 5865705 k no sys _event_sleep $t17 wait 0xf1000087900126c0 blocked 4825323 k no sys _event_sleep $t18 wait 0xf100008790015040 blocked 5505171 k no sys _event_sleep $t19 wait 0xf10000879000c640 blocked 3244143 k no sys _event_sleep $t20 wait 0xf10000879000dcc0 blocked 3612869 k no sys _event_sleep $t21 wait 0xf100008790013840 blocked 5111857 k no sys _event_sleep $t22 wait 0xf10000879000f5c0 blocked 4022313 k no sys _event_sleep $t23 wait 0xf100008790004240 blocked 1081589 k no sys _event_sleep $t24 wait 0xf10000879000f840 blocked 4063281 k no sys _event_sleep $t25 wait 0xf10000879000cdc0 blocked 3367051 k no sys _event_sleep $t26 wait 0xf10000879000df40 blocked 3653837 k no sys _event_sleep $t27 wait 0xf10000879000ec40 blocked 3866625 k no sys _event_sleep $t28 wait 0xf1000089c3f6aa00 running 3899401 k no sys socket (dbx) th current 1 (dbx) where glink.read() at 0x1002e080 ap_mpm_pod_check() at 0x1002df38 ap_mpm_pod_check() at 0x1002df38 child_main() at 0x1002a408 make_child() at 0x1002a0dc startup_children() at 0x10029f0c ap_mpm_run() at 0x1002c0d4 main() at 0x10000d8c (dbx) th current 2 (dbx) where (and so on) (dbx) detach #
# pstack 12345 > backtraces.12345