getting backtraces of running children

Two steps:

  1. get list of child processes
  2. get backtrace of each thread in each child process

getting list of child processes

Find the web server parent process id, then use ps to query the list of child procsses.

# PARENT=`cat /path/to/serverroot/logs/httpd.pid`
# ps -ef | grep $PARENT

Use the second column as the list of all process ids.

This will include piped log programs and perhaps other support programs that start up with the server, but since they can cause hangs too it is best to gather information about them.

getting a backtrace of each thread within each child process

For the 1.3 server there is usually only one real thread, but two issues can complicate this and it is best to go ahead and try to get all thread backtraces:

  • some third party modules create their own threads, and we need to know what they are doing
  • on Solaris, there are some system-created threads that don't do anything interesting; by getting all threads on Solaris we make sure we know what the interesting ones are doing

    With the 2.0 server, note that if mod_cgid is active then one of the child processes will have a backtrace that looks like this:

    cgid_server() at 0xd84521a0
    cgid_start() at 0xd84538f4
    cgid_init() at 0xd84537ac
    ap_run_post_config() at 0x100034f4
    main() at 0x10000d30
    

    This is the cgid daemon process, and it usually doesn't contribute to a web server hang, but please include that output for completeness.

    In the examples below, "12345" is the process id of the process we're getting a backtrace of. Perform the procedure on each child process.

    AIX, via Perl script

    Grab this Perl script and run it like this:

    If ThreadsPerChild is > 150:

    # ./aixthrdbt.pl -q 12345
    

    The "-q" option tells it to get less information. That isn't good, but with large numbers of threads per child, dbx can be very slow at retrieving the backtraces.

    If ThreadsPerChild <= 150:

    # ./aixthrdbt.pl 12345
    Getting backtrace for pid 12345...
    The backtrace for pid 12345 is in file backtraces.12345.
    
    AIX, the hard way, in case you don't have Perl installed

    If ThreadsPerChild is more than 150, do this and send the output, because otherwise it will take a very long time to get all the backtraces.

    # dbx -a 12345
    Waiting to attach to process 12345 ...
    Successfully attached to httpd.
    warning: Directory containing httpd could not be determined.
    Apply 'use' command to initialize source path.
    
    Type 'help' for help.
    reading symbolic information ...warning: no source compiled with -g
    
    stopped in pth_usched._event_sleep [/usr/lib/libpthreads.a] at
    0xd00579dc ($t3)
    0xd00579dc (_event_sleep+0x90) 80410014        lwz   r2,0x14(r1)
    (dbx) thread
     thread  state-k     wchan    state-u    k-tid   mode held scope
     function
     $t1     wait      0xf100009e1969dfb8 running  5603495     k   no  sys  read
     $t2     zomb                 terminated 5636279     k   no   sys  $PTRGL
    >$t3     run                  blocked  1179771     k   no   sys _event_sleep
     $t4     wait      0xf100008790010640 blocked  4292717     k   no  sys  _event_sleep
     $t5     wait      0xf10000879000b940 blocked  3031229     k   no  sys  _event_sleep
     $t6     wait      0xf10000879000f340 blocked  3981345     k   no  sys  _event_sleep
     $t7     wait      0xf100008790014140 blocked  5259459     k   no  sys  _event_sleep
     $t8     wait      0xf10000879000b440 blocked  2949243     k   no  sys  _event_sleep
     $t9     wait      0xf100008790012440 blocked  4784355     k   no  sys  _event_sleep
     $t10    wait      0xf1000087900108c0 blocked  4333687     k   no  sys  _event_sleep
     $t11    wait      0xf10000879000c540 blocked  3227757     k   no  sys  _event_sleep
     $t12    wait      0xf10000879000d7c0 blocked  3530975     k   no  sys  _event_sleep
     $t13    wait      0xf10000879000a240 blocked  2654383     k   no  sys  _event_sleep
     $t14    wait      0xf100008790002640 blocked   622809     k   no  sys  _event_sleep
     $t15    wait      0xf10000879000c3c0 blocked  3203173     k   no  sys  _event_sleep
     $t16    wait      0xf100008790016640 blocked  5865705     k   no  sys  _event_sleep
     $t17    wait      0xf1000087900126c0 blocked  4825323     k   no  sys  _event_sleep
     $t18    wait      0xf100008790015040 blocked  5505171     k   no  sys  _event_sleep
     $t19    wait      0xf10000879000c640 blocked  3244143     k   no  sys  _event_sleep
     $t20    wait      0xf10000879000dcc0 blocked  3612869     k   no  sys  _event_sleep
     $t21    wait      0xf100008790013840 blocked  5111857     k   no  sys  _event_sleep
     $t22    wait      0xf10000879000f5c0 blocked  4022313     k   no  sys  _event_sleep
     $t23    wait      0xf100008790004240 blocked  1081589     k   no  sys  _event_sleep
     $t24    wait      0xf10000879000f840 blocked  4063281     k   no  sys  _event_sleep
     $t25    wait      0xf10000879000cdc0 blocked  3367051     k   no  sys  _event_sleep
     $t26    wait      0xf10000879000df40 blocked  3653837     k   no  sys  _event_sleep
     $t27    wait      0xf10000879000ec40 blocked  3866625     k   no  sys  _event_sleep
     $t28    wait      0xf1000089c3f6aa00 running  3899401     k   no  sys  socket
    

    At this point, we sort of know some information about the different threads but it will take too long to get backtraces of all of them. As a compromise between quick and completeness, get a backtrace for a one thread per unique function listed. For example, there are several different functions listed in the previous display: read (for thread 1), $PTRGL (for thread 2), _event_sleep (for most of the threads), and socket (for thread 28). Run these commands to get a backtrace for each different function listed:

    (dbx) th current 1
    (dbx) where
    (dbx) th current 2
    (dbx) where
    (dbx) th current 3
    (dbx) where
    (dbx) th current 28
    (dbx) where
    
    Okay, now detach to exit dbx and leave the process unchanged:
    (dbx) detach
    

    As long as there are not hundreds of threads per child process, do this for each child process and save the output:

    # dbx -a 12345
    Waiting to attach to process 12345 ...
    Successfully attached to httpd.
    warning: Directory containing httpd could not be determined.
    Apply 'use' command to initialize source path.
    
    Type 'help' for help.
    reading symbolic information ...warning: no source compiled with -g
    
    stopped in pth_usched._event_sleep [/usr/lib/libpthreads.a] at
    0xd00579dc ($t3)
    0xd00579dc (_event_sleep+0x90) 80410014        lwz   r2,0x14(r1)
    (dbx) thread
     thread  state-k     wchan    state-u    k-tid   mode held scope
     function
     $t1     wait      0xf100009e1969dfb8 running  5603495     k   no  sys  read
     $t2     zomb                 terminated 5636279     k   no   sys  $PTRGL
    >$t3     run                  blocked  1179771     k   no   sys _event_sleep
     $t4     wait      0xf100008790010640 blocked  4292717     k   no  sys  _event_sleep
     $t5     wait      0xf10000879000b940 blocked  3031229     k   no  sys  _event_sleep
     $t6     wait      0xf10000879000f340 blocked  3981345     k   no  sys  _event_sleep
     $t7     wait      0xf100008790014140 blocked  5259459     k   no  sys  _event_sleep
     $t8     wait      0xf10000879000b440 blocked  2949243     k   no  sys  _event_sleep
     $t9     wait      0xf100008790012440 blocked  4784355     k   no  sys  _event_sleep
     $t10    wait      0xf1000087900108c0 blocked  4333687     k   no  sys  _event_sleep
     $t11    wait      0xf10000879000c540 blocked  3227757     k   no  sys  _event_sleep
     $t12    wait      0xf10000879000d7c0 blocked  3530975     k   no  sys  _event_sleep
     $t13    wait      0xf10000879000a240 blocked  2654383     k   no  sys  _event_sleep
     $t14    wait      0xf100008790002640 blocked   622809     k   no  sys  _event_sleep
     $t15    wait      0xf10000879000c3c0 blocked  3203173     k   no  sys  _event_sleep
     $t16    wait      0xf100008790016640 blocked  5865705     k   no  sys  _event_sleep
     $t17    wait      0xf1000087900126c0 blocked  4825323     k   no  sys  _event_sleep
     $t18    wait      0xf100008790015040 blocked  5505171     k   no  sys  _event_sleep
     $t19    wait      0xf10000879000c640 blocked  3244143     k   no  sys  _event_sleep
     $t20    wait      0xf10000879000dcc0 blocked  3612869     k   no  sys  _event_sleep
     $t21    wait      0xf100008790013840 blocked  5111857     k   no  sys  _event_sleep
     $t22    wait      0xf10000879000f5c0 blocked  4022313     k   no  sys  _event_sleep
     $t23    wait      0xf100008790004240 blocked  1081589     k   no  sys  _event_sleep
     $t24    wait      0xf10000879000f840 blocked  4063281     k   no  sys  _event_sleep
     $t25    wait      0xf10000879000cdc0 blocked  3367051     k   no  sys  _event_sleep
     $t26    wait      0xf10000879000df40 blocked  3653837     k   no  sys  _event_sleep
     $t27    wait      0xf10000879000ec40 blocked  3866625     k   no  sys  _event_sleep
     $t28    wait      0xf1000089c3f6aa00 running  3899401     k   no  sys  socket
    (dbx) th current 1
    (dbx) where
    glink.read() at 0x1002e080
    ap_mpm_pod_check() at 0x1002df38
    ap_mpm_pod_check() at 0x1002df38
    child_main() at 0x1002a408
    make_child() at 0x1002a0dc
    startup_children() at 0x10029f0c
    ap_mpm_run() at 0x1002c0d4
    main() at 0x10000d8c
    (dbx) th current 2
    (dbx) where
    (and so on)
    (dbx) detach
    # 
    
    Solaris
    # pstack 12345 > backtraces.12345