|Web Techniques Magazine
Volume 1, Issue 1
By Lincoln Stein
Lincoln is the director of informatics at the MIT Genome Center, and author of the book How to Set Up and Maintain a World Wide Web Site (Addison-Wesley, 1995). He can be contacted at http://www-genome.wi.mit.edu/~lstein/.
Back when my Web site got (at most) a hundred or so accesses a day, I leisurely browsed through the access and error logs, saying to myself "Oh, here's an access from the U.K.! And here's one from Hong Kong!" But as my site became more popular and the hundreds of accesses swelled to thousands, reading through the logs became impractical. When daily accesses shot past the 10,000 and the access log began to grow by five megabytes each week, I began to panic.
Logs are a blessing and a curse. On the one hand, they provide a rich source of information. You can determine who's browsing your Web site, what they're reading, and when they're reading it. With this information you can find out what works and what doesn't, adjusting your site to maximize reader satisfaction. You can also find broken links, pages that can't be read because nothing points to them, and CGI scripts that crash. Logs also can also tell you when remote users are trying to break into your system. On the other hand, logs can quickly overwhelm you with information. A popular site's access log can grow by 50 megabytes a day and, unless special provisions are taken to delete or cycle the logs, a site's logs can rapidly fill up a hard disk's entire partition.
Most servers provide at least two log files--the access log and error log. The access log provides a blow-by-blow accounting of every access to your server and includes goodies such as the host name of the requester, URL requested, and date and time of the request. The error log, as its name implies, logs problems--requests for documents that don't exist, attempts to access protected documents, and internal errors from CGI scripts and the server itself. Your server may provide additional log files.
Most servers use the "common" format for the access log introduced by NCSA httpd for UNIX. Each line of this file records the request for one document at your site. A typical entry from an access log might look like Example 1(a) which shows an access from host portio.wi.mit.edu on February 3, 1995 at 5:42 PM EST. The document /pictures/small_logo.gif was requested using the usual GET method. The server returned an "OK" status code of 200, then transferred 2172 bytes of image data. The -0500 after the date means that the local time zone is five hours earlier than Greenwich Mean Time, in this case Eastern Standard Time.
Example 1(b) shows the full format for the log-file entries. In particular, the rfc931 field contains the remote user name returned by an Internet-identification-service daemon running on the remote user's machine. Unfortunately, most systems don't run this service, and the field is often blank. Even if the information is provided, it's not reliable user identification.
The Error Log
The access log records all hits on your site; the error log records only those that go awry. There are two types of entries in the error log: warning and error messages generated by the Web server itself, and error messages generated by any running CGI scripts. Example 2 shows both types of entry. The time-stamped messages that begin with the label httpd are generated by the HTTP server itself and follow a standard format. Everything else, including two messages from ppmtogif and a couple from an unidentified Perl program, are warning messages generated by CGI scripts.
Because the error log can provide you with early warnings of problems, you should scan through it at regular intervals. Things to watch for include:
- "File does not exist" or "no multi in this directory," which indicate that someone tried to access a nonexistent URL. If this warning occurs frequently, it's a sign that a link is broken.
- "File permissions deny server access." The server tried to retrieve a document, but it didn't have sufficient privileges to read it. Documents you intend to send over the Web must be readable by the server. On UNIX systems, this usually means making them world readable.
- "Password mismatch" indicates that a user tried to access a protected document, but typed an incorrect password. A series of these may indicate an attempt to gain unauthorized access to private material.
- "Client denied by server configuration." Access to a directory was restricted to certain IP addresses, and someone from outside the list of allowed IP addresses tried to gain access. This may also indicate an attempt to gain unauthorized access.
- "Connection timed out" occurs when the remote browser breaks the connection before receiving a requested document in its entirety. While this usually means that the user got bored and pressed the Stop button, it may also indicate that your server is prematurely timing out connections to users on slow links. Consider increasing the timeout values in your server's configuration files.
- "Malformed header from script" is a warning that a buggy CGI script is producing bad output that the server can't interpret. Usually there's also some sort of error message from the script immediately before or after this entry in the log file.
If your log files runneth over, you need to get control of the situation. One option is to turn off logging altogether. This can be done in UNIX-based servers (such as NCSA httpd, Apache, and the CERN Web server) using /dev/null as the name for one or more log files in the server's configuration file, or by commenting out the log-file directive. The Windows-based server WebSite allows you to turn logging off by blanking out the name of the log files in the configuration dialog, while the Macintosh-based MacHTTP allows you to comment out the logging directive in its configuration file.
Turning off logging completely is somewhat draconian, and throws away valuable information. A better solution is to cycle the logs. At periodic intervals (typically a week), you close the current log files, rename them, and start fresh ones. The oldest logs are then archived or deleted entirely.
Many commercial Web servers come with prepackaged log-cycling utilities. For example, the popular Windows-based WebSite server does its own log cycling. The current access log is renamed "access.001," "access.001" is renamed "access.002," and so on. You can signal the server to cycle the logs either by requesting a special URL called /~cycle-both or by running a program called "logcycle.exe." Using logcycle.exe and the Windows NT scheduler service (or the Windows 95 WinCron utility), you can cycle the logs every Sunday at 1:00 AM with:
at 01:00 /every:Su "c:\website\support\logcycle -ae".
You'll probably want to regularly delete the oldest log files or archive them with a compression utility such as PKZIP or GZIP. WebStar, a commercial Macintosh server, has no log-cycling utility. It's unsafe to move the log file around on the Macintosh while the server is working with it, so you'll have to shut down WebStar, rename or archive the log file by hand, and restart the server.
For UNIX-based servers, the basic log-cycle script looks something like Example 3. The UNIX file system allows you to rename the log file even while the server is active and writing to it. The last line sends a HUP signal to the server, taking advantage of the fact that most UNIX-based servers write their process IDs to a file called "httpd.pid" (or something similar). When it receives this signal, the server closes the previous log file and opens a fresh one.
A more sophisticated log-cycling script would be more customizable, handle the error and other logs as well as the access log, and archive old log files rather than deleting them. Listing One shows the cycling script that I use at my site. It's written in Perl and uses the gzip program to compress and archive the oldest log files. By changing the constants @LOGNAMES, %ARCHIVES, and $MAXCYCLES, you can adjust the names of the log files you wish to cycle, choose which ones to compress and archive, and determine how many logs to keep hanging around before compressing them. You can run this script by hand when needed, or install it into a cron job to run it daily or weekly.
It would be a shame to collect all available information, then let it go to waste. Fortunately there are a number of freely available utilities for crunching and munching log files to produce useful summaries and charts. Because it is easy to set up, my favorite log-cruncher utility is "Wusage," by Thomas Boutell. Wusage examines a week's worth of access log and produces an HTML report that includes the total accesses, top ten hosts accessing your site, and ten most popular documents at your site.
Wusage also creates a GIF graphic showing your site's weekly usage and arranges for the graphic and weekly reports to be viewed from your Web pages. Wusage is available for UNIX, Windows, and Windows NT servers.
A more powerful log-crunching tool is Roy Fielding's "WWWStat," a Perl script that lets you summarize and sort access-log files in endless ways--by item accessed, by hour, day, week, or month of access, by the domain of the remote host, by IP address, or by country of origin. You can combine the reports produced by WWWStat with GWStat, a graphics package written by Quiegang Long, to produce a variety of colorful column, line, and pie charts graphing usage of your site. With a bit of effort, you can link these utilities up to CGI scripts or cron jobs to produce online reports.
In summary, log files must be treated with respect. Give them a little attention, and they'll help you keep your site continually improving. Ignore them, and they'll take over your hard drive!
For More Information
CGI Modules (Perl CGI library):
Source code listed in this article: