data verification

what is a checksum?

A checksum of some data (a string, file, image, cd-rom etc) is a big, hexadecimal number. A checksum serves as a finger print of the data: different pieces of data have different fingerprints.

Md5 is one of the standard ways to compute checksums, described in rfc 1321.

stringmd5 checksum
hello world! fc3ff98e8c6a0d3087d515c0473f8677
hello world? 71dd0680b60d8231bf2b1d177c9e42bc
world hello! 01779663d0b4a600e2e1271e0a12d84a
Two properties make md5 checksums very useful:
  1. When two file are different, their md5 checksums will also be different; even if the files differ by only 1 bit.
    Instead of comparing two files, you can compare checksums.

  2. When a file has some md5 checksum, it is impossible to make another file with the same checksum.
    It is impossible to change a file, and keep the checksum intact.
So, if you have a copy (of a copy of a copy) of some file, you can check its validity by comparing its checksum with the checksum of the original.
If you think this is too good to be true, you are right. There are infinitely many files with the same checksum,
but nobody has ever found two different files with the same md5 checksum. In a practical situation, collisions don't occur.

how to compute a checksum

Because MD5 is a public standard, implementations are widely available. Most unix/linux systems have a command line utility to compute md5 checksums. Look for md5, md5sum etc. Source is available through gnu coreutils, in the textutils subset.

For MS platforms, see the verify your apache software page.

how the verifier works

data files

Currently, five (.db) data files are used to maintain state about
contains state about signatures and packages with missing signatures.
goodkey occurs in some KEYS file and signature can be verified
cant signature can be verified ; key can't be found in any KEYS file
badfile/sig inconsistency
undefgpg balks, probably not a signature
misssignature without a package
unsigpackage without a signature
key IDor undef
file owneror uid
signeras reported by gpg
SIGS is updated hourly by mk_update.

contains md5 state about all files (symlinks are ignored)
file md5computed when the file is new or updated
datemtime of file at the moment the md5 was computed
STAT is updated hourly by mk_update. New and changed (determined by date/mtime) files are inspected.

contains the inverse of STAT regarding file <->md5. The keys are all md5's found in STAT.
md5 filesall files matching some md5
FILE is recomputed hourly from STAT by mk_update.

HIST has the same structure as STAT, but contains only information on packages. Once inserted, tuples are never automatically updated or deleted. Root can delete by hand. HIST is used to detect changes in published software. HIST is updated hourly by mk_update.
DEAD contains info on dead symlinks.
file ownerthe file system owner of the dead symlink
DEAD is updated hourly by mk_update.

the web application: md5.cgi

The web application md5.cgi takes from the browser an md5, and shows a page with information about all files matching the md5. Per file is reported: Here is the usage message of md5.cgi:
  Usage: md5.cgi [-v] [-q] [-d] [-n] [-t] [sum=md5sum]
  The options and argument are for testing only.
  As a cgi script, the program should be used without any options.
  option v  : be verbose
  option q  : be quiet
  option d  : show debug info
  option n  : dry run
  option t  : force test ; any other option implies -t
  arguments : sum=md5sum return matching file(s) or error message

database updates: mk_update

Program mk_update is used to update the databases ; it also generates the report pages.

Here is the usage message of mk_update:

  Usage: mk_update [-v] [-q] [-d] [-n] [-c] [-upd] [-chk] [-keys] [-page]
         mk_update -ok file
  option v    : be verbose
  option q    : be quiet
  option d    : show debug info
  option n    : dry run
  option c    : clean ; start with empty databases
  option keys : import all keys
  option upd  : compute and store md5's of new or changed files
  option chk  : compare all stored values of xxx.md5 files with md5 of xxx
              : for every tgz/zip/jar file, compare current and initial md5.
  option page : generate html pages
  option ok   : change history ; for file, set initial md5 to current md5 ;
                must run as root (or henkp :-)
  The order of processing is always : -upd -chk -keys -sigs -page
  option -d    implies -v   ; of course
  option -c    implies -upd ; to recompute the databases
  option -page implies -chk ; -page needs the date gathered by -chk
clear the databases ; recompute everything
import into pgp all KEYS files from www.apache.org/dist and /home/*/.pgpkey
  1. update STAT/FILE/HIST : compute and store md5's of new or changed files ;
  2. update SIGS :
    • recompute signatures that are not good ;
    • check unsigned packages for new signatures ;
    • inspect real signature and package files found in step 1 ;
    • inspect symlinks to (real) signature and package files found in step 1 ;
check all .md5 files against the source ; compare actual values with the values in HIST
generate the md5 checks and sig checks reports
-ok file
for file, set initial md5 to current md5 ; when file is all, set history for all files with different initial and current md5's

the perl library: Checker.pm

Most of the work is done in the perl library: Checker.pm. Some features :