A checksum of some data (a string, file, image, cd-rom etc) is a big, hexadecimal number. A checksum serves as a finger print of the data: different pieces of data have different fingerprints.Md5 is one of the standard ways to compute checksums, described in rfc 1321.
Two properties make md5 checksums very useful:
string md5 checksum hello world! fc3ff98e8c6a0d3087d515c0473f8677 hello world? 71dd0680b60d8231bf2b1d177c9e42bc world hello! 01779663d0b4a600e2e1271e0a12d84a So, if you have a copy (of a copy of a copy) of some file, you can check its validity by comparing its checksum with the checksum of the original.
- When two file are different, their md5 checksums will also be different; even if the files differ by only 1 bit.
Instead of comparing two files, you can compare checksums.
- When a file has some md5 checksum, it is impossible to make another file with the same checksum.
It is impossible to change a file, and keep the checksum intact.If you think this is too good to be true, you are right. There are infinitely many files with the same checksum,
but nobody has ever found two different files with the same md5 checksum. In a practical situation, collisions don't occur.
Because MD5 is a public standard, implementations are widely available. Most unix/linux systems have a command line utility to compute md5 checksums. Look for md5, md5sum etc. Source is available through gnu coreutils, in the textutils subset.For MS platforms, see the verify your apache software page.
Currently, five (.db) data files are used to maintain state about
- signatures -- .asc and .sig files
- packages -- .tar.gz, .tgz, .Z, .zip and .jar files
- dead symlinks
- all files
- SIGS
- contains state about signatures and packages with missing signatures.
SIGS is updated hourly by mk_update.
key attributes comment signature
or
packagestate
good key occurs in some KEYS file and signature can be verified cant signature can be verified ; key can't be found in any KEYS file bad file/sig inconsistency undef gpg balks, probably not a signature miss signature without a package unsig package without a signature key ID or undef file owner or uid signer as reported by gpg
- STAT
- contains md5 state about all files (symlinks are ignored)
STAT is updated hourly by mk_update. New and changed (determined by date/mtime) files are inspected.
key attributes comment file md5 computed when the file is new or updated date mtime of file at the moment the md5 was computed
- FILE
- contains the inverse of STAT regarding file <->md5. The keys are all md5's found in STAT.
FILE is recomputed hourly from STAT by mk_update.
key attributes comment md5 files all files matching some md5
- HIST
- HIST has the same structure as STAT, but contains only information on packages. Once inserted, tuples are never automatically updated or deleted. Root can delete by hand. HIST is used to detect changes in published software. HIST is updated hourly by mk_update.
- DEAD
- DEAD contains info on dead symlinks.
DEAD is updated hourly by mk_update.
key attributes comment file owner the file system owner of the dead symlink
The web application md5.cgi takes from the browser an md5, and shows a page with information about all files matching the md5. Per file is reported:Here is the usage message of md5.cgi:
- a warning in red, if a bad signature exist
- the non-existence of a signature, if the file is a package
- the result of a (cheap) dynamic consistency check of the contents of a matching .md5 file, if present
- the result of a comparison of the md5 and the initial md5
Usage: md5.cgi [-v] [-q] [-d] [-n] [-t] [sum=md5sum] ---------------------------------------------------------------------- The options and argument are for testing only. As a cgi script, the program should be used without any options. ---------------------------------------------------------------------- option v : be verbose option q : be quiet option d : show debug info option n : dry run option t : force test ; any other option implies -t arguments : sum=md5sum return matching file(s) or error message
Program mk_update is used to update the databases ; it also generates the report pages.Here is the usage message of mk_update:
Usage: mk_update [-v] [-q] [-d] [-n] [-c] [-upd] [-chk] [-keys] [-page] mk_update -ok file option v : be verbose option q : be quiet option d : show debug info option n : dry run option c : clean ; start with empty databases option keys : import all keys option upd : compute and store md5's of new or changed files option chk : compare all stored values of xxx.md5 files with md5 of xxx : for every tgz/zip/jar file, compare current and initial md5. option page : generate html pages option ok : change history ; for file, set initial md5 to current md5 ; must run as root (or henkp :-) ---------------------------------------------------------------------- The order of processing is always : -upd -chk -keys -sigs -page option -d implies -v ; of course option -c implies -upd ; to recompute the databases option -page implies -chk ; -page needs the date gathered by -chk ----------------------------------------------------------------------
- -c
- clear the databases ; recompute everything
- -keys
- import into pgp all KEYS files from www.apache.org/dist and /home/*/.pgpkey
- -upd
- update STAT/FILE/HIST : compute and store md5's of new or changed files ;
- update SIGS :
- recompute signatures that are not good ;
- check unsigned packages for new signatures ;
- inspect real signature and package files found in step 1 ;
- inspect symlinks to (real) signature and package files found in step 1 ;
- -chk
- check all .md5 files against the source ; compare actual values with the values in HIST
- -sigs
- -page
- generate the md5 checks and sig checks reports
- -ok file
- for file, set initial md5 to current md5 ; when file is all, set history for all files with different initial and current md5's
Most of the work is done in the perl library: Checker.pm. Some features :
- file locking is used for the .db files. The file locks are of the advisory, non-blocking kind.
- the relevant stuff is configurable ; it could be used to check out archive.apache.org. Brrr...