magazine resources subscribe about advertising

 

 

 








 CD Home < Web Techniques < 1997 < May  

Searching With Isearch

Moving Beyond WAIS

By Nassib Nassar

Looking for a search engine that meets your specific requirements can be a formidable task. There are so many to choose from that you practically need a search engine to find them all. Over the years of working with the Wide Area Information Server (WAIS) model, it became clear that while searching methods and implementations were continually improving, we needed a model that would not become obsolete after a few months. Isearch was developed as a building block for beginning a new information system or adapting to an existing server. This article aims to introduce Isearch and give you an idea of how it can be used.

Search-Engine History

As search-engine expert Erik Scott puts it, "In the beginning, there was grep." Searching software has come a long way since grep, and search algorithms are continually becoming smarter and faster. One of the earliest popular search systems on the Internet was WAIS, an information system that combined an indexing and searching engine with an information-retrieval interface, making it possible to search databases anywhere on the Internet. Unfortunately, the word WAIS came to mean almost anything: an information-retrieval model, a network protocol, a search system, a software package, and a company name.

The most widely used implementation of WAIS was freeWAIS (see " Online"), a WAIS software package free to the public. In 1992, the National Science Foundation gave a grant to the MCNC Center for Networked Information Discovery and Retrieval (CNIDR) to support the freeWAIS software and develop further techniques for building distributed information systems. During this time, we at CNIDR began experimenting with a new and somewhat different WAIS model; the searching component of this new model became Isearch.

Searching versus Retrieving

The freeWAIS software was actually two programs in one: a search engine that could be used to index and search data, and a retrieval-protocol implementation that provided a standard interface for network searching. The nuts and bolts of searching took place on a single computer, but the retrieving capability allowed the search engine to be accessed from any computer on the Internet.

The protocol that freeWAIS used was an early version of ANSI/NISO Z39.50 -- a cryptic name, to say the least, but a very powerful and robust network protocol for Internet searching. Z39.50 can handle precise search queries and document-retrieval requirements that a Web interface simply cannot support in a standardized way. Companies that sell data-searching access have continued to develop the Z39.50 protocol for their customers.

Isite and Isearch

One of the problems with freeWAIS was that the search engine and the retrieval protocol were mixed together rather than separated into different modules. The C source code was also a mess and very difficult to maintain; over time, both the protocol and search engine were becoming outdated. In the course of adapting freeWAIS to various projects and updating the protocol code, Jim Fullton, Kevin Gamiel, and I began to experiment with various search-engine algorithms as alternatives to the freeWAIS search code.

Based on this work, we started developing a new information system in 1994. Kevin Gamiel rewrote the entire Z39.50 protocol engine with a modular and extensible design, and I wrote a search engine in C++; we combined these two components into a freely available software package called Isite. Isearch, the search-engine component, can be used alone without the added complexity of the Isite protocol components. Isearch-cgi, a CGI interface for Isearch, is a retrieval layer for searching a Web page that is sufficient for most Web-searching purposes. After installing Isearch and Isearch-cgi, the rest of Isite can be added to existing Isearch data collections to provide a Z39.50-compliant retrieval layer.

Getting and Installing Isearch

The latest information about Isearch can be found at the CNIDR Web site; Isearch and Isearch-cgi are available via FTP (see " Online").

These directories contain the source code and some precompiled binaries of the latest version. Isearch has been ported to run on several platforms, including Solaris, SunOS, Linux, and SGI Irix. You will need to unzip and untar the distribution package after downloading.

You can compile the source code on many UNIX systems by simply typing "make," although it is a good idea to read the README file first. The only unusual requirement for compiling Isearch is that you must have C++ fully installed on your system, including gcc and the g++ library. If you encounter problems during the "make" process, you can get help on the ISITE-L mailing list. (Send a message to listproc@cnidr.org with the command "SUBSCRIBE ISITE-L FIRSTNAME LASTNAME" as the body of your message.) After the compilation stage, or after downloading a precompiled distribution, you should run "make install" as root to install the binary files, by default to /usr/ local/bin/.

There are three executable command-line utilities: Iindex, Isearch, and Iutil. Before you can perform searches on your data, you must run Iindex on it to create index files. You can search your data based on those index files, and Iutil has various functions for maintaining your index files. I generally refer to the collection of index files for a given data set as "database files," but they are not related in any way to relational databases.

Indexing Your Data

Example 1 shows the typical form for Iindex. For a database_name called /home/MYDB, for example, Iindex will create a set of files beginning with "MYDB" in the /home directory. The -t or document_type option specifies the type of indexed data; see Table 1. Iindex looks up the document_type name in a list of registered C++ classes, or "doctypes" that it calls during both the indexing process (to parse the field structure of your documents) and searching (to handle presentation of documents to the user). The -m or data_block option tells Iindex how many megabytes of data to load into memory at a time (the default is 1 MB if -m is not used). In the current Isearch version, this option has a very significant impact on indexing time for data that doesn't fit in memory. If you have enough physical memory, the ideal would be to set this option larger than your data collection, allowing the entire indexing process to be done in memory. In most cases this is not possible, and you should set it as high as possible while staying well below the amount of free physical memory.

At the end of the Iindex command-line options, the data files are listed or a file mask is used. A typical example might look like this:

Iindex-dWEBSITE-tHTML-m5/web/
     *.html

This creates a new database called WEBSITE in the current directory, indexing the files in /web/*.html, parsing them as HTML (using the HTML doctype), and loading 5 MB of data at a time.

The complete set of Iindex commands and the list of supported doctypes can be displayed by typing Iindex with no options; see Table 2.

To demonstrate indexing real HTML data, I have selected a set of poems by Paul Jones called, "What the Welsh and Chinese Have In Common" (sunsite.unc.edu/pjones/poetry/). You can use any small collection of Web pages to follow along. To index the pages first move to a directory where you want your Isearch database files created:

cd /DATABASES

Then create the indexes:

Iindex -d JONES -t HTML /poetry/
    *.html

Since my sample pages amount to less than a megabyte, the -m option is unnecessary. Iindex creates several files in the current directory.

These files contain special information for Isearch's fast searching; the program needs these database files and the original indexed data (in this case, the HTML files). If the original data has been modified in any way, you will need to re-index for searching to work properly.

If you are indexing a Web site that has HTML files in subdirectories, you can use the -f option to tell Iindex to read the names of files to index from a text file that lists the names. Thus, you can build a list of the files you want to index in a directory tree. For example, using the UNIX find command and piping the output to a file:

find website/ -name "*.html"
   -print> filelist

We can then use that file with the -f option of Isearch:

Iindex -d WEB -m 5 -t HTML -f
   filelist

This would index all files matching *.html under the website/ directory.

Searching

The Isearch command-line options look a lot like those of Iindex. For a typical test run with our sample database, I searched for three common poetic words: "sorrow," "vanity," and "moose":

Isearch-dJONESsorrowvanitymoose

Isearch found three documents that contain at least one of the three search terms we specified, and it ranked them in order of relevance to those words; see Example 2. The scores are scaled to the range 0 to 100, and the documents are sorted in order of score. The headline for each document, known as the "brief record," is extracted by the doctype and printed out under the filename. You can select one of the files to view by number, and Isearch will print out the entire document, in this case the raw HTML file. Isearch also supports nested Boolean queries with operands such as AND and OR. To specify a Boolean query, use the -infix option:

Isearch -d JONES -infix
"((nightandmoonandcold)ormoose)"

The complete set of options for Isearch are listed in Table 3. The -p option is useful for telling Isearch to print a certain piece of each document as a headline in the list of search results. The most typical value to use with -p is the name of a field in the document. Iutil lists the recognized fields for any given doctype. In the example Iutil -d JONES -vf, where JONES is an HTML doctype, Iutil would recognize TITLE, A, A@, A@HREF, H1, D1, DT, ADDRESS, and DL.

These fields name regions of text that Iindex has recognized as subcomponents of the indexed documents. We searched only for words in the document text, but Isearch also supports "fielded searching," which is restricted to matches within a certain field. For example, the query term "kites" would match any document that contained the word "kites," but "title/kites" would only match documents that contained the word "kites" inside the TITLE field.

Fields are also used during the presentation of documents, most commonly in the presentation of search results. By default, Isearch displays a special field called B or Brief Record. The -p option lets you specify a different field to present. In the case of the HTML doctype, B is mapped directly to the TITLE field, so -p TITLE would produce the same result as not using -p at all. With some doctypes, the Brief record is a combination of more than one field; the MAILFOLDER doctype, for example, combines the From and the Subject fields. Another special field is F, or Full record, which is always defined as the entire document. When you select a document to view in Isearch, the F field is used to retrieve the complete document. B and F are called "element sets," and are defined by the doctype; in fact, all presentation in Isearch is done by specifying element sets. With an option like -p TITLE, the element set is defined as TITLE, and in practice doctypes recognize field names as element sets. But an element set is not required to relate to fields; they have no necessary link to the list of fields returned by Iutil -vf. Element sets are abstract views of a document that can be interpreted by the doctype in any way. However, the allowable element sets are usually defined at a minimum as the document fields recognized by the doctype and the two special element sets, B and F.

Adding a Web Interface With Isearch-cgi

The Isearch-cgi package lets you set up Web access to your Isearch database through CGI scripts. Isearch-cgi includes Configure, a special script that creates the scripts to put behind your Web server. Configure takes one argument, the directory in which your databases are located:

Configure /DATABASES

Configure creates two scripts, ifetch and isearch, which need to be copied to the cgi-bin directory for your Web server:

cp ifetch isearch /httpd/cgi-bin/

You can then create the Web-search page for your database. Isearch-cgi includes a program called search_form that makes the Web page for you; all you have to do is give the database a name and a directory and pipe the output to an HTML file:

search_form /DATABASES JONES >/poetry/Poems.html

That's all there is to it. You can pull up the Poems.html page and begin searching.

More About Doctypes and the Isearch Architecture

I have tried to give an impression of the Isearch basics, but the Isearch model is very customizable, and the engine's behavior can be modified depending on the document being processed. You can even mix different doctypes in the same Isearch database and search multiple data types transparently. Every time you access a document, during indexing or searching, its doctype is consulted to handle the document's special requirements. The Isearch architecture consists of the shell, the search-engine library, and the doctypes. The command-line utilities (Iindex, Isearch, and Iutil) and the Isearch-cgi utilities are examples of Isearch shells that give you a user interface for accessing the C++ search-engine library (it does the real indexing and searching work). The doctypes are intended to be configured and extended for your specific requirements.

Isearch comes with a variety of useful doctypes, many of which were developed and distributed publicly by various authors. If you know some C++, you can write your own doctypes or create descendant classes of existing doctypes. This allows you to design your customizations in individual modules, separate from the Isearch distribution, making the entire system very easy to maintain over time. For example, Listing One extends the HTML doctype to print the TITLE in italics. If the algorithms within the search-engine library are improved, the doctypes and shells can generally stay the same. If your data formats change, you can change or extend your doctypes without affecting the shell interface, the main library, or how your users perform searches. If you add the rest of the Isite package in the future, to provide Z39.50 access, the Isite server will access your data through its own Isearch shell, via the Isearch library and your doctypes.

For more information and to browse some of the databases we have created with Isearch, visit the CNIDR Web site. I also recommend the UNIX Web Server Book, Second Edition, by R. Douglas Matthews et al. (Ventana Press, 1997), which includes information about Isearch and a CD of the software.

(Get the source code for this article here.)


Nassib designs distributed information systems for the MCNC Center for Networked Information Discovery and Retrieval (CNIDR) in Research Triangle Park, NC. He has been programming computers for 16 years, since the age of 9, and is also a trained classical pianist.




Copyright © 2003 CMP Media LLC