NeatSeeker User's and Developer's Guide

Sami Lempinen

Table of Contents

1. Introduction

About NeatSeeker
NeatSeeker Framework License

2. Search servlet quickstart

Editing the properties file
Copying the servlet components to the container space
Indexing the HTML files
Restarting your servlet container

3. Developing and extending NeatSeeker

Using the Ant build system
Extending NeatSeeker

A. NeatSeeker Indexer-HOWTO

1. Introduction

Table of Contents

About NeatSeeker
NeatSeeker Framework License

NeatSeeker is a collection of Java classes for building search engines. It also offers a reference implementations of an HTML indexer and a Servlet API 2.2 compliant Java servlet that can be used for indexing and searching web sites. If you are running a Servlet API 2.2 compatible servlet container such as Apache Tomcat, NeatSeeker provides an easy way of adding search capability to your website.

NeatSeeker is published under an Apache-style license. You must read and approve the conditions presented in it before using the software. You are free to distribute NeatSeeker to the extent permitted by the license, and you are most welcome to participate in the development effort.

The purpose of the NeatSeeker project is to develop a simple, all-purpose indexing and searching framework that can be used in a CGI environment, with Java servlets and from the command line. NeatSeeker does not require a database backend, as it uses serialized Java objects for storing the indexes.

This product includes software developed by the Apache Software Foundation (http://www.apache.org/).

NeatSeeker Framework License

Before using and/or distributing NeatSeeker, you must accept the conditions of the following license:

/*
 * $Id: guide.xml,v 1.1 2000/10/07 14:23:48 lempinen Exp $
 * ====================================================================
 * 
 * The NeatSeeker Framework License, Version 1.0
 * Based on the Apache Software License
 *
 * Copyright (c) 1999 Sami Lempinen (lempinen@iki.fi). All rights 
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer. 
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in
 *    the documentation and/or other materials provided with the
 *    distribution.
 *
 * 3. The end-user documentation included with the redistribution, if
 *    any, must include the following acknowlegement:  
 *       "This product includes NeatSeeker software developed by
 *        Sami Lempinen (lempinen@iki.fi)."
 *    Alternately, this acknowlegement may appear in the software itself,
 *    if and wherever such third-party acknowlegements normally appear.
 *
 * 4. The names "NeatSeeker", "NeatMaker", and "Sami Lempinen"
 *    must not be used to endorse or promote products derived
 *    from this software without prior written permission. For written 
 *    permission, please contact lempinen@iki.fi.
 *
 * 5. Products derived from this software may not be called "NeatSeeker"
 *    nor may "NeatSeeker" appear in their names without prior written
 *    permission of Sami Lempinen.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL SAMI LEMPINEN OR NEATSEEKER
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * ====================================================================
 *
 * This product includes software developed by the 
 * Apache Software Foundation (http://www.apache.org/).
 *
 */

2. Search servlet quickstart

Table of Contents

Editing the properties file
Copying the servlet components to the container space
Indexing the HTML files
Restarting your servlet container

This chapter gives step-by-instructions on how to quickly add sophisticated search capability to your website. The process here assumes that the following prerequisites are met:

You have downloaded the NeatSeeker software from the NeatSeeker SourceForge project site and unpacked the distribution.
You have a functional installation of a Java 2 runtime or development environment.
You have a functional Servlet API 2.2 compatible Java servlet container (e.g. Apache Tomcat.

Setting up a search engine for your web site consists of the following simple steps:

Editing the properties file to match your environment
Copying the Java classes to the servlet container space.
Indexing your files.
(Possibly) restarting your servlet container.

Note

The instructions given here do not require any external tools. You can perform most of the tasks described here automatically using the Jakarta Ant build tool. For more information, see Chapter 3., Developing and extending NeatSeeker.

Also, the instructions assume a UNIX-based system. For other operating systems, you need to adapt the instructions accordingly.

Editing the properties file

A property file is used both during indexing and searching, and it is generally a good idea to use the same property file both for indexing and running the search servlet.

The build/ subdirectory contains an already-built servlet component hierarchy for NeatSeeker. Before packaging the components to a warball or copying them to the servlet context, you need to edit the build/WEB-INF/NeatSeeker.properties file.

Take the following steps:

Open the property file build/WEB-INF/NeatSeeker.properties in a text editor of your choice.

Change the necessary properties to reflect your environment. The property file is well documented in itself, and for now, you should pay most attention to the following items:

Table 2.1. Common configuration options

Option	Description
neatseeker.repository	Where the search engine index is stored
neatseeker.stopwords	Whether or not a stopword list is used during indexing. If you set this to `true`, make sure that you provide a path to an plaintext stopword list with the `neatseeker.stoplist` option.
neatseeker.cleanup	Whether the existing index should be deleted when you start the indexing process
neatseeker.startpoint	The top-level directory for the indexing process. If your HTML files that you want to index are located in `/home/httpd/html/`, for instance, specify that here.
neatseeker.includes	A Jakarta Ant format directory task specification, which tells the indexer which files should be entered in the index. The setting `*/.html,*/.htm`, for instance, will cause all files ending in `.html` or `.htm` to be included in the index.
neatseeker.resultprefix	A prefix added to the search results when they are displayed. The index stores the file names relative to the `neatseeker.startpoint` path. Often, it is necessary to prepend e.g. `http://hostname/` in order to convert the path to a valid URL.

Once you have made the necessary changes, save the file and exit the text editor.

Copying the servlet components to the container space

Next, the compiled servlet components in the build/ subdirectory need to be copied to the servlet container web applications directory. With Apache Tomcat, for example, the default location for this is the webapps directory.

If we assume that the webapps directory is located in /opt/apps/jakarta-tomcat/webapps, you can use the following commands to copy the servlet components to a servlet context called neatseeker:

% mkdir /opt/apps/jakarta-tomcat/webapps/neatseeker/
% cp -r build/* /opt/apps/jakarta-tomcat/webapps/neatseeker/

On non-UNIX platforms, use the necessary tool (e.g. the Windows Explorer) to copy the contents of the build/ directory to the servler container space.

It's that simple. If you want to learn how to perform the above step as a part of an automated build/install process, see Chapter 3., Developing and extending NeatSeeker.

Indexing the HTML files

Once the servlet components have been installed, you are ready to start the indexing process.

Note

In version 1.0, the indexing process is rather memory and CPU intensive, as the index is held in memory before dumping it onto disk. For reasonably sized webspaces (< 100 MB), this should not be a problem. According to my own observations, the indexing process memory requirement is approximately 30-50% of the size of the indexable material.

The WEB-INF/bin/neatmaker shell script provides a convenient wrapper around the indexer class (lempinen.neatseeker.core.NeatMaker). If you are on a UNIX-based system that is shell script capable, perform the following steps to build the index:

Change to the WEB-INF/bin directory in the servlet context where you copied the servlet components. For example:
```
% cd /opt/apps/jakarta-tomcat/webapps/neatseeker/WEB-INF/bin
```
Enter the following command to start the indexing:
```
% ./neatmaker -Dneatseeker.properties=../NeatSeeker.properties 
```
The indexing process starts, providing you with information about its progress.

Note

The NeatSeeker property management mechanism allows you to conveniently override any of the properties in the property file using command line system properties. If, for example, you want to create an index in a different place, you can append -Dneatseeker.repository=/different/path to the above command.

If you are on a non-shell-script-capable system, you can start the indexing using the Java command-line tool. Take the following steps:

Change to the WEB-INF directory in the servlet container space.

Enter the following command:

C:\> java -classpath lib/neatseeker.jar:lib/Tidy-unofficial.jar 
     -Dneatseeker.properties=NeatSeeker.properties 
     lempinen.neatseeker.core.NeatMaker

(All of the above should be entered on a single command line.)

In an environment where the documents are modified often, it may be beneficial to set the indexer to run periodically e.g. by the cron daemon.

Restarting your servlet container

As the last step in the process, you should (at least with Tomcat) restart the servlet container so that it will detect the newly installed NeatSeeker.

After you have restarted the servlet engine, direct your web browser to the servlet context and try the search engine. The default search page gives instructions concerning the query syntax.

3. Developing and extending NeatSeeker

Table of Contents

Using the Ant build system
Extending NeatSeeker

This chapter gives information about extending NeatSeeker to provide indexing and searching methods for different types of material.

Using the Ant build system

Before you start working with the NeatSeeker source code, you should familiarise yourself with the Jakarta Ant build system. Ant is a cross-platform all-Java general purpose build tool that is particularly well suited for building Java projects. If you do not have Ant yet, get it from the Apache Jakarta website. The build file supplied with NeatSeeker is compatible with the stable 1.1 build of Ant.

The NeatSeeker build file has a number of targets for performing different tasks. These are listed in the following table.

Table 3.1. NeatSeeker build targets

Target	Purpose	Example
compile (default)	Compiles the Java sources into bytecode classes.	`ant compile`
jar	Builds a JAR of the NeatSeeker classes.	`ant jar`
build	Constructs a servlet build in the `build/` subdirectory.	`ant build`
war	Constructs a WAR (Web Application Archive) from the contents of the `build/` directory.	`ant war`
install_servlet	Installs the contents of the `build/` subdirectory into a servlet context.	`ant -Dwebapps=/path/to/webapps -Dcontext=neatseeker install_servlet`
install_war	Installs the WAR file into a servlet context.	`ant -Dwebapps=/path/to/webapps -Dcontext=neatseeker install_war`
javadoc	Builds the Javadoc documentation.	`ant javadoc`

Extending NeatSeeker

The NeatSeeker framework has been designed to be as easily extensible as possible. It requires very little effort to develop new types of indexers for different types of data (email, text files etc.)

The doc/Indexer-HOWTO.txt file contains a brief introduction to the Indexer interface architecture. The same information can be found in Appendix A. of this document.

Also, the Javadoc documentation should be fairly helpful.

I welcome any questions you may have concerning NeatSeeker development. Apart from mailing to me directly, you have the following resources at your disposal:

Any feedback you may have about installing, using or extending NeatSeeker is very welcome.

A. NeatSeeker Indexer-HOWTO

The following document gives a brief description of the tasks involved in developing a new type of indexer for NeatSeeker. The document is also available in the project documentation page.

================================================================
HOW TO WRITE A NEATSEEKER INDEXER

Author: Sami Lempinen (lempinen@iki.fi)
$Id: Indexer-HOWTO.txt,v 1.3 2000/10/07 14:24:32 lempinen Exp $
================================================================

1. Introduction

The easiest way to develop an indexer for arbitrary material (text,
XML etc.) is to inherit lempinen.neatseeker.core.AbstractIndexer. It
has a number of convenience functions already defined, and leaves only
the details up to you.

Writing an indexer requires that you implement the
lempinen.neatseeker.core.Indexer interface. If you bypass
AbstractIndexer, you need to implement the methods defined in the
interface yourself. If you inherit AbstractIndexer, you only need to
implement

	public void init(Configuration c)

	       and

	public void process (InputStream in, String uri);

In fact, if your implementation does not require additional attributes
that need initialising, you do not even need to worry about
implementing init().

2. Program control flow

When a calling party (e.g. lempinen.neatseeker.core.NeatMaker)
requests an Indexer using the IndexerFactory.getIndexer static method,
a new Indexer is created according to the configuration parameters
(see section 4 below) and its init() method is called with the
Configuration object as an argument. Furthermore, the Factory sets the
Collector partner for the Indexer according to the settings in the
Configuration.

To start the actual indexing, the calling party calls the start()
method of the Indexer object. The default action in AbstractIndexer is
to simply continue by calling the start() method of the Collector, and
there should be no reason to override this.

The Collector then starts traversing either the filesystem or a
networked resource and calls back the process() function for each
resource it encounters.

3. Processing the data

This way, you can implement a simple indexer by specifying appropriate
actions for process(InputStream in, String uri).

The InputStream is your indexable data. The URI string is given by
the Collector as a unique identifier for the indexable resource. The
processing can the proceed as follows:

 - You should create a new Target in the beginning of process and use
   its setURI() method to set the URI to the string passed in by the
   Collector.

 - Use the repository lookahead method getNextID() to obtain a unique
   NeatSeeker identifier for the Target.

 - For every term you index, create an Entry that contains the term,
   the ID returned by getNextID() and the word position of the term
   within the resource.

 - Call the add(Entry) method in AbstractIndexer to add the term to
   the index.

 - At the end of processing a resource, update the index Statistics
   object with the incrementDocumentCount() and incrementWordCount()
   methods.

 - Update the word count of the Target with the setSize() method.

 - Put the Target in the repository with the repository.put() method.

Every call to process() should carry out the above tasks. For a
complete example on how this can be done, look at
lempinen.neatseeker.html.HTMLIndexer.

4. Adding your indexer to the runtime properties

As the Indexers are created by a Factory, the actual implementing
class is determined at runtime. You should tell NeatSeeker about your
new indexer by adding a few entries to the property file
(e.g. etc/NeatSeeker.properties).

For instance, if you have created an indexer for XML documents of some
sort, and the indexer class is com.abba.xml.XMLIndexer, you should add
the following entries to the property file:

    # Default indexer MIME
    neatseeker.indexer.mime		= text/xml

    # Fully qualified class names of different indexers
    neatseeker.indexer.text/xml	        = com.abba.xml.XMLIndexer

This should be enough to get you started. If you have problems, do not
hesitate to contact the author.