Table of Contents
NeatSeeker is a collection of Java classes for building search engines. It also offers a reference implementations of an HTML indexer and a Servlet API 2.2 compliant Java servlet that can be used for indexing and searching web sites. If you are running a Servlet API 2.2 compatible servlet container such as Apache Tomcat, NeatSeeker provides an easy way of adding search capability to your website.
NeatSeeker is published under an Apache-style license. You must read and approve the conditions presented in it before using the software. You are free to distribute NeatSeeker to the extent permitted by the license, and you are most welcome to participate in the development effort.
The purpose of the NeatSeeker project is to develop a simple, all-purpose indexing and searching framework that can be used in a CGI environment, with Java servlets and from the command line. NeatSeeker does not require a database backend, as it uses serialized Java objects for storing the indexes.
This product includes software developed by the Apache Software Foundation (http://www.apache.org/).
Before using and/or distributing NeatSeeker, you must accept the conditions of the following license:
/* * $Id: guide.xml,v 1.1 2000/10/07 14:23:48 lempinen Exp $ * ==================================================================== * * The NeatSeeker Framework License, Version 1.0 * Based on the Apache Software License * * Copyright (c) 1999 Sami Lempinen (firstname.lastname@example.org). All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in * the documentation and/or other materials provided with the * distribution. * * 3. The end-user documentation included with the redistribution, if * any, must include the following acknowlegement: * "This product includes NeatSeeker software developed by * Sami Lempinen (email@example.com)." * Alternately, this acknowlegement may appear in the software itself, * if and wherever such third-party acknowlegements normally appear. * * 4. The names "NeatSeeker", "NeatMaker", and "Sami Lempinen" * must not be used to endorse or promote products derived * from this software without prior written permission. For written * permission, please contact firstname.lastname@example.org. * * 5. Products derived from this software may not be called "NeatSeeker" * nor may "NeatSeeker" appear in their names without prior written * permission of Sami Lempinen. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL SAMI LEMPINEN OR NEATSEEKER * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * ==================================================================== * * This product includes software developed by the * Apache Software Foundation (http://www.apache.org/). * */
Table of Contents
This chapter gives step-by-instructions on how to quickly add sophisticated search capability to your website. The process here assumes that the following prerequisites are met:
You have downloaded the NeatSeeker software from the NeatSeeker SourceForge project site and unpacked the distribution.
You have a functional installation of a Java 2 runtime or development environment.
You have a functional Servlet API 2.2 compatible Java servlet container (e.g. Apache Tomcat.
Setting up a search engine for your web site consists of the following simple steps:
The instructions given here do not require any external tools. You can perform most of the tasks described here automatically using the Jakarta Ant build tool. For more information, see Chapter 3., Developing and extending NeatSeeker.
Also, the instructions assume a UNIX-based system. For other operating systems, you need to adapt the instructions accordingly.
A property file is used both during indexing and searching, and it is generally a good idea to use the same property file both for indexing and running the search servlet.
The build/ subdirectory contains an already-built servlet component hierarchy for NeatSeeker. Before packaging the components to a warball or copying them to the servlet context, you need to edit the build/WEB-INF/NeatSeeker.properties file.
Take the following steps:
Open the property file build/WEB-INF/NeatSeeker.properties in a text editor of your choice.
Change the necessary properties to reflect your environment. The property file is well documented in itself, and for now, you should pay most attention to the following items:
|neatseeker.repository||Where the search engine index is stored|
|neatseeker.stopwords||Whether or not a stopword list is used during indexing. If you set this to true, make sure that you provide a path to an plaintext stopword list with the neatseeker.stoplist option.|
|neatseeker.cleanup||Whether the existing index should be deleted when you start the indexing process|
|neatseeker.startpoint||The top-level directory for the indexing process. If your HTML files that you want to index are located in /home/httpd/html/, for instance, specify that here.|
|neatseeker.includes||A Jakarta Ant format directory task specification, which tells the indexer which files should be entered in the index. The setting **/*.html,**/*.htm, for instance, will cause all files ending in .html or .htm to be included in the index.|
|neatseeker.resultprefix||A prefix added to the search results when they are displayed. The index stores the file names relative to the neatseeker.startpoint path. Often, it is necessary to prepend e.g. http://hostname/ in order to convert the path to a valid URL.|
Once you have made the necessary changes, save the file and exit the text editor.
Next, the compiled servlet components in the build/ subdirectory need to be copied to the servlet container web applications directory. With Apache Tomcat, for example, the default location for this is the webapps directory.
If we assume that the webapps directory is located in /opt/apps/jakarta-tomcat/webapps, you can use the following commands to copy the servlet components to a servlet context called neatseeker:
% mkdir /opt/apps/jakarta-tomcat/webapps/neatseeker/ % cp -r build/* /opt/apps/jakarta-tomcat/webapps/neatseeker/On non-UNIX platforms, use the necessary tool (e.g. the Windows Explorer) to copy the contents of the build/ directory to the servler container space.
It's that simple. If you want to learn how to perform the above step as a part of an automated build/install process, see Chapter 3., Developing and extending NeatSeeker.
Once the servlet components have been installed, you are ready to start the indexing process.
In version 1.0, the indexing process is rather memory and CPU intensive, as the index is held in memory before dumping it onto disk. For reasonably sized webspaces (< 100 MB), this should not be a problem. According to my own observations, the indexing process memory requirement is approximately 30-50% of the size of the indexable material.
The WEB-INF/bin/neatmaker shell script provides a convenient wrapper around the indexer class (lempinen.neatseeker.core.NeatMaker). If you are on a UNIX-based system that is shell script capable, perform the following steps to build the index:
Change to the WEB-INF/bin directory in the servlet context where you copied the servlet components. For example:
% cd /opt/apps/jakarta-tomcat/webapps/neatseeker/WEB-INF/bin
Enter the following command to start the indexing:
% ./neatmaker -Dneatseeker.properties=../NeatSeeker.propertiesThe indexing process starts, providing you with information about its progress.
The NeatSeeker property management mechanism allows you to conveniently override any of the properties in the property file using command line system properties. If, for example, you want to create an index in a different place, you can append -Dneatseeker.repository=/different/path to the above command.
If you are on a non-shell-script-capable system, you can start the indexing using the Java command-line tool. Take the following steps:
Change to the WEB-INF directory in the servlet container space.
Enter the following command:
C:\> java -classpath lib/neatseeker.jar:lib/Tidy-unofficial.jar -Dneatseeker.properties=NeatSeeker.properties lempinen.neatseeker.core.NeatMaker(All of the above should be entered on a single command line.)
In an environment where the documents are modified often, it may be beneficial to set the indexer to run periodically e.g. by the cron daemon.
As the last step in the process, you should (at least with Tomcat) restart the servlet container so that it will detect the newly installed NeatSeeker.
After you have restarted the servlet engine, direct your web browser to the servlet context and try the search engine. The default search page gives instructions concerning the query syntax.
This chapter gives information about extending NeatSeeker to provide indexing and searching methods for different types of material.
Before you start working with the NeatSeeker source code, you should familiarise yourself with the Jakarta Ant build system. Ant is a cross-platform all-Java general purpose build tool that is particularly well suited for building Java projects. If you do not have Ant yet, get it from the Apache Jakarta website. The build file supplied with NeatSeeker is compatible with the stable 1.1 build of Ant.
The NeatSeeker build file has a number of targets for performing different tasks. These are listed in the following table.
|compile (default)||Compiles the Java sources into bytecode classes.||ant compile|
|jar||Builds a JAR of the NeatSeeker classes.||ant jar|
|build||Constructs a servlet build in the build/ subdirectory.||ant build|
|war||Constructs a WAR (Web Application Archive) from the contents of the build/ directory.||ant war|
|install_servlet||Installs the contents of the build/ subdirectory into a servlet context.||ant -Dwebapps=/path/to/webapps -Dcontext=neatseeker install_servlet|
|install_war||Installs the WAR file into a servlet context.||ant -Dwebapps=/path/to/webapps -Dcontext=neatseeker install_war|
|javadoc||Builds the Javadoc documentation.||ant javadoc|
The NeatSeeker framework has been designed to be as easily extensible as possible. It requires very little effort to develop new types of indexers for different types of data (email, text files etc.)
The doc/Indexer-HOWTO.txt file contains a brief introduction to the Indexer interface architecture. The same information can be found in Appendix A. of this document.
Also, the Javadoc documentation should be fairly helpful.
I welcome any questions you may have concerning NeatSeeker development. Apart from mailing to me directly, you have the following resources at your disposal:
The following document gives a brief description of the tasks involved in developing a new type of indexer for NeatSeeker. The document is also available in the project documentation page.
================================================================ HOW TO WRITE A NEATSEEKER INDEXER Author: Sami Lempinen (email@example.com) $Id: Indexer-HOWTO.txt,v 1.3 2000/10/07 14:24:32 lempinen Exp $ ================================================================ 1. Introduction The easiest way to develop an indexer for arbitrary material (text, XML etc.) is to inherit lempinen.neatseeker.core.AbstractIndexer. It has a number of convenience functions already defined, and leaves only the details up to you. Writing an indexer requires that you implement the lempinen.neatseeker.core.Indexer interface. If you bypass AbstractIndexer, you need to implement the methods defined in the interface yourself. If you inherit AbstractIndexer, you only need to implement public void init(Configuration c) and public void process (InputStream in, String uri); In fact, if your implementation does not require additional attributes that need initialising, you do not even need to worry about implementing init(). 2. Program control flow When a calling party (e.g. lempinen.neatseeker.core.NeatMaker) requests an Indexer using the IndexerFactory.getIndexer static method, a new Indexer is created according to the configuration parameters (see section 4 below) and its init() method is called with the Configuration object as an argument. Furthermore, the Factory sets the Collector partner for the Indexer according to the settings in the Configuration. To start the actual indexing, the calling party calls the start() method of the Indexer object. The default action in AbstractIndexer is to simply continue by calling the start() method of the Collector, and there should be no reason to override this. The Collector then starts traversing either the filesystem or a networked resource and calls back the process() function for each resource it encounters. 3. Processing the data This way, you can implement a simple indexer by specifying appropriate actions for process(InputStream in, String uri). The InputStream is your indexable data. The URI string is given by the Collector as a unique identifier for the indexable resource. The processing can the proceed as follows: - You should create a new Target in the beginning of process and use its setURI() method to set the URI to the string passed in by the Collector. - Use the repository lookahead method getNextID() to obtain a unique NeatSeeker identifier for the Target. - For every term you index, create an Entry that contains the term, the ID returned by getNextID() and the word position of the term within the resource. - Call the add(Entry) method in AbstractIndexer to add the term to the index. - At the end of processing a resource, update the index Statistics object with the incrementDocumentCount() and incrementWordCount() methods. - Update the word count of the Target with the setSize() method. - Put the Target in the repository with the repository.put() method. Every call to process() should carry out the above tasks. For a complete example on how this can be done, look at lempinen.neatseeker.html.HTMLIndexer. 4. Adding your indexer to the runtime properties As the Indexers are created by a Factory, the actual implementing class is determined at runtime. You should tell NeatSeeker about your new indexer by adding a few entries to the property file (e.g. etc/NeatSeeker.properties). For instance, if you have created an indexer for XML documents of some sort, and the indexer class is com.abba.xml.XMLIndexer, you should add the following entries to the property file: # Default indexer MIME neatseeker.indexer.mime = text/xml # Fully qualified class names of different indexers neatseeker.indexer.text/xml = com.abba.xml.XMLIndexer This should be enough to get you started. If you have problems, do not hesitate to contact the author.