Data Services Newsletter

Volume 3 : No 4 : December 2001

A Review of the IRIS DMC Web Search Engine

In the past few years, the IRIS DMC World Wide Web presence has been growing appreciably. Data access services, online manuals, and earthquake references abound, making it increasingly difficult to guide visitors to
the information they are looking for. The sensible solution has been to present a search feature to the web page to help users quickly and easily find web pages of interest.

About a week of effort back in 1997 produced a web crawler engine that scans web pages and keeps a “dictionary” of those pages for quick reference. A web crawler is an automated program that follows links to Web pages, in the same way that a user would click on links in a Web browser to view other pages. The name given to the IRIS DMC web crawler is “WARP.” In its four years of operation, the service has been functioning reliably and is virtually unchanged since its inception.

On a regular basis, the WARP web crawler starts its scan with the IRIS Home Page, notes down all of the words in the document, and then collects all of the links on that page. Each of those links are then examined one by one, and all of their contents and links are recorded as well. This search pattern repeats in an ever-growing collection of links and web pages, much like a squirrel running up each branch of a tree, and each page that is accessed is added to the dictionary. It is this dictionary that you reference when you use the search engine on the IRIS Home Page.

So, what stops WARP from growing and growing until it searches the whole Internet? The answer to that comes through a special reference that each web server must include in order to be searched. Therefore, when we provide a link to a university, news center, or commercial site on one of our web pages, they don’t have to worry about having their entire site searched. Special arrangements have been made with institutions such as the Federation of Digital Seismic Networks and the Albuquerque Seismic Laboratory in New Mexico, and regular indexing is performed on those remote sites, searchable from the IRIS Home Page. It’s what you could call a “neighborhood” web crawler.

In order to help visitors make effective use of the IRIS search feature, two concepts will be illustrated here. First is how to enter an effective search query. The second is how to interpret the results. Unlike other search pages that you might be familiar with, the IRIS search page does not make use of logical glue words like AND and OR and doesn’t use special characters or quotes to include meaning in the query. Things are kept simple in that you enter a list of words relevant to your topic of interest, separating them with spaces. It’s case-insensitive and it ignores punctuation.

To make the best query, try listing a few unique words related to the subject of interest. Using common words will give you matches to a lot of pages you wouldn’t be interested in. Many times, one or two keywords are sufficient to give a good match, but if the results you get are not what you wanted, try entering different keywords, more keywords specific to your search, or enter them in a different order. Another tip is to enter a short phrase such as ‘Loma Prieta earthquake’, since the search engine will match strongly to such ordered word patterns. When the results come back, the 50 most relevant matches are shown on the screen. This degree of relevance is determined by a scoring system that assigns ‘points’ to how well each web page matches the search query. This score value can be seen as a number in parentheses at the end of each result listing, and can be used as a rule of thumb to determine just how well a given page matched to your query.

The score value is created through a set of test conditions that each web page is subjected to and the results are added together. Some types of test conditions can result in a high score being added to the total, while other tests result in just small gains. The overall score should be a good reflection of how appropriate the web page is to your search parameters. To illustrate this further, the scoring conditions are as follows:

Search Query Added Score
fits to text in the URL address 50 per keyword
fits to phrase in page title bar 20 times number of words in phrase
fits to phrase in first few sentences of web page text 10 times number of words in phrase
matches word found in web page equal to number of occurrences of word in page up to a max of 10
partially matches word in web page 1
matches more than one word double the added word match score

By looking at these scoring values, you can see that you find fairly good matches at a score of 20 or greater and get even better candidates at 50 or greater. The best results come from matches to the title of a page, or the first few sentences, which generally best reflect the topic of a web page. Having the score values as a guide will allow you to navigate the search results intelligently and choose the best route to find what you are looking for.

The WARP search engine at the IRIS DMC is intended as a service to our users to make their visit at our Web site and our affiliated Web sites as productive and convenient as possible. It is our hope that you find the search feature useful, yet if you find problems you wish to discuss with us, please contact us.

by Rob Casey (IRIS Data Management Center)

17:12:27 v.22510d55