Wednesday, April 11, 2007

LS 500 How a Search Engine Works

Liddy, E. (2001). How a Search Engine Works. Searcher 9(5):38-45.


A Search Engine is the more popular term for an Information Retrieval (IR) System. Whichever term you call the system it contains four different elements:

  • A Document Processor
  • A Query Processor
  • A Search and Matching Function
  • A Ranking Capability


Document Processor

Prepares, processes and inputs the documents, pages or sites that users are searching. Document processors perform some of the following steps:

  • Normalize the document stream to a predefined format
  • Break the document stream into retrievable units
  • Isolate and metatag subdocument pieces
  • Identify potential indexable elements in documents
  • Delete stop words
  • Stems terms
  • Extracts index entries
  • Computes weights
  • Creates and updates the main inverted file against which the search engine searches in order to match queries to documents


Query Processor

Query processing has seven possible steps:

  • Tokenize query terms
  • Recognize query terms vs. special operators
  • Delete stop words
  • Stem words
  • Create query representation
  • Expand query terms
  • Compute weights


Search and Matching Function

How systems carry out their search and matching functions change depending on which theoretical model of information retrieval underlies the system’s design philosophy. Searching the inverted file for documents meeting the query requirements, referred to simply as “matching”, is typically a standard binary search, no matter whether the search ends after the first two, five or all seven steps in the query process. Some search engines use algorithms for scoring not based on document content, but based on the relation among documents or past retrieval history of documents and pages. After the similarity is computed for each document in the subset of documents, the system presents an ordered list to the searcher. The sophistication of the ordering of the documents depends on the model the system uses as well as how advanced the document and query weighting mechanisms are. Some systems that are very sophisticated go the extra mile and let the user provide relevance feedback or modify their query based on the results they were given.


What Document Features Make a Good Match to a Query

Term Frequency

How frequently a term appears in a document is one of the most obvious ways to determine a document’s relevance to a query. However, several situations can undermine this premise. Many words have multiple meanings; such as “pool” or “fire.” Also in some domains certain words are so common and so frequent that their relevance declines sharply.


Location of Terms

Many search engines give preference to words found in the title or lead paragraph or in the metadata of a document. Terms that occur in the title of a document or page that match a query term are therefore frequently weighted more heavily than terms occurring in the body of the document. Also, query terms that occur in section headings or within the first paragraph of the document may be more likely to be relevant.


Link Analysis

Link analysis works like bibliographic citation practices. Link analysis is based on how well connected each page is as defined by Hubs and Authorities, where Hub documents link to large numbers of other pages (out-links) and Authority documents are those referred to by many other pages, or have a high number of “in-links.”


Popularity

Google and several other search engines use popularity to determine page relevance. Popularity uses data on the frequency that a page is chosen by users to predict the relevance of it.


Date of Publication

Some search engines assume that the newer the information is the more likely that it will be relevant to the user. These engines present the results beginning with the most current ones first followed by the older results.


Length

When there is a choice with two documents having the same query terms, the search engine chooses the document that has a higher occurrence of the term relative to the length of the document.


Proximity of Query Terms

When the terms occur near each other in a document it is more likely that the document is relevant to the query than if the terms occur at a greater distance.


Proper Nouns

These sometimes have a greater weight, since many searches are performed on people, places or things.


Summary and Reflection

Up till now search engine providers have primarily opted for less versus more complex processing of documents and queries. This then leaves the bulk of the work to be done by the searcher to pick their way through the results to find what they are seeking. Hopefully this status-quo will not continue and search engines will continue to enhance the quality of the processing.

I have to honestly say it never occurred to how or what exactly happens when I perform a search. It was interesting to learn exactly how complex the search process is and what all the different components are. Just today I saw an additional article (see below for link) from ZDnet.com that stated that Google is drawing 64% percent of the search queries for the month of March. Overall I found the article very enlightening and informative as to how the whole process works. I certainly won’t look at performing a search the same way again.


http://news.zdnet.com/2100-9595_22-6175248.html?part=rss&tag=feed&subj=zdnn

No comments:

Post a Comment