Intellectual Property in Cyberspace

Terms You Need to Know: Search Engines

John M. Mrsich, Meeka Jun^a

Reprinted with permission from the May 1997 issue of Multimedia & Web Strategist.
© 1998 The New York Law Publishing Company. All rights reserved.

THIS MONTH'S Terminology Tips looks at search engines and other terms you need to know to locate resources on the Internet. While the Net can be an invaluable source of useful and timely information, the breadth and volume of information on the Net can seem overwhelming. Search engines permit Internet users to identify Internet resources (on the Web, FTP servers and the Usenet) based upon specific criteria. Understanding how search engines work is critical for users (in order to select the most effective search vehicles), and for site owners and Webmasters (in order to lure surfers to their Web sites).

Spiders, Crawlers and Robots

Most search engines consist of three components: the spider, the index and the search-engine software. A spider (also known as a robot, crawler or indexer) is a program that scans the Web, crawling from link to link, visiting Web pages, recording URLs (uniform resource locators) and building an index for the search engine. A spider or robot generally starts from a historical list of URLs -especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web. The spider or robot will request documents from web sites, access all links from each web page and deposit its findings in the index.

Spiders or robots focus on two primary attributes of a good search engine, namely, freshness and completeness. A "freshness spider" revisits web pages already contained in the index on a regular basis to verify that they still exist and gathers any changes or additions to the web pages.

Some engines use a "smart spider," which keeps a record of when a page changes over time, and then predicts how often it should revisit. A "completeness spider" crawls through the Net in search of new pages that it finds either as a result of its regular crawling activity or the submission of pages by webmasters.

Indexes, Catalogues and Search Engine Software

The index, also known as the catalogue, is the repository or database where the spider or robot stores the HTML documents it finds. Some search engines, known as full-text search engines, index every word on a web page.

Other search engines, known as abstract search engines, create a condensed copy of each web page.

The search engine software searches for web pages containing the search terms entered by a user, ranks the web pages found based primarily on the frequency of the search terms in the HTML document and then displays the matches according to the order as ranked.

Keyword Searching and Power Searching

The effectiveness of a search depends not only on the search engine you choose, but also on the search terms you enter. The level of search capability exhibited by the various search engines ranges from the simple (e.g., keyword searching), to the complex (i.e., power searching). Keyword searching, which is the most common form of text indexing on the Web, simply locates all web pages containing any or all, depending on the search engine, of the terms entered by the user.

More complex types of searching include stemming, which allows for a search of a portion of a word or phrase (e.g., a query for "Web" would come back with "Webmaster"); phrase searching, which enables a search to focus on a group of words together; proximity searching, which allows for a search of words depending on how close the words appear to each other; wildcards, which function as place holders to fill in any unknown characters within a search; media searching, which allows for searches of sounds or images (also known as "image searching"); and field searching, which allows for searches of page titles, URLs, domains and in hyperlinks. Finally, Boolean searches are queries based on a combination of terms through the use of connectors, e.g., "and" and "or."

Searching by Meta Tags

The searching capability exhibited by search engines is not limited to the ability to pick up terms contained in the text of the web page.

Internet content can also be searched by way of meta tags. Meta tags are codes contained within web sites that provide a description (other than the actual text contained in the web page) that can be searched. The meta tag was designed to include information about the information in an HTML page, rather than what's visible on an HTML page.

The three most relevant meta tags for search engine indexing are the description, keyword and robot tags.

- The description tag returns a description of the web page in place of the summary the engineer would ordinarily create.

- The keywords tag provides keywords for the engine to associate with your page.

- The robots tag allows a Webmaster to specify that a particular page should not be indexed by a search engine.

Searching by Search Directories

Search engines should not be confused with search directories. While search engines find web pages automatically through the use of technology, search directories depend on human intervention. A search directory requires a human to submit a site, to classify the site and to select the site from a list of sites. As a result of human intervention, search directories tend to be better categorized, but less current than search engines.

Searching With Search Managers

Search managers (also referred to intelligent agents, information agents or agents) are tools that maximize the functionality of all the search engines by coordinating simultaneous intelligent queries into multiple Web search engines. This feature enables the user to enter his or her search terms a single time, but have more than one search engine activated.

(a) John M. Mrsich and Meeka Jun are associates at New York's Brown Raysman Millstein
Felder & Steiner LLP.