theITbridge, Fully Integrated E-business Solutions Home spacerProducts spacerServices spacerPortfolio spacerNews spacerAbout spacerReferencespacerSupportspacer
 Anatomy of a Search Engine Part 1

Crawling search engines are those that use automated programs, often referred to as "spiders" or "crawlers", to gather information from the Internet. Most crawling search engines consist of five main parts:

 

crawler : a specialised automated program that follows links found on web pages, and directs the spider by finding new sites for it to visit.


spider : an automatic browser-like program that downloads documents found on the web by the crawler.


indexer : a program that "reads" the pages that are downloaded by spiders. This does most of the work deciding what your site is about.


database (the "index") : simply storage of the pages downloaded and processed.


results engine : generates search results out of the database, according to your query.

 

There are some minor variations to this. For instance, Ask Jeeves (www.ask.co.uk) uses a "natural language query processor", which allows you to enter a question in plain language. The query processor then analyses your question, decides what you mean, and "translates" that into a query that the results engine will understand. This happens very quickly, and out of sight of users of Ask Jeeves, so it seems as though the computer is able to understand English.

Spiders and crawlers are often referred to as "robots", especially in official documents like the robots exclusion standard

Crawler:

When a spider downloads pages, it is on the lookout for links. They are easy for it to spot, because they always look the same. The crawler then decides where the spider should go next, based on the links, and its existing list of URLs. Often, any new links it finds when revisiting a site are added to its list. When you add your URL to a Search Engine, it is the crawler you are requesting to visit your site.

...top

Spider:

A spider is an automated program that downloads the documents that the crawler sends it to. It works very much as a browser does when it connects to a website and downloads pages. Most spiders aren't interested in images though, and don't ask for them to be sent. You can see what the spiders see by going to a web page, clicking the right-hand button on your mouse, then selecting "view source" in the menu that appears.

...top

Indexer:

This is the part of the system that decides what your page is about. The words in the site are "read". Some are thrown away, as they are so common (and, it, the etc). It will also examine the HTML code which makes up your site looking for other clues as to which words you consider to be important. Words in bold, italic or headers tags will be given more weight. This is also where the meta information (the keywords and description tags) for your site will be analysed.

...top

Database:

The database is where the information gathered by the indexer is stored. If you consider that Google claims the largest database at time of writing, with over 3 billion documents, even assuming that the average size of each document is only a few tens of kilobytes, this can easily run to many terabytes of data (1 terabyte = 1,000 gigabytes = 1 million megabytes), which will obviously require vast amounts of storage.

...top

Results engine:

This is in many ways the most important part of any Search Engine. The results engine is the customer-facing portion of a Search Engine, and as such is the focus of most optimisation efforts. It is the results engine's function to return the pages most relevant to a users query.

When a user types in a keyword or phrase, the results engine must decide which pages are most likely to be useful to the user. The method it uses to decide that is called its "algorithm". You may hear Search Engine Optimisation (SEO) experts discuss "algos" or "breaking the algo" for a particular search engine. After all, if you know what the criteria being used are, you can write pages to take advantage of them.

more...

 

Quick Links

Reference

Search Enginesspacer
How to Searchspacer
Who's Whospacer
Who Powers Who?spacer
Anatomy of a Search Engine Pt1spacer
Anatomy of a Search Engine Pt2spacer
Current Playersspacer
Search Engine Optimisationspacer
Optimisation - Off the Pagespacer
Optimisation - On the Pagespacer
Coursesspacer

© Copyright 2008, theITbridge Ltd. Home | Products | Services | Portfolio | News | About | Reference | Support | Site Map