|
Crawling search engines are those that use automated programs,
often referred to as "spiders" or "crawlers", to gather
information from the Internet. Most crawling search engines consist of five
main parts:
crawler : a specialised automated program that
follows links found on web pages, and directs the spider by finding new sites
for it to visit.
spider : an automatic browser-like program that
downloads documents found on the web by the crawler.
indexer : a program that "reads" the
pages that are downloaded by spiders. This does most of the work deciding
what your site is about.
database (the "index") : simply storage
of the pages downloaded and processed.
results engine : generates search results out
of the database, according to your query.
There are some minor variations to this. For instance, Ask Jeeves (www.ask.co.uk)
uses a "natural language query processor", which allows you to
enter a question in plain language. The query processor then analyses your
question,
decides what you mean, and "translates" that into a query that
the results engine will understand. This happens very quickly, and out of
sight
of users of Ask Jeeves, so it seems as though the computer is able to understand
English.
Spiders and crawlers are often referred to as "robots", especially
in official documents like the
robots exclusion standard
Crawler:
When a spider downloads pages, it is on the lookout for links. They are
easy for it to spot, because they always look the same. The crawler then decides
where the spider should go next, based on the links, and its existing list
of URLs. Often, any new links it finds when revisiting a site are added to
its list. When you add your URL to a Search Engine, it is the crawler you
are requesting to visit your site.
...top
Spider:
A spider is an automated program that downloads the documents that the crawler
sends it to. It works very much as a browser does when it connects to a website
and downloads pages. Most spiders aren't interested in images though, and
don't ask for them to be sent. You can see what the spiders see by going to
a web page, clicking the right-hand button on your mouse, then selecting "view
source" in the menu that appears.
...top
Indexer:
This is the part of the system that decides what your page is about. The
words in the site are "read". Some are thrown away, as they are
so common (and, it, the etc). It will also examine the HTML code which makes
up your site looking for other clues as to which words you consider to be
important. Words in bold, italic or headers tags will be given more weight.
This is also where the meta information (the keywords and description tags)
for your site will be analysed.
...top
Database:
The database is where the information gathered by the indexer is stored.
If you consider that Google claims the largest database at time of writing,
with over 3 billion documents, even assuming that the average size of each
document is only a few tens of kilobytes, this can easily run to many terabytes
of data (1 terabyte = 1,000 gigabytes = 1 million megabytes), which will obviously
require vast amounts of storage.
...top
Results engine:
This is in many ways the most important part of any Search Engine. The results
engine is the customer-facing portion of a Search Engine, and as such is the
focus of most optimisation efforts. It is the results engine's function to
return the pages most relevant to a users query.
When a user types in a keyword or phrase, the results engine must decide
which pages are most likely to be useful to the user. The method it uses to
decide that is called its "algorithm". You may hear Search Engine
Optimisation (SEO) experts discuss "algos" or "breaking the
algo" for a particular search engine. After all, if you know what the
criteria being used are, you can write pages to take advantage of them.
more...
|