Search Engines and Related Tools

We’ll discuss search engines, technology related to them (including Web crawlers and search software), and getting those things implemented to put taxonomies to good use.

A search engine is different from search software. The terms are used interchangeably by people, but they are not the same. Search software is an application – a bunch of codes – and a search engine is a collection of servers with a lot of data on it that is indexing the Internet and delivering it by HTML pages to customers. It is doing that by spidering around to get the information, usually from those metaname headers, and now increasingly from full text and putting that information up and comparing it, searching it. You all have some idea of how Google, Lycos, Bing, and all those other ones work, so you know what happens here. Again, a search engine is different from search software.

A spider, particularly a Web spider, is something that crawls across the pages and extracts the information that it needs. It then deposits that information in the search engine application so that it can be searched. What it really stores is the location of the information, the URL, and the keywords and the text of the page itself so that it can be searched.

Search software is looking at a discrete collection, which could be huge. It is the software itself. It is not the search engine that is crawling across the Web. In search software, it is likely to be applied to just something internally or a single website.

Search engines and metadata are connected in that we are crawling across looking for the information and pulling it out of that HTML header, which is the format page for the search software. The metadata is only one part of that header. It pulls that information, it pulls the URL (Uniform Resource Locator) and then it might also pull the page and cache it someplace so that it can display quickly.

Creating metadata for your website(s) is fairly straightforward. The metadata can help crawlers find appropriate information. Without metadata, structure, and categorization, web crawlers have nothing to work on. We need metadata to sort it all out and to do so in an organized way.

Search software is where somebody in an organization is going to be applying their information. If you don’t have the metadata and the structured information and the keywords applied to your content, the spiders can’t work well. So, your information is not as discoverable as it could be. We need that metadata so it can sort out the information on the Web, in an organized way.

The metadata tags are the first place on a Web page that the crawler looks at. They mine there first. If they don’t find that information they will go deeper down on the page, and the page will be lower-ranked because the information was not as easy to find.

The inverse of that is that if it finds the same keyword used 17 times in the metaname keywords fields, the crawler may recognize that tactic and rank the page lower because of it. So, if I say I want to have the word ‘taxonomies’ and I want to own it and I want it ranked really high, I might be tempted to put that word five times in my meta name keyword field, but that would be an error on my part. I should instead repeat the word Taxonomy in the full text of my page. That would rank me higher.

But, yes, that is where the spiders go first. It depends on the spiders – how they are crawling – but practically all of the spiders go first to the meta name fields. They later go to the full text page or they skip it. If there is no recent date – if it looks like the page has not been updated in a month – I’m out of here.

So they depend on the information in that field to know whether to mine the page further. They crawl first to the meta name text and they harvest what they can, and then the algorithm makes the decision as to what they are going to do with the page and how they are going to arrange it. Then they decide if they are going to cache the page or not, and whether they are going to cache just the first page or whether they are going to cache the entire site. If they are going to cache the entire site, the spider has to grab the entire site. People can make it available or lift it. If it is a lifted site and it is properly done, then it will be crawled on some schedule. They can crawl every two weeks; there are some sites that they crawl every 15 minutes. They crawl some of the new sites very frequently. If they have crawled the site before, they have your page, they have cached it, they show that, and while they are showing that page they go out and get the current page. They update the cache based on users’ keywords.

In the next installment, we’ll look at how search works.

Marjorie M.K. Hlava, President, Access Innovations

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Search Engines and Related Tools