Microsoft Research Approach to Hierarchical Taxonomy Building

June 3, 2010  
Posted in Access Insights, News, Text processing

June 3, 2010 – Earlier this year we learned about a novel method for building taxonomies. We located on Microsoft Research’s Web site the complete paper, “Building Taxonomy of Web Search Intents for Name Entity Queries.”   Like most technical papers, there is a hefty dose of math, but in this paper the authors have provided some possible hints about what Microsoft may make available in its enterprise and consumer search systems. One example is Microsoft’s focus on “intent phrases”. The idea is that a system can ascertain the reason the user is seeking information. The authors also employed an interesting method for validating the performance of their system. Microsoft Research used human judges via Mechanical Turk. The point of this check was that the Microsoft method can build trees of phrases that capture the relationships between important search intents.


Building a Keyword List

June 3, 2010  
Posted in News, Term lists, Text processing

June 3, 2010 – One of the methods we use to build a keyword list requires good, old-fashioned research. In the days of 4×6 notecards, library time was an essential first step. We recently learned about a shortcut that is rumored to be popular among some of the search engine optimization specialists. We wanted to share the process because it is a variation on some of the industrial-strength methods used by some firms that specialize in developing seed term lists.

The first step is to obtain a list of high traffic or important Web sites in a particular subject field. Most of the Web analytics firms publish these lists. We have seen them in the ClickZ newsletter. Then download a product called KeywordThief 3.0 from a shareware site like Brothersoft. (Note: we are not recommending this method. We want to call it to your attention. We are not making any guarantees about the method.)


Once the KeywordThief has been installed, it will go to each site on the seed list, locate the keywords, and output a list of these words. Among the claims made for the software, which costs about US $20, are that it:

  • Extracts the keywords from the Keyword Metatag, from the sites listed in the SERPs (search engine result pages) of Google, AltaVista, Open Directory or other search engines.
  • Calculates the occurrence number and percentage of every keyword found.
  • Sorts the keywords in frequency or alphabetic order
  • Exports keywords into plain text file.

The system generates an ASCII file of terms. The developer’s Web site is here.


Compliance and Indexing

June 2, 2010  
Posted in News, Text processing

June 2, 2010 – “Backing Up Corporate Data – How Deduplication Helps” references indexing but the focus is on the challenge of deduplication. Indexing can be a help in certain deduplication operations, and it can be a problem. In free form indexing environments, a document may have different index terms assigned. Is this enough to make a document different from another document that differs only with metadata? On the surface, the idea is silly. When two people index with uncontrolled methods, the document is the “same”. The metadata are not germane. What happens when two identical documents are indexed with uncontrolled terms and one user assigns the name of a person of interest? The explicit name of the entity does not appear in the document, but the human assigned an entity name to add value to that particular document. Are the documents now identical? Why not retain just the index terms? We think it is important to find out if the entity tag was accurate. Should inaccurate uncontrolled terms be retained?

The write-up focuses on some broad issues in deduplication. We found this passage interesting:

With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy reducing storage and bandwidth demand to only 1 MB. However, indexing of all data is still retained should that data ever be required. Enterprise data today is more dispersed and diverse than ever. And with over 30% critical corporate data sitting on PCs, administrators can no longer hold the end user responsible for its protection. The best corporate data protection solutions combine source-based data deduplication and continuous data protection.

We think the author is tracking with us, and we understand the need to trim duplicates. But once again, the definition of “duplicate” is the first step in figuring out what to keep and what to discard.


Sponsored by Access Innovations.

SEO and LSI May Help Boost Your Web Site in a Results List

June 2, 2010  
Posted in News, Text processing

June 2, 2010 – SEO is an acronym for search engine optimization. The idea is that indexing, content, and links can boost a Web page in a Google’s search result list. LSI is shorthand for latent semantic indexing. A definition from PCForHire says:

LSI uses word associations to help search engines know more accurately what a page is about.

If you navigate to Google and enter define:LSI, Google does not recognize this acronym, and our hunch is that most people don’t know what LSI means either.

We read “SEO And LSI How To Use Latent Semantic Indexing” and noted some interesting points in the article:

First, the write-up references LSA, a related method of processing content. We were surprised by the inclusion of both LSI and LSA in a short, essentially non technical write up. You can learn more about LSA which means latent semantic analysis. Both LSI and LSA seem to be a far reach for the SEO consultants with whom we have interacted.

Second, Google, according to the write up, “is utilized by Google primarily to detect spam, in respect of excessive repetition of keywords in an effort to fool the various search engines into providing a excessive listing for that keyword.” We don’t know too much about Google’s numerical recipes, but with large amounts of data, Google’s use of a mathematical method makes sense. Somewhere along the line we heard that the Oingo / Applied Semantics technology played a part in AdSense and other Google services.

Third, we found this passage interesting but it left us asking, “How can a Webmaster use these technologies?”

LSI is used to determine the true meaning of homonyms, heteronyms and polysemes. Homonyms are spelled and pronounced the same, however have different meanings, reminiscent of lock, with three meanings. A heteronym is a phrase spelled the identical as one other, but with a special pronunciation and which means, such as lead: a steel or to be in front. Polysemes are phrases spelled the same, and from the same root, however used in another way corresponding to a mole – a burrowing animal, or a mole – a spy deliberately placed in an organization. Each moles have the same root, but the words are utilized in completely different contexts. LSI or LSA can be utilized to determine the difference by way of analysis of the other words within the text.

The article contains a wealth of information. If you are in search of a better SEO method, you may find some useful sign posts in the Creative Digi Works’ article.


Sponsored by Data Harmony.

« Previous Page