We live in a data-rich world where organizations, governments and individuals can analyze anything and everything. At present, the most easily indexed material from the web is text. Unfortunately, anywhere from 89 to 96 percent of the content on the internet is actually something else, like images, video, audio, etc. and in every file type imaginable. This interesting information came to us from The Intelligencer in their article, “Searching deep and dark: Building a Google for the less visible parts of the web.”

This means the majority of online content isn’t available in a form that’s easily indexed by electronic archiving systems. It requires a user to log in, or it is provided dynamically by a program running when a user visits the page.

This needs to be an automated process so how can we teach computers to recognize, index and search all the different types of material that’s available online? Many are looking to the “deep web” or the “dark web” to achieve this.  The “surface web” is the online world we can see. The “deep web” is closely related, but less visible, to human users and search engines exploring the web to catalog it.

Melody K. Smith

Sponsored by Access Innovations, the world leader in thesaurus, ontology, and taxonomy creation and metadata application.