Text Analytics vs. Semantic Content

February 22, 2012  
Posted in News, semantic, Text processing

There has been an online debate recently about the distinction between text analytics and semantic content enrichment. Many think there are no differences, just terminology preferences, while others believe there are distinct differences in the definitions and their uses.

We found this interesting information on CMSWire in their article, “Smart Content Reviewed: Text Analytics & Semantic Content Enrichment.” Text analytics is actually a set of software and transformational steps that discover business value in “unstructured” text. Content enrichment is as simple as an HTML anchor tag or as complex as unstructured or semi-structured data that has been “enriched” with a context that is further linked to the structured knowledge of a domain. In the case of search results, this means it allows results that are not explicitly related to the original search.

Melody K. Smith

Sponsored by Access Innovations, the world leader in thesaurus, ontology, and taxonomy creation and metadata application.

Font Awareness

September 5, 2011  
Posted in Access Insights, Featured, Text processing

Shelf Awareness is a Web-based resource for readers and those in the book trade. It was brought to my attention by Marilyn Dahl, Shelf Awareness’s book review editor. (Full disclosure, Marilyn is my cousin.) I bring this wonderful resource to your attention for a variety of reasons, not the least of which is their free email newsletter service for readers and one for professionals in the book trade. Twice a week I get an email containing reviews of a wide variety of new, recent, and sometimes classic fiction and nonfiction works.

What stimulated this blog post is a review by John McFarland of Simon Garfield’s “Just My Type:  A Book About Fonts”. Fonts, it seems, have been hotly debated since “Gutenberg invented movable type” and probably as far back as scribes selecting which mallet and chisel combination to use. The issues have become even more contentious as digital has emerged as a competitor to traditional reading of books. It is one thing to spend all day hunched over your monitor plowing through email and spreadsheets, pondering Word docs, and puzzling out Power Point presentations, and quite another to curl up on the sofa with your digital reader. (Recall that the last two features in this blog pointed out that the printed book is not dead, but growing, even as e-book growth outpaces hard-copy growth.)

The choice of fonts for long reads using digital devices will probably prove to be more critical than for print. Eyestrain has been a big challenge for screen makers, software developers, and e-book publishers. With more than 100,000 fonts, the choice can be overwhelming.

Fonts have personalities and personalities have fonts. If you want to discover what font matches your personality, visit Pentagram to learn how to best ‘font’ your image. I did and it was fun.

To convey meaning – to achieve understanding – requires a complex interplay of the author, the reader, and the media, but also the environment, timing, delivery, and so much more. How does font contribute to, or inhibit, understanding? Together with layout, color, and point size (the size of the font), the choice of font can make for a pleasing pallet for learning and pleasure reading, or cause distraction and mental noise, lessening enjoyment and learning.

Meaning can be tricky to sleuth out, even under the best of circumstances. Font selection contributes to the overall organization and visual appearance of the printed or digital “page”. The choice of words to convey the “aboutness” of a document requires as much finesse and artistic sensibilities as font choice. Fortunately, there is help available for designing and implementing a knowledge organization system (KOS). You must start by choosing the right thesaurus or taxonomy before selecting and assigning terms that label an item. A thesaurus is a both a KOS and a knowledge representation system (KRS). It conveys a lot about a field, domain, or discipline. How the thesaurus is organized, laid out, and displayed, and yes, even the choice of fonts, shows how a field is organized and how concepts central to the field are related.

There is no need to try too hard to create a connection between fonts and taxonomies, but in the effort to build useful KOSs, small things make a big difference. The hierarchical layout of a thesaurus and the serif/sans serif debate do have an impact on researcher in their hunt for information. (The serif/sans serif debate centers on the visual viability on digital displays. The serif tails are thought to degrade on most displays.) The ability to review hundreds of articles could hinge on the small things, like having the right set (and the right-sized set) of documents and minimal eyestrain. Font choice can help ease the pain by producing to an appealing visual experience. A database accurately and effectively indexed using the right thesaurus produces the best document set possible, greatly easing the burden of a researcher.

As for me, I’m a “Perpetua Titling Light” man.

Jay Ven Eman, CEO
Access Innovations

You Cannot Classify Unless You Connect

October 4, 2010  
Posted in Featured, indexing, Taxonomy, Text processing

October 4, 2010 — One of the facts of classification and indexing is that one has to have access to the content. Most users of a system assume that any content in an organization is available in an electronic system. That assumption is incorrect. The reason is that certain systems use a proprietary file format and do not make exporting data a feature. In fact, even well known systems such as Lotus Notes and SAP R/3 data can be problematic to index.

The trick in gaining access to content in proprietary, mission-critical systems is what is variously described as a filter, a content connector or connector, or file conversion. Stated simply, without one of these key filters, the content in a third party system remains locked up and can be accessed only by logging into that system and accessing that proprietary system’s functions.

The somewhat esoteric world of connectors is now becoming of greater interest.

In a motion filed Monday August 9, 2010, i2 filed a complaint explained in “Media Advisory from i2.” The plaintiff, www.i2.co.uk, makes allegations related to i2’s intellectual property. You can access the legal documents via Scribd. i2 and Palantir are involved in content processing, data management, and various analytics processes. More about i2 is here. More about Palantir is here. Years ago I did some work for i2 and learned that the firm’s technologies were widely used in intelligence, law enforcement, and related market sectors. Palantir is more of a newcomer. Palantir received an infusion of $90 million in venture funding in 2010.

In the world of connector vendors, only Palantir has a filter to import i2’s content stored in the proprietary ANB file type. Navigate to the Palantir Web site at www.palantir.com and search for DevZone or click here.

image

If the data on this page is correct, which may not be the case, Palantir has had a connector that handles a file type that is somewhat unusual. i2, in my experience, was quite particular about keeping certain information under wraps. I learned this when I did a small project for the company’s former president, Mike Hunter, but that was four or five years ago. Careful management of data was a hallmark of i2 and a strong part of the firm’s culture when I interacted with the firm and a handful of its engineers.

Now that i2 and Palantir have embarked on a journey into the US judicial system, the future of connectors could be in the hands of a judge or jury. The decision may alter the landscape of the connector market. Companies like Autonomy, EntropySoft, and Oracle offer connectors. However, if the legal spat produces one of those interesting decisions, the freedom to reverse engineer a file type may become an issue.

If so, content that users want in a single system indexed with a common thesaurus and categorized under a single classification system may never arrive. The user will have to adapt to a world  in which different systems will have to accessed in order to retrieve needed information. Then the relevant information may have to be assembled manually.

Stay tuned.

Stephen E Arnold

Microsoft and Its Taxonomy of Intents

July 19, 2010 — Xiaosin Yin and Sarthak Shah wrote a paper published in April 2010, “Building Taxonomy of Web Search Intents for Name Entity Queries.” We revisited this paper because Yahoo announced that it was considering a service that used popular queries for a new Yahoo news service. You can read a summary of the service here.

Microsoft’s partnership with Yahoo may make certain linguistic and content processing technology part of the partners’ search services. We dug out our copy of the Yin and Shah paper because it contains a number of interesting and thought provoking ideas. If you have not seen a copy of this research paper, you can get a copy at this link.

What we recalled is that searches for the names of people, places, products, and things are a popular way to locate information. For example, a teen may want to find Lady Gaga’s concert dates from a mobile phone. The query “Lady Gaga” should display basic information. If the search system knows the user’s profile, concert data could be among the top results for that person’s query.

However, entities are notoriously difficult to get right. Many systems rely on controlled term lists, dictionaries, recursive extraction, and sometime hybrid methods.

The Microsoft team applies the novel idea of building a hierarchical taxonomy of the generic search intents for a class of name entities. The idea is that intent and entities in combination provide a useful “hook” to use when responding to queries. (Queries can come from a person or from a software component.) The Microsoft methods, according to the paper:

…are purely based on search logs, and do not utilize any existing taxonomies such as Wikipedia.

Our view is that this approach may yield some useful data. However, in our experience, Microsoft is pushing toward automation of this process. The key to success will be processing information quickly and then making those outputs available to the search system.

The weak spot in any automated system is latency. If the system lacks current entity information, in today’s fast moving data flows, the user may run a query and not get the results expected.

This challenge is one that any company relying on automated entity extraction faces. Perhaps more robust hardware and high performance infrastructure will address the challenge?

In our opinion, humans and specialized tagging methods are required to deal with certain types of entities. On the day the iPad became available, was that entity identified and available? Our tests showed that the string “iPad” was the way to obtain the information. “Side searches” and facets did not pick up the iPad as a new class of device. No problem in consumer products. Problem in other information arenas, however.

Stephen E Arnold

Yahoo and Popularity Driven News

July 6, 2010 — Access Innovations, the sponsor of this post, is in the indexing and classification business. Humans have grouped events, people, and information for millennia.

We read with interest the New York Times article about search and the Softpedia commentary. When you read this, the New York Times article will not be easily available. The Softpedia article, which is the focus of our comment, is “Yahoo Experiments with Search Data Driven News”.

The idea is that Yahoo will prowl through log files, identify what users are clicking on, and then use those data such as keywords and phrases to generate a news page.

The passage that caught our attention was:

Yahoo in particular has been adamant about content which is now central to its strategy. The company is going through some massive changes as it shifts its focus to the core properties and content generation. Yahoo has put together a great editorial team, with many editors victims of layoffs at major traditional publications. But it’s also dabbling with alternative approaches, pioneered by companies like Associated Content and Demand Media. Having recently acquired the former, Yahoo is now using it as inspiration for a new blog which will leverage Yahoo’s comprehensive search data to guide the editorial decisions.

The use of click streams to “define” news is interesting. If users follow a normal distribution, then, by definition, the news will appeal to the topics that have the most popular appeal. The approach works for popular culture, but what about important subjects which get little or no attention in the click stream. Examples range from scientific, technical, and medical information to subjects that are unknown to a mass audience such as replacement parts for a Honda generator.

The Yahoo method could be one of those recursive functions that iterate until reaching zero. In this case, will Yahoo news become a content zero?

Stephen E Arnold

Post sponsored by Access Innovations

Semantic Taxonomy Induction

At lunch today, the conversation turned to algorithmic methods of applying tags to content objects. At the table was a person deeply steeped in human-centric methods and one academic who embraced seed lists and algorithmic approaches. What was interesting was that both of these individuals agreed on the need for improved tagging of content objects.

In the course of the conversation, both of my lunch partners mentioned a paper I had not read. “Semantic Taxonomy Induction from Heterogeneous Evidence” became available in 2006. It was one of the first presentations of a method that is becoming increasingly important at a certain large Web search company, if the information I gleaned from the lunch conversation is accurate.

I tracked down a copy of this paper. Here’s the abstract:

We propose a novel algorithm for inducing semantic taxonomies. Previous algorithms for taxonomy induction have typically focused on independent classifiers for discovering new single relationships based on hand-constructed or automatically discovered textual patterns. By contrast, our algorithm flexibly incorporates evidence from multiple classifiers over heterogeneous relationships to optimize the entire structure of the taxonomy, using knowledge of a word’s coordinate terms to help in determining its hypernyms, and vice versa. We apply our algorithm on the problem of sense-disambiguated noun hyponym acquisition, where we combine the predictions of hypernym and coordinate term classifiers with the knowledge in a preexisting semantic taxonomy (WordNet 2.1). We add 10, 000 novel synsets to WordNet 2.1 at 84% precision, a relative error reduction of 70% over a non-joint algorithm using the same component classifiers. Finally, we show that a taxonomy built using our algorithm shows a 23% relative F-score improvement over WordNet 2.1 on an independent test set of hypernym pairs.

If you want to get a copy of this interesting paper by Rion Snow, Daniel Jurafsky, and Andrew Ng, click here. Hurry, the document could be removed without warning.

Stephen E Arnold, July 1, 2010

World Congress 2011 Focuses on Data Mining and Machine Learning

June 29, 2010  
Posted in Autoindexing, News, Text processing

June 29, 2010 – Learn about the latest developments and trends in intelligent data and signal analysis in New York, August 30 – September 3, 2011.

World Congress 2011 – The Frontiers in Intelligent Data and Signal Analysis combines stellar scientific events with specialized industrial forums and exhibitions, offering a complete platform for scientists, engineers, and decision makers from industry, but also for professionals with interest on this subject. I-Newswire shared news of this event in their article, “World Congress 2011 The Frontiers in Intelligent Data and Signal Analysis”.  Intelligent data and signal analysis has gained importance in our everyday life because we are part of the digital age. Every day a multitude of information is created that can be processed by computers. This information often exists as database entries in numerical or symbolic form. However, some is contained in digital signals that are collected and stored, as well as in image data.

This is a great opportunity to inform yourself about the new data analysis methods. The World Congress is also a good arena for companies, decision makers from industry and marketing, and for networking.

Melody K. Smith

Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.

Indexing: Chopping Logic

June 4, 2010  
Posted in Access Insights, News, Text processing

June 4, 2010 – The Nieman Journalism Lab’s “Aggregators, Curators, and Indexers: There’s a Difference and It Matters” makes a great deal of sense. You will want to read the original write-up by C. W.Anderson. For us, the most interesting comment was:

Anytime you hear someone talk about Google News, The Huffington Post, Gawker, blogging, aggregating, curation, and indexing as if they are the same phenomenon, ignore them. And if they attach that discussion to a set of policy recommendations, without acknowledging the full complexity of what it is people actually do when they aggregate, curate, and index information — well, then you should put your fingers in your ears and run in the other direction.

We have one point to add: professional indexing will become more, not less, important.

xx

Search Capabilities Expanding to Arabic Language

June 4, 2010  
Posted in News, Text processing

June 4, 2010 – The recently unveiled SAMAR Project team consists of the French government and international news agency Agence France-Presse (AFP). Their goal is to find a high-quality semantic search strategy for AFP’s Arabic-language news, audio, and video multimedia content.

The Arabic language presents unique challenges for content producers. The difficult linguistic patterns prohibit proper semantic tagging. This potential solution stirs hope that it could be a model platform for Arabic news organizations around the globe.  “SAMAR Project: Mapping Arabic Language to Aid News Searchers” reported on this new endeavor. Among the companies taking part in the SAMAR project is Paris-based corporation, TEMIS, the consortium’s ninth member. Temis provides knowledge extraction and information analysis and discovery with its text analytics enterprise solution Luxid. This content enrichment tool aids in understanding the Arabic language.

A daunting task, but if achieved it could help bridge a cultural and linguistic gap of giant proportions.

Melody K. Smith

ITC Infotech Escalates its Metatagging Solution

June 4, 2010  
Posted in News, Text processing

June 4, 2010 – Several years ago we heard about ITC Infotech, the India-based services company, and its Metatag+ solution. In 2009, the Gilbane Group published “ITC Infotech Introduces New Solution for the Media and Entertainment Industry.” The product dropped off our radar. Yesterday we fielded an inquiry about Metatag+. Our recollection is that the product combined manual and automated processes and tossed in some repository functions as well. We wanted to provide a snapshot of what the solution delivered:

Metatag+ enables organizations to effectively manage taxonomy, and control vocabulary and schema management across assets. A comprehensive workflow allows ingestion of metadata into the assets, thereby facilitating a productive search environment and enhanced collaboration across assets and related metadata. ITC Infotech also has tailored solutions and services to help customers enhance revenue propensity and reduce total cost of ownership (TCO), while transitioning to the new media world. ITC Infotech has a dedicated media and entertainment practice, focusing on building specific solution accelerators and cost take-out propositions. The practice has developed solution accelerators spanning across areas such as metadata management, asset life cycle management, and intellectual property management, besides cost take-out propositions in digitization, cataloging, technology operations and testing services. Source: Enterpriser.in.

If a reader knows about the status of Metatag+, use the Comments section of this blog to add to our understanding of this product.

xx

Next Page »