Limitations of Fuzzy Matching of Lexical Variants

April 14, 2014  
Posted in Access Insights, Featured, semantic

Some vendors of text analytics software claim that their software can identify the occurrences of text reflecting specific taxonomy terms  (with the strong, and false, implication that it identifies all such occurrences) using “fuzzy matching” or “fuzzy term matching.” Some explanations of the technology, from Techopedia and Wikipedia, show that it is a fairly crude mathematical approach, similar to the co-occurrence statistical approaches that such software also tends to use, and no match for rule-based indexing approaches that derive their effectiveness from human intelligence.

I remember searching online for information about the Data Harmony Users Group (DHUG) meeting. Google, in its infinitely fuzzy wisdom, asked “Did you mean “thug”?

As explained in Techopedia,

Fuzzy matching is a method that provides an improved ability to process word-based matching queries to find matching phrases or sentences from a database. When an exact match is not found for a sentence or phrase, fuzzy matching can be applied. Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold matching percentage set by the application.

Fuzzy matching is mainly used in computer-assisted translation and other related applications.

Fuzzy matching searches a translation memory for a query’s phrases or words, finding derivatives by suggesting words with approximate matching in meanings as well as spellings.

The fuzzy matching technique applies a matching percentage. The database returns possible matches for the queried word between a certain percentage (the threshold percentage) and 100 percent.

So far, fuzzy matching is not capable of replacing humans in language translation processing.

And the Wikipedia article on the subject explains as follows:

“The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the string and the pattern. The usual primitive operations are:

    • insertion:      cotcoat
    • deletion: coat      → cot
    • substitution:      coatcost

 These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by *) wherever a character has been deleted or inserted:

    • insertion:      co*tcoat
    • deletion: coat      → co*t
    • substitution:      coatcost

The most common application of approximate matchers until recently has been spell checking.

In blog comments, people have commented on the large number of false positives this would create. As a small example of the possibilities, think what would happen with the American Society of Civil Engineers (ASCE) thesaurus and “bridging the gap.” Not to mention the many, many words and acronyms that a rule-based approach could easily disambiguate.

And then there are all the occurrences of the concept that would be totally missed by the fuzzy matching approach. Not all synonyms, nor all the other kinds of expressions of various concepts, are lexical variants that are similar character strings. Fuzzy matching has no way of dealing with these alternative expressions, and they happen often.

There is another problem with these approaches. They are sometimes tied in with weighting, with the “edit distance” mentioned in the Wikipedia article used to downgrade the supposed relevance of a lexical variant. Why on earth should a variant be downgraded, if the intended concept is completely identical to the one expressed by the corresponding preferred term?

The fuzzy approach does not save human time and effort. Rules covering a wide variety of lexical variants can be written using truncated strings as the basic text to match, and proximity conditions added as desired to make those rules more accurate.

Sometimes (in fact, rather frequently), a concept is expressed in separate parts over the course of a sentence, a paragraph, or a larger span of text. There is no way that I’m aware of that fuzzy matching can deal with that. A rule-based approach can.

In short, fuzzy matching has serious deficiencies as far as indexing and search are concerned, and is vastly inferior to a rule-based approach.

Barbara Gilles, Taxonomist
Access Innovations

The Semantics of Whisk(e)y

April 7, 2014  
Posted in Access Insights, Featured, Taxonomy

As noted last week in our article, “A Spirit of Another Name“, Saveur has created a glossary of Whisk(e)ys. However, as we all know, a glossary does not a taxonomy make –it can, however, be a good starting point.

One of the problems, of course, is that national styles – and even spellings – are mutable.

In general, “whiskey” comes from Ireland and the United States, while “whisky” (no ‘e’) comes from Canada and Scotland.  However, well-known bourbon Maker’s Mark long ago decided to buck the semantic trend and drop the “e” despite being an all-American brand.

The Japanese usually use the Scottish version (being heavily influenced by Scotch). However, there are now whisk(e)ys being made in France, Wales, Germany, Australia, Finland, India, Sweden, Spain, and the Czech Republic, to name a few. Adoption of one or the other spelling variant is, well, varied.

Besides, once we accept that “whiskey” and “whisky” are synonyms, the spelling has little to do with the semantics.

It’s more important to understand the legal (and in some cases, traditional but perhaps not codified) production requirements that define the various styles; these are often (but not always) defined by name and region.

For example: Canadian whisky is, by law, allowed to have up to 9.09% “flavorings” – a category of adulterants with no definition (but which in practice include artificial colors, many different sweeteners, prune and other fruit juices, etc.). Scotch whisky can have caramel color added, but no flavorings. Bourbon, on the other hand, can be cut to proof with water, but must by law have no additives for either color or flavor. This gives Canadian whisky its characteristic sweet taste.

Straight Bourbon Whiskey is, in fact, the most strictly defined and regulated of the whiskeys – although, contrary to common beliefs, it need not be made in Kentucky. It must, however, be produced in the United States from spring water; the mash (mixture of grains) must comprise at least 51% corn (the rest being barley, wheat, and rye); it must be aged no less than 24 months in new charred American white oak barrels; and of course it must not contain any additives. (For the record, Rye is identical to Bourbon with the very important exception that it must contain no less than 51% rye.)

Other factors are also in play. For example, Irish whiskey is almost always triple-distilled, while Scotch is almost always double-distilled. Scotch is further delineated by region (Highland, Campbeltown, Islay, Highland Islands, Lowland; Speyside is a sub-region of Highland) and drying methods (whether peat, gas, or coal is used to dry the grain to stop the germination process) as well as the various permutations of blends, single malts, and vatted malts (by many names), not to mention other variants such as single-barrel, cask-strength, and various “finishes” in casks which formerly held other kinds of spirits.

Now that is a categorization problem.

In constructing a taxonomy of whisk(e)y, a faceted approach might be best. However, given the limited space here, let’s just take a quick crack.  (The Top Term is of course “Whiskey” UF=Whisky.)

Whiskey Blends
. . Blended Whiskey
. . Grain whiskey
. . . . Corn whiskey
. . Vatted Malts   [UF=Blended Malts   SN=Comprised of various single malts, no “grain”]
. . Single Malts
Whiskey Production
. . Peated Whiskey   [RT=Islay Whisky]
. . Pot still Whiskey
. . Single barrel Whiskey   [UF=Single-barrel Whiskey]
. . Small batch Whiskey
. . Whiskey Distillation
. . . . Double Distillation   [UF=Double-distilled]
. . . . Triple Distillation   [UF=Triple-distilled]
Whiskey Regions
. . American Whiskey
. . . . Bourbon Whiskey
. . . . California Whiskey
. . . . Oregon Whiskey
. . . . [add other states as necessary]
. . . . Rye Whiskey
. . . . Tennessee Whiskey
. . Australian Whiskey
. . Canadian Whisky
. . European Whiskeys
. . . . Czech Whisky
. . . . Finnish Whisky
. . . . French Whisky
. . . . German Whisky
. . . . Irish Whiskey
. . . . Scotch Whisky
. . . . . . Campbeltown Whisky
. . . . . . Highland Whisky
. . . . . . . . Speyside Whisky
. . . . . . . . Highland Island Whisky
. . . . . . Islay Whisky
. . . . . . Lowland Whisky
. . . . Welsh Whisky
. . . . Spanish Whisky
. . . . Swedish Whisky
. . . . [add others as needed]
. . Indian Whisky
. . Japanese Whiskey
Whiskey Strengths
. . Cask Strength
. . Overproof   [SN=95 proof or higher]
. . Standard proof   [SN=80 to 94 proof*]

*this is a little arbitrary but reflects industry norms

UF=Use For

SN=Scope Note

RT=Related term

Clearly I’m missing cask finishes (mostly in Scotch, but now reaching Bourbon territory) and a few other things. (Hey, it’s just a blog post.)  Ages, also, could quickly become a problem: no one wants a list of cardinal numbers in their thesaurus.

The various brands could, then, be narrower terms in the hierarchy I’ve sketched out above.

In order to avoid massive categorization issues and massive duplication (instead of going Netflix-style, as “Single malt overproof cask-finished Campbeltown Scotch whiskey” is a pretty unwieldy taxonomy term) you’d have to apply multiple labels to categorize each individual item. Imagining this would be most useful for e-commerce (as opposed to scholarly document categorization) helps: think about browseable tabs on a website; you’d want to find Laphroig under both “Peated Whiskeys” and “Islay Whiskys” to allow people to find what they were looking for using multiple approaches.

This is why I described it as a “faceted approach.” But let’s not get into that now. For the same reason, though, I’m going to stop while I’m ahead.

Bob Kasenchak, Project Coordinator
Access Innovations

Knowledge Organization Systems and Return on Investment (KOSs and ROI)

Let’s call him George. George was having a very bad day. He needed legal advice. So, over his lunch hour he scheduled an appointment. After George described his situation, the lawyer pulled a book from the shelf behind him. After briefly scanning it and checking precedent, he confidently told George with a smile, “Relax. This is a slam dunk.” On his way out of the office, George gazed at the consultation invoice and muttered to himself: “15 minutes! Only 15 minutes and he charged me $325.00!”

On his way back to the office, George’s car starting making some serious grinding noises whenever he turned to the right. He pulled into the local Fix-It-All Garage and described the noise to the technician. After turning over the keys, George looked through the large glass window as the mechanic pulled here and tugged there at his car up on the lift. After only a few keystrokes at his computer station, the technician began installing over the next ten minutes what looked to George like a $10.00 part. In the blink of an eye, George was standing at the counter with another invoice. He called his wife to grumble: “He pulled and tugged in two different spots and then charged me $325.00. I’m in the wrong line of work!”

Certain that his ulcer was acting up, George stopped at the clinic on his way home that day. The doctor, who agreed to fit him in right away, asked a few short questions, consulted his desk reference guide,  and started writing a prescription. Moments later, speechless George could only grimace as he faced yet another hefty bill. Poor George.

Besides a considerable amount of cash, what was George missing? Someone might say that George knew the cost of everything, but the value of nothing. Because George was able to successfully confront and overcome several perplexing and complex problems, someone else might say, “What a great quality of life George has!” It all looked so deceptively easy. However, to focus only on the “interface” is to fail to consider the years of training and experience behind each professional who knew just what questions to ask, just where to look and pull and tug, and just which resource to consult.

How does one measure the true value of successful information organization, navigation, and retrieval? Access Innovations Inc. offers superior customer service, ease of product use, and support, combined with years of experience in order to provide outstanding quality. Speak with the CEO of Access Innovations, Inc., Jay Van Eman, about the qualitative and quantitative criteria used to assess successful KOSs and the proper rationale for measuring ROI in your setting. Are you getting real value for the cost?

Check out these additional resources:

Why Knowledge Management Is Important To The Success Of Your Company

The Use of Return on Investment (ROI) in the Performance Measurement and Evaluation of Information Systems

ROI & Impact: Quantitative & Qualitative Measures for Taxonomies

Eric Ziecker, Information Consultant
Access Innovations, Inc.

Of Taxonomies, Biology, and Moneyball

March 24, 2014  
Posted in Access Insights, Featured, Taxonomy

Baseball and biology are not commonly found in the same conceptual space. Neither do you find taxonomy associated with baseball, but in recent news these connections were made. Grant Bisbee, editor of “Baseball Nation”, digresses into the arcane as he laments the coming of the “He’s In the Best Shape of His Life” season. This is the time of year baseball writers must assess the prospects for the coming season, and clichés and hyperbole reign. The dubious practice of evaluating the physical condition of players runs rampant as spring training begins. With tongue in cheek, Bisbee tries to shape a taxonomy to classify this spring ritual. His would be the taxonomy of the “In the Best-shape Stories”.

Bisbee suggests three categories:  “Play X Got in Shape”, “Fixed Eyesight”, and “A Serious, Previously Undiscovered Affliction”. He compiles representative stories for each classification. He is somewhat skeptical that a player could be in the “best shape of his life” as a result of some of the reported training regimes. Can someone really put on 18 pounds of muscle in the off-season? Legally? Shouldn’t vision examinations be a regular part of baseball teams’ operations? After all, they’re paying millions for their players’ hand-eye coordination.

Are there baseball taxonomies already out there that could help Mr. Bisbee classify his apocryphal collection? A “Google” search of baseball taxonomies returns a million items more or less – plus or minus – give or take – a few hundred thousand. Not much help there. At the very least, Bisbee could expand the classification to include other sports and how each portrays their mystical, preseason rituals. How are preseason body rebuilding rhythms disturbed by “Seasonus Interruptus”?

And biology is experiencing a potential season-disrupting trend. Tom Spears, reporting for the “Ottawa Citizen”, filed the story, “Taxing times for taxonomy”. On a more serious note, Mr. Spears chronicles the demise of field-based taxonomic study in biology in favor of lab work. Computers and DNA studies have relegated classification to the closet. While the work done in laboratories is vital to the field, biologist Ernest Small is quoted, “How can you do a study of forests without knowing the trees.” The reverse is also true; you can’t really study a tree without understanding its forest. Commenting on his lab-centric colleagues, Dr. Small laments “the lack of knowledge of what plants and animals make up our world”. Spears goes on to write, “We need to understand whole species, not just genes, if we are to solve the problems of agriculture, fisheries, insect pests, ecology and the spread of diseases.”

Taxonomies are central to understanding a field as a whole, whether biology or sports. And they are critical to understanding individual members of the taxonomy. The taxonomy of plants and animals tells you about the relationships and characteristics of its members. The child, or narrower node in a taxonomy, inherits the properties of the parent, or broader node. Reviewing an entire hierarchical branch and its relationship to other branches conveys the properties, similarities, and differences of branches. This knowledge can lead to sleuthing out meaningful trends and can help explain interactions or identify possible interactions. Taxonomy for organizing sports articles might reveal interesting groupings and trends. Are there more injuries being reported this season? Are there geographic trends or team trends? Applying graphical analysis visualizes the content in ways that enhance spotting of trends.

What these two diverse stories have in common besides the core theme of taxonomies is the need for hands-on, down-and-dirty fieldwork. Sports writers need to see players perform during preseason to really assess their fitness. Biologists can learn a great deal about a species by looking through a microscope, but they can’t really understand behavior that way. They need to observe firsthand. When the field data and lab data are gathered, placing it in context gives it meaning and leads to insights. Taxonomy is ideal for grouping data in meaningful ways, and that can lead to insights. It can also lead you back to the data at some future point. Taxonomy is often focused on organizing content for search to make search easy, fast, reliable, and replicable. Taxonomy also defines a domain or conceptual space. It can help you find your way and keep you from getting lost. Equally important, it provides meaning to the conceptual space you’re wandering through. Spring is almost here – enjoy the first Danaus plexippus of the season and know it has a genus and a tribe and a family and more. Play ball!

Jay Ven Eman, CEO, Access Innovations

First published on January 30, 2012.

Release of Data Harmony® Software Version 3.9 Planned for March 31, 2014

March 17, 2014  
Posted in Access Insights, Featured, indexing, metadata

Data Harmony, a division of Access Innovations, Inc., has announced that version 3.9 of its suite of software tools will be released on March 31, 2014. Current customers will be the first to receive the new version.

“Our recent enhancements provide greater search efficiency while browsing your thesaurus through callouts and hyperlinks to related and narrower terms,” said Allex Lyons, one of the Data Harmony programming team. “Dropdown menus in the search window display subject terms with an auto-complete function that assists users of any skill or experience level.”

“The Thesaurus Master™  module is a wonderful tool for creating and maintaining taxonomies, and when it’s coupled with M.A.I.™, our customers have a world-class semantic enrichment platform,” comments Bob Kasenchak, Project Coordinator at Access Innovations. “In addition, this release of the software works with our new Semantic Fingerprinting application for creation of disambiguated author networks. For a number of reasons, author nets are becoming a critical component of the publishing workflow for many of our publishing clients.”

Access Innovations’ editorial team collaborated with Data Harmony programmers to update the user guides and improve technical documentation for all software modules. “For version 3.9, MAIstro is featuring several new thesaurus export formats, and there are great new functions inside the user interface,” noted Kirk Sanders, Project Manager, Access Innovations Editorial Department. “In addition, we’re releasing several new Web service modules that extend Data Harmony software in intriguing ways. I’m eager to see how our clients gain from the creative implementation of Data Harmony Inline Tagging, Search Harmony, and the Data Harmony Author Submission System — and the other Web services!” The Data Harmony Web service modules offer the power to configure instantaneous Web functions inside a graphical user interface based on Data Harmony software actions.

Current Data Harmony customers can expect to be contacted about version 3.9 by the end of March, according to Marjorie M. K. Hlava, President of Access Innovations. Further information will be available on the Data Harmony website at www.dataharmony.com.

 

About Access Innovations, Inc. – www.accessinn.com, www.dataharmony.com, www.taxodiary.com

Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs.  Data Harmony is used by publishers, governments, and corporate clients throughout the world.

Semantic Fingerprinting for Name Disambiguation

fingerprint

 

 

 

 

 

Many institutions and organizations – notably (but not limited to) publishers – have large, or sometimes very very large, lists of names. These names are from member directories, employees and staff, clients and customers, marketing, development, and many other sources; indeed, oftentimes the lists from various departments in the same organization are not connected or resolved with one another in any way.

This growing problem has given rise to a sub-field in the information/data industry variously called “named entity disambiguation” or “author disambiguation” or “name disambiguation”, among other monikers. In the academic publishing space, disambiguation of author names is a common challenge.

In a nutshell, given a list of names—let’s say, oh, 3.2 million names—to determine which ones are the same person and which are not, we might proceed as follows:

names3

 

 

 

 

 

The goal is, as automatically as possible, to sort out which of these records should be merged. Once accomplished, you (a publisher) could make a webpage for each author listing all publications and so forth for your users to browse.

Clearly, some of the names above are potentially the same person, while others are not. For example B. Caldwell Smith, B.C. Smith, and Brandon C. Smith, and Brandon Caldwell Smith look like they might be the same person. To find out without looking at every name and every article (3.2 million, remember?) we need more information.

To accomplish this task, metadata associated with each author is examined and compared to try to eliminate duplicates. For example, from each article we can associate an author with his co-authors, the institution with which she was involved when the paper was published, email addresses, dates of publication, and so forth.

names4

 

 

 

 

Well, some things are clearer, but some are not. Whereas before we may have suspected that Rodger Smith and Roger Smith were different people, they published at the same institution in the same year; maybe it’s just a typo? And maybe Brandon C. Caldwell moved from Harvard to Yale (not unheard of) sometime between 1961 and 1972?

At Access Innovations we’ve been developing a way to add some certainty to the process using semantic metadata—it’s not a silver bullet, but it is a bigger gun. We call the process “semantic fingerprinting” and it’s based on our thesaurus and indexing technology.

Every author’s works (papers, conference proceedings, editorial roles) associates them with one or more pieces of content, and for each piece of content we have indexing terms from a thesaurus particular to that client. By associating the author directly with the indexing terms, we develop a semantic profile (or “fingerprint”) for each one. Since each author usually authors multiple papers (see “Lotka’s Law”), we compile the subject terms from each paper to make a more complete profile; obviously the more papers we have, the more accurate these profiles are.

Returning to our example:

names5

 

 

 

 

 

What we suspected to be perhaps one person based on our best information turns out pretty obviously to be two distinct researchers once the areas of expertise are added to the equation.

While the process is far from foolproof, it does help to automate the disambiguation process, which cuts down on the number of human hours required to review the work.

The concept of the “semantic fingerprint” can be applied to a paper, a school, an editor, or any other entity for which subject metadata is available. So this same basic process can be used for other purposes; for example, to:

  • Disambiguate institution names
  • Match articles to peer reviewers or editors
  • Demonstrate what areas of research are exploding at,
    • A journal
    • A college
    • A research laboratory

As datasets get cleaner and cleaner the accuracy of, and uses for, semantic technologies—such as Access Innovations’ Semantic Fingerprinting techniques—will continue to increase.

Bob Kasenchak, Project Coordinator
Access Innovations

Semantic Fingerprinting image © Access Innovations, Inc.

Pumas and Cougars and Snails, OH MY!

WizardOfOz
When you use a thesaurus for indexing context covering multiple disciplines, the need for disambiguation of terms is increased. This fact of thesaurus life was well illustrated in a presentation at this year’s DHUG (Data Harmony Users Group) meeting. The presentation, by Rachel Drysdale, Taxonomy Manager of the Public Library of Science (PLOS), was titled “The PLOS Thesaurus: the first year.”

While Rachel discussed a variety of aspects of thesaurus implementation and maintenance, what caught my interest and sympathy as a fellow taxonomist was her description of what she called “taxonomy funnies.” Anyone who has been a taxonomist for a period of time has run into such funnies, or problems that are chuckle-worthy but need some sort of dealing with.

In the talk, Rachel discussed the refinement of indexing rules. PLOS maintains its thesaurus in a Data Harmony software application, MAIstro that includes integration of a taxonomy management tool, Thesaurus Master with M.A.I., an indexing application in which a “rule base” of indexing rules is maintained. In MAIstro, when a term is added to a thesaurus, a simple identity rule is automatically created in the associated rule base. So when the Animals branch was being developed, the addition of “Pumas” caused the creation of a rule that looked like this:

Text to Match [in the text being read and parsed by M.A.I.]: pumas

USE [Indexing term] Pumas

M.A.I. also recognizes singular and plural variants. In the absence of any rule or condition to the contrary, the rule above would cause the automatic assignment, or suggestion to a human editor, of the indexing term “Pumas” when coming across the text string “puma”.

PLOS content has good coverage of zoological topics, but is also especially heavy on molecular biology, particularly genetics. The PLOS wordsmiths were mystified when they found that multitudes of genetics articles were being indexed with the term “Pumas”. True, there might have been a sprinkling of articles about wild feline genetics, but this would not account for the number of articles that boasted the “Pumas” descriptor.

The taxonomists at PLOS looked at the articles in question and found the culprit. “PUMA” was appearing in those articles, as an acronym for a gene whose full name is “p53 upregulated modulator of apoptosis.” (I can’t blame the geneticists for using an acronym for that one. The full name isn’t very conversation friendly.) And it’s not specific to pumas; humans have it, and so do such diverse creatures as fish and frogs. So the PLOS taxonomists modified the indexing rule, adding conditions that required at least one other word or phrase having to do with the world of wild feline creatures to be present before “Pumas” could be assigned or suggested. The addition of a few synonyms and quasi-synonyms for pumas made the rule richer and better able to disambiguate pumas from PUMAs. The rule ended up looking like this:

Text to Match: pumas

IF (MENTIONS “feline*” OR MENTIONS “jaguar*” … OR MENTIONS “panther*” OR MENTIONS “cougar*” OR MENTIONS “catamount*” …)

USE Pumas

ENDIF

The next indexing run was much better. Alas, there were still some articles inappropriately indexed with “Pumas”. What was wrong? The PLOS editors did some more detective work.

It turned out that some of the problem articles were about the toxoplasma parasite, which has many variant strains and is found in a wide variety of organisms, including people, frogs, and cats. One of those variant strains is known as COUGAR. A conceptual relationship with actual cougar critters does exist; the variant was first discovered in a group of Canadian cougars. That’s rather tangential, though. The toxoplasma articles in question aren’t really about cougars. The problem was that as far as animals (and the PLOS rule base) were concerned, “Cougars” is a synonym of “Pumas”. So when the indexing system read “COUGAR” in the text, “Pumas” got popped onto the list of subject terms for each of those toxoplasma articles.

The next critter slithering amok through the PLOS records was the snail. What would make snails unruly? The real culprit is once again a gene in disguise, in this case SNAI1, naturally referred to frequently as SNAIL. Once such a culprit is properly identified, it’s a straightforward matter to modify a rule that prevents the wrong term from being suggested or assigned, by considering likely contexts and reflecting those in the rule conditions. One bonus of the situation is that the same rule can be further modified to enable indexing of the formerly problematic document with a more appropriate term.

There’s no reason to be afraid of the wild animals in your thesaurus, as long as you stay alert for them. You can tame the mighty mountain lion and the slithery snail.

Barbara Gilles, Taxonomist
Access Innovations

Access Innovations Named in KMWorld’s Annual “100 Companies That Matter in Knowledge Management”

February 24, 2014  
Posted in Access Insights, Business strategy, Featured

Access Innovations, Inc., a leader in digital data organization, announces its inclusion on KMWorld’s annual list of the “Top 100 Companies That Matter in Knowledge Management.”

KMW 100 2014 small
Access Innovations, Inc. is featured for its third year, after debuting on the list in 2005. Other notable companies given a spot on the 2014 top 100 list include Adobe, Google, IBM, and Microsoft.

“Criteria for inclusion vary, but all companies have things in common. Access Innovations, Inc. has proven to define the spirit of practical innovation by blending sparkling technology with a deep, fundamental commitment to customer success,” says Hugh McKellar, KMWorld editor-in-Chief.

Marjorie M. K. Hlava, president of Access Innovations, says she is honored by her company’s accolade. “Access Innovations prides itself on pushing the edges of technology to meet the needs of the next generation of knowledge management,” she says. “It’s challenging and rewarding to be at the cutting edge of knowledge management, and it’s delightful to be recognized as a leader in the field, making content findable for our customers and their users,”

The Top 100 Companies That Matter list is compiled annually by editorial colleagues, analysts, theorists and practitioners. Unlike many other trade lists, inclusion is not purchased and is at the sole discretion of KMWorld’s editors.

For a full list of the Top 100 Companies That Matter in Knowledge Management, pick up the March issue of KMWorld, which is available on newsstands now, or click here to view the online article.

About Access Innovations, Inc.

Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world.

About KMWorld

The leading information provider serving the knowledge, document, and content management systems market, KMWorld informs more than 45,000 subscribers about the components and processes—and subsequent success stories—that together offer solutions for improving business performance. KMWorld is a publishing unit of Information Today, Inc.

Information and Data Governance/Management – Adding Ideas From Complementary Disciplines

February 10, 2014  
Posted in Access Insights, Featured, Standards

After critical business information has been identified at a high level and a focal has been assigned, best practices from complementary disciplines can be incorporated.

Identify the main subjects for a business-specific controlled vocabulary 

Each company or organization develops its own language for talking about what it does. Like all languages, organizational languages are based on a common way of seeing and thinking. A technology or farm machinery company may use alphanumeric designations to identify thousands of products. An entertainment firm may use cryptic acronyms in discussing thousands of events or programs. An agricultural organization may talk about plans and events as they relate to “the harvest.” Even within one company, the language varies between departments. A finance department is likely to have a language that is different from the language used in research or operations departments.

Controlled vocabularies provide the key for translating organizational language between departments, between new and experienced employees, and between internal and external stakeholders. Controlled vocabularies also provide the basis for consistent analysis, visualization, and reporting, as well as effective search, retrieval, and distribution. Maintaining effective, business-specific controlled vocabularies provides a competitive advantage. They can also provide operational advantages by supporting the translation of business concepts into rapidly evolving IT technical concepts.

Creating and maintaining controlled vocabularies, including relationships and cross-references, has been a best practice in library science, information science, and records management for a very long time. Over time, effective principles, practices, and standards have been developed for them, but currently marketed tools do not always use them.

At the beginning, it is important to identify the main business subject areas that might benefit from a controlled vocabulary. They may be specific to one or more industries, to a discipline, or to a technique. Existing vocabularies and standards can then be identified as building blocks or goals for future cooperative efforts.  Industry and subject vocabularies can usually be found through associations, through research, or through vocabulary lists such as Access Innovations’ TaxoBank. General standards for creating and maintaining vocabularies can be found through ANSI and ISO and apply more generally than technology-specific standards.

Once pertinent subjects, vocabularies and standards are identified, basic policies should be established regarding their use and upkeep. Like all languages, organizational languages change and evolve with use.   Because cultural and business environments are rapidly evolving, vocabulary policies need to support rapid innovation and creativity.

Define the general types of needed metadata

The metadata needed to identify and track business critical information and data is specific to an organization and is a corporate asset. Defining and maintaining it in a consistent, reliable, useful form is essential, even when a tool provides “OOB (Out Of the Box)” implementations or automated discovery. At a minimum, tools must be configurated and business vocabularies, codes, users, and processes retrofitted to the tool and then maintained. Ongoing tool success and return on investment require significant effort and investment in the definition of policies, standards, vocabularies, processes and procedures. Usually this involves changes in work, roles, and responsibilities for which planning and ongoing management are essential and need to be added to tool costs.

As with vocabulary creation and maintenance, many effective principles, practices, and standards have been developed over time for metadata definition and maintenance, but tools do not always use them.

At the beginning, it is important to identify the main areas of business concern and vulnerability, such as regulatory compliance, product liability, cross-departmental standardization and communication, fulfillment of marketing strategies, or a variety of business specific customer and product-related issues. Each of these areas requires specific tracking techniques and processes that dictate specific metadata. ANSI, ISO, and technology specific standards, such as those designed for the Internet, may be applicable to a business.  Determining which standards are applicable will require research.

Governance Level Understanding of Information and Data Needs 

Developing a governance level understanding of information and data needs, consisting of the four steps outlined in this and a previous blog posting in this series can be handled as time bounded projects. This high level understanding will be invaluable in providing a business-oriented basis for prioritizing and managing additional work, scoping and justifying the creation of an information and data governance program, and evaluating and efficiently implementing cost-effective new technologies.

Watch future blog postings for more on this subject.

Judith Gerber (guest blogger), JGG Enterprises

Sponsored by Access Innovations, the world leader in taxonomies, metadata, and semantic enrichment to make your content findable.

Marjorie M.K. Hlava Selected to Receive Prestigious Miles Conrad Award from the National Federation of Advanced Information Services (NFAIS™)

February 3, 2014  
Posted in Access Insights, Featured

Marjorie (Margie) M.K. Hlava, president of Albuquerque-based Access Innovations, Inc., has been selected to receive the prestigious Miles Conrad Award from the National Federation of Advanced Information Services (NFAIS). The award will be presented at the upcoming 2014 NFAIS Annual Conference in Philadelphia, PA, February 23-25, 2014. In keeping with longstanding tradition, Marjorie will present a 45-minute lecture with her perspective on the information industry during the NFAIS Annual Conference.

“The objective of the Miles Conrad Memorial Lecture, established in 1965 in commemoration of NFAIS founder G. Miles Conrad, is to recognize and honor those members of the information community who have made significant contributions to the field of information science and to NFAIS itself,” said Marcie Granahan, Executive Director of NFAIS. “The lecture is presented every year at the organization’s annual conference by an outstanding person on a suitable topic in the field of abstracting and indexing, but above the level of any individual service.”

“Margie Hlava is a well-known and well-respected information industry pioneer,” said NFAIS President Suzanne BeDell. “She has worked behind the scenes for most of the major information organizations, including many NFAIS member organizations. Margie believes that you learn as much as you receive by being active in professional organizations, and she has been intimately involved in the standards process for much of her career. She served for seven years on the NISO board and was personally involved in the development of many NISO standards. She chaired the Special Libraries Association’s (SLA) Standards Committee for nine years, has chaired the NFAIS Standards Committee since 2001, and is currently a member of the NISO Content and Collection Management Topic Committee. Margie is one of my predecessors, having been NFAIS President from 2003 to 2004, as well as President of other organizations such as the American Society for Information Science and Technology and Documentation Abstracts. She has served on the Board of SLA twice and currently serves on several boards, including those for the ASIS&T Bulletin of which she is Chair, Information Systems and Use, Places and Spaces, University of North Carolina SILS, and the SLA Taxonomy Division, of which she is the founding chair. Margie also is a volunteer outside the information industry, serving on the boards of the New Mexico Information Commons, the Hubbell House Alliance, New Mexico Data Stream, and the Hubbell Family Historical Society. The NFAIS Board is delighted to confer our organization’s highest honor upon her.”

“Previous recipients are people I have long admired and looked up to as luminaries in our field,” said Marjorie Hlava when she was notified of the award. “I am truly honored to be among them.”

Margie1_e1385482349663Margie was educated as a botanist and trained by NASA as an information engineer, a position she worked in for five years. She was a beta tester on the NASA Recon, Dialog, and other early online host systems such as BRS and SDC. She was also the Information Director for the Department of Energy National Energy Information Center and its affiliate NEICA. She rose to the position of Information Director before taking her team private as Access Innovations, Inc. in 1978.

Margie’s abiding research interests center on speeding the human processes in knowledge management through productivity enhancements. She has developed the Data Harmony software suite specifically to increase accuracy and consistency while streamlining the clerical aspects in editorial and indexing tasks. The most recent innovation is applying those systems to medical records for medical claims compliance in a new division, Access Integrity.

Margie’s work has been acknowledged through numerous awards throughout her career, including ASIS&T’s Watson Davis award, and recognition both as an SLA Fellow and as a Woman of Influence for Technology. She is the author of two books and over 200 articles. She holds two U.S. patents encompassing 21 patent claims. She has no intention of resting on her laurels, but plans to continue her adventures in information science and explore the boundaries of new technology and methodologies. A complete list of prior Miles Conrad Award winners can be found on the NFAIS website.

 

About Access Innovations, Inc. – www.accessinn.com, www.dataharmony.com, www.taxodiary.com Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs.  Data Harmony is used by publishers, governments, and corporate clients throughout the world.

About NFAIS – www.nfais.org Founded in 1958, NFAIS is a membership organization of more than 55 of the world’s leading producers of databases and related information services, information technology, and library services in the sciences, engineering, social sciences, business, and the arts and humanities.  For more information on NFAIS and its member organizations, on NFAIS Annual Conferences and meetings or the Miles Conrad Memorial Lecture series, contact Jill O’Neill, Director of Communication and Planning (jilloneill@nfais.org or 215-893-1561), or visit the NFAIS website.

Next Page »