Facets and Search

February 18, 2013  
Posted in Access Insights, Featured, indexing

Most of you who have studied library science or information science are familiar with faceted classification as developed by Indian mathematician and librarian S. R. Ranganathan (1892-1972). His main contribution was the colon classification system, the first faceted classification. It allowed multiple multiple classifications to be assigned to an object, rather than a single pre-coordinated taxonomic designation.

Dr. Ranganathan is still pretty popular in the search world, especially in e-commerce. The Endeca faceted search module is popular with online marketers, because they are looking at lots of different kinds of ways to filter data.

In the case of Endeca, their biggest usages are in retail. So, if people are looking for online ordering, then their system is generally going to be an Endeca search system. That’s because you have one object:  A shirt that comes in five colors; it comes in four sizes and maybe there are some other attributes to it. You want to be able to search on any one of those classifications and get the same shirt.

So, I want it in a women’s size whatever, and he wants it in a men’s size whatever. He wants it in blue and I want it in green. You can make those orders with the same general properties. The shirt classification has a lot of sub-facets to it. So it is searching all of those different facets, which we know as size and shape and color, and get it ordered. It isn’t a single taxonomic list; it is a lot of sub-lists that identify that object.

In the taxonomy, you could have built each of those out as a separate branch. More likely, you would build them all as little taxonomies that are separate, because they are basically little pick lists or authority lists. Any one of them is consistent.  If I want women’s clothing on L.L. Bean, I am going to click on Women’s Clothing, and then the website is going to tell me that it has pants, shirts, and other things, and then I can choose from those. They are facets. I am going down a hierarchical approach to them if they are in Ranganathan. This is called faceted search. I can click through and get ever more detail.

In Lucene, which is an open source search system, I can do the same thing but a bit differently, because here the facet count is giving me the individual item. I have a lot of different facets, so I can search for the manufacturer and from a drop-down list of manufacturers. Each of these narrows search. I can go by price, or I can go by resolution or by zoom range, and all of those things are just narrowing down my search.

Once again, I think we can see the influence of Boolean logic in search.

Marjorie M.K. Hlava, President, Access Innovations

 

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Inverted Files, Parsing, Discovery, and Clustering

February 11, 2013  
Posted in Access Insights, Featured, search

Inverted files and Boolean are basic to most or all search.

So, what we have is the inverted file – that big alphabetical list of all the terms and where they came from – and a way to combine them. At the end of the day, no matter which direction you come from on the high end and even higher presentation levels, you are depending on these two to make it work. If you add that taxonomy to that, you are strengthening the use of your terms.

Here’s an example of some text.

Pretend that this box shows you a page and focus on any one term, like thesaurus.

If we were to build an inverted file of this data, this is what we would be doing; we would make an alphabetic list.

If we wanted to make it much more complex, we would add stop words and we would add the conditions.

We could say that this term appears in these places. That would give me and my computer a way to combine the terms, because I know from their positions where they are. At the end of the day, this is what a very simple inverted index display looks like; it is basic to search.

We also use a lot of other stuff, like stemming and truncation and wild cards and other ways to do search. In addition, we accommodate all those misspellings, variant spellings, and so forth, so that we can make search even better.

Right truncation is very common, and so are wild cards and some stemming algorithms. Left hand truncation is hard, because you have to build an entire inverted file for every letter of everything to the left side. It makes an extremely large index very quickly. Wild cards are okay term substitutions, but left truncation is very hard and very expensive to implement.  Before you ask for it, look at what it is going to do to your system.

It is said that discovery is a very popular action, and it has received a great deal of the research monies and a great deal of attention. However, only about 2% of searchers’ time is actually spent in discovery. 98% of their time is spent doing whatever search is needed to update themselves. So, if they’re mostly updating themselves, that would want one kind of system and if you are in discovery mode, you might want a different kind of system.

One of the big questions in search is, what kind of search are you going to do? Are you doing discovery – looking for new things and new ways of combining stuff? Or are you trying to do an exhaustive everything search? Do you need relevance, recall, and precision, or are you in discovery mode? It makes a big difference as to what kinds of search stuff you need to design, depending on what the users are primarily going to be doing.

Vivisimo, for instance, is a discovery system. It searches on the fly, does automatic clustering, and it doesn’t return the same clusters each time. Each time, the clusters are new. It is not the same search; it’s not the same results. It really angers some researchers to get different stuff the next time. What they really want is additive results. They want to see what they got before, and anything that is new. That’s a different search presentation.

I won’t go into detail about all the types of clustering. Suffice it to say that you can do it in many different ways.

You can group everything together into hierarchical clusters, or you can partition stuff. Even in a cluster you can add a thesaurus, although it won’t really like it; but you can.

This is a view by Vivisimo of its own clustering module. The automatic clustering is shown on the side; this is what will change each time you search.

Remember, I showed you FAST earlier. This is yet another kind of model of how search is done. We have a core engine, and then everything else is connected by application programming interfaces (APIs). These are modules. You don’t get all of these when you buy it; you get just some of them. They are different faces of the data. So, if you wanted to search a bunch of different databases, then you would buy the federation module. If you wanted to go to a bunch of different query servers, then you’d get the module for that. If you want to look at it in different languages, particularly in different character sets, then you would get this module. If you want to use the results, you would need this one. If you want to connect it to other search engines, you need this one. If you want to display those clusters on the fly, then this is the cluster module. Collaboration is where people can view and tag those documents and show them to other people. That’s where you keep what you have, because next time you search, it will be different.

Marjorie M.K. Hlava, President, Access Innovations

 

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Kinds of Search

February 4, 2013  
Posted in Access Insights, Featured, search

There are search systems that are advertised as keyword search. There are ones that say that they are Bayesian. There are ones that are Boolean. There are ones that are primarily ranking algorithms.

The theories behind the different kinds of search are based on the work and theories of several famous guys, two of whom are dead, two of whom are still alive.

The first of these theorists was George Boole, a mathematician and logician who lived from 1815 to 1864.

Yes, this is the man behind Boolean algebra, an algebraic system of logic.

Boolean algebra is the basis of Boolean search, in which Web pages are handled as elements of Boolean sets.

The next theorist of interest is Thomas Bayes, a mathematician who lived from 1702 to 1761.

Bayes’ Theorem, which uses probability inductively, established a mathematical basis for probability inference. The theorem provides a way of calculating, from the number of times an event has not occurred, the probability that it will occur in future trials.

We’ve seen already that search technology often involves probability; some of this technology is based on Bayes’ Theorem. While Bayesian methods have value, they need to be used with caution.

Our next person of interest is Peter D. Turney, a Canadian expert in computational linguistics. One of his main contributions to search technology is Turney’s algorithm, which is useful for sentiment analysis, the intent behind the words.

Finally, there is Marco Dorigo, a Belgian researcher and initial developer of the ant colony optimization algorithms. Search that uses this approach uses artificial ants (with artificial intelligence) and mimics their search behavior.

Finally, we’ll take a look at a search approach with a more corporate background: the ranking algorithms of Google.

The terms that you input – the queries that you type – are called the term inputs.  Those term inputs are weighted based on an extremely large number of factors within Google. If you were to get the Google Search of Science to run your in-house searches, they would be using exactly the same algorithms. They will update those algorithms for you when you upgrade the algorithms on the main search box, assuming you subscribe to the system. That means that its real depth depends on search ranking algorithms in-house that you get on Google.

You might like that kind of search if it works well for you. If you don’t like that search and you feel like you get too many false drops, then you are going to get the same situation in-house. One of the ways you can mitigate that is with the data that you load to Google. You can make it partially well-formed by adding taxonomy terms to it, and then your search improves. Some people have just thrown up their hands because – “Oh no!  This is giving me a Google search appliance. I don’t know why I ever bothered to do a taxonomy in the first place.”  However, if you add those terms at the time that you are loading the data to Google, you can search the taxonomy and improve your search results.

Another approach used in search, and compatible with the statistical approaches, is natural language processing (NLP). NLP uses natural human language input, or output, or both, in combination with computer processing.

There are certain basic tenets of natural language processing. All of them are used, to some extent and to varying degrees, in search software. Some of them use it as a base layer and never let anyone touch it. Some use it as an interface for people. Often they are black box applications because the algorithms, particularly the grammatical ones, are hard to follow and you don’t want people messing with them because they break them. These have not been as popular lately as they were some years ago.

What has become more popular is automatic language processing (ALP).

This is where you could apply a lot of different tenets. I used to go to computational linguistics meetings and artificial intelligence meetings and automatic translation meetings and ASIS&T meetings, and suddenly a lot of people that I saw in those artificial intelligence meetings were showing up at the ASIS&T meetings. There is a lot of overlap in these fields, but what is happening with ALP is that they are trying to automate most of the pieces. It’s essentially NLP with a few more computational heuristics, and it lends itself well to search.

Another kind of search is statistical search.

It is primarily Bayesian. It has different levels of usage of the Bayesian algorithms. You might hear about the ‘smart factors’ from Cornell, or ‘cluster analysis’ from other sources; you might hear about neural nets and neural networks from concept analysts; you might hear about co-occurrence engines from Autonomy, you might hear about Bayesian inference engines from other people, and there are all doing a variation on a theme. They are all doing some kind of heavy statistical analysis of the data and trying (also depending on some of the ALP or NLP techniques) some kind of way to derive a search result for you.

What’s basic to all of these kinds of searches are inverted files and Boolean logic. We’ll cover inverted files (among other topics) in the next installment.

Marjorie M.K. Hlava, President, Access Innovations

 

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Measuring Accuracy in Search

January 28, 2013  
Posted in Access Insights, Featured, search

When people talk about how accurate the search is, there are lots of different ways to measure that. This list indicates some of the ways that we talk about measuring accuracy.

Relevance is one of those ways to think of accuracy. That’s the computer system’s guess from what they know about you, and from how likely it thinks that this set that they are presenting to you in answer actually meets your query. They are guessing that it’s the right system for you.

Recall is the measure of how many of the records in the database match your query versus how many you actually got.  If you are that smoking gun detective or that doctoral student or that patent researcher, you want 100% recall.

Precision is the percentage or ratio of the units returned that actually meet your query. So, if you see that first screen of ten, if eight of them are really something you are interested in, then you have 80% precision.

The information technology societies used to focus on precision and recall. That is what we talked about back when. Now we don’t talk about that. Now we talk about usability studies and other things.  If you look at the literature from the early 1990s back to the mid-1980s, you’ll find many articles about precision and recall. You don’t find much of that anymore.

Accuracy can be measured with hits, miss, and noise, which is another way to think about it. Hits are what a human searcher wants that the computer shows, that is, satisfactory search findings; misses are what the human should have been shown, but the computer didn’t show you; and noise consists of results that the computer showed you that you didn’t want – a false drop.

Ranking is another way of showing you the results – putting the ones that are thought to be the best first. This means that the search system is guessing which results are most likely to be accurate, and which are more likely to be accurate than others. There are lots and lots of ways to do that.

Linguistic analysis involves profiling you – the user – and coming up with the right answers.

There are some other ways that people measure accuracy, and one of them is how fast the query came back to you. What is your processing speed? Is it in nanoseconds or is it in 30 seconds? How fast is the search response to the user? Users are really spoiled and impatient, which is why you have all of that cache building going on behind the scenes. The users will think they are getting it all at once, but actually they are only getting what they are reading on the scene. The rest is queuing up while they wait.

That’s part of results processing. If I’ve run a query, how long does it take to build that cache to give me a full answer? How fast is it going to display my results? If you have an SQL system, for example, is it building that response on the fly? Can you refine the search? Can you do a search within the search? Can you do a re-purposed search that will give you more information?

Other kinds of business rules are used to measure the accuracy of search and how well it meets your needs.

Relevance is a measure of how well the documents answer your needs. It is a very subjective measure. It is different for different user communities. It really depends on the information resources and the tension between the user needs and the context available.  So, it might be the case that there is nothing relevant to your search in the entire corpus.

In the old days, when we talked about precision and recall, everyone said that relevance was the confidence rating. Then Google came along and everyone said “What about the relevance?” Some people would say that relevance is a canard, that it doesn’t mean anything. However, Google kind of changed that perception so that everyone is now looking for the relevance. Some people would say that relevance is a result of precision of recall.  The ones that had high precision/high recall – those are the ones that are really relevant to you. Those algorithms have also gone out the window, and relevance is now measured in many different ways, but not with an algorithm for precision of recall.

There are formulas.  I’m going to let you study these.

These are the traditional definitions of recall, precision, and relevance. These are the ones held by the ASIS&T community. There is is a great deal of information written about these formulas – practically all of it prior to 1990.

In measuring relevance, there are a lot of different algorithms that people use to come up with the percentage. A lot of discussion has been held on relevance but, in the end, it is our confidence in how well we think a particular answer to the query meets your needs.

In a future installment, we’ll look at different kinds of search.

Marjorie M.K. Hlava, President, Access Innovations

 

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

How Search Works

January 21, 2013  
Posted in Access Insights, Featured, search

Search has many parts.

The parts of search are moving parts, and every system does it a little differently. There’s the search software, based on one of two major camps. Then, there’s the computer network that it’s riding on. Then, there’s the way that the text is parsed, which is not always the same.

There’s also a differentiation between whether the search is working on well-formed or structured text. With XML, they talk about well-formed data, and that essentially means that the data is tagged well. We will also know if the data is structured data or field-formatted data, depending on how far back your vocabulary goes.

There’s the computer software – related, of course – in the network. There’s the hardware. Not everything runs on every piece of hardware, because the operating systems are different. There’s the telecommunications system and how you access that thing.

Then, if you have chosen to go a Bayesian route, there are training sets for the statistical system.

There are a lot of different pieces.

In the search software itself, in the search technology, there are also several pieces. One is the ranking algorithms.

If you look at systems like Autonomy or Google, there is the query language. This is what someone is going to use to get a result. The query language is usually of two levels. There is what the user types into a box, and then that is translated into a command line that is sent to the search software itself.

There are federators, which provide a way to gather information from several repositories and bring it back as a single result – also called ‘federated search’.

Then there are the caching algorithms. Caching is what happens behind the scenes. You run a query, and it starts to display the results. It displays the first ten because it is not serving all the one million up into your little bit of memory partition that’s been allocated to you. It is caching them. As you go through those numbers, if you go to the last number in the list, it takes a while to get that information to me. It is queuing the information up in case I want it. It is only going to go a little bit in advance of my query, because it doesn’t want to chew up more memory than it needs to. So, cache memory and the caching of it are aspects of a search system. It’s important to understand how they work.

The inverted index is something that is basic to all search systems. It is the alphabetic or alphanumeric listing of every word, and connections with those words for everything that is in the database. So, it is an important thing for practically all searches.

The final thing in search is the presentation layer. That is what you see. On top of all that technical stuff, there is the user interface that you see. That’s the presentation layer of search.

We can see a presentation layer in this diagram of a search system.

You notice that this presentation layer is going two places. So, here I am the user and I am going to the federator, and I am also going to the repository cache at the same time. My federator is taking me out to all the query servers, and I might have data in a lot of different places that I am pulling the information out of. At the same time, I am going to the repository cache, where the information might have come from the source data with some cleanup algorithms to make sure that it presents properly going into the cache. Then, that information is going into the cache builders, then going into both the deploy hub and the index builders, which is where the inverted file is built. All of this information, then, through the query servers and through the federators, is coming back and giving me my answers. There is a whole lot going on behind the scenes.

These things are built in different ways. Below is one example.

This is FAST, which was bought by Google. Here you have all of the information coming in through crawlers and through file traversers and API connectors to other data systems. Maybe it is indexing all of the emails and some other stuff coming from Oracle, Documentum, FileNet, or some other big systems. It is going through a content application programming interface (API). This API is translating all of this information that came in from someplace else, is building a cache, and is loading it into what FAST calls a document processor and what the other guys call a deploy hub.

When the data lands here, it is a really good time to apply the taxonomy terms to the data. This is where we are adding taxonomy terms. Then that information is going out to build the inverted file, the indexed database. It is also going to build another database to support sending out alerts and RSS feeds and the like.

This information – the index – is searchable through a query made using the presentation layer. That query is where the user is coming in from. Whether the users are coming in on a mobile device or regular laptop, they are all coming in through that query processor. If you also apply the taxonomy terms at the query end in the search presentation layer, you get a better search.

These are two places in the search system where you might want to be able to apply the taxonomy terms, if they were not already attached to the record. At a point like that, you want to enrich the data so that you can do the search.

In the next installment, we’ll examine measuring and achieving accuracy in search.

Marjorie M.K. Hlava, President, Access Innovations

 

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Search Engines and Related Tools

January 14, 2013  
Posted in Access Insights, Featured, metadata, search

We’ll discuss search engines, technology related to them (including Web crawlers and search software), and getting those things implemented to put taxonomies to good use.

A search engine is different from search software. The terms are used interchangeably by people, but they are not the same. Search software is an application – a bunch of codes – and a search engine is a collection of servers with a lot of data on it that is indexing the Internet and delivering it by HTML pages to customers. It is doing that by spidering around to get the information, usually from those metaname headers, and now increasingly from full text and putting that information up and comparing it, searching it. You all have some idea of how Google, Lycos, Bing, and all those other ones work, so you know what happens here. Again, a search engine is different from search software.

A spider, particularly a Web spider, is something that crawls across the pages and extracts the information that it needs. It then deposits that information in the search engine application so that it can be searched. What it really stores is the location of the information, the URL, and the keywords and the text of the page itself so that it can be searched.

Search software is looking at a discrete collection, which could be huge. It is the software itself. It is not the search engine that is crawling across the Web. In search software, it is likely to be applied to just something internally or a single website.

Search engines and metadata are connected in that we are crawling across looking for the information and pulling it out of that HTML header, which is the format page for the search software. The metadata is only one part of that header.  It pulls that information, it pulls the URL (Uniform Resource Locator) and then it might also pull the page and cache it someplace so that it can display quickly.

Creating metadata for your website(s) is fairly straightforward. The metadata can help crawlers find appropriate information. Without metadata, structure, and categorization, web crawlers have nothing to work on. We need metadata to sort it all out and to do so in an organized way.

Search software is where somebody in an organization is going to be applying their information. If you don’t have the metadata and the structured information and the keywords applied to your content, the spiders can’t work well. So, your information is not as discoverable as it could be. We need that metadata so it can sort out the information on the Web, in an organized way.

The metadata tags are the first place on a Web page that the crawler looks at. They mine there first. If they don’t find that information they will go deeper down on the page, and the page will be lower-ranked because the information was not as easy to find.

The inverse of that is that if it finds the same keyword used 17 times in the metaname keywords fields, the crawler may recognize that tactic and rank the page lower because of it.  So, if I say I want to have the word ‘taxonomies’ and I want to own it and I want it ranked really high, I might be tempted to put that word five times in my meta name keyword field, but that would be an error on my part. I should instead repeat the word Taxonomy in the full text of my page. That would rank me higher.

But, yes, that is where the spiders go first. It depends on the spiders – how they are crawling – but practically all of the spiders go first to the meta name fields. They later go to the full text page or they skip it. If there is no recent date – if it looks like the page has not been updated in a month – I’m out of here.

So they depend on the information in that field to know whether to mine the page further. They crawl first to the meta name text and they harvest what they can, and then the algorithm makes the decision as to what they are going to do with the page and how they are going to arrange it. Then they decide if they are going to cache the page or not, and whether they are going to cache just the first page or whether they are going to cache the entire site. If they are going to cache the entire site, the spider has to grab the entire site. People can make it available or lift it. If it is a lifted site and it is properly done, then it will be crawled on some schedule. They can crawl every two weeks; there are some sites that they crawl every 15 minutes. They crawl some of the new sites very frequently.  If they have crawled the site before, they have your page, they have cached it, they show that, and while they are showing that page they go out and get the current page. They update the cache based on users’ keywords.

In the next installment, we’ll look at how search works.

Marjorie M.K. Hlava, President, Access Innovations

 

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Why is a consultant not a “vendor?”

January 7, 2013  
Posted in Access Insights, Featured

It’s time to consider “services” to be “products,” at least where meetings and conferences are concerned.

Wikipedia says “IBM treats its business as a service business. Although it still manufactures computers, it sees the physical goods as a small part of the “business solutions” industry. They have found that the price of demand for “business solutions” is much lower than for hardware. There has been a corresponding shift to a subscription pricing model. Rather than receiving a single payment for a piece of manufactured equipment, many manufacturers are now receiving a steady stream of revenue for ongoing contracts.”  The change is significant.

Increasingly, then, we live in a service economy. Even large companies—such as Hewlett-Packard and IBM—make the bulk of their money on services, not on the software and hardware products that they sell.  Regardless, they are referred to as “vendors” because they do, in fact, sell products—even though most of their profits are made selling services.

But the old models are dying hard in the meetings business.

Conference season is here again; the calls for papers are out. Yet in our little industry, you are classed as a vendor” if you have any products to sell and “other” if you do not—even if you are selling “services.”

So, “services” are not “products,” at least in the taxonomy of the conference/meeting industry.

This false dichotomy is illogical and unfair.

Those that make their living selling their “services” (consulting, for example) to conference audiences get a free pass until they have a “product,” at which point they become a “vendor.” “Consultants” can hawk their paid expertise (“services”) from the podium, but it costs vendors to even whisper the name of their products. Even someone who has a lot of experience, has spent time deeply thinking about what it takes to make something into a product that actually works, and has been awarded patents for their thinking underscoring that the ideas are unique and innovative, is still classified as a dreaded “vendor” that cannot be trusted to avoid giving a sales pitch from the podium.

But since “services” aren’t “products,” consultants can give all the sales pitches they like.

If a consultant spends half her talk presenting a “case study” of a project she did, that seems to be okay, when in fact it is a huge sales job to the audience. These same consultants are never held responsible for the messes they make (we have cleaned up quite a few of them), but they continue to move around the speaker circuit with impunity. Patrick Lambe mentioned in his closing talk at the 2012 Taxonomy Boot Camp that consultants should be held accountable and liable for bad advice, which often leaves customers with a “once bitten, twice shy” situation. I believe he is right!  The customers have lost a lot of time, money, and possibly position in the marketplace because they tried going the direction a consultant recommended to them. Many of these consultants have either run through a lot of jobs—not keeping any of them long—or have not actually completed and instituted the types of projects they consult on. They don’t know the pitfalls and how to make something that works.  A vendor has the scars of implementation to show from previous projects completed, perhaps even a working product you can purchase.

Meeting organizers see vendors as targets, but those selling consulting and other services are exempt.  If you have a product—software, for example—they will let you “buy” your way on to the program by being a sponsor of the meeting. This means for a certain dollar amount you can have a 20-minute spot, and for a bunch of dollars more you can get a keynote address. Take a look at the sponsorship button or exhibitor button on any meeting or conference website and you will see what I mean.

But since those selling services are, illogically, not considered “vendors” they don’t have to “sponsor” (read: pay a fee) to be considered for giving a talk.

Who would you rather hear from?

1) Someone who has implemented a project like the one you are planning, such as a case study from a potential colleague?

2) A company who has implemented many projects and knows the variation and challenges in getting something to work?

3) A consultant who advises on projects and reads the literature with no accountability?

4) A professor who researches and writes in the area you are trying to implement?

Many conferences and meetings are becoming watered down by competing needs: of the organizers need more cash and the attendees need more information. There is a proliferation of such meetings, but I am seeing significantly smaller exhibit halls and attendance at many of them.

The meetings that are growing in popularity and attendance, however, are users’ group meetings. These are clearly based around a particular product—people attending are interested in the product and how the other attendees are using it.  Some of these are like rock concerts with big sound and stage productions; they may also include elaborate parties. Others are nuts-and-bolts presentations—how-to-use training sessions along with case studies and plenty of time for users to interact with each other, the potential users, and company staff.

Which meeting would you rather attend? One where the organizers 1) send out a call for papers and review them based on merit of the paper? 2) solicit sponsors and build the program based on who will pay them for a slot on the program? 3) bring together people who use a like product so they can learn about the actual implementations and lessons learned in a collegial environment?

It’s time to reconsider the old model. “Services” should be considered “products.”  We need to do away with the irrational and false distinction between types of “vendors.”

The Data Harmony Users Group meeting will be held February 18–20.  It is full of case studies, training opportunities and lots of time to interact with those of like minds.  Find out more.

RDF: A Wrapper for Metadata

December 31, 2012  
Posted in Access Insights, Featured, metadata

In information management, a wrapper is a program that translates tagged data into relational form, so that databases can deal with it. We’re interested in this because some of that tagged data includes subject terms, keywords, or whatever you want to call taxonomy terms that have been used to describe documents.

One forerunner of present-day wrappers was an attempt by the Internet Engineering Task Force (IETF) to develop standards for World Wide Web resources. They came up with uniform resource characteristics (URCs). IETF defined a URC as “a set of meta-level information about a resource. Some examples of such meta-information are: owner, encoding, access restrictions (perhaps for particular instances), cost.”

The URCs  concisely defined how a URL or a URN or a UID or any of those unique items could be uniformly defined so that everybody would do it the same way. The idea was that with a simple batch of metadata you could connect everything. It languished for about 10 years.

As the URCs faded, attention shifted to the Resource Description Framework (RDF), a wrapper that you could put around something like the Dublin Core. It has become much more involved and much fancier over the years.

Originally, it was a self-documenting format. RDF was the framework to describe whatever was there. So, if you sent me an RDF file, what you would be doing is sending me the file and sending me the DTD that went with it so that I could interpret the data.  A nice clean wrapper explained what was in there.

Basically, RDF is a way of putting a wrapper around the information so that you can see and interpret what someone has sent to you. It is a nice framework for doing that. However, it has taken on a different life now.

The syntax is pretty straightforward because it follows the XML syntax.

The schemas originally were tied to something like, but not necessarily, the Dublin Core.

So, an RDF does not have to be Dublin Core; it can be something completely separate. In fact, when W3C came out with the Simple Knowledge Organization System (SKOS), which was an export format for taxonomies, they were using that as a formal language to describe the way that taxonomies or vocabularies were exported. Unfortunately, they did not include everything in the first edition that one might need, so you couldn’t have multiple broader terms, you couldn’t have synonyms, you couldn’t have related terms. Very limiting. However, you could have a broader or narrower term. It was for classification systems, not for taxonomies. The new edition of SKOS does include options so that you can export null information.

Here’s an RDF wrapper on a set of information, just to show you what the format looks like:

  • <rdf:RDF
  •   xmlns:rdf=”http://www.w3.org/TR/WD-rdf-syntax#”
  •   xmlns:dc=”http://purl.org/dc/elements/1.0/”
  •   xmlns:dcq=”http://purl.org/dc/qualifiers/1.0/”>
  •   <rdf:Description about=”urn:issn:1361-3197″>
  •     <dc:Title>Ariadne</dc:Title>
  •     <dc:Subject>
  •       journal; magazine; elib; electronic libraries; digital libraries;
  •       networking; Web; IT; higher education
  •     </dc:Subject>
  •     <dc:Description>
  •       A print magazine of Internet issues for librarians
  •       and information specialists
  •     </dc:Description>
  •     <dc:Publisher>
  •       Information Services, University of Abertay, Dundee
  •     </dc:Publisher>
  •     <dc:Type>Text.Serial.Magazine</dc:Type>
  •     <dc:Relation>
  •       <rdf:Description>
  •         <dcq:RelationType
  •           rdf:resource=”http://purl.org/dc/vocabularies/AgentTypes/v1.0/IsBasisFor”/>
  •         <rdf:value resource=”http://www.ariadne.ac.uk/”/>
  •       </rdf:Description>
  •     </dc:Relation>
  •   </rdf:Description>
  • </rdf:RDF>

I won’t dwell on how this works. The example above shows the different pieces of the RDF. It is a very standard export for exchanging vocabularies.

RDF schemas are similar to XML schemas. A schema defines the meaning, characteristics, and relationships of a set of properties. This may include constraints on potential values and the inheritance of properties from other schemas.

With RDF, a resource is anything that can have a uniform resource identifier (URI). This includes all Web pages, and segments of an XML document.

An RDF PropertyType is a resource that has a name and can be used as a Property. This can be Author or Title.

An RDF Property is the combination of a Resource, a PropertyType, and a value. For instance, “The Author of http://www.accessinn.com/offkey.htm is Marjorie Hlava.” The value, in this case, is “Marjorie Hlava”. Resources have Properties, and Properties have Values.

Practical uses of RDF include the following:

Recently, RDF has expanded to include triples for linked data. So, a lot of you will hear about RDF and immediately think of triples and linked data. However, in order to support these triples, RDF first needed to be defined as a wrapper. So, now it is a wrapper for object/predicate/subject or subject/predicate/object. You can show the relationship of terms.

RDF is a really cool way to define related term instances. You can have more than one kind of related term. You can do it in an RDF export. You can say “Linda was president of ASIS&T.” You can infer other relationships and indicate that “Marjorie Hlava is a member of ASIS&T.”  We can talk about the different objects and their relationships. These can get very complex. The basis of it – RDF was first a wrapper and now has extended that wrapper and the way the syntax works so that you can talk about it in terms of triples. It has expanded that part of the RDF syntax.

Marjorie M.K. Hlava, President, Access Innovations

 

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Dublin Core: Sort of a Metadata Standard

December 24, 2012  
Posted in Access Insights, Featured, metadata

The Dublin Core metadata guidelines started in 1995 in Dublin, Ohio, which is why they are called the Dublin Core. The beginning was actually in the basement of the OCLC building, in March of 1995. In April of 1996, the Dublin Core advocates went to Warwick, England, where they had another conference. So there are Warwick Conventions for the Dublin Core. So there’s no mystery about the names; these things are just named after the places where they started.

The conferences led to the standard Z39.85 in 2007, which was adopted as an ISO standard in 2009. It was hotly debated, and the Internet Engineering Task Force adopted it as a Request for Comment document as well. It commanded attention.

The Dublin Core standard has 15 major elements, and people sort of agree on what they mean. Then there are 18 elements in the Qualified Core and the schema, and additional type qualifiers for it at this point.

It is a fairly simple framework. At the time it was developed, a lot of people said that it was the same as the Dialog data set. And it was kind of like MARC cataloging meets online databases. That is really what it was. It was an attempt to make it so that all of the complex library records with their 680 MARC fields could be digested down to a fairly simple list. And it still works that way at the heart of it.

So, in the illustrations below, there are the elements from Version 1.1, and the qualifiers are listed as well.

The qualifier properties are fairly definite. The way the wording sequences are written was hotly debated. I think you will find, when you try to apply them, that they are not very specific. If you are trying to apply Dublin Core as a standard, as a ‘check off the pieces’ so that you know that you have it exactly right for Dublin Core, Dublin Core won’t do that for you. It is not a yes or no, black or white standard. It guides you to a way that you can look at the information.

There has been a lot of work with it, a lot of additional things added in the last years. It raised a lot of questions at the start, and it has kept a lot of questions over all that time – 17 years later – as to whether it was really a standard or a guide. That boils down to “Is it measurable?”

Can you say if this database record is Dublin Core or if this is not Dublin Core? You can’t. Should the number of elements be expanded? Are there enough? Are there too many? A lot of different questions but the real question was this: Can people use it reliably? Can we measure it with a yardstick? That’s where Dublin Core has suddenly sprung to life again in that they are writing a lot of functional requirements that serve as yardsticks, which they did not have in the past.

Marjorie M.K. Hlava President Access Innovations

 

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Taxonomies and Metadata

December 17, 2012  
Posted in Access Insights, Featured, metadata, Taxonomy

Let’s talk about metadata a little bit. This is to give you a broad overview so that you know how to build those taxonomies.

Metadata is data about data, which means it is really information about information. If you look at the metadata standards world and especially the Dublin Core Metadata Initiative, it has suddenly taken flight again. It’s really interesting. Dublin Core is 10 years old or so and has suddenly become more relevant again.

Metadata describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. That is the whole idea. And we are looking at lots of kinds of metadata, and it does not really matter what it is. It is just going to be about the stuff you are working with. A simpler way to say it is that metadata is data that characterizes other data in a reflexive way. It may include descriptive information about the content, quality and condition, or other characteristics of the data.

“Metadata” covers a multitude of sins at this point. So, if you remember our basic list from our discussion of markup languages, that is basic metadata.

If we look at keywords, subject headings, index terms, identifiers, taxonomy terms, controlled descriptors, controlled vocabulary, and so forth, we can see that they are a type of metadata. That’s why taxonomists are interested in metadata.

Keywords (aka subject headings, index terms, identifiers, or subject area) are one type of metadata. A bibliographic database record usually includes keyword information, as well as information regarding the author(s), title, language, and date of creation. So does a traditional library card catalog.

A bibliographic citation is metadata, and so is a library card.

An HTML header can include metadata.

Not all web pages have metadata. A lot of websites don’t, because the people making the pages don’t fill in the headers. This has been one of the big problems for corporations in trying to exercise control over their intranets and being able to search their intranets. If you fill in the headers on your web pages, you can fill them in with your taxonomy terms. That’s where they would go.

Metadata standardization has been around for a while. Let’s look at some early metadata initiatives.

MARC – the Machine-Readable Cataloging format standard from the 1960s – was a metadata initiative spearheaded by the Library of Congress. The AACR2 – the Anglo-American Cataloguing Rules, 2nd edition (1988) – was the style sheet for MARC records.

There are quite a few metadata initiatives nowadays. These are some of the more prominent ones:

Dublin Core, you may know about. And maybe the indecs Content Model, which gave rise to ONIX.

ONIX, the ONline Information eXchange, is a set of XML standards that publishers use as the metadata for marketing and shipping their books and other publications. ONIX records tell you everything from how many of a certain book will fit into a box to what kind of display items will come with it. So, if it’s a new Harry Potter book display and it has some big cutout deal that goes with it, it is described using ONIX. ONIX records can also describe all of the CIPI codes and other things – serials, book/item identifier, contribution identifier and all of the additional information that Amazon or Ingram or any of those other guys need in order to ship a book to you or to each other. That is a huge standard and it was a huge effort.

The Text Encoding Initiative guidelines are also useful. And the list goes on. What’s important for us to realize is that there are a lot of different metadata initiatives. Dublin Core is one of them. I’ll discuss Dublin Core in another installment in this series.

Marjorie M.K. Hlava President Access Innovations

 

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

« Previous PageNext Page »