On a quest to reach the holy grail of the Semantic Web. 

September 20, 2010 – What started as a straightforward and elegant model by Eric Miller as the Resource Description Framework (RDF) has become incredibly complicated.  When I first came upon this model, in the early days of the Dublin Core standard discussions, it was a framework.  That is, it was a self describing way to transmit an XML file.  The RDF formed a wrapper around the XML by including the XML schema (DTD) so that whoever received the file would know what the elements, attributes,  allowed ranges, etc. were, and be able to use the data included without further hunting for file descriptions, translations of the fields (elements), how they related, etc.  RDF has grown up and become embroiled in discussions of triples, subject-object-predicate discussions, and its use as the basis for linked data and even the basis for the final real implementation of the Semantic Web

At least that was one of the key messages I came away with from the Linked Data meeting in London last week. There were the zealots who thought that implementation of linked data could ONLY be done via RDF and that meant the semantic web. They also felt that the Dublin Core was key to this implementation. Those assumptions seem a cognitive leap to me. Other speakers took a skeptical view. It was all very refreshing.  Perhaps a little more background is in order.

Dublin Core (DC) was developed to address the need to move from the exhaustive coverage of the MARC and AACR cataloging standards to something more streamlined, and a set of 15 elements was created by Stuart Weibel of OCLC to do that. It gathered great momentum among the library community who felt shackled by the arcane and complicated rules of cataloging and classification and are ready to make better use of the options used by the aggregators of principally journal information. These were the secondary publishers like Chemical Abstracts, PsychLit, ERIC, and DOE OSTI. In those days, many of the secondary publishers were hosted by Dialog, BRS (Ovid),SDC, and others.

While the librarians talked among themselves in the American Library Association (ALA) meetings, the secondaries talked with themselves in NFAIS and ASIDIC. The academic and public librarians didn’t talk much with the main users of the online databases of the secondaries, the corporations and associations, which were well represented within the Special Libraries Association (SLA). Sometime they invited one from the other “camps” as a guest speaker. Dublin Core moved to the National Information Standards Organization (NISO) and immediately became a hot potato. The secondaries said, “We already have a basic field set of about 14 items, they work well, and they are very clear.”  The DC advocates said, “We have hundreds of fields in the MARC and the detailed rules of the AACR and need a way to group them into general categories flexible enough to apply to many applications.” Since they were broader areas, like Creator instead of Author specifically, they could be considered “metadata”.  Metadata makes the link to the HTML Meta NAME fields and the options listed there as a prototype.

NISO decided to bypass its usual development process because the DC was so far advanced and widely supported using a brand new “Fast Track” system.  It was needed now. There was momentum in the industry. It was a potential best seller for NISO, they reasoned. The president of NISO was from OCLC and felt the need keenly. Seven NISO members voted NO. Usually one NO vote would mean the standard did not pass, but under Fast Track a majority was enough. It was received and Dublin Core moved forward to NISO Standard status as Z39.85.

Meanwhile, publishers were talking about finding a way to link their data by cross referencing the cited references in the articles so that people could easily follow the trail of the research, article to article. Each one would have a way to identify it, a handle. Several publishers joined together to establish the International DOI Foundation.  The DOI would assign a number supported by a URL which could direct the user to the article. The DOI syntax easily passed as a NISO standard, number Z39.84.

The Handle system was developed along with the technology to provide resolvers for the citations with the addition of URLs allowing the researcher to reach the article.  They needed a system to detect whether it is actually the “right” article resolving the differences in possible citation references. The group latched onto the Dublin Core as a potential set of fields to use for this resolving.

The secondary publishers, some of whom were also primary publishers became alarmed. Several of the DC metadata fields were value added data. These were the crown jewels of the secondary publishers. It was the basis for their business models. One of them was subject – you knew I would get around to taxonomies eventually. What if this new DOI system did away with the need for secondary publishing all together? The members of NFAIS set up a task force to come up with a solution. The DOI Task Force was made up of representatives from across the information industry: the American Institute of Physics (Tim Inglosby), Chemical Abstracts Service (Michael Dennis), the Institute for Scientific Information (Helen Atkins), Harcourt (Ed Pentz), Palinet (Christine Martiere), and Access Innovations’ NICEM (Marjorie Hlava).

The debates were fascinating and hard hitting. We came up with a Contributed Metadata Set. A list of fields from the Dublin Core with definite parameters, which would be contributed by the publisher with the DOI and a URL, so that the Handle systems Resolver could find the article.  The list was big enough to allow the resolving while short enough not to endanger the secondary publisher business model.

Helen Atkins became our spokesperson and presented the suggested Contributed Metadata Set to the DOI Foundation Board. They accepted it. CrossRef was born shortly thereafter with Ed Pentz as its Executive Director. The list still holds.  CrossRef is a very successful reference linking system. 

Okay, enough background. Where does that bring us to now? The basics are in place and much has been established. I think it is important to know the background before making forward progress. One speaker mentioned that he was not burdened by a relational database background, and so he could work more creatively without that legacy. Perhaps, but then we make those cognitive leaps that miss the steps in between. I agree that the RDF is a flexible and useful format with much room for continued growth and application. Dublin Core may be extended to 60 or more fields, but the current basic set is enough still to resolve information. Linking of data, though, can happen in many ways. It does not require either DC or RDF to make the link. They are helpful, but not required. 

Another small set of fields may be enough to distinguish the right link or object. In the case of authors, for example, which I believe to be the next great implementation frontier, can we distinguish or resolve to a specific author at several locations, working on several different grants for different agencies while teaching at two institutions and a member of other communities as the same person? An author who stays put and uses only a single email or corporate source is easy. The hot shots that move around and advance the field from many perspectives are not as easy.

I like the idea of the Semantic Web. I look forward to it becoming a reality. But, to say that the use of linked data requires RDF is not true. To say that if we use RDF we have arrived at the Semantic Web is also not true. They may all work together to reach that point. But other options and technologies should not be disregarded in our quest to reach the holy grail of the Semantic Web. The simple clear and elegant solution may lie in algorithms still being thought of or even being applied elsewhere. Let’s not restrict ourselves to a group think and be bypassed in the final solution.

Margie Hlava
President, Access Innovations