A Look at Markup Languages

The most frequently used markup languages these days are those of the SGML family: SGML, HTML, and XML. Information professionals are likely to encounter all three fairly often.

Let’s look at SGML first. It has these basic parts:

If you look at an SGML file, you’ll see the declaration, an instruction that associates the document with a document type definition (DTD). There are several standard DTDs, such as EAD, and TEI, in SGML. As the Wikipedia article on “Document Type Definition” explains, “A DTD uses a terse formal syntax that declares precisely which elements and references may appear where in the document of the particular type, and what the elements’ contents and attributes are.”

Then there is the instance, which is the document itself, and which could be any page that has been marked up. This is what it looks like:

I think you all have probably seen those tags.

We have attributes, which provide information (such as the identifier for an image) about elements. Those can be nested into an element.

HTML it is a specific kind of SGML. It is limited and it really covers mostly format and display, but there is one little metaname element that is important in taxonomy usage. This is it. If you look at the source page of a web page, what you see is this:

In the red circle – the declaration – the doctype, this is W3C/DTD HTML 4.0 frameset in English. If your browser doesn’t have this, this declaration will tell your browser to go get the URL, and now you can read my page.

In the green circle, you can see elements that characterize the page or document: title; meta name keywords; meta name description; meta name author; meta name copyright. This is built into the HTML standard. If you want your pages on the Web to be crawled efficiently by Google and other systems, you will fill in this information.

A lot of people got into loading the keyword field, and so Google changed the ways it determines the rankings. However, if you are going to have lot of corporate intranet, this is one of the best ways to support search on your pages, because there is a field to put your taxonomy terms in. It’s called meta name keywords. That’s really where the term metadata came from. These big encompassing fields are included here. If you go to ‘View Source’ on a lot of pages, you can find this.

Web “spiders” are programs that are set to capture and bring back the information in HTML headers.

Most TaxoDiary readers know what tagging looks like, so I am not going to spend a lot of time on that. The HTML examples below should suffice.

Another “meta language” is eXtensible Markup Language, or XML. With XML, you aren’t really required to have a DTD. Instead of a big formal document type declaration, nowadays what they do is a schema. They just outline the fields. So, either a schema is what is required if you are going to have an XML-friendly system. It is simpler, but if you have data in SGML or if you have data in a DTD, and it is going to be loaded into a pure XML system, you have to convert it to a schema. Not a hard thing to do, but it is one of those little overlooked pieces that people sometimes forget.

DTD people are using “DTD” and “schema” interchangeably these days. One is a formal declaration (DTD), and the other is just an outline of the fields (XML). They are stated differently. XML is kind of the default language now ,and because we now have style sheets, it uses the style sheets that started their life with SGML. It doesn’t have all the features, but then apparently it doesn’t need it. You can find a lot on XML from ASIS&T – OASIS Open. That is a really good site for information about XML.

Below is a rough diagram showing how the various markup languages relate to various systems.

There is a set of XML schemas from the National Library of Medicine called JATS – the Journal Article Tag Suite – which is being broadly embraced by the publishing community. Use of that, or a variation thereof, is nice because it is fairly standard, and a lot of people are looking at the variety of the option.

We’ve mentioned the ability of the markup languages to identify metadata. In the next installment in this series, I’ll discuss metadata.

Marjorie M.K. Hlava President Access Innovations

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

A Look at Markup Languages