Let’s look at the history of the MLs a little bit. Those are the markup languages for computer text processing.

Markup languages started appearing in the 1960s. At that time, if you were a publisher, you would have your pages typeset by a professional typesetter or a typesetting company. The typesetters used to encode your data so that you couldn’t see the results until they generated the pages. The problem was that once a publisher had a large corpus of their publications set with a typesetter, they were handcuffed. They couldn’t leave very easily. Their data was so tied up with some particular typesetting system, the Penta system or whatever they were involved with. They couldn’t migrate. They couldn’t move. So, they were kind of imprisoned.  The price kept going up every year because, well, “Where are you going to go? Hah!” It was very frustrating.

They came up with what they called a standard way of marking up the pages so that no matter which typesetter you went to, they could interpret the data and come out with the same kind of pages. So, the page would look the same if you had a journal and you decided to shop the typesetting – which costs about $40 per page. You could move it from place to place to place.  Not only that, but you had the potential of taking that information that had been locked up in a photo composition system and move it to this newfangled way of distributing data called online.  You could take data from the typesetter and put it up so that people could search it and look at it from distributed computers over acoustic coupler modems or something similar. This was a really big deal, a really big breakthrough.

The Standard Generalized Markup Language – SGML – was published by NISO as a standard and subsequently approved by ISO in 1985. SGML was eagerly embraced by the publishing community. Unfortunately, it was incredibly complex. The typography systems had been very complex. Unscrambling photo composition systems was very complex. I actually took the main typesetter from a typesetting company and got him a bit smashed. I even remember what bar it was. But if I tell you the city, you’ll know maybe which one it was. The offices were under a bridge in a large metropolitan area. As we talked, the typesetter wrote out little code sequences. From that I was able to unscramble their coding system and generate a SGML markup for the publisher. It was that kind of cloak-and-dagger stuff. It was really a horrible time for the publishing industry.

In the early 1990s, Tim Berners-Lee introduced the HyperText Markup Language, or HTML. In combination with new Internet browsing technology (remember WAYS?), HTML made it possible to do really simple format posts. Bold, underscore, big headings – H1, little headings – H5, and mark the data up. It was a very simple, simple, simple implementation of SGML. In fact, it was way too simple, as far as publishing was concerned. Publishers couldn’t quite use just the HTML. They needed to have some idea of what this element was and what it contained; more descriptive information. It was there in abundance in SGML, but not there at all in the format-only HTML, particularly in HTML 1. That’s what gave rise to XML. XML is an extensible version of HTML that allows you to get both format and content markup, so you know more about the system.

The reason this is important is because now we have moved markup forward so that we have structure, we have content, and we can put in added value information.  The added value information includes the subject term indexing of the taxonomy terms. We have the capability to do the formatting, as well. So we have structure, content, added value (usually from an attribute table), and an element, which is where you would put your taxonomy information.  It gives you an extraordinary system.

Now we have SHTML, Chemical ML, and we have Math ML, and on it goes. If you know the basics, then doing extensions to look at the other languages is not so hard.

The main markup languages are published and supported standards.  It usually takes a long time to do a standard. You have to run it through the whole consensus process, and you have to get people to have an idea, then you have get people to convene a committee, then you have to get it approved. Then the standards organization has something to develop. Then you have to write it, then it has to bounce around a bunch of people for approval, then you have to vote on it. If there is a “no” vote, then it is dead. Then you have to resolve all of the no votes. It takes years; it could take five years or longer.

The people with the World Wide Web Consortium (W3C) said (in effect), “This is stupid. We are not going to be able to make any progress in the world if we have to wait around for the ISO/NISO/ANSI standards process. So we are going to do something called Request for Comments. We are going to convene a bunch of guys; they are going to do it online. They are going to talk about it. We will make our notes open (most of the time). Then, we will be able to say, okay, we think it is done. Comments? Any of you early implementers can get started.”

That is what they did with XML. In 1997, XML came out as XML 1.0, and early implementers did go ahead and start using it. In using it, they could prove whether it worked or not. That approach works really well.

So, we had all of these standards that were approved by the ISO/NISO process and then, suddenly, you might have noticed that a lot of them have moved over to W3C. W3C is moving them a lot faster, and they really don’t care if they work perfectly because – that was 1.0; we can go to 2.0 now. That’s a really important advance.  Let’s call that ___ 2.  Oh, you want related terms in taxonomy.  Okay, let’s call that __3.  You know, they just evolve as needed.  So, it responds in Internet time instead of in the ponderous consensus voting time.

Which reminds me — the cascading style sheets that were supposed to happen for SGML back in the 1980s finally were approved a couple of years ago. So, now they are approved for XML instead. They are a great forward movement, but they could have been approved 15 years ago. The big problem with SGML, besides that it was complicated, is that there wasn’t a browser that could support it. Now we have a variety of browsers that can support HTML and XML.

Marjorie M.K. Hlava
President Access Innovations

Note: The above posting is one of a series based on a presentation, The Theory of Knowledge, given at the Data Harmony Users Group meeting in February of 2011. The presentation covered the theory of knowledge as it relates to search and taxonomies.

Data Harmony is an award-winning semantic suite that leverages explainable AI.