Data Analysis in a Standards-Challenged World

We deal pretty heavily around here in words, what they mean, and how they’re used. It should go without saying, but it’s a fundamental part of what we do and is what makes us so concerned with standards, in both taxonomies and the written word. The two go hand in hand; it’s a whole lot easier to have one when the other is compliant.

Academic publishing, which we deal with the most, and most publishing in general, is pretty good about standards, and so we’re able to easily go in and build a taxonomy or mine the content for data analysis. There has been plenty of talk about how useful that enriched content can be in regard to linked open data, direct consumer advertising, and all that. It’s all well and good, but in places where there aren’t standards, it’s a whole lot more difficult to deal with, at least in a semantic sense.

Only a short time ago, nearly all disseminated content would go through some kind of editing process to make sure that simple things like spelling, grammar, and syntax were correct, but also to be sure that it complied with appropriate standards. Only once that process was complete would the public see anything, at least for the most part. This, of course, made for perfectly readable and understandable content, assuming you were familiar with the language and the jargon.

Then the Internet happened. Now, poorly constructed text is the norm and standards have gone out the window. Much of the Internet is about speed of delivery, and content is given a cursory edit, at best. Compliance with standards and delivery speed rarely make good bedfellows.

I’m not here to argue the rightness of accuracy over speed; the change has happened and there’s no going back. Blogging and especially social media have blown up the old ways. This is just how people communicate today and, to me, this is content that begging for analysis.

But how does one even begin? The one who worries about grammar or syntax in a tweet is a rare beast indeed, and that doesn’t even take into account how things are spelled. Multiple Z’s instead of a single S, fifteen O’s when writing “love,” numbers in place of letters, all sorts of ridiculous things.

Add into that the multitude of languages that people use online, along with widespread disregard for traditional spelling and grammar, and the number explodes. However, because more people from more cultures are communicating with one another, it seems even more important to find a way to be able to control and structure all this data that we have for the same reason that we structure vocabularies for scholarly publishing: quick and easy search.

In theory, that’s what tags on blogs and hashtags on social media are for, but when anybody can come up with their own tags, it’s plain chaos. This anarchy is something that has no place in an information realm that requires at least some degree of standardization. Tags might never be completely standardized, but a system of organizing them into broad concepts may be a solution. There will always be things like #janetiswinning, #imeatingdinner or whatever, so noise is going to be inevitable, but some kind of broader classification could help people find what they’re looking for, given how much new content is produced every hour of every day from every corner of the world.

That noise will always exist, but we can’t dismiss it all as trash. Communicating through social media has become too important a part of our lives to pretend that there’s no value in at least some of the millions of tweets, Facebook posts, Instagrams, Vines, and all the rest. If there’s no value from a scholarly standpoint, there still is from anthropological and political ones, and the power that marketing gains using this kind of data analysis is abundantly clear.

This is conceptually pretty simple when we’re talking about data in the form of text, whether that’s a post or a hashtag. They’re all words, after all, even if they’re spelled tragically wrong. What about things that aren’t words, but still convey concepts? Instagram and Vine are currently two of the fastest growing social media sites and, though they use hashtags, they deliver content visually.

And then there’s the whole new issue of emojis. That might seem like a small thing at first, but they aren’t necessarily used at random, and some are used very specifically. An additional wrinkle with these is that they communicate meaning across languages. It seems to have huge potential for analysis, but is effective analysis even possible given the amount of noise?

I think that the answer is almost certainly yes. This kind of data is too valuable not to mine, especially when the technologies for doing so are already being developed for other purposes. For text, there are developments in sentiment analysis that have already been implemented to analyze social media for political campaigns, and its uses are only going to evolve. Less has been done on this level for imagery, but if a computer algorithm can be built that can accurately identify Jackson Pollock paintings and if a self-driving car can determine spatial proximity and object identification in real time, certainly the potential exists for use in social media. Almost nothing has been done with analyzing emoji use, though there is Emojitracker, which absolutely fascinates me (and I will write about at more length in a later post).

We used to communicate almost exclusively by the written word, but now the technology exists to communicate meaning in a large number of ways. Shouldn’t we analyze and study that meaning? I don’t have answers to the questions, but the more we explore these new realms, it seems like time to start thinking about semantics in a slightly broader way. Standards are important and I’m all for them. But I’m also all for people communicating with each other. The least we can do, as people who work in semantics, is to try to find ways to see meaning in the content of social media, even if it seems like a great bog of nonsense some of the time.

Daryl Loomis
Access Innovations

Data Analysis in a Standards-Challenged World