The bell struck twelve.
The Phantom slowly, gravely, silently, approached. … It was shrouded in a deep black garment, which concealed its head, its face, its form, and left nothing of it visible save one outstretched hand. But for this it would have been difficult to detach its figure from the night, and separate it from the darkness by which it was surrounded.
(From A Christmas Carol in Prose; Being a Ghost Story of Christmas, by Charles Dickens.)
And so it is with emerging concepts, those concepts whose forms we can but vaguely discern at the present point in time, whose true reality lurks in the future.
As taxonomists, we have a responsibility to discern those future concepts, although they may still be invisible to most. We can save the various expressions of those concepts in search logs from being rejected from consideration for a vocabulary simply on account of their as yet infrequent appearance. In a taxonomy or thesaurus, we can provide labels that will consolidate the indexing for a concept for which researchers have not yet settled on a name. In some cases, especially with widely used vocabularies, we can perhaps determine the name by which a concept will be known on a standard basis.
This role in itself is one of the emerging responsibilities for taxonomists, thanks to the rapid advances in science and technology. In “What Next, Taxonomy?” (posted on The Taxonomy Blog on November 4, 2011), taxonomist Marlene Rockmore concludes that taxonomists need to deal with emerging technologies in a variety of ways, including collection of relevant content:
“So what next, taxonomy? What is nice to hear is that more taxonomists are surviving because their organizations understand their core roles. What’s the emerging topics and challenges – how to distribute and decentralize (localize) while having authority and control, how to collect new content on emerging, current topics, visualization, how to be more agile, how to fit in with new technologies like social media, mobile, and big data. Phew! That’s a challenge. Taxonomists have a chance to build relationships not only between terms, but with stakeholders on the way to a compelling, visualized, multidimensional content strategy. Good luck.”
This challenge has been growing in step with the rapid advances in science and technology. One example among the many advances in science is the ability of biologists to recognize new and emerging species, as well as life forms that have existed for a while but were formerly overlooked. The Live Science page Newfound Species observes:
“Science has identified some 2 million species of plants, animals and microbes on Earth, but scientists estimated there are millions more left to discover, and new species are constantly discovered and described. The most commonly discovered new species are typically insects, a type of animal with a high degree of biodiversity. Newly discovered mammal species are rare, but they do occur, typically in remote places that haven’t been well-studied previously. Some animals are found to be new species only when scientists peer at their genetic code, because they look outwardly similar to another species — these are called cryptic species. Some newfound species come from museum collections that haven’t been previously combed through and, of course, from fossils.”
Even the humble hosta has its own emergings, due in part to technological and social advances in communication.
“In past centuries, we used to talk about people “discovering” new species of plants. What this usually meant was that European, English or American plant explorers traveled to remote parts of the world and found plants that were new to them. Now, of course, we know that local people in those other parts of the world were often quite familiar with these plants all along. Many of the so-called new plants, including hostas, have been found in local paintings and documents produced long before the Westerners started poking around. In more recent times, however, with better communications, we more universally share the knowledge of different horticultural communities.”
As far as actually emerging species are concerned, evolutionary biologist Rob DeSalle of the American Museum of Natural History has indicated the continuing nature of species emergence:
“Identifying a new species as it emerges is the holy grail of evolutionary biology. … Species must be emerging someplace on earth. The best places to look would be places with lots of species, like rain forests, and islands, because isolation opens new niches.” (In “Q & A; Emerging Species” by C. Claiborne Ray, published June 17, 2003 in The New York Times)
The ScienceDaily website has a webpage dedicated to news about “new” species of plants and animals. While most of these will escape public awareness, Time Magazine has sifted through the barrage of information to identify the Top 10 New Species” of 2014. According to author Bryan Walsh, “The collection includes a dragon tree, a skeleton shrimp, a gecko and a microbe that likes to hang out in the clean rooms where spacecraft are assembled.”
Speaking of top things of 2014, and moving on to emerging technologies, the Massachusetts Institute of Technology’s online Technology Review has published a list of 10 Breakthrough Technologies 2014.” The list includes such things as brain mapping, genome editing, and agile robots.
The Wikipedia article “Emerging technologies” emphasizes the role of technology convergence in the emergence of new technologies. The article mentions an acronym of particular interest to those in the information technology world:
“NBIC, an acronym for Nanotechnology, Biotechnology, Information technology and Cognitive science, is currently the most popular term for emerging and converging technologies, and was introduced into public discourse through the publication of Converging Technologies for Improving Human Performance, a report sponsored in part by the U.S. National Science Foundation.”
Wikipedia also has a “List of emerging technologies” containing brief descriptions of “some of the most prominent ongoing developments, advances, and innovations in various fields of modern technology.” More than two hundred emerging technologies are listed.
There are and will continue to be many new and emerging concepts in science, technology, and other fields. Taxonomies can help define the terminology for those concepts. This is perhaps most readily evident for genus-species-subspecies-etc. names, whose designation is the territory of the biological taxonomist, or the biologist temporarily acting as taxonomist. Elsewhere, taxonomists can identify predominant labels and the occasionally used synonyms, and then use that information to add appropriate preferred terms and non-preferred synonyms to a vocabulary. They can also add definitions and scope notes. The skills of the taxonomist can bring clarity to formerly mysterious concepts and nomenclature.
No fog, no mist; clear, bright, jovial, stirring, cold; cold, piping for the blood to dance to; Golden sunlight; Heavenly sky; sweet fresh air; merry bells. Oh, glorious! Glorious!
So don’t be scared of the ghosts of future concepts. Think of them as true spirits of the future, taking flight with the benefit of well-chosen terms and synonyms in a taxonomy or thesaurus.
Every time a new term rings true, an emerging concept gets its wings.
Originally posted December 30, 2013.
We use taxonomies and ontologies to organize document collections. But controlled vocabularies (of a sort) are also used to organize the aisles in a grocery or hardware store, clothes in the closet, and your kitchen. In the last two cases, the subject matter expert (SME) is definitely you!
If you are more comfortable shopping in a Target instead of a Walmart, it is probably due to the way in which they’ve organized their collection of merchandise. Barnes & Noble and Borders books stores had significantly different ways to organize their books and other offerings. I loved one, but could not easily find my way around the other.
We frequently organize document collections for associations—organizations of learned and scholarly publishers. Occasionally, they ask us to organize the governance layer of the society as well. What they want us to do of course is take all the committees, special interest groups, divisions, chapters, and communities of interest or practice and add them to the taxonomy for use in navigation on their website. To do so, we look at the content to be indexed that’s relevant to that section of the society to ensure proper tagging.
Our philosophical bent at Access Innovations is to build a term record for every term in the taxonomy (or thesaurus or ontology). That means a small (usually) database of terms; their broader, narrower, and related terms; aliases (synonyms); and perhaps definitions, scope notes, or other links. The terms are used to tag the content, whether they are HTML pages, articles, book chapters, memorabilia, meetings, minutes, etc. We are often asked to provide a “full path” export showing exactly where in the taxonomic hierarchy the term itself resides, and we do. But we know that searchers do not ask for the full path; they ask for the term in that tiny little search box. Thus, we tag at the term level—and each term needs to stand on its own as a potential search term. The meaning of the term should not be inferred from its place in the hierarchy, since the searcher (1) usually has no idea where it resides taxonomically, and (2) doesn’t really care; what they want is the appropriate content.
Along the way, over the course of organizing content for many associations and societies, we are often able to shed an interesting sidelight of information: we learn the organization well, but only from its content. We are not experts in the field, nor are we active members of the organization. We can read the history of how it started, why it is different from other organizations, what is so special about it that made many people come together to form the society in the first place. However, by building the taxonomy we get a snapshot in time—we see the content and organization as it is today. This interesting perspective has led us to see where the society is and what it has become, not what it was. It gives a fresh perspective on how the organization is really organized, what it actually covers, and, based on recent activity, in which direction the association is going. This provides a solid foundation for future scopes and long range planning for organizations.
Visualization of the data provides the present communities of interest and links to the other communities within the organization. Add the time and date of the publication of each piece of content and it also shows the trending directions for each topical area.
As we build out the governance layer (how the organization fits together) we depend on people who know the organization and the published guides about how it works. If we did not, we might organize it in a very different way based on what they actually do today, which would be an uncomfortable experience for those who know and are active in the association. Just like going into a bookstore which is arranged differently than you think it should be, the arrangement of the taxonomy for an organization needs to reflect how the organization thinks of itself. The other way of looking at it (solely from the content data) often does not reflect how the organization wants to be seen; it could, however, be an excellent strategic planning asset to use the taxonomy for this purpose. Sometimes a taxonomy is used exactly that way for a look at the future. If you are a member of an association, how would you go about building a taxonomy for the organization, and then applying it to the governance layer in order to secure a bright future?
Marjorie Hlava, President
The weather professionals have been puffed up and posturing with graphs, maps and all their Doppler mechanisms for several weeks now trying to predict what the winter of 2015-16 will be like. Will there be blizzards like we’ve never seen? Will there be ice storms that shut down cities, airports and interstates? Or will it be a “quiet” winter with tempered temperatures and balmy afternoons?
One thing is for certain: people like to talk about the weather. But when they do, are they using consistent language and terminology? For instance, what does sleet mean to you, and how is it different from freezing rain?
The American Meteorological Society uses the Classification of Precipitation Types during Transitional Winter Weather Using the RUC Model and Polarimetric Radar Retrievals to apply some consistency and standards to weather terminology.
The classification algorithm they use distinguishes between nine classes of precipitation near the surface. The winter precipitation classes provided by this algorithm are:
- Dry snow
- Wet snow
- Ice pellets /sleet
- Freezing rain
- Freezing rain and ice pellets mix
- Heavy rain
Another unofficial term is graupel, which are snowflakes that have become encrusted with ice. This happens when snowflakes pass through a chilly cloud on their way down and water droplets freeze on them.
Where are snowflakes on this list?
Snowflakes are made of ice crystals. Impressively, each snowflake is made of as many as 200 ice crystals.
Just like in kindergarten when we all cut out our own snowflakes by folding paper and clipping triangles and circles out of the folds, snow crystals are symmetrical and they can form a hexagonal shape because that is how water molecules organize themselves as they freeze. Others are small and irregularly shaped. If they spin like tops as they fall to the ground, they may be perfectly symmetrical when they hit the Earth. But if they fall sideways, they will end up lopsided.
We have all heard that no two snowflakes are identical. This is because even though most have a hexagonal structure, there are so many ways that water molecules can arrange themselves as the water freezes, the options are endless. No two snowflakes have exactly the same arrangement of molecules, but they can look alike.
What combination of these and other factors are required to create a winter wonderland blizzard?
Three things are needed to make a blizzard:
- Cold air (below freezing) is needed to make snow.
- Moisture is needed to form clouds and precipitation.
- Warm, rising air is needed to form clouds and cause precipitation.
Even with this classification, there are events that can alter the expectations of a season. One of those is El Niño. El Niño is the periodic warming of water in the Pacific Ocean every few years. When it occurs, it means more energy is available for storms to form there.
Winters, during the El Niño effect, are warmer and drier than average in the Northwest, northern Midwest, and upper Northeast United States, so those regions experience reduced snowfalls. Meanwhile, significantly wetter winters are present in northwest Mexico and the southwest United States, including central and southern California, while both cooler and wetter than average winters in northeast Mexico and the Southeastern United States.
The El Niño of 2015-16 is the strongest El Niño in 50 years and this means locations across North American that do not typically see a white blanked Christmas holiday just might this year.
Either way, we hope your days will be merry and bright, even if your Christmas is not white.
Melody Smith, Blog Wrangler and Extreme Foodie
Access Innovations is proud to announce the sale of its publishing arm, the National Information Center for Educational Media (NICEM) to Graham Carter-Dimmock. Graham’s expertise and experience includes audio and video production, localization and marketing, database design and electronic publishing. He has been involved in NICEM project development on several occasions over the years.
NICEM is an aggregation of non-print educational media including CDs, videos, audio cassettes, kits and many other types of audio and visual offerings used in the K – Adult educational programs as either primary training materials or supplemental educational materials. NICEM collections are available through NICEM.com, AV MARC (offered through the Library Corporation), AV-ONLINE (offered through the Ovid Platform) and the MediaSleuth e-commerce databases. NICEM has 106 print titles and a broadly implemented thesaurus for educational curriculum.
“NICEM has been part of our core since 1984. It has a very well developed production platform and highly automated data feed system. As we have changed our focus to the Data Harmony software and content enrichment services, NICEM is no longer central to what we do as a company,” said Marjorie M.K. Hlava, President of Access Innovations. “We are glad to see it go to a new and innovative home.”
“As the internet world has become more intensely digital, a solid archival bank of the nearly 700,000 items in the NICEM databases provides a perfect base to enter the rapidly evolving marketplace for streaming media,” said Jay Ven Eman, CEO of Access Innovations, Inc. “This dataset has archives of a huge array of materials that go way beyond just the historical educational focus of NICEM.”
Graham added “As an international media producer and distributor my go-to reference was always NICEM. I believe that it can become even more relevant and useful to educators and trainers working with today’s increasingly multi-media and internet-focused curricula.”
Founded in 1978, Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes automatic indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world.
When I say Thanksgiving, do you immediately think of a basted, golden brown bird on a large platter adorned with oranges, cranberries and sage? Maybe you think of a juicy oven-roasted ham bearing the traditional clove and pineapple scored design baked into its caramelized goodness?
Thanksgiving seems a holiday that’s as American as apple pie, or pumpkin pie for that matter. But actually, there are variants of this holiday all around the globe. Their meanings, dates and customs may vary, but they all revolve around the concept of gratitude and food, of course.
I think of these foods as traditional to the North American Thanksgiving holiday since I am North American. Other countries view traditional holiday foods through their own cultural lens.
For instance, while North Americans and Canadians both celebrate Thanksgiving Day, there are several differences between the traditions, practices and foods in the two neighboring countries. While the basic Thanksgiving foods are similar in name, in practice they are quite different.
For instance, Canadian pumpkin pie is spicy, with ginger, nutmeg, cloves and cinnamon, while North American pumpkin pie is typically sweet and has custard in it.
The North American Thanksgiving holiday has been celebrated as a federal holiday every year since 1789, after a proclamation by George Washington. The event that Americans commonly call the “First Thanksgiving” was celebrated by the pilgrims after their first harvest in the New World in 1621. This feast lasted three days, and it was attended by 90 Native Americans and 53 pilgrims to offer thanks for their blessings.
The First Thanksgiving 1621, oil on canvas by Jean Leon Gerome Ferris (1899). The painting shows common misconceptions about the event that persist to modern times: Pilgrims did not wear such outfits, and the Wampanoag are dressed in the style of Native Americans from the Great Plains. https://en.wikipedia.org/wiki/Thanksgiving_(United_States)
Despite what we were taught in school plays, for many of the pilgrims, England was just a layover on the way to America. Approximately 40 percent of the adults on the Mayflower were coming from Leiden in the Netherlands. The people of Leiden still celebrate the American settlers who once lived there with a non-denominational church service on the fourth Thursday of November. Afterwards, there’s no turkey, but refreshments of cookies and coffee.
Canada’s Thanksgiving celebrates the harvest and other blessings of the past year and has been an annual Canadian holiday, occurring on the second Monday in October since 1879, when Parliament declared a national day of thanksgiving.
Other countries have their own version of this holiday. Germany sees this celebration as a religious holiday that often takes place on the first Sunday of October. Erntedankfest is essentially a harvest festival that gives thanks for a good year and good fortune. Although turkeys are making inroads, chickens and geese are favored for the feast.
A food decoration for Erntedankfest, a Christian Thanksgiving harvest festival celebrated in Germany. https://en.wikipedia.org/wiki/Thanksgiving
A variation on North America’s Thanksgiving can be found in the West African nation of Liberia. This country was founde in the 19th century by freed slaves frm the United States. Liberians take the concept of the cornucopia and fill their churches with baskets of local fruits like bananas, papayas, mangoes, and pineapples. An auction for these is held after the service, and then families retreat to their homes to feast.
Kinrō Kansha no Hi is a national public holiday in Japan to celebrate celebration of hard work and community involvement. It is derived from ancient harvest festival rituals named Niinamesai. Today it is celebrated with labor organization-led festivities, and children creating crafts and gifts for local police officers. This is one exception in that food is not central to this holiday and turkey does not have a traditional role.
Tradition has its place in every culture, but more and more new generations are looking to make their mark on the culinary expectations of holidays. “Foodies” like to experiment and cook outside the classification.
What is seen as non-traditional to some will vary to the geographical area and history. I have a friend who through a series of unfortunate events failed to procure a turkey in time to safely thaw and prepare before the family feast last year. He instead prepared some stuffed pork tenderloins and the response from his family was joyous. They have declared this their new “tradition”.
Fusion is a result of mixed cultures and it is represented in food more and more. The gourmet food magazine, Food & Wine, offer alternatives to the Thanksgiving menu and they aren’t referring to just using a Cornish hen vs. a turkey. Out of the kitchen ideas like mushroom lasagna and sausages would make even the most traditional among us give pause.
Wherever you fall in the food spectrum – traditionalist or adventurer – there are many options available both for home preparation and dining out. More restaurants than ever are open on this holiday to give your favorite home chef the day off so everyone can gather and celebrate in their own way — together.
Melody Smith, Blog Wrangler and Extreme Foodie
National Association of Government Web Professionals (NAGW) 2015 Annual Conference held in Albuquerque, NM, September 23-25, 2015
Conveniently held in our hometown of Albuquerque, the program for the National Association of Government Web Professionals (NAGW) 2015 Annual Conference was sufficiently compelling to warrant our participation. Two of us attended sessions, receptions, and networked with an enthusiastic group of professionals.
A first observation is in the name. NAGW members prefer “Web Professionals” over webmasters. The difference in meaning (semantics) can have a significant impact on how words are perceived. Master verses professional? I’ll let you draw your own conclusions.
They are a very professional group, in my observation, and the meeting focused on the challenges and triumphs of running an essentially entrepreneurial effort in a highly political, bureaucratic environment. City, county, and state governments and agencies, as well as some federal agencies, were represented. Issues included web site organization, discovery, security, mobile venues, measuring success, Section 508 compliance, look and feel, branding, training, support, dealing with citizens, and a host of issues common to all web professionals. Technical sessions at the coding level were also on the program.
Besides challenges, there were plenty of triumphs chronicled by various presenters as well as NAGW’s annual “Pinnacle Awards”. The Pinnacle Awards are divided into the population size of the government entity – small, medium, large, etc. Some of the award criteria included team size, content, organization, design, performance and flexibility, accessibility, standards, and interactivity. It was nice to see a significant number of entrants in each category. It can be intimidating having your work evaluated by your peers, but it can be very instructive, leading to an improved site.
Delving into the politics of government websites is out of my purview. What gets posted to a government website brings with it an assumed imprimatur. Verifying, checking, and getting approvals (often multiple) of every content item is costly and time consuming. Resisting blatant or even subtle propaganda posting can be hazardous to one’s career! Being responsive to a new mayor with their unfunded mandates requires a great deal of creativity and maneuverings. Government departments are often fiefdoms and getting cooperation on design issues, what to name things, and providing access to important, useful content is not easy.
A challenge that I can address is discovery. Ron Pringle, City of Boulder, gave a great and candid presentation, “Improving Search: Lessons from the Trenches”. His remarks addressed citizen-facing websites versus internal portals. Why do citizens go to their city’s website? To find resources that answers questions like: What can be recycled? What day is trash pickup? Where do I vote? Who is the city council person for my district? Many city websites seem to be geared to wooing tourists. They are awash in pretty pictures, while a simple listing of government services is woefully missing.
Tourism website for Los Angeles, California
Citizens’ website for Los Angeles, California
Search boxes are hard to find. Navigating is often difficult, although some cities’ websites, like that of Los Angeles, California, were highlighted as quite good.
A good place to start is by analyzing search logs. This will tell you what citizens are trying to find. It beats guessing. The most requested resources should be the easiest to find. Simple listings and navigation tabs are helpful.
Even a simple listing of a city’s major departments can be difficult to assemble. Do you list an agency by its official name or by what most citizens call it? Should the Solid Waste Management Department be listed as such or should it be called the garbage department or sanitation department on the website listing? Again, what your citizens call a department should provide a clue. Navigation aids should be just that – clues that help citizens find the resources they need. Once to the right resource, the official name of a department can be, and should be, prominently displayed. A drop-down navigation aid on the home page does not have to have the official name or the technical name. Do you want to lead with “HHW” or with household hazardous waste disposal or maybe just waste disposal? Lead with a common, general term and then get more specific. From “waste” a citizen might navigate to “hazardous waste” and “nonhazardous waste”. Under hazardous waste could be a list, but again, use common names and not the scientific name: “antifreeze” not “ethylene glycol”. Under types of antifreeze, you could then list ethylene glycol along with propylene glycol, etc., as each may have different disposal requirements.
Albuquerque’s citizen website shows “Trash & Recycling” under the “Community” tab
Lists are good, but what about the ubiquitous search box? This is where a good taxonomy is invaluable. It is the foundation of your navigation lists and aids. A good taxonomy provides the basis for sound navigation and rapid, accurate discovery. It does this by mapping the language of the citizen to the language of city bureaucrats. What the citizen calls the garbage department, the city calls the sanitation department, or the solid waste department, or… A taxonomy will bridge this gap. Taxonomies can help resolve the hundreds of acronyms that are so prevalent in government. It provides a reliable connection between the vernacular and the formal, or more scientific, terminology.
I encourage you to investigate the rich resources on semantics, thesauri, and taxonomies found at our company website. I also encourage you to investigate NAWG, if you are a web professional in the government arena.
Jay Ven Eman, CEO
Access Innovations is proud to announce that it has reached the testing phase on its latest product, Ontology Master (OM). This adds to their current lineup of Data Harmony software offerings, including Thesaurus Master, MAIstro, and the Data Harmony Suite.
Ontologies are a growing trend in the information science industry. They are intended to provide a language that can be used to describe classes and the relationships between them. They formalize a knowledge domain by defining classes and the properties of those classes, while providing semantic meaning within entities. Ontologies, which use Resource Description Framework (RDF) triples, use Web Ontology Language (OWL), a format developed by the W3C in 2006 to formalize a language that would allow for linked data and relational database sharing across the web.
“Ontology Master extends the already powerful Data Harmony software tools by allowing users to create relationships between terms in a vocabulary,” comments Win Hansen, head of project development for Access Innovations. “This will give our users much more detail and richness when working with their content.”
Ontology Master is still currently in development, but Access Innovations is at a stage in the project where testing is required. They are currently calling out for beta testers to use and comment on the software to help them refine it before release.
Jay Ven Eman, CEO of Access Innovations, remarks, “Ontologies are increasingly becoming the norm in Information science, because software agents can now make more reliable inferences helping get users to the web resources they need. Our team has worked long and hard to bring Ontology Master to our clients and we’re very excited to have people beta test the software.”
If you are interested in becoming a beta tester for Ontology Master, contact Access Innovations at mailto:email@example.com
Access Innovations has extensive experience with Internet technology applications, master data management, database creation, thesaurus/taxonomy creation, and semantic integration. Access Innovations’ Data Harmony software includes machine aided indexing, thesaurus management, an XML Intranet System (XIS), and metadata extraction for content creation developed to meet production environment needs. Data Harmony is used by publishers, governments, and corporate clients throughout the world.
Words that occur together frequently are likely to encode important concepts. Therefore, simply sorting a list of phrases according to the frequency of occurrence in text is an automatic way of capturing important concepts associated with that subject. Word order is important since meaning tends to be associated with order. Thus, word order must be preserved when creating phrases from text.
The basic idea of N-gram analysis is to count phrases consisting of N sequential words from a document. Sorting a list of all of the phrases of all sizes contained in a corpus of documents by frequency presents a list of candidate phrases. These are likely to encode concepts important in the corpus. Frequency of occurrence will occasionally be correlated with the importance of the concept, though occurrences at the highest frequency level often can be less than helpful. They may well be commonly seen phrases, so they can’t simply be taken on their own; the human element must still come into play.
Concepts are not necessarily captured by phrases of a particular length. Indeed, some concepts might be complex enough to require several sentences to describe them. Therefore, it is appropriate to explore phrases of various lengths when searching for concepts in text. Of course, human propensity for acronyms and economy of communication tends to drive the representation of important concepts toward shorter words or phrases.
Thus there are two opposing forces at work that tend to adjust the balance between representing complex ideas: short sequence of characters that need to be supported by a large dictionary of complex concepts and long sequences of words that can be supported by a smaller dictionary of simpler concepts.
A fine example of this is in the comparison of the Windows and Linux operating systems. There is stark contrast between the point and grunt paradigm of Windows, where enormously complex concepts are embodied into the process of pointing to a simple button in Windows, and the verbosity of Linux, with paragraphs of text used to describe the same operation on Linux.
New ideas can be discovered by finding combinations of words that have not been seen before or that are occurring with higher (or lower) frequency than in the past. Therefore, having the capability of detecting changes in the frequency of occurrence of phrases can be a path toward discovery of new or evolving concepts. In addition, when starting a new taxonomy, useful groups of words can be selected from the N-grams as a starting point for the taxonomy.
N-grams are not only good for discovering new concepts, though. Equally important is the ability to use N-grams to discover concepts that are no longer being discussed. Take a journal on nuclear physics. As early as the 1920s, scientists were hypothesizing on the concept that would come to be known as “cold fusion.” Papers were published on the topic all the way into the late 1980s, when Martin Fleischmann and Stanly Pons drew wide media attention after reporting that their experiment actually worked. The idea was a cause of celebration in the wake of rising energy costs and the need for cheap clean energy.
Had N-grams been run on the corpus of that nuclear physics journal in 1988, “cold” and “fusion” might frequently be seen together–an obvious choice for a candidate term in a taxonomy. Just a year later, however, after nobody could repeat the Fleischmann-Pons experiment, the concept was debunked and quickly considered a joke. Afterward, almost nobody wrote on the concept and it became extremely rare to find anything on the topic in a reputable journal. Running N-grams on the same journal today would reveal it as a concept that may well no longer even belong in the vocabulary at all. N-grams have considerable value in the understanding the evolution of a single concept or an entire discipline.
They can also be extremely useful in limiting concepts in a taxonomy to things that are useful. Say, for instance, I’m building a taxonomy of food for a website and I come to a branch for “Cheese.” There are thousands of different styles of cheese and I could fairly easily get a list of cheeses and add them all into the branch. That’s simple, but extremely time consuming and, ultimately, not very useful. If there is no content on this website about Abondance, an excellent but relatively unpopular cheese, and nobody is searching for content about it, why would it be in the taxonomy? It’ll just sit there uselessly. The answer, of course, is to run N-grams on the site content and the visitor search logs. The cheeses that appear in the results are the ones that could be considered highly useful in the taxonomy, helping to keep it clean, concise and, especially, relevant to your content.
N-grams may not be perfect, but they’re a great beginning to a controlled vocabulary. Their quick analysis is brilliant for going through scores of content, but they still absolutely require the human element to be useful. We at Access Innovations use N-grams in close conjunction with our taxonomists to help bring out the most in our clients’ content.
Daryl Loomis, Business Development
Search for “jats” and you will find two very distinct concepts:
14th Murrays Jat Lancers (Risaldar Major) by AC Lovett (1862-1919).jpg
This post doesn’t discuss the Jats, a race of people, interesting as they are. This post covers information tagging, specifically a specialized list of xml elements for journal articles.
Problem and need
Interoperability has continually posed a greater hurdle for scholarly publishers over the past few years. With multiple organizations publishing journal articles on an open source basis and offering free content for casual readers, scientific and technical journals have banded together to abide by a set of standards to streamline shared documents.
As more scientific, medical, technical, and engineering writings are pushed out the door of major publishers, the need to structure this data in a robust and interoperable manner greatly increases. Scholarly publishers, universities, and multiple institutions within the scientific community require access to simple tools in order to convert format to format.
The Journal Archiving and Interchange Tag Suite (JATS) provides a comprehensive list of xml elements and attributes in order for each of the published articles to swap easily into multiple data repositories and archives. Tag sets defined by JATS provide information spanning entity identification for authors, editors, and reviewers, as well as the institutions with whom the authors are affiliated. Regardless of the source for which the content was published, the tag suite would allow publishers and archives to capture the semantic components of each document without requiring additional formatting or processing issues.
In 2003, the National Library of Medicine introduced the NLM DTD v1.0 set of standardized XML elements used to mark up scientific and medical journal articles.
Prior to 2000, articles were published in either SGML, TeX, LaTeX, PDF , or other proprietary formats. The varied and individually rigid nature of the formats, along with the issues of lacking metadata and structure, caused woes of conversion, sharing, loading, findability and retention. Since then, major revisions have been implemented in order to satisfy needs to mark up header, metadata, full-text, formulas, and references. The JATS schema evolved from the NLM DTD v3.0 standard. An NLM 3.1 version was slated to be in production, but was superseded by the joined efforts of the publishers and new features added to the JATS 1.0 DTD instead.
After the adoption of JATS, journal publishers from all sectors from for-profit to open access began creating repositories for JATS. Several institutions within the scientific community sharing open access journals utilized open access repositories including PubMed Central and SciELO.
Discussions of additional JATS applications have evolved since 2012. Frameworks have been established to encompass additional keyword descriptors from multiple sources and flagging them within the full-body of the articles and to assign a relevance-based frequency count to these keywords within the metadata fields. Further enhancements of JATS could span to defining additional roles for content creators aside from simply distinguishing between authors, editors, and reviewers. The Contributor Roles Taxonomy (CRediT), developed by CASRAI, aims to add additional role-types to the JATS standard to expose data curators, software-used, methodologies, supervisors, and funding sources along with authors, reviewers, and editors.
The growth potential for JATS is immense. Projects to assign unique identifiers for individual contributors, such as ORCID, have begun to develop within the past few years. Since authors may write or appear within multiple journal articles, news articles, or conference proceedings, archives and repositories must accurately assign individuals to each one of their contributed papers. However, since authors share names, locations, and backgrounds, the importance for using a single identifier code to disambiguate authors is entirely more relevant now than previous years.
Content requires structure. Content regarding emerging scientific fields of study, new medical advancements, and solutions for engineering and design woes requires immense amount of discoverability and ease of access. While converting older articles into newer formats may be a hassle for time and resources, publishers must account for changes made to their content within the next decade. Reformatting content into an interchangeable and interoperable format is the only method for success in sharing, hosting, and providing content to end users.
NISO JATS DTD v1.0 is the formal technical specification of the US-based NISO Z39.96 2012-08-22. Discussions have begun for another revision of the specification NISO JATS v1.1.
JATS-CON is the central conference for those implementing JATS or for those who wish to know more about the standard. http://jats.nlm.nih.gov/jats-con/upcoming.html
Jack Bruce, Senior Taxonomist
In Defense of Taxonomies: In Response to the Recent Scholarly Kitchen Posts about Google Scholar, Indexing, and Content Findability
Several interesting points were raised over the course of the two posts — and, notably, in the resulting comments featuring Anurag Acharya — by John Sack about Google Scholar.
Google Scholar is a wonderful tool and resource, and it is not the goal here to disparage or otherwise belittle its importance or contribution to research. But some of the observations and conclusions are confusing — especially as regards the utility of taxonomic indexing vs. the sort of broad indexing Google Scholar has implemented.
Many scholarly and other society publishers have, as Bruce Gossett pointed out in his comment, invested considerable time, effort, and money to build bespoke taxonomies/thesauri to index the specific corpus of their content. It’s misleading to insinuate (per Anurag’s response to that comment) that this is a wasted effort on their part.
1) I don’t know what taxonomy Anurag is thinking about:
“Taxonomies are often too broad for answering user queries. User queries are usually more specific than taxonomy terms/labels. Full-text matching & ranking matches user expectations better and usually goes a long way towards returning useful results.”
…but scholarly associations often have taxonomies of 3,000-10,000 terms or more — extremely granular subject terms designed specifically to cover their content. Since Google Scholar indexes content from every field, any robust subject-specific thesaurus is almost guaranteed to be more granular with regard to the discipline in question than a generic indexing can provide.
Whether Google Scholar can find a way to leverage this indexing is another matter.
2) Since we don’t know what Google Scholar is using to “index” the papers, it’s very hard to argue that the indexing is “better” than that done with the bespoke thesaurus of a scholarly publisher.
The information at this link …is not very helpful from an indexing perspective.
One suspects that it’s literally a very large inverted index simply using words that appear in the text — with no synonymy or disambiguation (the two lynchpins of good subject categorization). This is subject indexing 101: it’s not the words that are important, it’s the concepts being expressed.
This cannot be stressed enough.
Consider the following two searches for (what are indisputably) the same concept:
3) It’s also hard to argue that, absent any kind of surfaced indexing/subject browse and disambiguation, Google Scholar’s indexing is always helpful.
So…a search for “mercury” (never mind the absence of any kind of disambiguation: what am I looking for? Planets? Silvery metallic substances? Automobiles? a Roman God?) yields over 2.2 million results (“finding is easy!”) to look through — the “most relevant” of which is from 1969? (Based on what? Frequency?) Note, also, that two of the top four results are for a visualization tool called “Mercury” (apparently used for the analysis of crystals).
Naturally, there are advanced search options available in Google Scholar to further curate this result set. But the lack of synonymy and disambiguation persists through Advanced Search as well.
Even simple singular/plural pairs yield different results, which is distressing:
This is a bit distressing. Is there no NLP in the background? Are literally only the words that occur being indexed?
4) Uncontrolled keywords are, basically, useless metadata from an information science perspective. Author-supplied keywords are notoriously inconsistent; further, even a “helpful” keyword considered in the context of a particular discipline, becomes ambiguous and unusable in another (see example below). It’s not clear to what extent Google Scholar uses or ignores these keywords, but they seem to come up in searches.
From an information science perspective, this is poor practice. Keywords — unless well-mapped to a central taxonomy of some kind — should be the last thing considered for search indexing (after title, abstract, full text, etc.).
Again, the goal here is not to disparage Google Scholar — but rather to point out the extreme importance of discipline-specific (and, more importantly, content-set specific) taxonomies (and thesauri, ontologies, authority files, etc.) constructed to index specific bodies of content.
That Google Scholar chooses not to map or leverage these important vocabularies is not an indication that this work is fruitless; on the contrary, perhaps the most useful activity Google Scholar could do with regard to indexing would be to gather and map the taxonomies from various large scholarly publishers (to a central ontology? or some other structure?) and leverage them to deliver more focused search results.
If indeed “search is the new browse” we need to have something fewer than 2.2 million results to cultivate — unless we’re all granted a limitless supply of research assistants.
Bob Kasenchak, Director of Business Development