The following post, by Rachel Drysdale, originally appeared in PLOS BLOGS on April 8, 2014.
Science does not stand still and neither does the PLOS thesaurus. With more than 10,700 Subject Area terms, we use the thesaurus to index our articles and provide useful links to related papers, enhanced search functions, and, for PLOS ONE (more than 90 articles published every day!), customizable Subject Area-based email alerts and Subject Area landing pages.
Sometimes we decide to renovate a sector of the thesaurus to better reflect the make-up of the PLOS corpus. For example, we’ve long had a Subject Area term for “Synthetic biology,” sitting beneath “Biology and life sciences.” We even have a healthy Synthetic Biology Collection. However, the Subject Area term “Synthetic biology” was being applied to only a handful of articles despite the fact that many more PLOS articles were about synthetic biology and should ideally have been indexed accordingly. Why was this?
Part of the explanation is that ‘synthetic biology’ is not a phrase that is frequently used in natural language. So whereas an article about hypertension may use the word ‘hypertension’ 26 times within the text, an article about synthetic biology might state ‘synthetic biology’ rarely, if at all. This poses a challenge to the Machine Aided Indexing process which assigns Subject Areas to articles based on the frequency of matches in the text.
The way around this is to introduce a level of abstraction to the rulebase that governs the Machine Aided Indexing. The base rules are very literal: “if I see ‘synthetic biology’ in the text I’m going to use the ‘Synthetic biology’ Subject Area term.” But there are additional words and phrases that are diagnostic of synthetic biology topics, such as “biobricks” and “Registry of Standard Biological Parts.” Adding rules for these terms – for example “if I see ‘Registry of Standard Biological Parts’ in the text I’m going to use ‘Synthetic biology’” – increases the frequency of indexing to “Synthetic biology” and thus the retrieval of relevant articles in our searches.
A second factor is to do with the hierarchical structure of the thesaurus – an especially important factor given that our search functionality is designed to utilize this hierarchy. For example, a Subject search for “Vascular medicine,” beneath which Hypertension sits, retrieves articles indexed specifically with Hypertension, even if they have not been explicitly tagged with “Vascular medicine.” In earlier versions of the PLOS thesaurus “Synthetic biology” had no narrower terms, and this was doing it no favours with regard to how useful it was for retrieving relevant articles. We therefore reviewed essays about synthetic biology, scope descriptions from relevant institutional and departmental web sites, and proceedings from synthetic biology conferences, all in light of the content of our articles, and introduced new, narrower terms to sit beneath our existing “Synthetic biology” where that made sense. So we went from having the single “Synthetic biology” term to the new structure of 30 terms in one renovation. Here is what we have now:
Much of the evolution of the PLOS thesaurus is gradual, as for example when we realised that “puma” can be used as an abbreviation for “p53 upregulated modulator of apoptosis” as well as a kind of big cat, or learned that asteroids can be starfish. Dealing with these indexing missteps requires small-scale changes to specific rules. But sometimes the change needs to be more radical. Our new “Synthetic biology” sector was implemented in Ambra 2.9.12 (released March 26th, 2014). Where previously only a handful of articles was indexed with “Synthetic biology,” now a Subject search across all PLOS journals retrieves over 400 “Synthetic biology” articles – much more fitting for this important and developing field.
For more about the work PLOS is doing with Synthetic biology see “An Invitation to Contribute to the Second Life of the Synthetic Biology Collection.”