In Defense of Taxonomies: In Response to the Recent Scholarly Kitchen Posts about Google Scholar, Indexing, and Content Findability

Several interesting points were raised over the course of the two posts — and, notably, in the resulting comments featuring Anurag Acharya — by John Sack about Google Scholar.

Google Scholar is a wonderful tool and resource, and it is not the goal here to disparage or otherwise belittle its importance or contribution to research. But some of the observations and conclusions are confusing — especially as regards the utility of taxonomic indexing vs. the sort of broad indexing Google Scholar has implemented.

Many scholarly and other society publishers have, as Bruce Gossett pointed out in his comment, invested considerable time, effort, and money to build bespoke taxonomies/thesauri to index the specific corpus of their content. It’s misleading to insinuate (per Anurag’s response to that comment) that this is a wasted effort on their part.

1) I don’t know what taxonomy Anurag is thinking about:

“Taxonomies are often too broad for answering user queries. User queries are usually more specific than taxonomy terms/labels. Full-text matching & ranking matches user expectations better and usually goes a long way towards returning useful results.”

…but scholarly associations often have taxonomies of 3,000-10,000 terms or more — extremely granular subject terms designed specifically to cover their content. Since Google Scholar indexes content from every field, any robust subject-specific thesaurus is almost guaranteed to be more granular with regard to the discipline in question than a generic indexing can provide.

Whether Google Scholar can find a way to leverage this indexing is another matter.

2) Since we don’t know what Google Scholar is using to “index” the papers, it’s very hard to argue that the indexing is “better” than that done with the bespoke thesaurus of a scholarly publisher.

The information at this link …is not very helpful from an indexing perspective.

One suspects that it’s literally a very large inverted index simply using words that appear in the text — with no synonymy or disambiguation (the two lynchpins of good subject categorization). This is subject indexing 101: it’s not the words that are important, it’s the concepts being expressed.

This cannot be stressed enough.

Consider the following two searches for (what are indisputably) the same concept:

Any thesaurus would equate “aluminum” and “aluminium” (the latter of which is the common British spelling; the former the American) and return results for both searches.

3) It’s also hard to argue that, absent any kind of surfaced indexing/subject browse and disambiguation, Google Scholar’s indexing is always helpful.

So…a search for “mercury” (never mind the absence of any kind of disambiguation: what am I looking for? Planets? Silvery metallic substances? Automobiles? a Roman God?) yields over 2.2 million results (“finding is easy!”) to look through — the “most relevant” of which is from 1969? (Based on what? Frequency?) Note, also, that two of the top four results are for a visualization tool called “Mercury” (apparently used for the analysis of crystals).

Naturally, there are advanced search options available in Google Scholar to further curate this result set. But the lack of synonymy and disambiguation persists through Advanced Search as well.

Even simple singular/plural pairs yield different results, which is distressing:

This is a bit distressing. Is there no NLP in the background? Are literally only the words that occur being indexed?

4) Uncontrolled keywords are, basically, useless metadata from an information science perspective. Author-supplied keywords are notoriously inconsistent; further, even a “helpful” keyword considered in the context of a particular discipline, becomes ambiguous and unusable in another (see example below). It’s not clear to what extent Google Scholar uses or ignores these keywords, but they seem to come up in searches.

From an information science perspective, this is poor practice. Keywords — unless well-mapped to a central taxonomy of some kind — should be the last thing considered for search indexing (after title, abstract, full text, etc.).

Again, the goal here is not to disparage Google Scholar — but rather to point out the extreme importance of discipline-specific (and, more importantly, content-set specific) taxonomies (and thesauri, ontologies, authority files, etc.) constructed to index specific bodies of content.

That Google Scholar chooses not to map or leverage these important vocabularies is not an indication that this work is fruitless; on the contrary, perhaps the most useful activity Google Scholar could do with regard to indexing would be to gather and map the taxonomies from various large scholarly publishers (to a central ontology? or some other structure?) and leverage them to deliver more focused search results.

If indeed “search is the new browse” we need to have something fewer than 2.2 million results to cultivate — unless we’re all granted a limitless supply of research assistants.

Bob Kasenchak, Director of Business Development
Access Innovations

In Defense of Taxonomies: In Response to the Recent Scholarly Kitchen Posts about Google Scholar, Indexing, and Content Findability