by Marjorie M.K. Hlava
First published in Information Today
So far this column has covered design considerations for the building of a database. Editorial guidelines have been set forth. Fields have been discussed and quantified.
The next steps involve the tagging or coding of the data in a uniform manner so it can be pulled out of the computer system efficiently. Vocabulary development, authority lists, controlled codes, category codes, and other indexing schemes are made up and refined at this time. The general term for applying these controls to the individual units is indexing.
An important consideration in building your database is vocabulary development. Search vocabulary lists vary from highly sophisticated lists of specific terms (such as those included in medical databases with a strict hierarchical structure) to those that contain a basic authority list.
With an authority list, your dictionary will simply tell you which terms are correct and which terms are incorrect. You will not get references to narrower terms, broader terms, or related terms with an authority list. A hierarchical list, on the other hand, will give you a tree structure where, for example, “home additions” would be a narrower term under housing, and “home mortgage” may be a related term.
In a small private file, a complex thesaurus of hierarchical vocabulary may not be necessary. Again, the hierarchical list will be more complete and will facilitate an in-depth search on a large extensive file, but it is expensive and perhaps unnecessarily cumbersome for a small private file. An authority list will give you a consistent source for the correct or preferred terms, company names, and author spellings within the file. Many times, this is enough.
Easy on the user
The design of the vocabulary and the rules for choosing the terms will depend on how relevant you want the vocabulary to be and how easy you want to make searching for your end user. If you have a file that is going to be searched company-wide by a great many people, you might want to pass out a user’s manual with a small structured vocabulary. Once again, it has to be consistent with the system, just as the system must be consistent with itself.
If you decide that you always want the plural form (there are a few strange plural forms that are not used in everyday language), you’ll have to decide if you want to make exceptions. Is it grapefruits or grapefruit? Is it management by objectives or management by objective? Those kinds of questions sometimes seem ludicrous. Don’t get stuck on minute details, but do consider the impact of choices on your database. You need to make sure your treatment of these terms is consistent.
Another vocabulary question involves multi-part phrases, such as “do-it-yourself.” Are you going to bind these terms with hyphens, run all the words together, or treat each part as a separate word? If you are going to use a phrase parsed system, you won’t have to worry about it.
Whatever you finally decide, you need to document every decision very carefully. Good documentation of your vocabulary will make a big difference to your end user. A well-indexed, consistent vocabulary will make it easier and faster to locate information. For all the expense of computerized information storage, you certainly want to make sure the system works efficiently and rapidly. If you change the rules, go back and change both the data and the list.
The organization of the vocabulary will also be a strong indication of the general organization of the file. If the file isn’t well organized and systematic, it will certainly be reflected in the vocabulary. It’s common to find sixteen different ways to write out DuPont in a database. Needless to say, this kind of problem will make effective, thorough searches rather difficult. A controlled vocabulary, particularly an authority list for corporate and personal names, will speed the search strategy development process.
Abstracting and indexing
After the fields in the file have been satisfactorily designed and you have decided which ones you want to have searchable and sortable, the next step is abstracting. In the abstracting process, your staff takes the original document and fills out abstracting forms that divide the information into the various chosen fields. Selector or descriptor terms are assigned at this time. An abstract is written, or the entire document is marked for inclusion, depending on how you choose to store the material. The bibliographic information is also attached.
After the data has been broken down by the abstraction staff, it needs to be indexed. The indexing requires two steps:
All names must be standardized, including corporate names, brand names, generic names of chemicals, etc.
Additional descriptors used in the abstracting process are added to the controlled vocabulary, and these are also standardized.
We generally use a form for each unit entered into the database to be sure that each field is at least considered for inclusion in the database. This makes it easier to start and stop the job of indexing and abstracting at any time. It also makes the treatment of data more uniform from one person to another. (Request a sample “Abstract Form” from the publisher of Information Today if you would like to see how we at Access Innovations have developed ours.)
When the data has been completely abstracted and indexed, it goes to the data entry staff.