File Design: Those Important First Steps

December 8, 2010  
Posted in Access Insights, metadata, storage, Technology

By Marjorie M.K. Hlava
First published in Information Today

In my first column, we discussed general editorial and file size considerations. This column discusses the file design questions and opportunities. The next issue will cover vocabulary control, indexing, and abstracting.

Once you’ve set your editorial policy and fleshed out those basic questions on size and scope of the file, you’re probably ready to deal with file design. You will have to decide what type of system on which to mount your file. If the access points are few, the system is used by people in the same location, or there are not a lot of people searching the file, you should probably keep a simple card catalog. A card catalog may be all you need, even if you have 75,000 documents. But if you need more than the author/title/subject access points, if your searchers are geographically disparate, or if security is a major consideration, then you should consider automating your system.

Consider security

If you have a large security problem, keeping your automated files inhouse will help prevent leaks. The database vendors — DIALOG, BRS, CDC, GE, for example — try to keep the systems they use to mount private files very tight. Nonetheless, this is not a foolproof solution. Once I broke into a private file accidentally and sent the print-out from a classified file to the vendor to demonstrate the problem. The leak was fixed, but the incident certainly did indicate that, for tight security of proprietary materials, keeping the file inhouse might be worth considering.

The software connection

If you need numerous access points to assure that you will receive as much information as possible in any given search, a sophisticated search software should be used. However, sophisticated software requires a large virtual memory to manipulate large searches. This is where the vendor comes in. A time-sharing system with a telecommunications interface to your location is a good option once questions of security have been settled.

When you have decided on a given system, make sure you build your file so that it matches the particular requirements of the chosen system. You still have to maintain enough latitude in your design so that you can change systems with a minimum amount of conversion if the chosen system doesn’t work out satisfactorily. In other words, keep your design portable.

Decisions, decisions

After you have settled on a strong, consistent editorial policy and you have chosen a satisfactory system for your database (hardware and software options will be discussed in a later column), it’s time to make decisions on your database design.

Which fields do you want? A database may have fields of searchable data, author and title information; textual fields that include an abstract or perhaps the full text; and contact fields that include all the bibliographic information. When choosing the particular fields, you also must decide which ones you want to have as searchable. You will have to decide if a given field should be searchable and displayable or displayable only. Again, remember the end user in your design. What is the most likely approach this person will use when trying to locate the material? By subject? By publication date? Design your search capabilities to accommodate this user.

With each field you must decide if it is a mandatory one. Do you require a title every time around, for example, or will a document type suffice in lieu of a title? If it is a quarterly report, is that a title or a document type? Do you want the field sortable? Do you want it sortable in ascending order or descending order? Is it worth the expense to make the field sortable? Each of these seemingly minute questions must be answered before compiling the information to be stored.

The design questions go on and on. Do you want each field to be printable? There may be a field you want to search numerically, but it will never need to be printed. Sophisticated searchers know that some fields can be searched numerically, while others can be range searched or cascaded. Do you want to be able to list all of the authors whose last names begin with vex-, or would this be treated better with a Boolean statement? What should your field length be? Do you want to set limits of characters per field or simply go to the system’s logical limit? For each field in a database, which might be 6 or 109, the above decisions must be made.

In addition, you must decide on parsing rules. Parsing is the manner in which information is divided for searching. You will probably use parsing in one of two ways. One way is word-for-word (word parsing) where the computer breaks at every space. For example, if you have a title such as “The Electronic Mail Box,” the computer would break after “The,” “Electronic,” “Mail,” and “Box.” Each word would be searchable. With word parsing systems, the computer can be programmed to ignore words such as the, of, and, but, etc. A hyphenated word is read as a single word by the computer, so the text must be impeccably consistent if the system is to operate effectively.

A second method is phrase parsing. In this system, the breaks occur only where you indicate a break. The break indicator, or subfield delimiter, determines where each phrase is to be broken. Phrase parsing solves the problem of double-word descriptors. Within these breaks the information must be consistent in order to facilitate searching. Also, a system can be programmed for both word and phrase parsing to make searching more extensive and complete.

Index design

Most files, such as those produced by the large time-sharing vendors, have what is known as a “basic index,” or “default file.” This file index consists of the basic controlled term vocabulary as well as terms preceded by their categorical mnemonics, such as AU for “author,” or TI for “title.” You can search using the mnemonic tags or codes or through general, natural language terms.

If this searching is not as extensive as you would like, e.g., if you feel that this method of searching might not be able to locate peripheral documents you want to recover, further additional indexing may then be attached. For example, you can have full-text searching in addition to your basic index. A full-text search can increase your number of false drops, but it will enhance the coverage of your search.

For each index you build, you can create an inverted file. If you have a fairly small inhouse system, you can probably avoid the expense of an inverted file by using a linear file, which is simpler but takes longer to search. The advantage of an inverted file is its speed. The disadvantage is that it takes a larger system. If your file is small, linear searching is no real problem, but if you have 30,000 documents, you may spend more than an hour searching for a single phrase on a minicomputer and longer on a microcomputer. So the decision between inverted file and linear file is based on your budget vs. time constraints and on the size of your file.