by Marjorie M.K. Hlava
Previously published in Information Today

In previous columns, we have discussed many of the basics of preparing the data for a computerized file. We have stressed being sure that the editorial policy and individual rules are set consistently so that the data may be easily retrieved once it is entered into that big black box called the computer. The next and very important step is computerization of the data, or converting already-computerized data to match the specifications of the hardware and software system you are going to run your file on–an exciting but exacting task. The data can be made to match your machine by using photocomposition tapes, MARC tapes, floppies, hard disks, paper, forms, catalog cards, brochures, and any number of other information sources.

Data entry

Once the data has been tagged, abstracted, and indexed, it is ready for some sort of data conversion process. Sometimes this process will involve taking already machine-readable data and adding to it information or tagging to make it easier to retrieve in a manner that your user will be most likely to want.

These tasks can be done either inhouse or sent out-of-house to a jobber. They can be done in the U.S. or offshore. Data entry and data conversion companies use various combinations of these. Some companies have people do this work in their homes on minicomputers or on remote entry, dial-up machines. If data is entered on a microcomputer at home, for example, the material is entered into the computer and stored on floppy diskettes to be delivered to the office whenever a diskette is completed.

The data entry process is the same whether the data is numeric or bibliographic. The data needs to be entered very carefully: some of the fields can be masked as mandatory fields, required entry, or of a specific length, to cut down the number of errors that might be produced. Another way to cut down on errors is to pass these systems through a spelling dictionary to make sure that the words in the file are only those words used in the dictionary for the particular database. The spelling dictionaries can be adjusted to match the jargon of a particular file.

After the data has been entered, it should be printed out and proofread,or verified, to catch problems such as typographical errors, errors in grammar, errors in content, or lack of data in all the appropriate fields.

Proofreading or verification

In our company, we double-proofread the material. We proof once for typos and grammatical errors, missing copy, missing fields, bibliographic information, missing mnemonics, etc. We then proofread a second time to make sure all of the data has been faithfully transmitted from the original to the typewritten or computerized form. The second proofreader also double checks for any missing fields or missing information. After this is done,we then check the data for mnemonics to make sure that the mnemonics in the file actually match the tagging protocol in the original computer design.

Another way to assure that the information is keyed accurately is through verification. For statistical data or numerical data, verification is a better and more accepted way of ensuring no mistakes get by. In the verification process, one data entry person enters the information on the keyboard. When the keying is complete, another data entry person enters the same information over that original. The computer system is programmed to ring a bell when a character in the two bodies of information doesn’t match.

The only problem with this type of verification is that, for bibliographic compilations, the system won’t catch the errors in grammar or misspelling that a proofreader would. With statistical data or other kinds of numerical information, verification is the most effective way of eliminating all errors and is less tedious than a visual scan. Another method of verification is to type all the material twice and then run a computer match to make sure that all of the data is indeed the same.

Delivery to the vendor

Once the information has been keyboarded, verified or proofread, and corrected, it is then ready to be put either on magnetic tape, floppy disk,or delivered over telephone or Internet lines to the host computer. Magnetic tape (although it seems an archaic kind of technology) is a nearly-universal standard, and people are able to read 9-track tapes from one computer to another very easily. Large bibliographic vendors, such as Dialog, SDC, BRS, GE, The Source and CompuServe accept their data in 9-track, magnetic tape format. If your data is created on a microcomputer on 5-1/4 inch floppies, for example, it will need to be converted to a 9-track tape for delivery to the online vendor.

Telephone lines, using the RS232 protocol, are another fairly standard way to deliver data from one vendor or one computer to another. We have been successful, for example, in reading from a Wang PC to an Apple PC to a Wang minicomputer or from a TRS-80 to a Wang or from an IBM PC to an Apple, and so forth. These interface protocols using the RS232 line convert the data into something that will go over a voice grade telephone line and be accepted by computers.

Record keeping and archiving

Throughout the process of creating a database, you need to keep accurate records of where the materials for the database are at any given time. Losing materials is very easy as the material moves through the production pipeline if you don’t have a strict record-keeping system. When you are building a database either for a client or for yourself, it is important not to lose any of the documents. A logbook should be kept and all materials should be entered in the log. When materials are taken out for abstracting and indexing, data entry, or either of the proofreading cycles, they should be signed out so that the person keeping records or looking for data will know at any time exactly where all materials are located and where they are in the production process.

Once the material has been entered and delivered to the vendor, it is extremely important to keep an archive of the system. Most places keep two archives, or backups. One is kept on-site near the computer for use should the system go down for any reason; the second needs to be stored off-site.

We make duplicates of all materials entered on diskettes so that during the production process the duplicate, or archive, can be stored off-premises so that at least one copy of the entered material is preserved in case of natural disaster, theft or accidental loss. Most of our clients require that we retain an archive of their data for a specified period after the material has been delivered to the online vendor.

Throughout the process of building a database, from making editorial decisions to entering the material, you must learn to balance the seemingly conflicting goals of quality and costs. The more times a record is proofread, for example, the fewer errors you will have, but it is more expensive to produce each record. We have found that “quality control groups” within our company for each section (data entry, proofreading, abstracting, indexing, etc.) can keep things consistent and running efficiently, while assuring that the file design is solid. These groups do cost in personnel time but seem to pay off in the file production.

Data Harmony is an award-winning semantic suite that leverages explainable AI.