Database Production: Determining the Cost

by Marjorie M.K. Hlava
Fiirst published as part of Information Today’s Private Files Series, Volume 1, Issue 7.

The costs of database production are extremely variable depending on what kind of data you are dealing with and what you want to do with it. So let’s consider the variables from beginning to end and then discuss the cost of each of them.

First, there is the research and development — finding what database you want to do and where it will fit in the overall scheme of databases currently available.
Second, there is the data. The data needs to be looked at in terms of what it is like and how it should be treated.
Third is the system or procedure that you implement for the collection, tagging, and entering the data, and then there is the hardware and software cost.
Fourth is the information to be abstracted from the data collected — that is, the creation of abstracts or annotations, the indexing, coding of the data and so forth.
Fifth, there is that important variable: the amount of data. If you have a small amount of data, some things are not going to vary in cost; other factors are directly related to the amount of data.
Sixth is the marketing of the database, whether it is marketed internally or externally.
Seventh is the distribution of the information to points for customers to use the data. It may be distributed throughout a company or throughout the world.
Eighth is the collection of fees or payments for the data you’ve delivered.

In this paper, we will consider only the costs related directly to the production of the database. For that reason, we will assume that the R&D has been done, the data has been identified, the marketing/distribution and collection of fees taken care of, and the hardware and software chosen, so what remains is the production of the database itself.

Step by Step

It is important to remember during the design process which data will be covered and which data elements, if found in the stack of available data, will be thrown out. This of course, implies that the user has been consulted as to which data elements he may be interested in, how he will ask a question, and how he might retrieve the data.

After the data elements are chosen, the next step is selecting editorial guidelines. Once this is done, the backfile of the database can be started on. What I mean by a backfile is all that old data that you’ve already collected that’s waiting to be computerized. This backfile expense is usually the largest expense that you have in the creation of a database. Some portion of this data may already be computerized. It may be available from photocomposition tapes, OCLC tapes, from an online system where you could download the data, a tape from a government database, or floppies that you borrow from a friend.

At any rate, if the data is there and available in machine-readable form, it is nearly always cheaper to take the data and convert it to run on the new system than it is to re-keyboard it. If it is not available in machine readable form, the data can be run through an optical character reader (OCR) to scan for the data or keyboarded either in the U.S. or offshore. All of this preparation of data is a labor-intensive task and, therefore, the variables need to be considered on an items per hour and hourly cost basis. For all intellectual tasks and for keyboarding tasks, most of the charges are done per thousand characters. If any translations are done, then the cost of translations is charged at price per thousand English words, a word being an average of five characters long.

At this point, the first step involved is the creation of a citation to the entry. It might be a bibliographic citation, a shelf location number, a check number, or some other item that would point you to the location of the physical data, whether it’s in the computer or stored on a shelf or in a filing drawer.

The second step is the creation of an annotation if you decide to make one. This may be short, one or two sentences, or may be as long as an entire abstract of five hundred words. In some cases you may like to put in the full text of an article rather than just an annotation. The next step is the coding of individual data elements and/or the addition of descriptors or subject headings to each individual bit of data.

The last item is to make the data into machine readable format or to merge the already machine readable data with additional items that you added during the creation process.

Rules of Thumb

Let’s take each of these options one at a time for cost information. There are rules of thumb for hammering out how much data might be done at any one time.

In the first option, the bibliographic citations can be done at a rate of approximately 10 per hour if the information treated is a book or journal article and the tracings are done in a different step. If, however, full verification of the document needs to be done, rather than taking it directly off the article or unit in hand, the rate can shrink to as low as two per hour. So, one of the first decisions you must make in the production item is how full a verification job you want done on each item.

The second step, which is the annotation of the article, depends clearly on the length of each annotation and on the technical nature of the material. A third consideration is whether it’s in English or needs to be translated for abstracting. Assuming that the data is in English, short abstracts can be done at approximately five per hour, and longer and more technical ones at one to two per hour.

The next item is the indexing. Indexing, as you know from searching commercially available online files, can vary from three to over 50 descriptors. Obviously, the number of duplications and the number of controls put on each unit descriptor make a lot of difference in how many you can do per hour. For example, if every corporate name must be on a corporate authority list and every descriptor term must come from a large and tightly structured vocabulary, it takes considerably longer to add each term than it does if terms are chosen from a list of 200 or less words and augmented by free text terms. An average number of items from a controlled list of perhaps 5,000 terms can be indexed at a rate of five per hour.

The last item is the computerization of the data. The procedure most often used today is keyboarding the data. Keyboarding cost varies considerably depending on whether the material is typed, handwritten, or read from microfiche or paper copy.

These items aside, there are some other things that determine the cost of the data. One of them is whether the data is proofread or verified. In the verification process, two people keyboard the same data and then the data is matched to see if there are any errors. This gives you the capability of having character for character match and is a very good option.

The second option is proofreading. Generally, if you have data typed and proofread, the proofing should be done twice. The first proofreading would be for typographical errors and format; the second proofreading would be for content and matching the item against the original to make sure that faithful copy has been made. After either proofreading or verification, the data needs to be corrected and then read over into machine readable format for shipping to the customer.

The rates for keyboarding are per thousand characters. Should an item be verified, you need to pay for per thousand characters twice or indicate that it is double keyed. Prices for offshore keying vary from $1.40 to $1.90 including verification. Prices in the U.S. go from about $3.00 for verified or proofread data and up. On the East Coast, costs are more in the range of $6.50 per thousand characters. In the West and South, they are considerably cheaper.

Conversion Costs

The last stage or cost factor involves the conversion of the data into something that can be uploaded by your system. Should the data already be machine readable, the cost is only for the conversion. If you’re working on a backfile of data which comes from paper or some other medium, then the steps listed above all must be followed before you can get to the conversion stage.

Data conversion involves the reading of data from the format it is currently in to a format that can be read by the database management system that will be used. So, to end this month’s discussion, let’s consider costs individually. Next month we’ll consider some additional kinds of cost options that you could follow.

In this example, I’d like to consider a database of 10,000 items with bibliographic citation, a three line annotation, and six descriptors on the average added to the unit. This gives you an average unit of about 1,200 characters to be keyboarded. The information has been boxed up and has accession numbers ready, so it’s ready to be treated in this stage of the production process.

Based on 10,000 items with a bibliographic citation for each:

10,000 / 5 per hour x $16 per hour	$32,000.00
Annotations at 5 per hour	$32,000.00
Descriptors at 5 per hour (from a controlled vocabulary of 5,000 terms)	$32,000.00
Keyboarding – 1200 characters x 10,000 items x $3 per 1000	$36,000.00
Total for backfile	$132,000.00

Database Production: Determining the Cost