Compliance and Indexing

June 2, 2010 – “Backing Up Corporate Data – How Deduplication Helps” references indexing but the focus is on the challenge of deduplication. Indexing can be a help in certain deduplication operations, and it can be a problem. In free form indexing environments, a document may have different index terms assigned. Is this enough to make a document different from another document that differs only with metadata? On the surface, the idea is silly. When two people index with uncontrolled methods, the document is the “same”. The metadata are not germane. What happens when two identical documents are indexed with uncontrolled terms and one user assigns the name of a person of interest? The explicit name of the entity does not appear in the document, but the human assigned an entity name to add value to that particular document. Are the documents now identical? Why not retain just the index terms? We think it is important to find out if the entity tag was accurate. Should inaccurate uncontrolled terms be retained?

The write-up focuses on some broad issues in deduplication. We found this passage interesting:

With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy reducing storage and bandwidth demand to only 1 MB. However, indexing of all data is still retained should that data ever be required. Enterprise data today is more dispersed and diverse than ever. And with over 30% critical corporate data sitting on PCs, administrators can no longer hold the end user responsible for its protection. The best corporate data protection solutions combine source-based data deduplication and continuous data protection.

We think the author is tracking with us, and we understand the need to trim duplicates. But once again, the definition of “duplicate” is the first step in figuring out what to keep and what to discard.

Sponsored by Access Innovations.

Compliance and Indexing