Data lakes are storage repositories that hold a vast amount of raw data in its native format until it is needed. Data lakes have become very popular as organizations look for ways to store all their data. But how has the quality of data fared in those lakes? This interesting information came to us from Datanami in their article, “How Databricks Keeps Data Quality High with Delta.”

While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. The invention of the data lake was a critical moment in big data’s history. Instead of processing raw data and then storing the highly structured results in a relational data warehouse that can be queried as needed, distributed systems like Apache Hadoop allowed organizations to store huge amounts of data and add structure, i.e. value to the data at a later time.

Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried.

Melody K. Smith

Sponsored by Data Harmony, a unit of Access Innovations, the world leader in indexing and making content findable.