History
Traditional data warehousing has been around since the mid-80s and has been adopted as a foundation of modern business intelligence platforms. By incorporating different data architecture fundamentals compared to an operational system, the data warehouse allows businesses to use their data as a basis for reporting. By pre-aggregating, transforming, and contextualizing transactional data, analysis of historical data becomes easier to accomplish, allowing businesses to garner significant value and make data–based decisions promptly. This system of data management comes not without disadvantages. Once a traditional warehouse grows to a certain size and complexity, changes to the architecture require more time to develop and regression test.
Data Warehouse 2.0
An official definition states that a data vault “is a detail–oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3nf) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today‘s enterprise data warehouses.“
Quite verbose, but what does all this mean? To simplify, the data vault provides a flexible framework for adding all of your business‘ data reliably and efficiently. Some of the key advantages that the data vault methodology provides include a simpler data ingestion pipeline, an easy and reliable auditability mechanism, flexibility and ease, and new sources and relationships without impact to the existing design.
What does it look like?
A typical implementation of data vault architecture uses three primary table types:
- Hubs contain a list of unique business keys as well as some related metadata. The business keys provide the backbone of the warehouse as all data revolves around them.
- Links bring business keys together and represent associations or transactions between said business keys (HUBs). LINKs often provide the basis for creating facts in a dimensional model, as they effectively represent a many-to-many relationship between attached HUBs.
- Satellites, which are joined to one hub or one link, contain all contextual data relevant to its parent table. They also contain any historical data that can be used to facilitate record–level auditing. This is accomplished by a record–level Valid Date/Invalid Date that enables data lineage across all business keys for the related hub.
Consider a data vault application that stores a person‘s demographic and employment information. Unique entities would be the Person and the Employer, and data that uniquely identifies individual entities would be stored in the related HUB table (eg. customer_id for the PERSON entity). Demographic information related to an individual entry in the PERSON HUB table (fname, lname, etc.) would be stored in the PERSON SATELLITE table. Establishing a relationship between a person and an employer would involve creating a LINK table that incorporates only the identifying information for related HUBs (the ids from PERSON_HUB and EMPLOYER_HUB).
The real power of the data vault‘s flexibility lies in its ability for fast, on the fly, addition of new relations (LINKs). This principle allows the incorporation of new data sources and relations with much less overhead when compared to a traditional data warehouse implementation. Placing the associations between the new sections and the old sections in LINK tables enables the new relations to be created from new domains without requiring re-engineering the existing data structures. This architectural difference is where the “secret sauce“ of data vaulting applies.
Depending on the use case, a data vault could be right for you. While traditional data warehousing will always be a tried and true method, data vault‘s advantages outweigh the old. Today’s software development teams live and breathe in Agile, and the notion of incremental delivery and cyclical development align strongly to a data vault‘s flexibility. The idea of auditability is baked into the data vault – this includes change tracking and source system tracking through metadata provided within each table. As changes are always incremental and additive, ‘old‘ data is never lost, and an automatic audit trail is baked into any data vault implementation. Take your data to the next level with a data vault!
Further Reading:
https://danlinstedt.com/solutions-2/data-vault-basics/
https://www.talend.com/blog/2015/03/27/what-is-the-data-vault-and-why-do-we-need-it/
https://en.wikipedia.org/wiki/Data_vault_modeling