The call for a Modular Data Warehouse

In those day of huge focus on the Big Data mouvement, it seems that nobody needs a data warehouse anymore but a huge cloud of Hadoop instead. That’s far from the truth. The data warehouse is here to stay as the center of the decision system of any company. Nevertheless, it need to evolve and meet the new challenges. In this post I will introduce a new data warehouse methodology : the modular data warehouse.

What is the modular data warehouse?

At it’s core, it’s a set of what is called module. Each module has a purpose and a SLA. For instance, you can have a module that contains retail sales, another that contains web sales and the last one combine those two. The last one is targeted on reporting usability and implement the business rules. The two first modules are more focused on simplifying the sources data (extracting only the needed part of the source system, normalization a bit, cleaning), i.e. to make you more productive when using the data and adding some data warehouse feature like CDC. Inside the modules, tables are usually following the star schema paradigm but it’s not mandatory. Different modules can be spread on different data stores systems.

Instead of giving more details (maybe it will be done in a further post), I will highlight some core focus of the design.

Change management

No work will last forever. Data warehouse are supposed to be used for decade. Yet I know more example of data warehouse lasting only a few years before a major rewrite than decade old data warehouse. The business can change very quickly and BI should be able to respond to it. At some point some parts of the data warehouse can’t be upgraded to reflect those changes, they become legacy. Just like there is bad banks, handled in extinction and shielded from interfering with goods banks, we should maintain a bad data warehouse that consists of legacy part we want to shut off in the near future.

Software heterogenity and specialization

In the era of big data and NoSQL, who can expect having only one data management system in the BI realm? The data warehouse should expand where customer need it. Some of your data could be in a MOLAP store but it’s still the data warehouse realm. Your website use your precalculated best next action recommendation? Maybe you should put those in a MongoDB data store. Your data scientist team love Hadoop? Clone some part of the data warehouse in their system so they are more eager to leverage it. Finally today columns store are common. You should leverage that.

Agility

BI should keep up with the pace of business. The velocity should be in weeks not in months or years. If the company expand in new territories or acquire other businesses, BI should expand it’s scope as quickly as possible. And yes, that means being messy sometimes. Sometimes you should create a technical debt and yes you will pay the price, maybe event rebuild the whole part from scratch next month. Working with modules allow to isolate the messy part. I wish good to luck to someone saying that the data will be ready in 3 months to a CFO in the middle of a yearly closing period.

Quality Assurance

Showing wrong numbers or empty reports destroy the needed confidence in BI. Test driven development was a good addition to software programming. For the data warehouse we should make the ETL as transactional as possible and running some sanity check before commiting it. That’s challenging because ETL aren’t transaction aware and it not easy to have transactions that span over many hours with huge volume of data and complex queries. We also want to give as much as possible to customers. Accounting doesn’t care is the weblog parsing crashed. Here again the isolation and atomicity of the module can be helpful.

Comparison with other data warehouse methods

Those features are not possible using the current data warehouse methodology as they exists, namely Imnon, Kimball and the data vault. The Imnon way takes way too much thinking before getting anything and is not ready for fast paced structural change in the business. Kimball is more efficient at the start but fail as well to handle legacy work. It’s difficult to migrate a star schema from one model to another, sometimes impossible (if all the data is not present in the first version). Data vault gives you a strong flexible foundation for your data but fail to factor you business rules. And if a change can’t be handled by your initial model well, then you are in trouble.

Let's stay in touch with the newsletter

Data science for data-driven startups

The call for a Modular Data Warehouse

Can I help ?

Subscribe to the mailing list

Search the site

Tags

Search the site

Last posts