Data Warehouses have been (probably) around since the invention of databases, but they are still one of the most difficult data storages to master. Most Data Warehouses would require you to copy the data into their data model, which is normally optimised for certain queries. Most companies, however, tend to think that you can run anything on the data warehouse they have. They would run all of their reports on that warehouse, add so many more columns, even hook up other production systems to receive updates from it. I mean, why not? The data is there, so why not query it, right? … Right?
That’s when the data warehouses would grow out of size, become underperforming, require scaling up and become costly. They would become a black hole in the company’s budget, and most of the time people would try to keep it alive and breathing, rather than getting value out of that investment.
Let’s go through some of the biggest mistakes that people tend to make.
1. Your Warehouse Is Not Your Master Data Storage
This is probably the biggest mistake. People tend to invest quite a lot into warehouses, and the first thing they do is to collate multiple sources together to create master datasets. That can include the reference data like product codes, sales types, customer types, etc. but it also includes your customers, orders, employees, dealerships, shops, etc.
This is all good and well, but when you only keep your master data in your database, it attracts all the predators. Everyone who needs a collated dataset (combined customers, all your shops & dealers, even a list of countries and cities) will come knocking on your door to ask for access to your warehouse.
As a rule of thumb, don’t let your warehouse be your Master Data Store. If you combined all the master data in your warehouse already, fine, push it also to somewhere else where people can read from. Don’t let the expectations derail your warehouse from its purpose.
2. Don’t Treat Your Warehouse Like an ETL Tool
Historically, most integrations are implemented to be system-to-system, and it’s especially true in data projects. And because the warehouse projects tend to be a direct map from source systems to the warehouses, people tend to load it into their favourite warehouse environment and then bring it into their proper warehouse schema. Although this is not an entirely bad approach, this adds another responsibility and complexity to the warehouse, which is already challenging on its own.
Remember, “Just because you can, doesn’t mean you should”. Do your ETL somewhere else, use other staging mediums. Have your warehouse be your last stop in your ETL journey, not the means.
3. Create a Warehouse to Serve a Single Purpose
A warehouse is not a Swiss Army knife, and don’t force it to be. Design it to solve a single problem: A series of reports for your sales department, or analysis of past customer behaviour, or sales order projections for the next 5 years. You can group some of these questions and answer them in one shot, but don’t force it to be an omnipotent entity.
Remember the story of Deep Thought from The Hitchhiker’s Guide to the Galaxy: If you try to build a warehouse and ask it “What’s the meaning of life?”, it’ll give you the answer “42” after several millions of years, which would only point to one thing: You’re not understanding the complexity of what you’re asking.
Design your warehouses to serve a single purpose. It’ll do that and will do it nicely.
4. You Can Have Multiple Warehouses
There’s nothing preventing you from having multiple warehouses. True, it’s a costly endeavour, but only if you overcomplicate them with the mistakes mentioned above. When designed properly for a specific purpose, you’ll realise that warehouses are not that much expensive than other solutions. Because the cost wasn’t necessarily the solution itself; it was the cost of keeping to operational.
If you can keep most of your ETL processes independent of your warehouses and make them publish to multiple targets, you can actually build multiple warehouses easily. If you have kept your master data somewhere accessible, you can have your warehouses import it effortlessly.
Then you can craft multiple warehouses that answer different questions in different ways. Do you need to calculate sales projections? Build it on your sales-optimised warehouse, based on your Sales division’s requirements. Do you need to create specific reports that your operations team need? Build a specific warehouse that can answer their unique questions.
Treat your warehouses as data products, not expensive supercomputers.
5. Don’t Use Your Warehouse for Transactional Integrations
Maybe this could’ve gone without saying, but I couldn’t resist. I’ve seen enough people trying to make their warehouses become part of critical paths of production systems. Using a warehouse to create sales reports that get refreshed hourly is one thing, having a transactional system running queries to get orders received in the last hour is another.
Think of it this way: If you’re Amazon, would you let people come into your warehouses to shop around? Or would you direct them to retailers to do that kind of shopping, and only ship the online orders through your warehouses?
Final Thoughts
I know this article ended up a bit vague to keep it short, but hopefully, it managed to explain some of these mistakes enough. I hope to go into more details in other articles and give more concrete examples, even talk about my experiences and own mistakes.
Don't hesitate to drop a comment below and let’s turn this into a discussion.