Data architecture can be difficult to understand – when new to the subject, it can be very hard to distinguish between hype and concept, between real things (with obscure names) and conceptual names which are actually marketingy gizmos. Over time in consulting I come back to a fish based explanation, which I synthesised in a phone call last year with someone and have finally committed to paper.
What’s the difference between a data lake and a data warehouse?
This is the question I started with, and I find it’s also the question everyone else starts with too. What is the actual fucking difference between these things? Is a data lake really unstructured? Computers aren’t actually capable of unstructured info as they organise stuff in a hierarchical way, so how does that work?
Let’s start with the lake. The data lake is like a real lake. There’s different kinds of fish, plants, shopping trolleys and old boots in it. If you’re a conservationist, you want to know about all the fish, and you probably want to remove the shopping trolleys and boots and shit. If you’re just going fishing, you’re probably fishing for a particular kind of fish – but the next fisherwoman might be looking for a different fish, and both of you are hoping the conservationist has already taken out the boots or shopping trolleys.
So, you’ve had a look at the lake and seen a fish you want. The first thing you do isn’t get your rod out and rent a boat. The first thing you do is check the smoke house for that fish, or information on catching that fish that someone else has already tried. If there is fish available, great. If there isn’t, alright, at least you know which bait to use. The smoke house is the data warehouse – you might find there’s more than one warehouse in an organisation (esp with cloud deployments), because different parts of the organisation have different requirements of their data, even though its all from the same place (the lake). Similarly, different fish need to be smoked for different lengths of time, or even not at all.
Let’s say your fish was in the smoke house. It’s smoked, but it is not deboned, and how do you do that? You might work in finance, you’ve got no idea how to debone a fish, or whether or not certain bits you should avoid. Or, perhaps it’s a new fish (a langoustine or something) and you don’t want prepare it because it’s still got the head on and you’re used to frozen prawns? Some fish are easy to prepare before cooking, but some aren’t. For the ones that aren’t, that’s where a fishmonger comes in. The fishmonger is a data engineer or analyst, and their tools tend to be special versions of things you have at home – in the way that any low/no code analytics platform often has some element which resembles Excel on steroids.
Once you or the fishmonger has prepared your fish, it’s time to put it in a dish. This might be with other ingredients (e.g. business context) or even other fish (for cross functional analyses). Preparing that recipe of insight and information can have many or few steps, and it can be done in a huge variety of ways – you might be making a spreadsheet, an invoice, a poster or a dashboard – sushi, ceviche, fish pie, or fish fingers. Similarly, there are a huge number of tools that you in your analytics kitchen, or an analyst in their more cheffy analytics kitchen if its quite complicated and hey, wouldn’t you rather go to a restaurant and ask an expert?
Finally, maybe you’re making a nice dish for yourself, maybe you’ll be sharing it with others. The last thing you do after preparing your meal is assess what the needs are for the meal to look nice. Are you on Masterchef? Is your housemate a food writer? Our presentation methods reflect and relate back to the purposes our analysis is trying to serve, but in the data pipeline, this tends to be visualisation servers such as Tableau Server, PowerBI Server, Mapbox Server and the like, because this is where you’re sharing your analysis with others in the business.
Neither my sushi nor my dashboards ever come out looking as nice as that.