Asked too often #1: Why can't I just work with raw data?
Why can't I just query raw data?
Let’s get one thing straight: we, analytics engineers, love our jobs and solving problems with clients, but some questions we hear day in and day out are just plain exhausting. To save ourselves from answering the same thing five times a day, here's a handy blog series that answers these questions thoroughly and to which we can start referring to instead.
Mike Kamysz, Data Engineer here at The Data Institute, and Bo Lemmers, Analytics Engineer at Xebia kick off the series with: “Why can’t I just query the raw data?” Oh, sweet summer child. We get it — you’re eager, you’re curious, and you want your answers now. But diving headfirst into raw data without a plan is like trying to find a needle in a haystack, blindfolded, in a blizzard. So, let’s talk about why this is a bad idea and why we have shiny, structured data models to save us from chaos.
The Appeal of Querying Raw Data: What People Get Right
Before we dive into the big no-no of querying raw data, let’s be fair. We get why people ask this question. It seems like a good idea. You’re sitting on a goldmine of data—why not access it directly, in its rawest form? Here’s the rationale behind the request:
1. Granularity:
Raw data is the most detailed version of your dataset, capturing every event, transaction, and interaction in its purest form. If you want a full, unfiltered view of the data—whether it's transactional data, user clicks, or sensor logs—raw data is the ultimate source of detail.
2. Flexibility:
There’s no perceived form restricting you. When you query raw data, it’s like having a blank canvas. You can create your own metrics, dimensions, and transformations on the fly. No predefined logic or business rules to get in the way—just you, the data, and your SQL editor.
3. Speed to Insight:
When you’ve got a burning question, waiting for the data team to add a new column or table to the warehouse can feel like an eternity. Why not cut out the middleman? Querying raw data gives the impression of an immediate path to insights.
The Hidden Downsides of Querying Raw Data
Now, let’s talk about why querying raw data isn’t all it’s cracked up to be. On the surface, it seems empowering, but behind the scenes, it creates all kinds of headaches—headaches that your friendly analytics engineers (us!) have to deal with. Here’s why:
Data Quality: Garbage In, Garbage Out
Raw data is unprocessed. It’s exactly what it sounds like—raw, uncooked, unrefined, full of noise, errors, and missing values. Before any serious analysis can be done, raw data typically needs to go through data cleaning and transformation processes. Here’s the reality:
1. Inconsistent Formats
Dates in different formats, varying units (e.g., pounds vs. kilograms), and inconsistent categorizations (e.g., “NL” vs. “Netherlands”, or "NL" vs. "nl") are just a few of the format issues that crop up in raw data. Every time you query raw data, you risk misinterpreting or misaggregating your results.
2. Duplicate or Missing Records
Raw datasets often have duplicate entries or missing values. Without a proper process to clean and validate the data, you’ll spend most of your time handling these issues manually, leading to incomplete or false analyses.
When different teams query raw data without standardisation, they often handle these inconsistencies differently, leading to contradictory results: one team might treat missing values as zero, while another ignores them entirely, and yet others may take the previously known data point, creating an inconsistent understanding of the data.
Performance and Cost: The Hidden Burden
While querying raw data seems faster, the hidden costs in terms of both performance and expense are significant, especially in cloud-based environments like Snowflake, BigQuery etc. Here’s why querying the raw data can be a performance and cost nightmare:
1. Large, Unoptimized Tables
Raw datasets tend to be big. When you query unprocessed data, you’re often forcing the database to scan through billions of rows and columns, much of which may be irrelevant to your actual analysis.
2. Expensive Cloud Costs
In cloud data warehouses, every query has a cost. Since raw data may not be optimised for efficient querying, each query possibly consumes more computational power than necessary. Aggregating raw data on the fly, over long periods, is resource-intensive, leading to added costs, making your query bills skyrocket over time.
3. Repeated Transformations
Because raw data lacks pre-built logic or calculations, every query requires you to re-execute complex transformations like joins, filters, and aggregations. Not only does this make each query slower and more expensive, but it also wastes valuable compute resources repeating the same operations over and over again.
Reinventing the Wheel: Duplicating Efforts, Inconsistent Results
When multiple teams or individuals query raw data, each is effectively reinventing the wheel. This leads to duplicated efforts and inconsistent business logic across the organisation. Without a shared data model, each team has to build its own transformations, metrics, and calculations, which causes a number of issues:
1. Duplicate Work
Every team ends up creating their own transformations to clean and aggregate raw data. For example, one analyst might spend hours creating a transformation to calculate monthly sales, only for another team to do the exact same thing independently. This duplication of effort wastes time and resources across the organisation.
2. Inconsistent Business Logic
Without standardised metrics, different teams will often define key business metrics (e.g., "monthly active users" or "churn rate") in slightly different ways. This leads to confusion and misaligned decision-making, as different reports tell conflicting stories about the same data.
The Case for a well-designed data warehouse
So, if querying raw data is such a bad idea, what’s the alternative? Enter a Kimball-style, Data Vault, or any well-designed data warehouse. This is where raw data gets transformed into something clean, structured, and optimised for efficient querying. Rather than everyone reinventing the wheel, a data warehouse brings order to the chaos. Here’s why this approach is not just useful, but essential for anyone serious about analytics.
Single Source of Truth
A common data model ensures that there’s a single, standardised version of the truth. Metrics, key performance indicators (KPIs), and business definitions are all pre-defined and consistent across the organisation. No more arguing over whose version of the monthly revenue report is correct—everyone’s working off the same data.
- Predefined Metrics: In a Kimball-style model, important metrics (e.g., revenue, customer counts) are defined and calculated once. This eliminates the risk of contradicting results and ensures that everyone is using the same definitions.
- Centralised Data Logic: All the complex business logic—like how we define "active users" or "churn"—is encoded into the data model. This means analysts don’t have to recreate these calculations every time they query the data.
Pre-Aggregated and Optimised for Performance
In a data warehouse, raw data is transformed into tables that are optimised for querying. For instance in a Kimball style data warehouse, fact tables store transactional data (e.g., sales transactions), while dimension tables store descriptive information (e.g., customer demographics). This design supports fast, efficient queries.
- Pre-Aggregation: Data that’s commonly used in aggregate (e.g., daily or monthly sales) can be pre-aggregated, meaning analysts don’t have to compute sums or averages over billions of rows every time they run a query.
- Optimised for Queries: Dimension and fact tables are designed to minimise expensive table scans. Queries that would take hours on raw data can be completed in seconds.
Data Quality Assurance
A foundational data model ensures data quality by acting as a gatekeeper for clean, well-structured data. Ideally, it goes through several automated tests to check both code readability and data accuracy. These tests help catch errors, ensure consistency, and maintain reliability, making sure the model performs well and the data remains trustworthy.
Conclusion
While there are valid cases for querying raw data, think of data science, validation purposes, or other specific needs, for analytics it often leads to more problems than it solves. From data quality issues and inconsistent metrics to performance bottlenecks and unnecessary costs. The solution? A well-structured, governed data model that centralises business logic, ensures data consistency, and optimises performance. Trust us—once you see the benefits of a curated data model, you won’t want to go back to raw data.
Is your organisation facing the challenge to implement best practices in the are of data modelling? We're here to help. Get in touch and we'll get back to you soon.
Follow us on LinkedIn
Don't miss out on updates and insights
Data news for pros
Want to know more? Then subscribe to our newsletter! Regular news from the data world about new developments, tools, best practices and events!
Follow us on LinkedIn
Don't miss out on updates and insights
Photo by Cristi Ursea on Unsplash
Passende Case Studies
Zu diesem Thema gibt es passende Case Studies
Stay up to date
Subscribe to our newsletter
Data news for pros
Want to know more? Then subscribe to our newsletter! Regular news from the data world about new developments, tools, best practices and events!
Stay up to date
Subscribe to our newsletter