Cookie Settings

By clicking "Agree," you consent to the storage of cookies on your device to enhance site navigation, analyze site usage, and support our marketing efforts. For more information, please refer to our Privacy Policy.

Blog

Asked too often #1: Why can't I just work with raw data?

Why it is a bad idea to query raw data and how it's done best instead.
von
Mike Kamysz
26.10.2024 13:53
7
minutes to read
Share this post

Why can't I just query raw data?

Let’s get one thing straight: we, analytics engineers,  love our jobs and solving problems with clients, but some questions we hear day in and day out are just plain exhausting. To save ourselves from answering the same thing five times a day, here's a handy blog series that answers these questions thoroughly and to which we can start referring to instead.

Mike Kamysz, Data Engineer here at The Data Institute, and Bo Lemmers, Analytics Engineer at Xebia kick off the series with: “Why can’t I just query the raw data?” Oh, sweet summer child. We get it — you’re eager, you’re curious, and you want your answers now. But diving headfirst into raw data without a plan is like trying to find a needle in a haystack, blindfolded, in a blizzard. So, let’s talk about why this is a bad idea and why we have shiny, structured data models to save us from chaos.

The Appeal of Querying Raw Data: What People Get Right

Before we dive into the big no-no of querying raw data, let’s be fair. We get why people ask this question. It seems like a good idea. You’re sitting on a goldmine of data—why not access it directly, in its rawest form? Here’s the rationale behind the request:

1. Granularity:

Raw data is the most detailed version of your dataset, capturing every event, transaction, and interaction in its purest form. If you want a full, unfiltered view of the data—whether it's transactional data, user clicks, or sensor logs—raw data is the ultimate source of detail.

2. Flexibility:

There’s no perceived form restricting you. When you query raw data, it’s like having a blank canvas. You can create your own metrics, dimensions, and transformations on the fly. No predefined logic or business rules to get in the way—just you, the data, and your SQL editor.

3. Speed to Insight:

When you’ve got a burning question, waiting for the data team to add a new column or table to the warehouse can feel like an eternity. Why not cut out the middleman? Querying raw data gives the impression of an immediate path to insights.    

The Hidden Downsides of Querying Raw Data

Now, let’s talk about why querying raw data isn’t all it’s cracked up to be. On the surface, it seems empowering, but behind the scenes, it creates all kinds of headaches—headaches that your friendly analytics engineers (us!) have to deal with. Here’s why:

Data Quality: Garbage In, Garbage Out

Raw data is unprocessed. It’s exactly what it sounds like—raw, uncooked, unrefined, full of noise, errors, and missing values. Before any serious analysis can be done, raw data typically needs to go through data cleaning and transformation processes. Here’s the reality:

1. Inconsistent Formats

Dates in different formats, varying units (e.g., pounds vs. kilograms), and inconsistent categorizations (e.g., “NL” vs. “Netherlands”, or "NL" vs. "nl") are just a few of the format issues that crop up in raw data. Every time you query raw data, you risk misinterpreting or misaggregating your results.

2. Duplicate or Missing Records

Raw datasets often have duplicate entries or missing values. Without a proper process to clean and validate the data, you’ll spend most of your time handling these issues manually, leading to incomplete or false analyses.

When different teams query raw data without standardisation, they often handle these inconsistencies differently, leading to contradictory results: one team might treat missing values as zero, while another ignores them entirely, and yet others may take the previously known data point, creating an inconsistent understanding of the data.

Performance and Cost: The Hidden Burden

While querying raw data seems faster, the hidden costs in terms of both performance and expense are significant, especially in cloud-based environments like Snowflake, BigQuery etc. Here’s why querying the raw data can be a performance and cost nightmare:

1. Large, Unoptimized Tables

Raw datasets tend to be big. When you query unprocessed data, you’re often forcing the database to scan through billions of rows and columns, much of which may be irrelevant to your actual analysis. 

2. Expensive Cloud Costs

In cloud data warehouses, every query has a cost. Since raw data may not be optimised for efficient querying, each query possibly consumes more computational power than necessary. Aggregating raw data on the fly, over long periods, is resource-intensive, leading to added costs, making your query bills skyrocket over time.

3. Repeated Transformations

Because raw data lacks pre-built logic or calculations, every query requires you to re-execute complex transformations like joins, filters, and aggregations. Not only does this make each query slower and more expensive, but it also wastes valuable compute resources repeating the same operations over and over again.

Reinventing the Wheel: Duplicating Efforts, Inconsistent Results

When multiple teams or individuals query raw data, each is effectively reinventing the wheel. This leads to duplicated efforts and inconsistent business logic across the organisation. Without a shared data model, each team has to build its own transformations, metrics, and calculations, which causes a number of issues:

1. Duplicate Work

Every team ends up creating their own transformations to clean and aggregate raw data. For example, one analyst might spend hours creating a transformation to calculate monthly sales, only for another team to do the exact same thing independently. This duplication of effort wastes time and resources across the organisation.

2. Inconsistent Business Logic

Without standardised metrics, different teams will often define key business metrics (e.g., "monthly active users" or "churn rate") in slightly different ways. This leads to confusion and misaligned decision-making, as different reports tell conflicting stories about the same data.

The Case for a well-designed data warehouse

So, if querying raw data is such a bad idea, what’s the alternative? Enter a Kimball-style, Data Vault, or any well-designed data warehouse. This is where raw data gets transformed into something clean, structured, and optimised for efficient querying. Rather than everyone reinventing the wheel, a data warehouse brings order to the chaos. Here’s why this approach is not just useful, but essential for anyone serious about analytics.

Single Source of Truth

A common data model ensures that there’s a single, standardised version of the truth. Metrics, key performance indicators (KPIs), and business definitions are all pre-defined and consistent across the organisation. No more arguing over whose version of the monthly revenue report is correct—everyone’s working off the same data.

  • Predefined Metrics: In a Kimball-style model, important metrics (e.g., revenue, customer counts) are defined and calculated once. This eliminates the risk of contradicting results and ensures that everyone is using the same definitions.
  • Centralised Data Logic: All the complex business logic—like how we define "active users" or "churn"—is encoded into the data model. This means analysts don’t have to recreate these calculations every time they query the data.

Pre-Aggregated and Optimised for Performance

In a data warehouse, raw data is transformed into tables that are optimised for querying. For instance in a Kimball style data warehouse, fact tables store transactional data (e.g., sales transactions), while dimension tables store descriptive information (e.g., customer demographics). This design supports fast, efficient queries.

  • Pre-Aggregation: Data that’s commonly used in aggregate (e.g., daily or monthly sales) can be pre-aggregated, meaning analysts don’t have to compute sums or averages over billions of rows every time they run a query.
  • Optimised for Queries: Dimension and fact tables are designed to minimise expensive table scans. Queries that would take hours on raw data can be completed in seconds.

Data Quality Assurance

A foundational data model ensures data quality by acting as a gatekeeper for clean, well-structured data. Ideally, it goes through several automated tests to check both code readability and data accuracy. These tests help catch errors, ensure consistency, and maintain reliability, making sure the model performs well and the data remains trustworthy.

Conclusion

While there are valid cases for querying raw data, think of data science, validation purposes, or other specific needs, for analytics it often leads to more problems than it solves. From data quality issues and inconsistent metrics to performance bottlenecks and unnecessary costs. The solution? A well-structured, governed data model that centralises business logic, ensures data consistency, and optimises performance. Trust us—once you see the benefits of a curated data model, you won’t want to go back to raw data.

Is your organisation facing the challenge to implement best practices in the are of data modelling? We're here to help. Get in touch and we'll get back to you soon.

Abstrakte Form eines Pfades

Follow us on LinkedIn

Don't miss out on updates and insights

Data news for pros

Want to know more? Then subscribe to our newsletter! Regular news from the data world about new developments, tools, best practices and events!

Abstrakte Form eines Pfades des Data Institute

Follow us on LinkedIn

Don't miss out on updates and insights

Abstrakter Pfad des Data Institutes

Which services fit this topic
?

<svg width=" 100%" height=" 100%" viewBox="0 0 62 62" fill="none" xmlns="http://www.w3.org/2000/svg"> <g clip-path="url(#clip0_5879_2165)"> <path d="M21.3122 46.5H40.6872V50.375H21.3122V46.5ZM25.1872 54.25H36.8122V58.125H25.1872V54.25ZM30.9997 3.875C25.8611 3.875 20.933 5.91629 17.2995 9.54981C13.666 13.1833 11.6247 18.1114 11.6247 23.25C11.4937 26.0658 12.0331 28.8726 13.1985 31.4392C14.364 34.0059 16.1222 36.2592 18.3285 38.0138C20.266 39.8156 21.3122 40.8425 21.3122 42.625H25.1872C25.1872 39.06 23.0366 37.0644 20.9441 35.1462C19.1332 33.7595 17.69 31.9499 16.7408 29.8759C15.7917 27.802 15.3655 25.5269 15.4997 23.25C15.4997 19.1391 17.1327 15.1967 20.0396 12.2898C22.9464 9.38303 26.8889 7.75 30.9997 7.75C35.1106 7.75 39.0531 9.38303 41.9599 12.2898C44.8667 15.1967 46.4997 19.1391 46.4997 23.25C46.6317 25.5286 46.2025 27.8047 45.2499 29.8788C44.2973 31.9529 42.8504 33.7616 41.036 35.1462C38.9628 37.0837 36.8122 39.0213 36.8122 42.625H40.6872C40.6872 40.8425 41.7141 39.8156 43.671 37.9944C45.8757 36.2428 47.6331 33.9929 48.7986 31.4295C49.964 28.8662 50.5042 26.0628 50.3747 23.25C50.3747 20.7056 49.8736 18.1862 48.8999 15.8355C47.9262 13.4848 46.499 11.3489 44.6999 9.54981C42.9008 7.75067 40.7649 6.32352 38.4142 5.34983C36.0635 4.37615 33.5441 3.875 30.9997 3.875Z" fill="currentColor"/> </g> <defs> <clipPath id="clip0_5879_2165"> <rect width="62" height="62" fill="currentColor"/> </clipPath> </defs> </svg>

Data Strategy

When what happens how and why — that explains the data strategy.

<svg width=" 100%" height=" 100%" viewBox="0 0 62 62" fill="none" xmlns="http://www.w3.org/2000/svg"> <g clip-path="url(#clip0_5994_7571)"> <path d="M52.3125 46.5C51.4494 46.5039 50.5984 46.7026 49.8228 47.0813L41.4916 38.75H34.875V42.625H39.8854L47.0832 49.8228C46.7043 50.5984 46.505 51.4494 46.5 52.3125C46.5 53.4621 46.8409 54.5859 47.4796 55.5418C48.1183 56.4976 49.0261 57.2426 50.0882 57.6826C51.1502 58.1225 52.3189 58.2376 53.4465 58.0133C54.574 57.789 55.6097 57.2355 56.4226 56.4226C57.2355 55.6097 57.789 54.574 58.0133 53.4465C58.2376 52.319 58.1225 51.1503 57.6825 50.0882C57.2426 49.0261 56.4976 48.1183 55.5417 47.4796C54.5859 46.8409 53.4621 46.5 52.3125 46.5ZM52.3125 54.25C51.9293 54.25 51.5547 54.1364 51.2361 53.9235C50.9175 53.7106 50.6691 53.408 50.5225 53.054C50.3758 52.6999 50.3375 52.3104 50.4122 51.9345C50.487 51.5587 50.6715 51.2135 50.9425 50.9425C51.2134 50.6715 51.5587 50.487 51.9345 50.4122C52.3103 50.3375 52.6999 50.3758 53.0539 50.5225C53.408 50.6691 53.7106 50.9175 53.9235 51.2361C54.1364 51.5547 54.25 51.9293 54.25 52.3125C54.25 52.8264 54.0459 53.3192 53.6825 53.6825C53.3192 54.0459 52.8264 54.25 52.3125 54.25ZM52.3125 25.1875C51.1143 25.1911 49.9465 25.5655 48.9696 26.2593C47.9927 26.9531 47.2546 27.9323 46.8565 29.0625H34.875V32.9375H46.8565C47.2134 33.9395 47.8389 34.8242 48.6646 35.4948C49.4903 36.1653 50.4845 36.5961 51.5384 36.7399C52.5923 36.8837 53.6655 36.735 54.6407 36.3101C55.6158 35.8852 56.4554 35.2005 57.0678 34.3307C57.6801 33.4609 58.0416 32.4396 58.1127 31.3783C58.1838 30.317 57.9618 29.2565 57.471 28.3128C56.9802 27.3691 56.2395 26.5785 55.3297 26.0273C54.42 25.4761 53.3762 25.1856 52.3125 25.1875ZM52.3125 32.9375C51.9293 32.9375 51.5547 32.8239 51.2361 32.611C50.9175 32.3981 50.6691 32.0955 50.5225 31.7415C50.3758 31.3874 50.3375 30.9979 50.4122 30.622C50.487 30.2462 50.6715 29.9009 50.9425 29.63C51.2134 29.359 51.5587 29.1745 51.9345 29.0997C52.3103 29.025 52.6999 29.0633 53.0539 29.21C53.408 29.3566 53.7106 29.605 53.9235 29.9236C54.1364 30.2422 54.25 30.6168 54.25 31C54.25 31.5139 54.0459 32.0067 53.6825 32.37C53.3192 32.7334 52.8264 32.9375 52.3125 32.9375ZM52.3125 3.875C50.7714 3.87654 49.2939 4.48942 48.2041 5.57914C47.1144 6.66887 46.5015 8.1464 46.5 9.6875C46.5066 10.6157 46.738 11.5284 47.1742 12.3477L39.9048 19.375H34.875V23.25H41.4702L49.9953 15.0118C50.7872 15.3571 51.6461 15.5215 52.5096 15.493C53.373 15.4644 54.2193 15.2438 54.9867 14.8469C55.7541 14.4501 56.4234 13.8872 56.9458 13.1991C57.4682 12.511 57.8306 11.715 58.0065 10.8692C58.1825 10.0234 58.1677 9.14899 57.9631 8.30963C57.7585 7.47027 57.3694 6.68709 56.8239 6.01711C56.2785 5.34712 55.5905 4.8072 54.8101 4.43664C54.0297 4.06608 53.1764 3.87421 52.3125 3.875ZM52.3125 11.625C51.9293 11.625 51.5547 11.5114 51.2361 11.2985C50.9175 11.0856 50.6691 10.783 50.5225 10.429C50.3758 10.0749 50.3375 9.68535 50.4122 9.30951C50.487 8.93368 50.6715 8.58845 50.9425 8.31748C51.2134 8.04652 51.5587 7.86199 51.9345 7.78723C52.3103 7.71247 52.6999 7.75084 53.0539 7.89749C53.408 8.04413 53.7106 8.29247 53.9235 8.61109C54.1364 8.92971 54.25 9.3043 54.25 9.6875C54.25 10.2014 54.0459 10.6942 53.6825 11.0575C53.3192 11.4209 52.8264 11.625 52.3125 11.625Z" fill="currentColor"/> <path d="M34.875 11.625H38.75V7.75H34.875C33.7709 7.75369 32.6804 7.99469 31.6775 8.45667C30.6747 8.91866 29.7829 9.59082 29.0625 10.4276C28.3421 9.59082 27.4503 8.91866 26.4475 8.45667C25.4446 7.99469 24.3541 7.75369 23.25 7.75H21.3125C16.6894 7.75513 12.257 9.59393 8.98799 12.863C5.71893 16.132 3.88013 20.5644 3.875 25.1875V36.8125C3.88013 41.4356 5.71893 45.868 8.98799 49.137C12.257 52.4061 16.6894 54.2449 21.3125 54.25H23.25C24.3541 54.2463 25.4446 54.0053 26.4475 53.5433C27.4503 53.0813 28.3421 52.4092 29.0625 51.5724C29.7829 52.4092 30.6747 53.0813 31.6775 53.5433C32.6804 54.0053 33.7709 54.2463 34.875 54.25H38.75V50.375H34.875C33.8476 50.374 32.8626 49.9654 32.1361 49.2389C31.4096 48.5124 31.001 47.5274 31 46.5V15.5C31.001 14.4726 31.4096 13.4876 32.1361 12.7611C32.8626 12.0346 33.8476 11.626 34.875 11.625ZM23.25 50.375H21.3125C18.0545 50.3692 14.9073 49.1916 12.4457 47.0572C9.9841 44.9229 8.37242 41.9743 7.905 38.75H11.625V34.875H7.75V27.125H13.5625C15.1036 27.1235 16.5811 26.5106 17.6709 25.4209C18.7606 24.3311 19.3735 22.8536 19.375 21.3125V17.4375H15.5V21.3125C15.5 21.8264 15.2959 22.3192 14.9325 22.6825C14.5692 23.0459 14.0764 23.25 13.5625 23.25H7.905C8.37242 20.0257 9.9841 17.0771 12.4457 14.9428C14.9073 12.8084 18.0545 11.6308 21.3125 11.625H23.25C24.2774 11.626 25.2624 12.0346 25.9889 12.7611C26.7154 13.4876 27.124 14.4726 27.125 15.5V23.25H23.25V27.125H27.125V34.875H23.25C21.7089 34.8765 20.2314 35.4894 19.1416 36.5791C18.0519 37.6689 17.439 39.1464 17.4375 40.6875V44.5625H21.3125V40.6875C21.3125 40.1736 21.5166 39.6808 21.88 39.3175C22.2433 38.9541 22.7361 38.75 23.25 38.75H27.125V46.5C27.124 47.5274 26.7154 48.5124 25.9889 49.2389C25.2624 49.9654 24.2774 50.374 23.25 50.375Z" fill="currentColor"/> </g> <defs> <clipPath id="clip0_5994_7571"> <rect width="62" height="62" fill="currentColor"/> </clipPath> </defs> </svg>

Machine Learning

Not a gimmick, but real added value

Abstrakte Form eines Pfades

Stay up to date

Subscribe to our newsletter

Data news for pros

Want to know more? Then subscribe to our newsletter! Regular news from the data world about new developments, tools, best practices and events!

Abstrakte Form eines Pfades des Data Institute

Stay up to date

Subscribe to our newsletter

Abstrakter Pfad des Data Institutes