Closed Beta: Aggregate Awareness (with PreAggregate)

datbth · May 21, 2024, 5:26am

Hi all,

In this post, we would like to proudly introduce our upcoming feature: Aggregate Awareness.

Background

Imagine you have a fact table called orders that contains data of every order made on your e-commerce platform. Throughout your 5 years in business, the table orders has accumulated billions of rows.

Now, in order to answer questions like “Total number of items ordered in each year”, the database has to crunch through those billions of rows and aggregate the total. This process may take several minutes to complete and consume lots of database resources (typically memory and CPU)!
On the other hand, towards business users, this is a very simple question. In their eyes, they just need 5 numbers, 1 for each year. “Why does it take so long to get 5 numbers?!”

In data engineering and analytics, “pre-aggregation” is a well-known tool/technique to serve data questions like this faster.
Essentially, the data engineer can “pre-aggregate” the orders table into total_quantity_by_month or total_quantity_by_year. Such “pre-aggregation” is done ahead of the business users’ usage, so that when the business users actually ask the question, the system can just process total_quantity_by_month (~ 60 rows) or total_quantity_by_year (~5 rows) and pull out the answer immediately!

Challenges

In the context of self-serve analytics where the business users use the Business Intelligence platform themselves to “query” or “explore” the data, it is actually quite challenging to make sure those users are correctly using the pre-aggregated tables like total_quantity_by_month or total_quantity_by_year instead of orders. This raises the need for the BI platform to be “aggregation-aware” and be able to automatically use the appropriate pre-aggregates for the business users.

Additionally, it requires significant effort and skills for the data engineers to maintain the pre-aggregates, keeping them accurate and up-to-date.

Aggregate Awareness

To solve the above challenges, Holistics is developing Aggregate Awareness that consists of these sets of features:

Aggregate Awareness: automatically identify eligible pre-aggregates and substitute them into queries, making the queries use smaller pre-aggregated tables while still producing accurate results.
Pre-Aggregate Definition: allow analytic engineers to conveniently define the pre-aggregates
Pre-Aggregate Persistence: allow analytic engineers to conveniently and accurately persist/materialize the pre-aggregates
Pre-Aggregate Suggestions (later): automatically suggest relevant pre-aggregates to speed up your explorations and reports.

Please take a first look at the usage of Aggregate Awareness via this tutorial: Using Aggregate Awareness | Holistics Docs (4.0).

Here are some snippets taken from the above tutorial:

pre_aggregate aggregated_quantity_by_month {
  dimension pa_created_at {
    for: ref('demo_orders', 'created_at')
    time_granularity: 'month'
  }
  measure pa_sum_quantity {
    for: ref('demo_order_items', 'quantity')
    aggregation_type: 'sum'
  }
  persistence: FullPersistence {
    schema: 'persisted'
  }
}

Closing Notes

We believe Aggregate Awareness will unlock analytics engineers and allow them to answer data questions much more efficiently and much faster, ultimately delivering better experience to end users.

To receive updates (or even Beta version) early, please fill in this Early Interest Form.
Kindly note that Aggregate Awareness will only be available in Datasets that use AQL engine.

If you have any questions, feel free to let us know!

p/s the above tutorial page and its screenshots are all taken from a functional version in our development environment. So you can expect this to be released soon!

datbth · May 28, 2024, 9:19am

Here is our new Demo video!

0:00 - 0:10: Without Aggregate Awareness
0:10 - 0:32: Setting up and using Aggregate Awareness via PreAggregate

tridvm · July 18, 2024, 3:35am

Hi folks
I have a new update

Aggregate Awareness finally opens requests for Closed Beta users. You can Request Beta Access and try it out.

Aggregate Awareness helps you utilize summarized data (transformed by tools such as dbt, Airflow, BI like Holistics, etc.) to dynamically answer data questions with just enough information. This enhances performance and reduces query costs, avoiding the need to always query the raw tables.

This feature has impacts on everyone in the team:

Analysts can maintain the pre-aggregates better with less effort (improved process).
Business users will get the same insights faster by pre-aggregated tables instead of raw data (improved timeliness).

You can learn more about the feature in the official Public Doc | Aggregate Awareness.

Cheers!