[Upcoming] Aggregate Awareness (aka Pre-aggregations)

Hi all,

In this post, we would like to proudly introduce our upcoming feature: Pre-Aggregation.

Background

Imagine you have a fact table called orders that contains data of every order made on your e-commerce platform. Throughout your 5 years in business, the table orders has accumulated billions of rows.

Now, in order to answer questions like “Total number of items ordered in each year”, the database has to crunch through those billions of rows and aggregate the total. This process may take several minutes to complete and consume lots of database resources (typically memory and CPU)!
On the other hand, towards business users, this is a very simple question. In their eyes, they just need 5 numbers, 1 for each year. “Why does it take so long to get 5 numbers?!”

In data engineering and analytics, “pre-aggregation” is a well-known tool/technique to serve data questions like this faster.
Essentially, the data engineer can “pre-aggregate” the orders table into total_quantity_by_month or total_quantity_by_year. Such “pre-aggregation” is done ahead of the business users’ usage, so that when the business users actually ask the question, the system can just process total_quantity_by_month (~ 60 rows) or total_quantity_by_year (~5 rows) and pull out the answer immediately!

Challenges

In the context of self-serve analytics where the business users use the Business Intelligence platform themselves to “query” or “explore” the data, it is actually quite challenging to make sure those users are correctly using the pre-aggregated tables like total_quantity_by_month or total_quantity_by_year instead of orders. This raises the need for the BI platform to be “aggregation-aware” and be able to automatically use the appropriate pre-aggregations for the business users.

Additionally, it requires significant effort and skills for the data engineers to maintain the pre-aggregations, keeping them accurate and up-to-date.

Aggregate Awareness :tada:

To solve the above challenges, Holistics is developing Aggregate Awareness that consists of these sets of features:

  • :robot: Aggregate awareness: automatically identify eligible pre-aggregations and substitute them into queries, making the queries use smaller pre-aggregated tables while still producing accurate results.
  • :writing_hand: Pre-Aggregation Definition: allow analytic engineers to conveniently define the pre-aggregations
  • :package: Pre-Aggregation Persistence: allow analytic engineers to conveniently and accurately persist/materialize the pre-aggregations
  • :bulb: Pre-Aggregation Suggestions (later): automatically suggest relevant pre-aggregations to speed up your explorations and reports.

Please take a first look at the usage of Aggregate Awareness via this tutorial: Use Pre-Aggregations | Holistics Docs (4.0).

Here are some snippets taken from the above tutorial:

pre_aggregate aggregated_quantity_by_month {
  dimension pa_created_at {
    for: ref('demo_orders', 'created_at')
    time_granularity: 'month'
  }
  measure pa_sum_quantity {
    for: ref('demo_order_items', 'quantity')
    aggregation_type: 'sum'
  }
  persistence: FullPersistence {
    schema: 'persisted'
  }
}

Closing Notes

We believe Aggregate Awareness will unlock analytics engineers and allow them to answer data questions much more efficiently and much faster, ultimately delivering better experience to end users.

To receive updates (or even Beta version) early, please fill in this Early Interest Form.
Kindly note that Pre-Aggregations will only be available in Datasets that use AQL engine.

If you have any questions, feel free to let us know! :heart:

p/s the above tutorial page and its screenshots are all taken from a functional version in our development environment. So you can expect this to be released soon! :wink:

6 Likes

Here is our new Demo video!

  • 0:00 - 0:10: Without Pre-Aggregations
  • 0:10 - 0:32: Defining and using Pre-Aggregations