Setup pre_aggregates and aggregate awareness

stonematt · March 15, 2025, 8:24am

I’m trying to get my first pre_aggregate setup, but I seem to be missing a step. Here’s the scenario:

there is a fact table called warehouse_etl_stack_assignment_examiner_timing
there are dimensions and measures about teams, examiners working various events.
i’m starting with a simple count stacks (rows in this table represent stacks that have been assigned and completed)
i made a simple pre_agg to count by several dimensions.

Here’s the dataset block:

pre_aggregates: {
    agg_event_stacks: PreAggregate  {
      dimension completed_at_day {
        // It references the field `created_at` in model `transactions`
        for: ref('warehouse_etl_stack_assignment_examiner_timing', 'completed')
        time_granularity: "day"
      }
      dimension pa_event {
        for: ref('events', 'event_name')
      }
      dimension pa_calibration {
        for: ref('warehouse_etl_stack_assignment_examiner_timing', 'calibration')
      }
      dimension pa_team {
        for: ref('warehouse_etl_stack_assignment_examiner_timing', 'team_name')
      }
      dimension pa_examiner {
        for: ref('warehouse_etl_stack_assignment_examiner_timing', 'examiner_name')
      }
      measure pa_stacks_count {
        for: ref('warehouse_etl_stack_assignment_examiner_timing', 'stack_assignment_id')
        aggregation_type: 'count'
      }
      persistence: FullPersistence {
        schema: 'holistics_persistence'
      }
    }
  }

Results

I manually persisited the pre_aggs
I opend the SQL browser, and see the pre_agg table w/ correct data included

When I go browse the data set, I ran a table with event, calibration, team, examiner, stacks_count (a measure on the fact table) I get this executed SQL (It’s hitting the raw table):

SELECT
  "events"."event_name" AS "e_en_615074",
  "warehouse_etl_stack_assignment_examiner_timing"."calibration" AS "wesaet_c_234501",
  "warehouse_etl_stack_assignment_examiner_timing"."team_name" AS "wesaet_tn_2f4e60",
  "warehouse_etl_stack_assignment_examiner_timing"."examiner_name" AS "wesaet_en_73ee56",
  COUNT("warehouse_etl_stack_assignment_examiner_timing"."stack_assignment_id") AS "c_wesaet_s_9d2667"
FROM
  "xxx"."warehouse_etl_stack_assignment_examiner_timing" "warehouse_etl_stack_assignment_examiner_timing"
  LEFT JOIN "holistics_persistence"."events__H4864d3_T1238755" "events" ON "warehouse_etl_stack_assignment_examiner_timing"."exam_event_id" = "events"."exam_event_id"
GROUP BY
  1,
  2,
  3,
  4
ORDER BY
  5 DESC
LIMIT 10000

NOTE The 'holistics_persistence".events__H4*" is a persisted query model. That is working.

Questions

are my expectations in line?
what did i miss?

datbth · March 17, 2025, 3:31am

Hi Matt,

Thank you for posting the question.

From my first look, what you are doing and expecting seem to be correct.
However, there could be an issue related to how your Holistics workspace or Dataset is being configured.
When Aggregate Awareness is enabled, it would include inline debugging comments in the Executed SQL like this

Would you mind sending an email to [email protected] using the email you are using in Holistics, so that we can check your setup further?

stonematt · March 17, 2025, 3:21pm

I missed a step when creating the new dataset from scratch. I didn’t remember to include:

  __engine__: 'aql' //turn this Dataset to using AQL Engine

I activated it and ran the report again and see the debugging comments wooohooo! But it’s not picking up the pre_agg when I use a measure that’s defined on the table model:

Pre_agg Success

Success SQL

-- [Aggregate Awareness - Hit!] Using agg_event_stacks (persisted at 2025-03-17T14:56:26Z; in place of warehouse_etl_stack_assignment_examiner_timing, events)
SELECT
  "agg_event_stacks"."pa_event" AS "e_en_615074",
  SUM("agg_event_stacks"."pa_stacks_count") AS "c_wesaet_sai_c5bab5"
FROM
  "holistics_persistence"."HPA_5ea7bf6d13ace28dbff4_T1742223375" "agg_event_stacks"
GROUP BY
  1

Pre_agg Miss

Looks like it misses when I try to use the measure that’s defined in the table model. Here’s the definition:

Table Model

  measure stacks {
    label: "stacks_count"
    type: "number"
    definition: @sql {{ stack_assignment_id }};;
    description: "count of stacks"
    aggregation_type: "count"
    format: "#,###"
  }

Then run the report like this:

Miss SQL

-- [Aggregate Awareness - missed] Not using agg_event_stacks (field miss: "c_wesaet_s_9d2667" - warehouse_etl_stack_assignment_examiner_timing.stacks->inner | count)
SELECT
  "events"."event_name" AS "e_en_615074",
  COUNT("warehouse_etl_stack_assignment_examiner_timing"."stack_assignment_id") AS "c_wesaet_s_9d2667"
FROM
  "cfa"."warehouse_etl_stack_assignment_examiner_timing" "warehouse_etl_stack_assignment_examiner_timing"
  LEFT JOIN "holistics_persistence"."events__H4864d3_T1425949" "events" ON "warehouse_etl_stack_assignment_examiner_timing"."exam_event_id" = "events"."exam_event_id"
GROUP BY
  1

datbth · March 17, 2025, 3:48pm

Hi Matt,

Thank you for the update. Happy to hear that you have managed to get started!

To use SQL-based measures, you would need to put the measure in your PreAgg like this:

PreAggregate {
  measure stack_count {
    for: ref('stacks', 'stack_count')
    aggregation_type: 'custom'
  }
}

But we would strongly recommend using AQL-based measures instead.

stonematt · March 17, 2025, 4:58pm

OK! I pointed the pre_agg at my table model during troubleshooting above. I actually want to use a query model that has more interesting measures in general (a bunch of sums and medians ) – they are generally AQL based. I’ll iterate and let you know how it goes!

stonematt · March 19, 2025, 7:33pm

@datbth I thought I was close, but pre_agg is still not doing what I expected. Here’s the update.

Update model to use all @aql measures where possible. (especially count, sum, and median)
update pre_agg to point at query model stack_assignment_timing
Update pre_agg definitions to use the same name stacks as the model according to the Pre Agg FAQ example
Run data view with stacks metric (misses)
Run data view with similar defintion as pre-agg: count(stack_assignment_id) in the data panel (hits) – this is similar to above ^^^

Config Summary:

Model:

Model stack_assignment_timing {
  type: 'query'
  label: 'Stack Assignment Timing'

  // Custom Measures
  measure stacks {
    label: "Stacks"
    type: "number"
    definition: @aql count(stack_assignment_timing.stack_assignment_id);;
    description: "count of stacks"
    format: "#,###"
  }
  query: @sql
    select
       {{ #s.exam_event_id }}
      ,  {{ #s.team_name }}
      , {{ #s.exam_event_id }} || '-' || {{ #s.team_name }} as event_team_name
      ,  {{ #s.examiner_name }}
      ,  {{ #s.stack_assignment_id }}
      ,  {{ #s.calibration }}
      ,  {{ #s.assigned }}
      ,  {{ #s.evaluation_round }}
      ,  {{ #s.completed }}
      ,  {{ #s.last_completed }}
      ,  {{ #s.stack_seconds }}
      ,  {{ #s.stack_interval_seconds }}
      ,  {{ #s.updated }}
      ,  {{ #s.completed_hour }}
      ,  {{ #s.stack_seconds_bins }}
      ,  {{ #s.stack_hours }}
    from
      {{ #all_stack_assignment_timing as s }}
    WHERE
      -- exclude stacks that were left open more than 30min
      {{ #s.stack_seconds }} < 30 * 60  ;;
  models: [
    all_stack_assignment_timing
  ]
}

Dataset

Dataset cfa_event_comparison {
  __engine__: 'aql' //turn this Dataset to using AQL Engine

  models: [
    stack_assignment_timing,
    events,
    team_data,
  ]

  relationships: [
    relationship(stack_assignment_timing.exam_event_id > events.exam_event_id, true),
    relationship(stack_assignment_timing.event_team_name > team_data.event_team_name, true),
  ]

  pre_aggregates: {
    ag_f_event_stacks: PreAggregate  {
      dimension completed_day {
        for: ref('stack_assignment_timing', 'completed')
        time_granularity: "day"
      }
      dimension event_name {
        for: ref('events', 'event_name')
      }
      dimension calibration {
        for: ref('stack_assignment_timing', 'calibration')
      }
      dimension team_name {
        for: ref('stack_assignment_timing', 'team_name')
      }
      dimension examiner_name {
        for: ref('stack_assignment_timing', 'examiner_name')
      }
      measure stacks {
        // this block works
        for: ref('stack_assignment_timing', 'stacks')
        aggregation_type: 'custom'
        // this block doesn't work
        // for: ref('stack_assignment_timing', 'stack_assignment_id')
        // aggregation_type: 'count'
      }
      persistence: FullPersistence {
        schema: 'holistics_persistence'
      }
      // persistence: IncrementalPersistence {
      //   schema: 'holistics_persistence'
      //   incremental_column: 'completed'  // persistence failed to find the completed column ¯\_(ツ)_/¯ <sigh>
      //   primary_key: 'stack_assignment_id'
      // }
    }
  }
}

Results

using aggregation_type: 'count' on the model id column

-- [Aggregate Awareness - missed] Not using ag_f_event_stacks (field miss: "c_sat_s_49aabc" - stack_assignment_timing.stacks->inner | count)

using aggregation_type: 'custom' on the model measure stacks

-- [Aggregate Awareness - Hit!] Using ag_f_event_stacks (persisted at 2025-03-19T19:23:20Z; in place of stack_assignment_timing, events)

Advice?

datbth · March 24, 2025, 2:55am

Hi Matt,

Regarding the PreAgg miss, could you please

Double-check that you are using the Published code of stack_assignment_timing.stacks. There is a chance you changed the measure to @aql but haven’t published it yet, so when you open the Dataset in Reporting tab, it was still using the old measure definition.
Share the full Executed SQL and a screenshot of your Visualization settings
Enabling Support Impersonation so that we can investigate more efficiently

Sorry for the inconvenience.

I also noticed that you put a comment on incremental_column: 'completed' // persistence failed to find the completed column, this is because the column is named completed_day in your pre-aggregate. Please try renaming it to completed (same as model field name) instead.
The primary_key, if specified, also needs to be present as a field in both the model and the pre-aggregate. Though typically, it isn’t needed for pre-aggregates.
We will try to revise the incremental settings to make it more convenient for pre-aggregates.