How to model drill-across relationships?

SterlingParamoreMSE · November 3, 2022, 11:50pm

I’m curious how I should approach modeling a drill-across situation in Holistics. Below describes a simplified version of what I have and the output I want.

My company builds “packages” that generate electricity. Each package may consist of multiple cores, which each consume fuel independently. I have a dimensional model consisting of:

dim_package - A dimension containing various attributes describing packages.
dim_core - A dimension containing various attributes describing cores.
fct_signal_package - A fact table containing telemetry signals - one per package per point in time. Points in time have been put into standardized 15-minute “time bins”, but there can be multiple
signals per bin.
fct_signal_core - A fact table containing telemetry signals - one per core per point in time. Points in time have been put into standardized 15-minute “time bins”, but there can be multiple signals per bin.

I want to create a time-series report showing the energy-to-fuel ratio. So I might have data that looks like this (very simplified):

dim_package - Just one package with a surrogate key and serial number:

package_sk	package_sn
pkg1	P1

dim_core - Two cores belonging to one package with a surrogate key and serial number:

core_sk	core_sn
core1a	C1A
core1b	C1B

fct_signal_package - Signal telemetry measuring energy produced by the package (signal_at is the time the signal was recorded, bin_start_at is the start of a 15-minute interval):

package_sk	bin_start_at	signal_at	energy_kwh
pkg1	2022-01-01 00:00:00	2022-01-01 00:00:01	10.0
pkg1	2022-01-01 00:15:00	2022-01-01 00:15:01	5.0
pkg1	2022-01-01 00:15:00	2022-01-01 00:18:01	5.0
pkg1	2022-01-01 00:30:00	2022-01-01 00:30:01	20.0
pkg1	2022-01-01 00:45:00	2022-01-01 00:45:01	10.0

fct_signal_core - Signal telemetry measuring fuel consumed by the core:

core_sk	package_sk	bin_start_at	signal_at	fuel_kbtu
core1a	pkg1	2022-01-01 00:00:00	2022-01-01 00:00:01	5.0
core1a	pkg1	2022-01-01 00:15:00	2022-01-01 00:15:01	5.0
core1a	pkg1	2022-01-01 00:30:00	2022-01-01 00:30:01	12.0
core1a	pkg1	2022-01-01 00:45:00	2022-01-01 00:45:01	5.0
core1b	pkg1	2022-01-01 00:00:00	2022-01-01 00:00:01	5.0
core1b	pkg1	2022-01-01 00:15:00	2022-01-01 00:15:01	5.0
core1b	pkg1	2022-01-01 00:30:00	2022-01-01 00:30:01	8.0
core1b	pkg1	2022-01-01 00:45:00	2022-01-01 00:45:01	5.0

In my report, I want the average package energy-to-fuel ratio by time (data has been fabricated to give a constant ratio):

package_sn	bin_start_at	energy_to_fuel
P1	2022-01-01 00:00:00	1.0
P1	2022-01-01 00:15:00	1.0
P1	2022-01-01 00:30:00	1.0
P1	2022-01-01 00:45:00	1.0

I know how to write a SQL query for this: I take each of my facts, aggregate them up to the package-bin granularity, then join the results together (viola, drill-across). Something like this (Snowflake dialect):

with dim_package as (
  select 'pkg1' as package_sk, 'P1' as package_sn
),

dim_core as (
  select 'core1a' as core_sk, 'C1A' as core_sn
  union all
  select 'core1b' as core_sk, 'C1B' as core_sn
),

fct_signal_package as (
  select 'pkg1' as package_sk, '2022-01-01 00:00:00'::timestamp as bin_start_at, '2022-01-01 00:00:01'::timestamp as signal_at, 10.0 as energy_kwh
  union all
  select 'pkg1' as package_sk, '2022-01-01 00:15:00'::timestamp as bin_start_at, '2022-01-01 00:15:01'::timestamp as signal_at, 5.0 as energy_kwh
  union all
  select 'pkg1' as package_sk, '2022-01-01 00:15:00'::timestamp as bin_start_at, '2022-01-01 00:18:01'::timestamp as signal_at, 5.0 as energy_kwh
  union all
  select 'pkg1' as package_sk, '2022-01-01 00:30:00'::timestamp as bin_start_at, '2022-01-01 00:30:01'::timestamp as signal_at, 20.0 as energy_kwh
  union all
  select 'pkg1' as package_sk, '2022-01-01 00:45:00'::timestamp as bin_start_at, '2022-01-01 00:45:01'::timestamp as signal_at, 10.0 as energy_kwh
),

fct_signal_core as (
  select 'core1a' as core_sk, 'pkg1' as package_sk, '2022-01-01 00:00:00'::timestamp as bin_start_at, '2022-01-01 00:00:01'::timestamp as signal_at, 5.0 as fuel_kbtu
  union all
  select 'core1a' as core_sk, 'pkg1' as package_sk, '2022-01-01 00:15:00'::timestamp as bin_start_at, '2022-01-01 00:15:01'::timestamp as signal_at, 5.0 as fuel_kbtu
  union all
  select 'core1a' as core_sk, 'pkg1' as package_sk, '2022-01-01 00:30:00'::timestamp as bin_start_at, '2022-01-01 00:30:01'::timestamp as signal_at, 12.0 as fuel_kbtu
  union all
  select 'core1a' as core_sk, 'pkg1' as package_sk, '2022-01-01 00:45:00'::timestamp as bin_start_at, '2022-01-01 00:45:01'::timestamp as signal_at, 5.0 as fuel_kbtu
  union all
  select 'core1b' as core_sk, 'pkg1' as package_sk, '2022-01-01 00:00:00'::timestamp as bin_start_at, '2022-01-01 00:00:01'::timestamp as signal_at, 5.0 as fuel_kbtu
  union all
  select 'core1b' as core_sk, 'pkg1' as package_sk, '2022-01-01 00:15:00'::timestamp as bin_start_at, '2022-01-01 00:15:01'::timestamp as signal_at, 5.0 as fuel_kbtu
  union all
  select 'core1b' as core_sk, 'pkg1' as package_sk, '2022-01-01 00:30:00'::timestamp as bin_start_at, '2022-01-01 00:30:01'::timestamp as signal_at, 8.0 as fuel_kbtu
  union all
  select 'core1b' as core_sk, 'pkg1' as package_sk, '2022-01-01 00:45:00'::timestamp as bin_start_at, '2022-01-01 00:45:01'::timestamp as signal_at, 5.0 as fuel_kbtu
),

agg_package as (
  select
    dim_package.package_sn,
    fct_signal_package.bin_start_at,
    sum(energy_kwh) as energy_kwh
  from
    fct_signal_package
  inner join
    dim_package
  on
    fct_signal_package.package_sk = dim_package.package_sk
  group by
    1,2
),

agg_core as (
  select
    dim_package.package_sn,
    fct_signal_core.bin_start_at,
    sum(fuel_kbtu) as fuel_kbtu
  from
    fct_signal_core
  inner join
    dim_package
  on
    fct_signal_core.package_sk = dim_package.package_sk
  group by
    1,2
),

final as (
  select
    coalesce(agg_package.package_sn, agg_core.package_sn) as package_sn,
    coalesce(agg_package.bin_start_at, agg_core.bin_start_at) as bin_start_at,
    agg_package.energy_kwh / agg_core.fuel_kbtu as energy_to_fuel
  from
    agg_package
  inner join
    agg_core
  on
    -- Join on common dimensions - note that `bin_start_at` is a degenerate dimension
    agg_package.package_sn = agg_core.package_sn
    and
    agg_package.bin_start_at = agg_core.bin_start_at
)

select * from final

My main question here is: what is the proper way to model this kind of
relationship in Holistics? I have attempted to define the facts and dimensions as
Holistics models and include relationships between them, but unfortunately I get an error
about “Dataset contains ambiguous paths” and it makes me concerned that Holistics does
not support his kind of drill-across relationship.

Does Holistics support datasets with multiple facts that can be combined via a Kimball-style
drill-across query, when the facts are at different levels of granularity but share common dimensions?
If not, what are some good alternatives? I suppose I could create an aggregate package-core-signal fact table that rolled up the base metrics to the same granularity. But of course if I do that, then I’ll basically be duplicating some of those metrics (e.g., I’ll need energy and fuel metrics in both the details facts and the aggregated fact).
My real situation is actually a bit more complicated because I’ve got a third fact table I need to incorporate: fct_signal_node, which contains measurements of the energy “grid” that is connected to a group (node) of packages all connected together. Would any of this approach change with 3 fact tables (if I had to create derived aggregate facts, I’d then need to create another one at the higher node granularity as well).

tan · November 10, 2022, 11:41am

Hi Sterling,

Thank you for giving us a very detailed question. I’ll try my best to answer some of your concern here.

Does Holistics support datasets with multiple facts that can be combined via a Kimball-style
drill-across query, when the facts are at different levels of granularity but share common dimensions?

Yes, as long as the common dimensions can connect to each others and your fact tables without ambiguity. Unfortunately, this is not the case for your use-case. In your case, you want an exploration like this:

explore {
  dimensions {
    dim_package.package_sn,
    // fct_signal_package.bin_start_at or fct_signal_core.bin_start_at
  }
  measures {
    energy_to_fuel: // you can put this in a biz cal
      sum(fct_signal_package.energy_kwh) / sum(fct_signal_core.fuel_kbtu)
  }
}

This would not work because as far as we concern fct_signal_package.bin_start_at and fct_signal_core.bin_start_at are not the same dimension. To overcome this, what we usually recommend is to introduce a date_dim model to connect the two models.

But this would not work for your case either because now the dataset contains ambiguous paths between dim_package and date_dim

We know about this issue and ideally the experience we want you to have here is:

Allow you disable the relationship in the dataset to remove ambiguity during normal exploration.
But still allow you to explicitly enable/choose the correct path at measure definition time

This is what we’re actively working on, but a lot is still work in progress so I cannot share much with you now.

For now, the best work-around is either do what you said but separate them to different dataset or create a dim model that merge dim_package with date_dim and create a relationship on (package_sn, bin_start_at) to the fact tables.

Obviously, both of this is not ideal with different issues, We’ll get back to you when we think of any better work-arounds.

SterlingParamoreMSE · November 10, 2022, 7:26pm

Thanks @tan. How do you envision getting the answer I want if you 1. “disable” a relation or 2. explicitly choose the “correct” path. It seems like you need all of the relationships in order to get the right answer. The algorithm would need to be smart enough to know that if I have two metrics defined in different tables, they need to be aggregated up to the dimensions they share in common before being joined.

tan · November 11, 2022, 2:43am

Hi Sterling,

Ideally we would add the ability to let you choose the path in the metric definition itself, and you can choose the “correct” path for each definition. That way both paths are used in the exploration, but each one is locally scoped to each metrics.

The algorithm would need to be smart enough to know that if I have two metrics defined in different tables, they need to be aggregated up to the dimensions they share in common before being joined.

Our engine is already doing this, the only issue prevent you from doing your use-case here is due to the ambiguity issue mention above.

hapham · November 11, 2022, 6:36am

Hi Sterling,

You can try creating a separate dimension table where each row represents a (package - time bin) pair

And then create relationships between the cores, packages & package_time_bins like so:

Next, in the cores and package models, you can create separate “Total fuel consumed” & “Total energy produced” measures. These two measures will be used to create the “Energy to fuel” business calculation in the Dataset exploration / report creation view:

Can you try if this works for you?

SterlingParamoreMSE · November 11, 2022, 4:25pm

@tan - I’m still not clear on what you mean by the “path” for a metric. How do you envision applying a path definition for a metrics like sum(fct_signal_package.energy_kwh) and sum(fct_signal_core.fuel_kbtu) to solve the problem of being able to create a higher level metric that is the ratio of the two?

SterlingParamoreMSE · November 11, 2022, 4:30pm

@hapham - I can see how that kind of solution would work in this toy situation. However, I don’t think would scale well when we start to add a few more facts and dimensions because then you’d end up having to build these kinds of “bridge” tables for each combination of facts that you’d want to bring into a report. Furthermore, these “bridge” tables would end up being as long as the fact tables, so the joins wouldn’t likely be computationally expensive and it would probably be simpler to just create one big pre-joined table. That’s not great either, because then you’ve got a lot of duplication in the the dimensions and measures.

tan · November 15, 2022, 8:00am

Basically the solution would be to create a date dim model like this:

And at the measure level you can define the ratio as:

safe_divide(
  sum(fct_signal_package.energy_kwh),
  use(
    sum(fct_signal_core.fuel_kbtu), 
    fct_signal_core.bin_start_at > date_dim.date_key,
  )
)

With this the sum(fct_signal_package.energy_kwh) would use the default relationship (fct_signal_core.bin_start_at > date_dim.date_key), while sum(fct_signal_core.fuel_kbtu) would use the explicitly enabled fct_signal_core.bin_start_at > date_dim.date_key relationship. The final query would be similar with what you write manually, except that they will have an addition join to the date_dim model:

--- ....
agg_core as (
select
    dim_package.package_sn,
    **date_dim.date_key**,
    sum(fuel_kbtu) as fuel_kbtu
  from
    fct_signal_core
  inner join
    dim_package
  **left join date_dim on** --- ....
  on
    fct_signal_core.package_sk = dim_package.package_sk
  group by
    1,2
)
--- ....
final as (
--- ....
on
    agg_package.package_sn = agg_core.package_sn
    and
    -- Same underlying column identity
    **agg_package.date_key = agg_core.date_key**

SterlingParamoreMSE · November 15, 2022, 4:51pm

@tan - Interesting. I see how that would work in this situation because you’ve defined the relationship in the measure. I think I would need to learn more about how relationships at the dataset level relate interact with relationships defined at the measure level. Are dataset-level relationships just defaults?

What would it mean for other measures if you disabled the relationship you indicated at the dataset level? For example, if I just wanted to look at the sum(fct_signal_core.fuel_kbtu) over time, without involving any measures from the other fact, wouldn’t disabling the relationship at the dataset prevent this? Or would I have to define the relationship to time for each measure to avoid this issue. If so, then in a model with a few more facts and dimensions, I might end up disabling most dataset-level relationships and just define the relationships within each measure.

tan · November 17, 2022, 4:50am

What would it mean for other measures if you disabled the relationship you indicated at the dataset level? For example, if I just wanted to look at the sum(fct_signal_core.fuel_kbtu) over time, without involving any measures from the other fact, wouldn’t disabling the relationship at the dataset prevent this? Or would I have to define the relationship to time for each measure to avoid this issue. If so, then in a model with a few more facts and dimensions, I might end up disabling most dataset-level relationships and just define the relationships within each measure.

This is a very valid concern. But I believe you will only have to do this when “Role playing dimension” is required with multiple measures, which would not be that common.

For example, if I just wanted to look at the sum(fct_signal_core.fuel_kbtu) over time, without involving any measures from the other fact, wouldn’t disabling the relationship at the dataset prevent this?

In this case, you wouldn’t have to use the fct_signal_core.bin_started_at through date_dim as a “Role playing dimension”. You could directly use fct_signal_corebin_started_at which would not involve the disabled relationship at all.

SterlingParamoreMSE · November 17, 2022, 5:54pm

In this case, you wouldn’t have to use the fct_signal_core.bin_started_at through date_dim as a “Role playing dimension”. You could directly use fct_signal_corebin_started_at which would not involve the disabled relationship at all.

Ok, but if I had one or two more facts and dimensions in this model, I’d have to disable more “ambiguous” paths where using an attribute that was duplicated on the fact wouldn’t be possible.

SterlingParamoreMSE · August 30, 2023, 1:02am

@tan - Wondering if there have been any changes in the last year that would help with these scenarios?

tan · August 30, 2023, 4:11am

Yes, I’m been waiting to share the progress on this, but the documentation is not there yet.

But I can tell you that we have a way to support these scenarios in the new AQL engine for 4.0.

The gist of the unreleased documentation is that we will support the new with_relationships function that allows you to choose the relationships at the metric/expression level. Here is an example of it in action:

Dataset ecommerce_aql {
  __engine__: 'aql'
  data_source_name: 'demodb'

  models: [
    users,
    orders,
    dim_dates,
  ]

  relationships: [
    relationship(users.sign_up_date > dim_dates.date_key, true),
    relationship(orders.created_date > dim_dates.date_key, false),
    // This is enabled by default, which means by default orders are retrieved 
    // through users
    relationship(orders.user_id > users.id, true),
  ]

  owner: '[email protected]'

  metric total_orders_by_date {
    description: '''
      This metric counts the number of orders created in a period. 
      If you want the orders created by users, please use the default count(orders.id)
    '''
    label: 'Total Orders By Date'
    type: 'number'
    definition: @aql count(orders.id) | with_relationships(orders.created_date > dim_dates.date_key) ;;
  }
}