Wrong sum metrics when joining with a model with data from a flattened array

mabr · February 17, 2022, 8:27pm

Hi
I have a somewhat complex dataset. It has both a many-to-many join + several lookup tables with untested data from array columns of the primary table in my dataset.

Here is what I am trying to accomplish:

I am trying to group by a dimension from one of the tables with untested array data and sum a column from the model joined in via the many-to-many relationship.

This gives a to high sum/total as many rows has several values in the array data model.

I then tried to counter this by multiplying by the ratio of the number of duplicate rows:
count( distinct {{ primary_key }} ) / count( {{ primary_key }} )
but this caused a fan-out error due to the many-to-many join that does not allow for that type of custom aggregations.

My request would be that you implement a solution, where I can define my primary key and you then make sure that the total row adds up correctly if included in this type of scenario.

For inspiration take a look at Looker’s primary key solution: primary_key | Looker | Google Cloud

Khai_To · February 18, 2022, 3:42am

Hi @mabr ,

We don’t support many-to-many so could you refer to this doc to see if it solves the use case?

mabr · February 18, 2022, 8:14am

Hi Khai
By many-to-many I mean what you refer to in the doc with a many-to-one join to an id table and a many-to-one to the table I am doing the sum aggregation in.

mabr · February 18, 2022, 10:31am

Hi Khai
I have also tried the same thing with another dataset with simpler joins, and the issue is still the same.

There is no way for me to join in a model with a flattened array and get the sum of distinct rows.

To further explain my use case:

The flattened array I am joining in is product categories, where a product can belong to multiple categories. I then want to look at sales per category, but also include a total row. Right the total row becomes too high, which will lead to confusion.

I would expect the sum in the total row to sum distinct rows based on a primary key.

Khai_To · February 19, 2022, 8:33am

Hi @mabr ,

For now, we don’t have a plan for setting primary_key in the model. I will discuss it internally with my team and get back later.
In the meantime, I’m thinking of one way to set up the relationship like this (below), do let me know what do you think:
Products n - 1 Categories 1 - n Categories unnested

By doing that, the Categories model will have unique id(s) so that you can still count the sales per category (categories.name, SUM(products.sales)).

I will definitely check this with my team but could you have a quick check on the above setup to see if it solves the problem (as the workaround)?

mabr · February 23, 2022, 1:15pm

Hi Khai
I have tried your technique, but it did not have an effect.

I have talked to some of our users about the behavior and we will be okay with the current behavior.