对于互联网公司报表需求中,有很多多维度的组合,大的汇总维度、小的明细维度,精细化的下钻。这样的背景下,我们很多的指标都要进行多维度的聚合。
grouping sets: 多维度组合,组合维度自定义
with cube: 多维度组合,程序自由组合,组合为各个维度的笛卡尔积
with rollup: with cube的子集,以左侧维度为主,即不允许左侧为NULL,右侧为非NULL的情况出现
我们来几个案例:
with cube:
spark-sql> with test1 as
> (select '2021-08' as month_date,'2021-08-11' as day_date,10 as pv
> union all
> select '2021-08' as month_date,'2021-08-10' as day_date,15 as pv
> union all
> select '2021-08' as month_date,'2021-08-08' as day_date,35 as pv
> union all
> select '2021-07' as month_date,'2021-07-08' as day_date,35 as pv
> union all
> select '2021-07' as month_date,'2021-07-06' as day_date,25 as pv
> union all
> select '2021-07' as month_date,'2021-07-03' as day_date,15 as pv)
> select
> month_date,
> day_date,
> sum(pv) as pv
> from test1
> group by month_date,day_date
> with cube
> order by month_date,day_date;
NULL NULL 135
NULL 2021-07-03 15
NULL 2021-07-06 25
NULL 2021-07-08 35
NULL 2021-08-08 35
NULL 2021-08-10 15
NULL 2021-08-11 10
2021-07 NULL 75
2021-07 2021-07-03 15
2021-07 2021-07-06 25
2021-07 2021-07-08 35
2021-08 NULL 60
2021-08 2021-08-08 35
2021-08 2021-08-10 15
2021-08 2021-08-11 10
Time taken: 26.605 seconds, Fetched 15 row(s)
grouping sets:
spark-sql> with test1 as
> (select '2021-08' as month_date,'2021-08-11' as day_date,10 as pv
> union all
> select '2021-08' as month_date,'2021-08-10' as day_date,15 as pv
> union all
> select '2021-08' as month_date,'2021-08-08' as day_date,35 as pv
> union all
> select '2021-07' as month_date,'2021-07-08' as day_date,35 as pv
> union all
> select '2021-07' as month_date,'2021-07-06' as day_date,25 as pv
> union all
> select '2021-07' as month_date,'2021-07-03' as day_date,15 as pv)
> select
> month_date,
> day_date,
> sum(pv) as pv
> from test1
> group by month_date,day_date
> grouping sets
> (
> (),
> (month_date),
> (month_date,day_date),
> (day_date)
> )
> order by month_date,day_date;
NULL NULL 135
NULL 2021-07-03 15
NULL 2021-07-06 25
NULL 2021-07-08 35
NULL 2021-08-08 35
NULL 2021-08-10 15
NULL 2021-08-11 10
2021-07 NULL 75
2021-07 2021-07-03 15
2021-07 2021-07-06 25
2021-07 2021-07-08 35
2021-08 NULL 60
2021-08 2021-08-08 35
2021-08 2021-08-10 15
2021-08 2021-08-11 10
Time taken: 12.854 seconds, Fetched 15 row(s)
总结:
从with cube和grouping sets的案例可以看出,两个结果是一样的。
with cube的维度组合就是groupingsets里面手动添加的维度,即为month_date,day_date两个维度的笛卡尔积。
我们再来看看 with rollup:
spark-sql> with test1 as
> (select '2021-08' as month_date,'2021-08-11' as day_date,10 as pv
> union all
> select '2021-08' as month_date,'2021-08-10' as day_date,15 as pv
> union all
> select '2021-08' as month_date,'2021-08-08' as day_date,35 as pv
> union all
> select '2021-07' as month_date,'2021-07-08' as day_date,35 as pv
> union all
> select '2021-07' as month_date,'2021-07-06' as day_date,25 as pv
> union all
> select '2021-07' as month_date,'2021-07-03' as day_date,15 as pv)
> select
> month_date,
> day_date,
> sum(pv) as pv
> from test1
> group by month_date,day_date
> with rollup
> order by month_date,day_date;
NULL NULL 135
2021-07 NULL 75
2021-07 2021-07-03 15
2021-07 2021-07-06 25
2021-07 2021-07-08 35
2021-08 NULL 60
2021-08 2021-08-08 35
2021-08 2021-08-10 15
2021-08 2021-08-11 10
Time taken: 18.792 seconds, Fetched 9 row(s)
总结:
从结果上可以看出,和with cube的区别是,少了day_date这个单独维度的聚合,因为with rollup是以左侧维度为主,当左侧month_date维度为NULL时,右侧day_date维度就不允许为NULL。
实际生产中,上一篇对grouping sets做了详细的分析,可以点击此处