Currently, we use a pz.GroupBySig to store the specification that defines a group by operation, this includes:
- The
group_by_fields (i.e. the fields uniquely identifying each group)
- The
agg_fields (i.e. the fields which we are aggregating over for each unique group)
- The
agg_funcs (i.e. the aggregations we are applying to the respective agg_fields for each unique group)
For example, if we had the following data on home prices:
address, price, num_beds, city, state
And we wanted to compute the average home price and maximum number of bedrooms in each city and state, we would construct the following pz.GroupBySig:
home_gby_sig = pz.GroupBySig(
group_by_fields=['state', 'city'],
agg_fields=['price', 'num_beds'],
agg_funcs=['mean', 'max'],
)
We could then supply this to a pz.Dataset as follows:
ds = pz.HomePriceDataset(...)
ds = ds.groupby(home_gby_sig)
...
The goal for this PR is to change this interface so that we simply specify:
ds = pz.HomePriceDataset(...)
ds = ds.groupby(gby_fields=['state', 'city'], agg_fields=['price', 'num_beds'], agg_funcs=['mean', 'max'])
...
Furthermore, we should supply these arguments (gby_fields, agg_fields, agg_funcs) directly to the GroupByAggregate() logical operator and its corresponding ApplyGroupByOp physical operator. In doing so, we will need to copy some functionality from the GroupBySig into the logical and physical operators (e.g. .validate_schema() in GroupByAggregate; .get_agg_field_names() in ApplyGroupyByOp -- just to name two places).
Currently, we use a
pz.GroupBySigto store the specification that defines a group by operation, this includes:group_by_fields(i.e. the fields uniquely identifying each group)agg_fields(i.e. the fields which we are aggregating over for each unique group)agg_funcs(i.e. the aggregations we are applying to the respectiveagg_fieldsfor each unique group)For example, if we had the following data on home prices:
And we wanted to compute the average home price and maximum number of bedrooms in each city and state, we would construct the following
pz.GroupBySig:We could then supply this to a
pz.Datasetas follows:The goal for this PR is to change this interface so that we simply specify:
Furthermore, we should supply these arguments (
gby_fields,agg_fields,agg_funcs) directly to theGroupByAggregate()logical operator and its correspondingApplyGroupByOpphysical operator. In doing so, we will need to copy some functionality from theGroupBySiginto the logical and physical operators (e.g..validate_schema()inGroupByAggregate;.get_agg_field_names()inApplyGroupyByOp-- just to name two places).