Skip to content

Remove pz.GroupBySig and Move Logic into GroupByAggregate and ApplyGroupByOp #252

@mdr223

Description

@mdr223

Currently, we use a pz.GroupBySig to store the specification that defines a group by operation, this includes:

  1. The group_by_fields (i.e. the fields uniquely identifying each group)
  2. The agg_fields (i.e. the fields which we are aggregating over for each unique group)
  3. The agg_funcs (i.e. the aggregations we are applying to the respective agg_fields for each unique group)

For example, if we had the following data on home prices:

address, price, num_beds, city, state

And we wanted to compute the average home price and maximum number of bedrooms in each city and state, we would construct the following pz.GroupBySig:

home_gby_sig = pz.GroupBySig(
  group_by_fields=['state', 'city'],
  agg_fields=['price', 'num_beds'],
  agg_funcs=['mean', 'max'],
)

We could then supply this to a pz.Dataset as follows:

ds = pz.HomePriceDataset(...)
ds = ds.groupby(home_gby_sig)
...

The goal for this PR is to change this interface so that we simply specify:

ds = pz.HomePriceDataset(...)
ds = ds.groupby(gby_fields=['state', 'city'], agg_fields=['price', 'num_beds'], agg_funcs=['mean', 'max'])
...

Furthermore, we should supply these arguments (gby_fields, agg_fields, agg_funcs) directly to the GroupByAggregate() logical operator and its corresponding ApplyGroupByOp physical operator. In doing so, we will need to copy some functionality from the GroupBySig into the logical and physical operators (e.g. .validate_schema() in GroupByAggregate; .get_agg_field_names() in ApplyGroupyByOp -- just to name two places).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions