Skip to content

pandas 3.x compatibility: groupby().apply() drops grouping column #113

@conchoecia

Description

@conchoecia

Summary

source/odp_functions.py:reciprocal_best_hits_blastp_or_diamond_blastp raises KeyError: 'qseqid' on pandas 3.x. The CI smoke test against tests/test_odp_basic/test_odp_basic.sh reproduces this; locally with pandas 2.x it works.

Root cause

The idiom

fdf = (fdf.groupby("qseqid")
         .apply(lambda group: group.loc[group["evalue"] == group['evalue'].min()])
         .reset_index(drop=True))

worked on older pandas because the grouping column was kept in apply's result. On pandas 3.x, groupby().apply() no longer includes the grouping column by default, so after reset_index(drop=True) the qseqid column is gone and the next line that does fdf["qseqid"] (line 390 in odp_functions.py) raises KeyError.

Workaround in place

CI pins pandas<3 in requirements-dev.txt so the smoke test passes against the API the codebase was written for. Users running odp from source on a fresh pandas 3.x install will hit the same error.

Fix scope

grep -n "groupby" source/*.py scripts/odp scripts/odp_nway_rbh scripts/odp_filechecker shows ~85 groupby sites. The problematic pattern is specifically groupby(col).apply(...).reset_index(drop=True). Hot spots:

  • source/odp_functions.py:312, 321, 379, 384

Migration: either pass include_groups=False then reset_index() (without drop=True), or rewrite as a vectorized form, e.g.:

fdf = fdf.loc[fdf.groupby("qseqid")["evalue"].transform("min") == fdf["evalue"]]

Followup PR should audit all 85 sites and migrate to a form that runs cleanly on pandas 2.x and 3.x.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions