chore: Refactor makefile and update dependencies in pyproject.toml by mikita-sakalouski · Pull Request #250 · Nike-Inc/koheesio

mikita-sakalouski · 2026-05-04T21:59:10Z

Summary

Fixes two unrelated import errors that surface when running koheesio on Python 3.12 with PySpark 3.5.x, and tightens the make dev workflow.

`pandas` extra now sources pandas via `pyspark[pandas-on-spark]`

Before:
pandas = ["pandas>=1.3", "setuptools", "numpy<2.0.0", "pandas-stubs"]

After:
pandas = ["pyspark[pandas-on-spark]>=3.2.0"]

koheesio.pandas is implemented on top of pyspark.pandas (PandasStep calls import_pandas_based_on_pyspark_version from koheesio.spark.utils), so the pandas extra always required pyspark to function. Routing it through pyspark[pandas-on-spark] lets pyspark drive pandas/pyarrow/numpy version selection (the extra declares pandas>=1.0.5, pyarrow>=4.0.0, numpy<2,>=1.15) instead of duplicating those bounds in koheesio.
This was triggered by ImportError: cannot import name '_builtin_table' from 'pandas.core.common' in pyspark/pandas/groupby.py:50. PySpark <=3.5.x imports a private symbol that pandas removed in 2.2.0; the PySpark-side fix only landed in Spark 4.0.

`pyspark` extra installs `setuptools` on Python 3.12+

pyspark = [
  "pyspark>=3.2.0",
  "pyarrow>13",
  "setuptools; python_version >= '3.12'",
]

- Updated the `dev` target in the makefile to check if already in the dev hatch shell before executing the command. - Modified pandas and pyspark dependencies in pyproject.toml to ensure compatibility with Python 3.12 and to streamline dependency management. These changes enhance the development experience and maintain compatibility with the latest Python standards.

- Added notes regarding compatibility between pyspark versions and pandas, specifically addressing the need for pandas<2.2 when using pyspark<4.0. - Updated extra-dependencies to include pandas<2.2 for specific pyspark versions to ensure compatibility and prevent import issues. These changes enhance dependency management and clarify version constraints for users.

mikita-sakalouski requested a review from a team as a code owner May 4, 2026 21:59

mikita-sakalouski enabled auto-merge (squash) May 6, 2026 13:46

mikita-sakalouski disabled auto-merge May 6, 2026 13:48

mikita-sakalouski merged commit 676d98c into main May 6, 2026
26 of 28 checks passed

mikita-sakalouski deleted the hotfix/pandas branch May 6, 2026 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Refactor makefile and update dependencies in pyproject.toml#250

chore: Refactor makefile and update dependencies in pyproject.toml#250
mikita-sakalouski merged 2 commits into
mainfrom
hotfix/pandas

mikita-sakalouski commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikita-sakalouski commented May 4, 2026

Summary

pandas extra now sources pandas via pyspark[pandas-on-spark]

pyspark extra installs setuptools on Python 3.12+

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`pandas` extra now sources pandas via `pyspark[pandas-on-spark]`

`pyspark` extra installs `setuptools` on Python 3.12+