Skip to content

chore: Refactor makefile and update dependencies in pyproject.toml#250

Merged
mikita-sakalouski merged 2 commits into
mainfrom
hotfix/pandas
May 6, 2026
Merged

chore: Refactor makefile and update dependencies in pyproject.toml#250
mikita-sakalouski merged 2 commits into
mainfrom
hotfix/pandas

Conversation

@mikita-sakalouski
Copy link
Copy Markdown
Contributor

Summary

Fixes two unrelated import errors that surface when running koheesio on Python 3.12 with PySpark 3.5.x, and tightens the make dev workflow.

pandas extra now sources pandas via pyspark[pandas-on-spark]

Before:
pandas = ["pandas>=1.3", "setuptools", "numpy<2.0.0", "pandas-stubs"]

After:
pandas = ["pyspark[pandas-on-spark]>=3.2.0"]

  • koheesio.pandas is implemented on top of pyspark.pandas (PandasStep calls import_pandas_based_on_pyspark_version from koheesio.spark.utils), so the pandas extra always required pyspark to function. Routing it through pyspark[pandas-on-spark] lets pyspark drive pandas/pyarrow/numpy version selection (the extra declares pandas>=1.0.5, pyarrow>=4.0.0, numpy<2,>=1.15) instead of duplicating those bounds in koheesio.
  • This was triggered by ImportError: cannot import name '_builtin_table' from 'pandas.core.common' in pyspark/pandas/groupby.py:50. PySpark <=3.5.x imports a private symbol that pandas removed in 2.2.0; the PySpark-side fix only landed in Spark 4.0.

pyspark extra installs setuptools on Python 3.12+

pyspark = [
  "pyspark>=3.2.0",
  "pyarrow>13",
  "setuptools; python_version >= '3.12'",
]

- Updated the `dev` target in the makefile to check if already in the dev hatch shell before executing the command.
- Modified pandas and pyspark dependencies in pyproject.toml to ensure compatibility with Python 3.12 and to streamline dependency management.

These changes enhance the development experience and maintain compatibility with the latest Python standards.
@mikita-sakalouski mikita-sakalouski requested a review from a team as a code owner May 4, 2026 21:59
- Added notes regarding compatibility between pyspark versions and pandas, specifically addressing the need for pandas<2.2 when using pyspark<4.0.
- Updated extra-dependencies to include pandas<2.2 for specific pyspark versions to ensure compatibility and prevent import issues.

These changes enhance dependency management and clarify version constraints for users.
@mikita-sakalouski mikita-sakalouski enabled auto-merge (squash) May 6, 2026 13:46
@mikita-sakalouski mikita-sakalouski disabled auto-merge May 6, 2026 13:48
@mikita-sakalouski mikita-sakalouski merged commit 676d98c into main May 6, 2026
26 of 28 checks passed
@mikita-sakalouski mikita-sakalouski deleted the hotfix/pandas branch May 6, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant