chore: Refactor makefile and update dependencies in pyproject.toml#250
Merged
Conversation
- Updated the `dev` target in the makefile to check if already in the dev hatch shell before executing the command. - Modified pandas and pyspark dependencies in pyproject.toml to ensure compatibility with Python 3.12 and to streamline dependency management. These changes enhance the development experience and maintain compatibility with the latest Python standards.
- Added notes regarding compatibility between pyspark versions and pandas, specifically addressing the need for pandas<2.2 when using pyspark<4.0. - Updated extra-dependencies to include pandas<2.2 for specific pyspark versions to ensure compatibility and prevent import issues. These changes enhance dependency management and clarify version constraints for users.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two unrelated import errors that surface when running koheesio on Python 3.12 with PySpark 3.5.x, and tightens the
make devworkflow.pandasextra now sources pandas viapyspark[pandas-on-spark]Before:
pandas = ["pandas>=1.3", "setuptools", "numpy<2.0.0", "pandas-stubs"]After:
pandas = ["pyspark[pandas-on-spark]>=3.2.0"]koheesio.pandasis implemented on top ofpyspark.pandas(PandasStepcallsimport_pandas_based_on_pyspark_versionfromkoheesio.spark.utils), so thepandasextra always required pyspark to function. Routing it throughpyspark[pandas-on-spark]lets pyspark drive pandas/pyarrow/numpy version selection (the extra declarespandas>=1.0.5,pyarrow>=4.0.0,numpy<2,>=1.15) instead of duplicating those bounds in koheesio.ImportError: cannot import name '_builtin_table' from 'pandas.core.common'inpyspark/pandas/groupby.py:50. PySpark <=3.5.x imports a private symbol that pandas removed in 2.2.0; the PySpark-side fix only landed in Spark 4.0.pysparkextra installssetuptoolson Python 3.12+