Skip to content

Replace many Pandas operations with NumPy#198

Open
JCGoran wants to merge 4 commits intostefmolin:mainfrom
JCGoran:jelic/feature/vectorize
Open

Replace many Pandas operations with NumPy#198
JCGoran wants to merge 4 commits intostefmolin:mainfrom
JCGoran:jelic/feature/vectorize

Conversation

@JCGoran
Copy link
Contributor

@JCGoran JCGoran commented Jul 15, 2024

Describe your changes

  • use numpy instead of Pandas to avoid the overhead

Perf before:

320361688 function calls (316213771 primitive calls) in 116.715 seconds

Perf after:

79419311 function calls (78769517 primitive calls) in 43.085 seconds

which is more or less in-line with the circular shapes.

Checklist

  • Test cases have been modified/added to cover any code changes.
  • Docstrings have been modified/created for any code changes.
  • All linting and formatting checks pass (see the contributing guidelines for more information).

@github-actions github-actions bot added testing Relating to the testing suite shapes Work relating to shapes module data Work relating to data module plotting Work relating to plotting module morpher Work relating to morpher module labels Jul 15, 2024
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations on making your first pull request to Data Morph! Please familiarize yourself with the contributing guidelines, if you haven't already.

@stefmolin
Copy link
Owner

Thanks for the PR, @JCGoran! As I'm sure you've seen, I have a backlog to get through 😄 I hope to get to this in the next few weeks.

@codecov
Copy link

codecov bot commented Jul 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.53%. Comparing base (e440ee7) to head (0a25272).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #198   +/-   ##
=======================================
  Coverage   98.53%   98.53%           
=======================================
  Files          58       58           
  Lines        1907     1915    +8     
  Branches      114      114           
=======================================
+ Hits         1879     1887    +8     
  Misses         25       25           
  Partials        3        3           
Files with missing lines Coverage Δ
src/data_morph/data/dataset.py 74.07% <100.00%> (+0.65%) ⬆️
src/data_morph/data/stats.py 100.00% <100.00%> (ø)
src/data_morph/morpher.py 100.00% <100.00%> (ø)
src/data_morph/plotting/static.py 100.00% <100.00%> (ø)
tests/data/test_stats.py 100.00% <100.00%> (ø)
tests/test_morpher.py 100.00% <100.00%> (ø)
---- 🚨 Try these New Features:

@stefmolin stefmolin added this to the 0.3.0 milestone Jul 16, 2024
Copy link
Owner

@stefmolin stefmolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start by pulling the LineCollection changes into a separate PR.

@JCGoran JCGoran mentioned this pull request Jul 22, 2024
3 tasks
@JCGoran JCGoran force-pushed the jelic/feature/vectorize branch from 298870b to e708f6c Compare July 22, 2024 22:08
@github-actions github-actions bot removed the shapes Work relating to shapes module label Jul 22, 2024
@JCGoran JCGoran changed the title Refactor to use more numpy functions internally Replace many Pandas operations with NumPy Jul 22, 2024
@JCGoran JCGoran requested a review from stefmolin July 30, 2024 18:04
@JCGoran
Copy link
Contributor Author

JCGoran commented Sep 24, 2024

Bump, this is more or less ready for review as-is.

@stefmolin
Copy link
Owner

I haven't forgotten 😄 I'm going to work through the PyCon Taiwan sprint PRs first since I couldn't get to them all at the event, and I want to think more about the design of the internals here. I'm traveling right now and will have very limited time for the next couple of weeks.

A dataset with columns x and y.
x : Iterable[Number]
The ``x`` value of the dataset.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment on lines +482 to +485
x, y = (
start_shape.df['x'].to_numpy(copy=True),
start_shape.df['y'].to_numpy(copy=True),
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use the _x and _y from the Dataset.__init__() changes?

Suggested change
x, y = (
start_shape.df['x'].to_numpy(copy=True),
start_shape.df['y'].to_numpy(copy=True),
)
x, y = start_shape._x, start_shape._y

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering if we need to copy here, when we copy in the loop.

Comment on lines +53 to +54
self._x = self.df['x'].to_numpy()
self._y = self.df['y'].to_numpy()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._x = self.df['x'].to_numpy()
self._y = self.df['y'].to_numpy()
self._x, self._y = self.df[['x', 'y']].to_numpy().T

y1 : Iterable[Number]
The original value of ``y``.
x2 : Iterable[Number]
The perturbed value of ``x``.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space:

Suggested change
The perturbed value of ``x``.
The perturbed value of ``x``.

Comment on lines +55 to +56
self._x = self.df['x'].to_numpy()
self._y = self.df['y'].to_numpy()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be properties? If we change the DataFrame, these will no longer match.

morphed_data = perturbed_data
if self._is_close_enough(x, y, *perturbed_data):
x, y = perturbed_data
morphed_data = pd.DataFrame({'x': x, 'y': y})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary in the loop with switch to NumPy. We can have _record_frames() only make the DataFrame if we need to save the CSV. The plot() function can be reworked to use NumPy, and to return the DataFrame at the end of this method, we can do that outside of this loop instead of doing it thousands of times.

@stefmolin stefmolin removed this from the 0.3.0 milestone Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Work relating to data module morpher Work relating to morpher module plotting Work relating to plotting module testing Relating to the testing suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants