Skip to content

DONT MERGE GSOC26: Add error in dataset generator#454

Draft
OmBiradar wants to merge 7 commits intolincc-frameworks:mainfrom
OmBiradar:add_error_to_generator
Draft

DONT MERGE GSOC26: Add error in dataset generator#454
OmBiradar wants to merge 7 commits intolincc-frameworks:mainfrom
OmBiradar:add_error_to_generator

Conversation

@OmBiradar
Copy link
Copy Markdown
Contributor

Change Description

Closes #156

Solution Description

Added a new column which has a constant value of 1.

Deciding on what kind of tests to write

Code Quality

  • I have read the Contribution Guide and agree to the Code of Conduct
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

@OmBiradar
Copy link
Copy Markdown
Contributor Author

OmBiradar commented Feb 27, 2026

About the function

The generator can be made to create complex n depth nested layers.

Is this something that can be worked on?

Tests

I was unsure what the tests should be

The current tests check for 1 case of int being pass and 1 case of a dict as input.

Also the 2nd test passes when given a dict as such in input:

data = generate_data(20, {"A": {"C": 20}, "B": 30}) 

but actually this raises an error (below). This needs to be cosidered.

TypeError: unsupported operand type(s) for *: 'dict' and 'int'

I guess the testing can be made better?

Could I freely try my own way to improve the test?
@nevencaplar @hombit @dougbrn

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 27, 2026

Before [1f834d0] After [bee8b3f] Ratio Benchmark (Parameter)
91.9±10ms 75.8±10ms ~0.82 benchmarks.ReadFewColumnsHTTPS.time_run
138M 148M 1.07 benchmarks.CountNestedBy.peakmem_run
9.61±0.1ms 10.0±0.04ms 1.04 benchmarks.NestedFrameQuery.time_run
10.5±0.1ms 10.8±0.2ms 1.03 benchmarks.NestedFrameAddNested.time_run
110M 113M 1.03 benchmarks.NestedFrameQuery.peakmem_run
109M 111M 1.02 benchmarks.NestedFrameReduce.peakmem_run
46.6±0.2ms 47.6±0.3ms 1.02 benchmarks.ReassignHalfOfNestedSeries.time_run
63.8±0.8ms 64.6±0.7ms 1.01 benchmarks.CountNestedBy.time_run
1.09±0.02ms 1.10±0.02ms 1.01 benchmarks.NestedFrameReduce.time_run
1.2G 1.2G 1.00 benchmarks.ReadFewColumnsS3.peakmem_run

Click here to view all benchmarks.

@OmBiradar
Copy link
Copy Markdown
Contributor Author

Many pytest test are failing due to the error column being new and not expected in few of the tests.

I will go through the code that is testing the generate dataset feature.

Copy link
Copy Markdown
Collaborator

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there, I think this is a PR related to Google Summer of Code, so I will just be providing some comments here:

The main change seems correct (with a slight preference on column name listed below)

For tests, this change actually breaks quite a bit of the existing unit tests, which is expected, as many of our tests rely on this function to create input datasets, and now that the input has changed the outputs are all carrying this additional column that will cause a good chunk of the validation checks to fail. See the CI failures on this PR for details.

On a proper PR that would merge this, we'd want to see all of those existing tests be updated. For this PR, I would just ask to write a small test that runs generate_data and verifies that the error nested sub-column is present in the resulting nestedframe.

Comment thread src/nested_pandas/datasets/generation.py Outdated
@OmBiradar OmBiradar changed the title Add error in dataset generator DONT MERGE GSOC26: Add error in dataset generator Feb 27, 2026
@OmBiradar
Copy link
Copy Markdown
Contributor Author

For tests, this change actually breaks quite a bit of the existing unit tests, which is expected, as many of our tests rely on this function to create input datasets, and now that the input has changed the outputs are all carrying this additional column that will cause a good chunk of the validation checks to fail. See the CI failures on this PR for details.

Yes sure

On a proper PR that would merge this, we'd want to see all of those existing tests be updated.

This would take time, tho I believe I can get it done. As anyhow I am exploring the project, the tests will prove to be a help to me also.

For this PR, I would just ask to write a small test that runs generate_data and verifies that the error nested sub-column is present in the resulting nestedframe.

Sure this seems easy

@OmBiradar
Copy link
Copy Markdown
Contributor Author

For this PR, I would just ask to write a small test that runs generate_data and verifies that the error nested sub-column is present in the resulting nestedframe.

I have completed this in this latest commit

@OmBiradar
Copy link
Copy Markdown
Contributor Author

OmBiradar commented Feb 28, 2026

On a proper PR that would merge this, we'd want to see all of those existing tests be updated.

All the tests are now updated! Yay!

Some thoughts:
@dougbrn @hombit

I feel that until a release is being planned, the doc tests should be skipped. I had updated around 30 doctests and it was a bit hectic.

When a realease is to be generated, the doc tests can be updated at once.

Why this is good? : I think doc strings are mostly used by the users, thus having them in the dev development would just unnecessarily hinder progress.

@hombit
Copy link
Copy Markdown
Collaborator

hombit commented Mar 2, 2026

I agree that it is a pain to update, but I also believe that CI should be green for a successful PR.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.30%. Comparing base (1f834d0) to head (53637d6).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #454   +/-   ##
=======================================
  Coverage   97.30%   97.30%           
=======================================
  Files          19       19           
  Lines        2156     2156           
=======================================
  Hits         2098     2098           
  Misses         58       58           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@OmBiradar
Copy link
Copy Markdown
Contributor Author

I agree that it is a pain to update, but I also believe that CI should be green for a successful PR.

True,

The only test that failed was a formatting error, I guess this will be green after a simple formatting.

Could you also review my approaches to generate poisson like error in the main issue #156

@hombit

@OmBiradar OmBiradar force-pushed the add_error_to_generator branch from 5ee1856 to fd4f487 Compare March 3, 2026 06:58
@delucchi-cmu delucchi-cmu added the GSOC26: WIP In-progress PRs for Google Summer of Code 2026 applicants label Mar 3, 2026
@hombit hombit marked this pull request as draft March 4, 2026 15:39
@hombit
Copy link
Copy Markdown
Collaborator

hombit commented Mar 5, 2026

@OmBiradar I think we are almost done here! Could you please fix the pre-commit failure?

@OmBiradar
Copy link
Copy Markdown
Contributor Author

OmBiradar commented Mar 5, 2026

@hombit
I was stuck in a clash between ruff and pytest.

I have figured it out now and am running local tests to confirm it works. I will push the changes soon.

@hombit
Copy link
Copy Markdown
Collaborator

hombit commented Mar 5, 2026

Oh, right, sorry for that. Let us know if you spend too much time on that

@OmBiradar
Copy link
Copy Markdown
Contributor Author

@hombit
I could solve them! YAY!!

Also anytime I run the pytest, I find these 2 files created

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        nestedframe.parquet
        uv.lock

Could I add them to the .gitignore please

it has been a hassle not being able to use git add .

Signed-off-by: OmBiradar <ombiradar04@gmail.com>
This makes it easily understandable

Signed-off-by: OmBiradar <ombiradar04@gmail.com>
Signed-off-by: OmBiradar <ombiradar04@gmail.com>
Signed-off-by: OmBiradar <ombiradar04@gmail.com>
Signed-off-by: OmBiradar <ombiradar04@gmail.com>
Signed-off-by: OmBiradar <ombiradar04@gmail.com>
Signed-off-by: OmBiradar <ombiradar04@gmail.com>
@OmBiradar OmBiradar force-pushed the add_error_to_generator branch from 5af2d0b to 53637d6 Compare March 5, 2026 18:31
@OmBiradar OmBiradar marked this pull request as ready for review March 5, 2026 18:31
@hombit
Copy link
Copy Markdown
Collaborator

hombit commented Mar 5, 2026

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        nestedframe.parquet
        uv.lock

Could I add them to the .gitignore please

it has been a hassle not being able to use git add .

Please go ahead and open a PR to add uv.lock to our team's Python template, here.

For nestedframe.parquet, I think we should fix the test that produces it. Feel free to open an issue.

FYI, you can use git add -u instead of git add .

@hombit hombit marked this pull request as draft March 5, 2026 18:32
@hombit
Copy link
Copy Markdown
Collaborator

hombit commented Mar 5, 2026

@OmBiradar We are keeping GSOC26 PRs draft, please read the canvas on Slack

@OmBiradar
Copy link
Copy Markdown
Contributor Author

@hombit

I am sorry. I forgot. I got into the practice that as all tests are passing, I ment to just notify.

Sorry again.

@hombit
Copy link
Copy Markdown
Collaborator

hombit commented Mar 5, 2026

@hombit

I am sorry. I forgot. I got into the practice that as all tests are passing, I ment to just notify.

Sorry again.

No problem!

@hombit
Copy link
Copy Markdown
Collaborator

hombit commented Mar 5, 2026

@OmBiradar Thanks for your time, I'm considering this as done!

@hombit hombit added GSOC26: Done PRs for Google Summer of Code 2026 applicants, that have completed code review. and removed GSOC26: WIP In-progress PRs for Google Summer of Code 2026 applicants labels Mar 5, 2026
@OmBiradar
Copy link
Copy Markdown
Contributor Author

@hombit Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

GSOC26: Done PRs for Google Summer of Code 2026 applicants, that have completed code review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add error to generated dataset

4 participants