fix: bugs in sample pages new script#42

Merged

ninpnin merged 6 commits intodevfrom

sample-pages-fix

Mar 13, 2026

Contributor

ninpnin commented Feb 27, 2026 •

edited

Loading

Also switched to tqdm and added the ability to concatenate all samples into one CSV.

ninpnin added 3 commits

February 27, 2026 12:09


          fix: docDate parsing; refactor: remove dependency on deprecated function

b9dd448


          refactor: remove unnecessary function

b049899


          refactor: enable writing output to a single file instead of one per d…

30eeaf1

…ecade

ninpnin requested a review from mandlilaast

February 27, 2026 10:53

mandlilaast reviewed

View reviewed changes

src/sample_pages_new.py Outdated Show resolved Hide resolved

mandlilaast reviewed

View reviewed changes

src/sample_pages_new.py Outdated Show resolved Hide resolved

mandlilaast reviewed

View reviewed changes

src/sample_pages_new.py Outdated Show resolved Hide resolved

mandlilaast reviewed

View reviewed changes

src/sample_pages_new.py Show resolved Hide resolved

mandlilaast reviewed

View reviewed changes

src/sample_pages_new.py Show resolved Hide resolved

mandlilaast reviewed

View reviewed changes

src/sample_pages_new.py

                   parser = etree.XMLParser(remove_blank_text=True)
                   rows = []
                   for _, row in df.iterrows():

Contributor

mandlilaast Mar 11, 2026

Maybe we could merge this and the next for loop together. Otherwise it seems that we are a bit duplicating our already done work (seems to be same lines)?

Contributor Author

ninpnin Mar 11, 2026

Yeah it looks messy but it works and is also miniscule in terms of resource use compared to get_page_counts. Also I would need to look into how this script works in more detail to refactor it, which takes a lot of time.

Contributor

mandlilaast Mar 12, 2026

Haha, okay, thanks! :)

mandlilaast requested changes

View reviewed changes

Contributor

mandlilaast left a comment

Small comments here and there, proposed them mainly as suggestions.

But thank you for the code, and if you agree with my questions, please feel free to change :)

ninpnin added 2 commits

March 11, 2026 15:01


          refactor: cleaner imports

ce57c55


          refactor: address review

c9fbe3a

ninpnin requested a review from mandlilaast

March 11, 2026 14:13


          refactor: better printouts

4a01df6

mandlilaast approved these changes

View reviewed changes

Contributor

mandlilaast left a comment

Yup, looks good! :)
Green light from me!

ninpnin merged commit fbf3737 into dev

1 check passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet