Skip to content

Script for building parallel paragraph corpus#47

Merged
laurejt merged 4 commits intodevelopfrom
feature/parallel-pars
Mar 10, 2026
Merged

Script for building parallel paragraph corpus#47
laurejt merged 4 commits intodevelopfrom
feature/parallel-pars

Conversation

@laurejt
Copy link

@laurejt laurejt commented Mar 6, 2026

Associated Issue(s): resolves #41

Changes in this PR

  • Adds one-off program for building the MTO parallel paragraph corpus

Notes

  • This script does not use the MTO webscrape, only the data in the side-by-side translations spreadsheet.
  • The input CSVs (exported from the side-by-side translations spreadsheet) are available here
  • The language mapping for the articles is hard-coded.
  • This script only extracts the parallel texts (i.e., rows) that correspond to paragraphs. So, it ignores section titles.

Reviewer Checklist

  • Check that build_paragraph.py runs locally using the MTO-parallel-data files (expected input is a directory). Confirm that the local output matches mto-parallel-pars.jsonl
  • Confirm that the output JSONL has the fields specified in data-design.md
  • Output JSONL's contents look reasonable
  • Confirm that the hard-coded language mapping is correct

@laurejt laurejt requested a review from tanhaow March 6, 2026 19:11
Copy link

@tanhaow tanhaow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the code and the output looks good. This branch is ready to merge.


Yields parallel paragraph records
"""
count = 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the ID starts with 0, but the ID in build_sentence.py starts with 1. We may want to unify them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging this, I must have missed this when reviewing build_sentence, I did expect IDs to start at 0 since it's really serving as a index rather than a meaningfully unique identifier.

I find that 1-indexing always leads to trouble in 0-indexed languages.

Co-authored-by: Hao Tan <tanhao@princeton.edu>
@laurejt laurejt merged commit 0bff9d6 into develop Mar 10, 2026
1 check passed
@laurejt laurejt deleted the feature/parallel-pars branch March 10, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants