Script for building parallel paragraph corpus by laurejt · Pull Request #47 · Princeton-CDH/muse

laurejt · 2026-03-06T19:11:49Z

Associated Issue(s): resolves #41

Changes in this PR

Adds one-off program for building the MTO parallel paragraph corpus

Notes

This script does not use the MTO webscrape, only the data in the side-by-side translations spreadsheet.
The input CSVs (exported from the side-by-side translations spreadsheet) are available here
The language mapping for the articles is hard-coded.
This script only extracts the parallel texts (i.e., rows) that correspond to paragraphs. So, it ignores section titles.

Reviewer Checklist

Check that build_paragraph.py runs locally using the MTO-parallel-data files (expected input is a directory). Confirm that the local output matches mto-parallel-pars.jsonl
Confirm that the output JSONL has the fields specified in data-design.md
Output JSONL's contents look reasonable
Confirm that the hard-coded language mapping is correct

tanhaow

I tested the code and the output looks good. This branch is ready to merge.

src/muse/parallel_corpus/build_paragraph.py

tanhaow · 2026-03-10T12:27:42Z

src/muse/parallel_corpus/build_paragraph.py

+
+    Yields parallel paragraph records
+    """
+    count = 0


Here the ID starts with 0, but the ID in build_sentence.py starts with 1. We may want to unify them.

Thanks for flagging this, I must have missed this when reviewing build_sentence, I did expect IDs to start at 0 since it's really serving as a index rather than a meaningfully unique identifier.

I find that 1-indexing always leads to trouble in 0-indexed languages.

Co-authored-by: Hao Tan <tanhao@princeton.edu>

laurejt added 3 commits March 5, 2026 16:21

Add script for building parallel paragraph corpus

1586f0f

Ruff fixes

be75784

Update docs

b0f3f87

laurejt requested a review from tanhaow March 6, 2026 19:11

tanhaow approved these changes Mar 10, 2026

View reviewed changes

Update src/muse/parallel_corpus/build_paragraph.py

36e9ecd

Co-authored-by: Hao Tan <tanhao@princeton.edu>

laurejt merged commit 0bff9d6 into develop Mar 10, 2026
1 check passed

laurejt deleted the feature/parallel-pars branch March 10, 2026 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script for building parallel paragraph corpus#47

Script for building parallel paragraph corpus#47
laurejt merged 4 commits intodevelopfrom
feature/parallel-pars

laurejt commented Mar 6, 2026 •

edited by tanhaow

Loading

Uh oh!

tanhaow left a comment

Uh oh!

Uh oh!

tanhaow Mar 10, 2026

Uh oh!

laurejt Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

laurejt commented Mar 6, 2026 • edited by tanhaow Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in this PR

Notes

Reviewer Checklist

Uh oh!

tanhaow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tanhaow Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

laurejt Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

laurejt commented Mar 6, 2026 •

edited by tanhaow

Loading