Skip to content

Feature/feature/term annotation#40

Merged
laurejt merged 4 commits intodevelopfrom
feature/feature/term-annotation
Feb 25, 2026
Merged

Feature/feature/term annotation#40
laurejt merged 4 commits intodevelopfrom
feature/feature/term-annotation

Conversation

@laurejt
Copy link

@laurejt laurejt commented Feb 25, 2026

Associated Issue(s): #37, #38

Changes in this PR

  • Prodigy recipe for Notion concept evaluation task
  • Script for building input for Notion concept evaluation task

Notes

  • Will need to install prodigy using a developer key. Consider creating a separate environment for running prodigy.

Reviewer Checklist

  • Confirm that build_notion_concept_tasks.py runs successfully locally. For this you can use the mt corpora in the project drive [here]
  • Confirm that output for build_notion_concept_tasks.py has the following fields: tr_id, pair_id, model, scr_lang, tr_lang, src_text, ref_text, text, term`.
  • Confirm that output for build_notion_concept_tasks.py only include English machine translations
  • Check that prodigy recipe runs locally using the output from build_notion_concept_tasks.py
  • Confirm that the concept evaluation task's web interface looks good enough for now.

@laurejt laurejt requested a review from rlskoeser February 25, 2026 14:40
@laurejt laurejt self-assigned this Feb 25, 2026
@laurejt laurejt added 👇this sprint Add Issue to ZenHub and removed 👇this sprint Add Issue to ZenHub labels Feb 25, 2026
@laurejt laurejt removed their assignment Feb 25, 2026
Copy link

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure which file in the google drive I was supposed to use. I tried running the script like this and I get an error about a missing id column:

python src/muse/annotation/build_notion_concept_tasks.py out.jsonl notion-sent-translations-madlad.jsonl --mt-corpus notion-sent-translations-madlad.jsonl

Am I giving it the wrong file?

Co-authored-by: Rebecca Sutton Koeser <rlskoeser@users.noreply.github.com>
@laurejt
Copy link
Author

laurejt commented Feb 25, 2026

I wasn't sure which file in the google drive I was supposed to use. I tried running the script like this and I get an error about a missing id column:

python src/muse/annotation/build_notion_concept_tasks.py out.jsonl notion-sent-translations-madlad.jsonl --mt-corpus notion-sent-translations-madlad.jsonl

Am I giving it the wrong file?

Ah, sorry I forgot about the full input necessary for the script. The parallel sentence corpus notion-parallel-sent.jsonl (which has term information) needs to be provided first with the series of machine translation corpora (notion-sent-translations-*.jsonl) after the --mt-corpus flag.

@laurejt laurejt requested a review from rlskoeser February 25, 2026 16:35
@rlskoeser
Copy link

rlskoeser commented Feb 25, 2026

oops, I thought I did give it the parallel sentence file first, but obviously based on the command I cut and pasted I did not. 🤦‍♀️ (the filenames are very similar and long)

I'm able to run it now.

Confirmed translation lang with jq, output only includes "en":

jq '.tr_lang' out.jsonl | uniq

Copy link

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to run the script and run prodigy with the script output and the custom recipe. Looks good to me. The disclosure element is working well and I don't think it takes up too much space.

You might want to add a task to do a quick check of getting the annotation data out of the database just to make sure it's exportable and structured the way you want.

@laurejt laurejt merged commit bf51907 into develop Feb 25, 2026
1 check passed
@laurejt laurejt deleted the feature/feature/term-annotation branch February 25, 2026 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants