different p-values compared to other tools

Hi, 

I typically run multeval for bleu and ter but haven't assessed statistical significance so far. Now that I actually need it , I find it (1) difficult to grasp what exactly multeval computes (I checked issue #8 and it clarifies somehow what is going on) and (2) to run it 'correctly'. With (1) what I mean is that according to Koehn's paper (https://www.aclweb.org/anthology/W04-3250) I would assume you take different samples from sys1 and sys2 score w.r.t. the reference and assess the differences. If in 95% of the cases the scores differ favouring one of the systems then the difference is statistically significant.  Or am I getting it wrong? Furthermore, I compared the multeval tool to mteval for the same number of samples and shuffles and the scores are completely different.
2. Maybe this all comes from me not running multeval correctly. I have one reference and the output of two MT systems. As multeval doesn't like it when there is only one variant for system 1 and the baseline I use copies, e.g. for system 1 I will use sys1.test.out and sys1.test.out.copy (and they are identical). Is this a good way to invoke multeval?

Thanks.
Cheers,
Dimtiar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

different p-values compared to other tools #25

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

different p-values compared to other tools #25

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions