Skip to content

Noisy original documents in Corpus3.csv #16

@st143575

Description

@st143575

Dear authors,

When checking the Mocheg dataset downloaded here, I noticed that 61351 of the 91822 examples in the Corpus3.csv file contain noisy strings in the Origin Document column, such as

  • Please complete the security check to access Why do I have to complete a CAPTCHA? Completing the CAPTCHA proves you are a human and gives you temporary access to the web property. What can I do to prevent this in the future? If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware. If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.

  • �y����lo��k�q_�Fa*�옱�l����^c��=ƪ��K���GmE�GLe��[���g(��m'�)d�&p�0Ӑ7���@��b �Y-U�7��;���sq�K�(�=C�����>�1X ��fd��,im��.~7,���/NE:�Z�eNJn������9�X'�*�M���O�d+� y�==�y�yxIܵWk+�2���=��=˺@^2�f¬iF��¢ 7�ꃥ��s[���'�̼�!>�wp�7!�il)!.�'�'V�u-�V�H s.�5S�����!窵� ���D��eDyJ��ߣ��5���J�Vy�X��l8 �L�K__�g9z�f�C�<3��܍ީα0�̈́����f��-� K�Ⱦ�O��g�D��w%K����O���عc� ����'o%)�A����p����\\�TI� ���H��E��f$�xũ;c��1�����[�J߀���Cr���w.Ǭ�&'���) � �5��hvf�.r��:=;tU7>��� �^ddѕ endstream endobj startxref 1793055 %%EOF 595 0 obj <>stream 2020-02-10T12:35:11-05:00

Here are the screenshots for the loaded Corpus3.csv file and its noisy examples:

Image Image

Due to the large proportion of such examples, the system's performance might be significantly affected. Thus, I would like to ask how these examples should be handled in the experiments?

Thank you very much for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions