-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Dear authors,
When checking the Mocheg dataset downloaded here, I noticed that 61351 of the 91822 examples in the Corpus3.csv file contain noisy strings in the Origin Document column, such as
-
Please complete the security check to access Why do I have to complete a CAPTCHA? Completing the CAPTCHA proves you are a human and gives you temporary access to the web property. What can I do to prevent this in the future? If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware. If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.
-
�y����lo��k�q_�Fa*�옱�l����^c��=ƪ��K���GmE�GLe��[���g(��m'�)d�&p�0Ӑ7���@��b �Y-U�7��;���sq�K�(�=C�����>�1X ��fd��,im��.~7,���/NE:�Z�eNJn������9�X'�*�M���O�d+� y�==�y�yxIܵWk+�2���=��=˺@^2�f¬iF��¢ 7�ꃥ��s[���'�̼�!>�wp�7!�il)!.�'�'V�u-�V�H s.�5S�����!窵� ���D��eDyJ��ߣ��5���J�Vy�X��l8 �L�K__�g9z�f�C�<3��܍ީα0�̈́����f��-� K�Ⱦ�O��g�D��w%K����O���عc� ����'o%)�A����p����\\�TI� ���H��E��f$�xũ;c��1�����[�J߀���Cr���w.Ǭ�&'���) � �5��hvf�.r��:=;tU7>��� �^ddѕ endstream endobj startxref 1793055 %%EOF 595 0 obj <>stream 2020-02-10T12:35:11-05:00
Here are the screenshots for the loaded Corpus3.csv file and its noisy examples:
Due to the large proportion of such examples, the system's performance might be significantly affected. Thus, I would like to ask how these examples should be handled in the experiments?
Thank you very much for your help!