Skip to content

Duplicate rows found in the parent dataset #12

@iosonopersia

Description

@iosonopersia

Hi @Harshdeep1996 , I'm working on the parent dataset (the 'citations_from_wikipedia.zip' file available on Zenodo).

I found some duplicated rows (approx. 2 thousands for each parquet partition file), meaning that they have the same 'id' and the same 'citations' value. As a result of the workflow of this project, the entire lines are completely equal.

Those duplicated lines should be removed from the next edition of the dataset.
As a suggestion, these lines of code could be used at some point during the workflow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions