Skip to content

Figure out the sources of dataset gap between catalog and beta #5606

@tdlowden

Description

@tdlowden

There are 353,000 datasets on Catalog.data.gov and around 220,000 showing right now on Catalog-beta.data.gov. We need to figure out what this discrepancy is being caused by so we can fix it.

Feature/what we're after

The number of datasets on Catalog-beta need to be basically as much as we see on Catalog.

Anticipated/hypothesized benefits

  • [benefit A]
  • [benefit B]

Measurements/metrics

  • In our H2.0 harvest record table, how many unique pairs of (source_id, identifier) where the most recent record has a "success" status? These are records that we have harvested that could be datasets. If we have many more than 353,000 of these, then the problem is not with the raw harvesting.
  • In our H2.0 dataset table, how many rows are there? We know that this number is not as large as it "should" be.
  • For any record candidates that are in the harvest record table but NOT in the dataset table, what features do they share? What harvest sources are they from? What organizations own those harvest sources? When were they harvested? What other patterns can we find?

References/background

  • [notes about desirable attributes or context for the epic]
  • [links to previous research, possible routes to take, things that multiple stories will likely need to refer to]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    ✔ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions