-
Notifications
You must be signed in to change notification settings - Fork 167
Closed
Labels
Description
There are 353,000 datasets on Catalog.data.gov and around 220,000 showing right now on Catalog-beta.data.gov. We need to figure out what this discrepancy is being caused by so we can fix it.
Feature/what we're after
The number of datasets on Catalog-beta need to be basically as much as we see on Catalog.
Anticipated/hypothesized benefits
- [benefit A]
- [benefit B]
Measurements/metrics
- In our H2.0 harvest record table, how many unique pairs of (source_id, identifier) where the most recent record has a "success" status? These are records that we have harvested that could be datasets. If we have many more than 353,000 of these, then the problem is not with the raw harvesting.
- In our H2.0 dataset table, how many rows are there? We know that this number is not as large as it "should" be.
- For any record candidates that are in the harvest record table but NOT in the dataset table, what features do they share? What harvest sources are they from? What organizations own those harvest sources? When were they harvested? What other patterns can we find?
References/background
- [notes about desirable attributes or context for the epic]
- [links to previous research, possible routes to take, things that multiple stories will likely need to refer to]
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
✔ Done