Figure out the sources of dataset gap between catalog and beta

There are 353,000 datasets on Catalog.data.gov and around 220,000 showing right now on Catalog-beta.data.gov. We need to figure out what this discrepancy is being caused by so we can fix it.

## Feature/what we're after

The number of datasets on Catalog-beta need to be basically as much as we see on Catalog.

## Anticipated/hypothesized benefits

- [benefit A]
- [benefit B]

## Measurements/metrics

- In our H2.0 harvest record table, how many unique pairs of (source_id, identifier) where the most recent record has a "success" status? These are records that we have harvested that could be datasets. If we have many more than 353,000 of these, then the problem is not with the raw harvesting.
- In our H2.0 dataset table, how many rows are there? We know that this number is not as large as it "should" be.
- For any record candidates that are in the harvest record table but NOT in the dataset table, what features do they share? What harvest sources are they from? What organizations own those harvest sources? When were they harvested? What other patterns can we find? 

## References/background

- [notes about desirable attributes or context for the epic]
- [links to previous research, possible routes to take, things that multiple stories will likely need to refer to]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Figure out the sources of dataset gap between catalog and beta #5606

Feature/what we're after

Anticipated/hypothesized benefits

Measurements/metrics

References/background

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Figure out the sources of dataset gap between catalog and beta #5606

Description

Feature/what we're after

Anticipated/hypothesized benefits

Measurements/metrics

References/background

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions