Merged
Conversation
Background ========== In pelias/geonames#93 we added some special case logic to the Geonames importer that ensures Geonames records in the `locality` and `localadmin` layer have themselves as parents in that layer. Before this change, they would have a Who's on First parent, but these parents didn't always line up perfectly. Sometimes it would lead to broken labels, and as I recall it could also break search queries that rely on locality/localadmin names. Hierarchy checks ================ This special logic causes problems with our hierarchy checks, which expect records that can be considered duplicates to share all parents higher than the _lower_ record. So for example, if a locality and localadmin are to be considered duplicates, the hierarchy must be the same from the country layer down to localadmin. Geonames localadmins ==================== Geonames seems to have a penchant for having both a `locality` _and_ a `localadmin` record for a given city, even when the local administrative divisions don't really support such nuance. These records often have a name following the format 'City of X', which makes them very disruptive and confusing when shown in a list of results. Deduplication ============= Our deduplication code can handle minor name differences like 'City of' after #1371, but can't handle the hierarchy differences that generally occur with these records. Generally, there will be one of two scenarios: - A WOF locality record for the city can't deduplicate with the Geonames localadmin because the WOF record is parented by a WOF localadmin - A WOF locality record for the city can't deduplicate with the Geonames localadmin beause the WOF record has no localadmin parent at all Concordances (from #1606) generally don't help either, since ther often isn't a localadmin in WOF to even have a concordance to the Geonames localadmin. Adding a hierarchy exception ============================ This PR works by skipping the hierarchy checks for any layer where a Geonames record has itself as a parent. This means that assuming all the other layers are the same, the names are compatible, etc, deduplication is still possible. Impact ====== Of the 314 cities in our [`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json) fuzzy tests, most of them (125) had a 'City of X' record somewhere in the results when querying via the autocomplete endpoint. With this PR, there are only 15 cases. Potential regressions ===================== Theoretically, this could allow records that aren't actually duplicates to be deduped, but they would have to have a similar name and likely share at least a `county`, so it feels like the chance for error is limited. That said, there are no regressions in our acceptance tests, and quite a few improvements.
3eadc2e to
0040864
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
this PR is branched off of #1606, since we'll likely merge that one first. comparison for only this PR is here
Background
In pelias/geonames#93 we added some special case logic to the Geonames importer that ensures Geonames records in the
localityandlocaladminlayer have themselves as parents in that layer.Before this change, they would have a Who's on First parent, but these parents didn't always line up perfectly. Sometimes it would lead to broken labels, and as I recall it could also break search queries that rely on locality/localadmin names.
Hierarchy checks
This special logic causes problems with our hierarchy checks, which expect records that can be considered duplicates to share all parents higher than the lower record.
So for example, if a locality and localadmin are to be considered duplicates, the hierarchy must be the same from the country layer down to localadmin.
Geonames localadmins
Geonames seems to have a penchant for having both a
localityand alocaladminrecord for a given city, even when the local administrative divisions don't really support such nuance.These records often have a name following the format 'City of X', which makes them very disruptive and confusing when shown in a list of results.
Deduplication
Our deduplication code can handle minor name differences like 'City of' after #1371, but can't handle the hierarchy differences that generally occur with these records.
Generally, there will be one of two scenarios:
Concordances (from #1606) generally don't help either, since there often isn't a localadmin in WOF to even have a concordance to the Geonames localadmin.
Adding a hierarchy exception
This PR works by skipping the hierarchy checks for any layer where a Geonames record has itself as a parent. This means that assuming all the other layers are the same, the names are compatible, etc, deduplication is still possible.
Impact
Of the 314 cities in our
top_us_citiesfuzzy tests, most of them (125) had a 'City of X' record somewhere in the results when querying via the autocomplete endpoint.With this PR, there are only 15 cases.
Potential regressions
Theoretically, this could allow records that aren't actually duplicates to be deduped, but they would have to have a similar name and likely share at least a
county, so it feels like the chance for error islimited.
That said, there are no regressions in our acceptance tests, and quite a few improvements.