Deduplicate Geonames 'City of' prefixes#1609
Closed
orangejulius wants to merge 1 commit intomasterfrom
Closed
Conversation
Member
|
FYI there is some similar logic and IIRC tests too here |
Member
Author
|
Ah very nice. That logic is quite a bit simpler so I'll bring it into this PR. I think it's ok to deduplicate across all of those differences in name, since things like county and locality will not (generally) be deduped since they have different layers (unless it hits one of the exceptions like one being a parent of the other). |
5619a12 to
9eb9c98
Compare
A common cause of deduplication errors is Geonames locality/localadmin records that start with 'City of'. Our name comparison logic is fairly conservative: it only looks at things like punctuation, diacriticals, etc. Otherwise, we have to consider names that are different meaning the underlying records represent genuinely different places. Getting too far away from this general stance could be dangerous, but we can handle specific outliers just fine. Geonames records that start with 'City of' are one of these cases. Often, there is a Geonames `locality` record with just the name, (like 'New York'), and then a Geonames `localadmin` record with the 'City of' prefix. Usually only one of those records will have a WOF concordance, so this is still helpful even combined with #1606
9eb9c98 to
3f72bb7
Compare
Member
Author
Member
Author
|
Closing in favor of #1371 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A common cause of missed deduplication is Geonames locality/localadmin records that start with 'City of'.
Our name comparison logic is fairly conservative: it only looks at things like punctuation, diacriticals, etc. Otherwise, we have to consider names that are different meaning the underlying records represent genuinely different places.
Getting too far away from this general stance could be dangerous, but we can handle specific exceptions just fine.
Geonames records that start with 'City of' are one of these cases. Often, there is a Geonames
localityrecord with just the name, (like 'New York'), and then a Geonameslocaladminrecord with the 'City of' prefix. Usually only one of those records will have a WOF concordance, so this is still helpful even combined with #1606