Problem
(All examples from EPSA 2029 names.)
Variations in family names:
Ahmed Mohamed
Ahmed Mohammed
Shortened first names:
Ambiguities:
Aina Gallego
Anna Gallego
Alina Vrânceanu
Alina Vranceau
Middle initials:
Andreas C. Goldberg
Andreas Goldberg
Accents and special characters:
More complex case:
"Ferran Martinez i Coma"
"Ferran M i Coma"
… all cause some authors to have multiple first names (and thus multiple participant UIDs).
Partial solution
Use stringdist to detect simple cases:
https://cran.r-project.org/web/packages/stringdist/vignettes/RJournal_6_111-122-2014.pdf
n <- unique(d$full_name)
# Levenshtein distances
lv <- tibble::tibble(
x = n,
y = map(x, ~ n[ !n %in% .x ]),
lv = map(x, ~ stringdist(.x, n[ !n %in% .x ], method = "lv"))
) %>%
tidyr::unnest(c(y, lv))
# finds many true positives
filter(lv, lv < 3) %>%
arrange(x, lv)
# finds "Alex Smith" and "Alex C. Smith"
filter(lv, lv == 3) %>%
arrange(x, lv)
# finds mostly false positives
filter(lv, lv == 4) %>%
arrange(x, lv)
Question to self
Fix names here, once all names are assembled, or in the source 2019, 2020, 2021 repos?
Cleanup in this repo makes more sense because it allows to treat all names at once, which avoids treating recurring participants several times.
If cleanup happens here, UIDs need to be regenerated here, after applying the fixes (not a huge hassle).
Problem
(All examples from EPSA 2029 names.)
Variations in family names:
Shortened first names:
Ambiguities:
Middle initials:
Accents and special characters:
More complex case:
… all cause some authors to have multiple first names (and thus multiple participant UIDs).
Partial solution
Use
stringdistto detect simple cases:https://cran.r-project.org/web/packages/stringdist/vignettes/RJournal_6_111-122-2014.pdf
Question to self
Fix names here, once all names are assembled, or in the source 2019, 2020, 2021 repos?
Cleanup in this repo makes more sense because it allows to treat all names at once, which avoids treating recurring participants several times.
If cleanup happens here, UIDs need to be regenerated here, after applying the fixes (not a huge hassle).