Skip to content

Automate non-UTF-8 character conversion #90

@lawalter

Description

@lawalter

We receive non-UTF-8 characters in raw data files. These non-UTF-8 characters are not understood by R and must be manually fixed before contacting HIP registrants. For example, to manually edit the city value "CA\xd1ON CITY" in a file from Colorado, we would replace \xd1 with N to get the human readable value of "CANON CITY". This is done by opening the raw file, making the change, saving the file, and re-running read_hip().

Sometimes, it is not obvious as to what non-UTF-8 characters should be changed to. First names, last names, and street names are particularly variable. To ensure we make the correct change, each escape sequence of a hexadecimal byte value must be checked and replaced manually. This is time consuming.

Create a function that can automatically replace non-UTF-8 characters after reading in raw data, so that:

  • Raw data files do not need to be manually edited
  • Non-UTF-8 character conversion is automated and fully reproducible
  • A full list of non-UTF-8 character replacements can be reviewed

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestworkflowImprovement to processing speed or methodology
No fields configured for Feature.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions