Skip to content

Conversation

@gorcha
Copy link
Contributor

@gorcha gorcha commented Dec 23, 2025

From tidyverse/haven#788 - the new MR set changes only support ascii characters in the set name in the ragel parser, but SPSS uses the file code page/UTF-8 for these.

This PR updates the parser to allow for non-ASCII characters, and runs text from the MR set through readstat_convert() to make sure the character encoding comes in correctly.

}

nc = (alnum | '_' | '.' ); # name character (including dots)
nc = ([^ =]); # name character (all characters except space and equals)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems excessively lax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that we're just using this for pattern matching to extract the components of the field rather than actually validating the contents so it doesn't matter too much if it lines up with actual allowed characters.

The attempt to peg it to actual allowed characters in the current version is what's causing the error in tidyverse/haven#788 (and it would be a massive pain to build a character class that matches all potential valid chars from UTF-8 and other encodings). Can tighten it up a bit if there are particular things you're worried about?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants