Skip to content

Add option to serialize UTF-8 strings directly#17

Open
michal42 wants to merge 1 commit into
maekitalo:masterfrom
michal42:serializer-utf8
Open

Add option to serialize UTF-8 strings directly#17
michal42 wants to merge 1 commit into
maekitalo:masterfrom
michal42:serializer-utf8

Conversation

@michal42

Copy link
Copy Markdown

The JSON serializer assumes that std::string and char* values are
encoded as Latin1, while cxxtools::String are unicode strings (using
characters from the BMP). This is impractical when serializing UTF-8
data. Instead of converting from UTF-8 to cxxtools::String and then using
the \uXXXX notation, add a flag inputUtf8 to tell the serializer that
std::string and char* values are UTF-8 encoded and can be inserted into
the UTF-8 JSON verbatim.

The JSON serializer assumes that std::string and char* values are
encoded as Latin1, while cxxtools::String are unicode strings (using
characters from the BMP). This is impractical when serializing UTF-8
data. Instead of converting from UTF-8 to cxxtools::String and then using
the \uXXXX notation, add a flag inputUtf8 to tell the serializer that
std::string and char* values are UTF-8 encoded and can be inserted into
the UTF-8 JSON verbatim.
michal42 added a commit to michal42/cxxtools that referenced this pull request Dec 12, 2017
The JSON serializer assumes that std::string and char* values are
encoded as Latin1, while cxxtools::String are unicode strings (using
characters from the BMP). This is impractical when serializing UTF-8
data. Instead of converting from UTF-8 to cxxtools::String and then using
the \uXXXX notation, add a flag inputUtf8 to tell the serializer that
std::string and char* values are UTF-8 encoded and can be inserted into
the UTF-8 JSON verbatim.

This is a backport of maekitalo#17. The
UTF-8 codec in jsonserializer.cpp is disabled, to mimic the behavior of
9669f65 ("Optimize json serializer").
jimklimov pushed a commit to 42ity/cxxtools that referenced this pull request Dec 12, 2017
The JSON serializer assumes that std::string and char* values are
encoded as Latin1, while cxxtools::String are unicode strings (using
characters from the BMP). This is impractical when serializing UTF-8
data. Instead of converting from UTF-8 to cxxtools::String and then using
the \uXXXX notation, add a flag inputUtf8 to tell the serializer that
std::string and char* values are UTF-8 encoded and can be inserted into
the UTF-8 JSON verbatim.

This is a backport of maekitalo#17. The
UTF-8 codec in jsonserializer.cpp is disabled, to mimic the behavior of
9669f65 ("Optimize json serializer").
@maekitalo

Copy link
Copy Markdown
Owner

The json serialize do not assume anything about std::string. std::string is just a string of byte values. And std::string is not really suitable to hold UTF-8 encoded data. You should really use cxxtools::String if you want to store anything else than ASCII. And I prefer not to handle UTF-8 different than other encodings. We have a codec, which handles it better. If you can convert your UTF-8 data to cxxtools::String in the serialization operator of your class if you want. I know this is not the most performant way to do it but it is way cleaner.

@michal42

Copy link
Copy Markdown
Author

It actually does assume that the input is Latin1. Here https://github.com/maekitalo/cxxtools/blob/master/src/jsonformatter.cpp#L402, it encodes bytes 0x80 - 0xff in the UTF-16 \uXXXX notation, which obviously only works for code points up to U+00FF, which is Latin1. And there is nothing wrong with using std::string to store UTF-8 string, it's definitely more sensible than converting to wide strings and back...

@maekitalo

Copy link
Copy Markdown
Owner

Ok, if you want, it assumes latin1. It actually just assumes, that every character in the string is a character with a value. It encodes every character separately. Thats how std::string is defined.
It is actually not really a good idea to store UTF-8 in a std::string. If you access the nth character of the std::string you are not sure if it is really the nth character. You can't even be sure, if it is a character at all. Many methods of std::string are useless if it contains UTF-8 data. For good reason std::string is actually a std::basic_string. Of course technically it is possible and sometimes convenient but not really a good style. Even cxxtools::Utf8codec::encode/decode stores UTF-8 data in a std::string. It is a somewhat pragmatic solution.
I really suggest to use cxxtools::String or the new string classes of C++11 to store unicode data.

@maekitalo

Copy link
Copy Markdown
Owner

It is long ago since this discussion but I'm just looking into old stuff.
You may use the static method cxxtools::Utf8Codec::decode to convert a std::string, which holds utf-8 encoded data to a unicode cxxtools::String to format it correctly and cxxtools::Utf8Codec::encode the convert it back from unicode cxxtools::String. It is quite convenient and do not pollute the json serializer with some specific codec. The same way you can use e.g. cxxtools::Iso8859_15Codec::encode/decode if you have a different encoding. Or Win1252Codec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants