Add option to serialize UTF-8 strings directly#17
Conversation
The JSON serializer assumes that std::string and char* values are encoded as Latin1, while cxxtools::String are unicode strings (using characters from the BMP). This is impractical when serializing UTF-8 data. Instead of converting from UTF-8 to cxxtools::String and then using the \uXXXX notation, add a flag inputUtf8 to tell the serializer that std::string and char* values are UTF-8 encoded and can be inserted into the UTF-8 JSON verbatim.
The JSON serializer assumes that std::string and char* values are encoded as Latin1, while cxxtools::String are unicode strings (using characters from the BMP). This is impractical when serializing UTF-8 data. Instead of converting from UTF-8 to cxxtools::String and then using the \uXXXX notation, add a flag inputUtf8 to tell the serializer that std::string and char* values are UTF-8 encoded and can be inserted into the UTF-8 JSON verbatim. This is a backport of maekitalo#17. The UTF-8 codec in jsonserializer.cpp is disabled, to mimic the behavior of 9669f65 ("Optimize json serializer").
The JSON serializer assumes that std::string and char* values are encoded as Latin1, while cxxtools::String are unicode strings (using characters from the BMP). This is impractical when serializing UTF-8 data. Instead of converting from UTF-8 to cxxtools::String and then using the \uXXXX notation, add a flag inputUtf8 to tell the serializer that std::string and char* values are UTF-8 encoded and can be inserted into the UTF-8 JSON verbatim. This is a backport of maekitalo#17. The UTF-8 codec in jsonserializer.cpp is disabled, to mimic the behavior of 9669f65 ("Optimize json serializer").
|
The json serialize do not assume anything about std::string. std::string is just a string of byte values. And std::string is not really suitable to hold UTF-8 encoded data. You should really use cxxtools::String if you want to store anything else than ASCII. And I prefer not to handle UTF-8 different than other encodings. We have a codec, which handles it better. If you can convert your UTF-8 data to cxxtools::String in the serialization operator of your class if you want. I know this is not the most performant way to do it but it is way cleaner. |
|
It actually does assume that the input is Latin1. Here https://github.com/maekitalo/cxxtools/blob/master/src/jsonformatter.cpp#L402, it encodes bytes 0x80 - 0xff in the UTF-16 \uXXXX notation, which obviously only works for code points up to U+00FF, which is Latin1. And there is nothing wrong with using std::string to store UTF-8 string, it's definitely more sensible than converting to wide strings and back... |
|
Ok, if you want, it assumes latin1. It actually just assumes, that every character in the string is a character with a value. It encodes every character separately. Thats how std::string is defined. |
|
It is long ago since this discussion but I'm just looking into old stuff. |
The JSON serializer assumes that std::string and char* values are
encoded as Latin1, while cxxtools::String are unicode strings (using
characters from the BMP). This is impractical when serializing UTF-8
data. Instead of converting from UTF-8 to cxxtools::String and then using
the \uXXXX notation, add a flag inputUtf8 to tell the serializer that
std::string and char* values are UTF-8 encoded and can be inserted into
the UTF-8 JSON verbatim.