ServiceStack's JsonSerializer does not seem to encode control characters correctly.
For example, this C# expression....
JsonSerializer.SerializeToString(new { Text = "\u0010" })
... evaluates to this...
{"Text":"?"}
... where the "?" is the literal control character.
Instead, according to http://www.json.org it should evaluate to this:
{"Text":"\u0010"}
Is this a known bug or am I missing something?
The bad JSON output by my services is causing errors during deserialization by my service consumers.
You need to tell the serializer to escape unicode characters.
JsConfig.EscapeUnicode = true;
JsonSerializer.SerializeToString(new{Text = "\u0010"});
The above evaluates to this:
{"Text":"\u0010"}
Thanks Mike, that works. But I think this approach escapes ALL non-ASCII Unicode characters in addition to control characters.
I'm expecting to have a lot of foreign language characters in my data (Arabic, for example) so this will cause significant size bloat versus just including those unescaped unicode characters in the JSON (which is still standard-compliant).
I imagine the purpose of EscapeUnicode = true is to produce JSON that can be stored or transmitted with simple ASCII encoding, which is certainly useful. And it apparently also encodes ASCII control characters as a side-effect which does solve my problem.
But in my opinion, JsonSerializer should escape control characters regardless of the EscapeUnicode setting since the standard requires it. I consider this a bug.
Since this is primarily a problem for me within my Service Stack services I also found this solution:
SetConfig(new EndpointHostConfig
{
UseBclJsonSerializers = true
});
This tells Service Stack to use .NET's built-in DataContractJsonSerializer instead of Service Stack's JsonSerializer. I have verified that DataContractJsonSerializer does escape control.characters correctly.
So it appears that I need to choose between JsonSerializer with EscapeUnicode = true (faster but with bloated output) and DataContractJsonSerializer (slower but with compact Unicode output).
Related
I was playing with algorithms using Dart and as I actually followed TDD, I realized that my code has some limitations.
I was trying to reverse strings as part of an interview problem, but I couldn't get the surrogate pairs correctly reversed.
const simple = 'abc';
const emoji = '๐๐๐';
const surrogate = '๐ฎ๐ฝโโ๏ธ๐ฉ๐ฟโ๐ป';
String rev(String s) {
return String.fromCharCodes(s.runes.toList().reversed);
}
void main() {
print(simple);
print(rev(simple));
print(emoji);
print(rev(emoji));
print(surrogate);
print(rev(surrogate));
}
The output:
abc
cba
๐๐๐
๐๐๐
๐ฎ๐ฝโโ๏ธ๐ฉ๐ฟโ๐ป
๐ปโ๐ฟ๐ฉ๏ธโโ๐ฝ๐ฎ
You can see that the simple emojis are correctly reversed as I'm using the runes instead of just simply executing s.split('').toList().reversed.join(''); but the surrogate pairs are reversed incorrectly.
How can I reverse a string that might contain surrogate pairs using the Dart programming language?
When reversing strings, you must operate on graphemes, not characters nor code units. Use grapheme_splitter.
Dart 2.7 introduced a new package that supports grapheme cluster-aware operations. The package is called characters. characters is a package for characters represented as Unicode extended grapheme clusters.
Dartโs standard String class uses the UTF-16 encoding. This is a common choice in programming languages, especially those that offer support for running both natively on devices, and on the web.
UTF-16 strings usually work well, and the encoding is transparent to the developer. However, when manipulating strings, and especially when manipulating strings entered by users, you may experience a difference between what the user perceives as a character, and what is encoded as a code unit in UTF-16.
Source: "Announcing Dart 2.7: A safer, more expressive Dart" by Michael Thomsen, section "Safe substring handling"
The package will also help to reverse your strings with emojis the way a native programmer would expect.
Using simple Strings, you find issues:
String hi = 'Hi ๐ฉ๐ฐ';
print('String.length: ${hi.length}');
// Prints 7; would expect 4
With characters
String hi = 'Hi ๐ฉ๐ฐ';
print(hi.characters.length);
// Prints 4
print(hi.characters.last);
// Prints ๐ฉ๐ฐ
It's worth taking a look at the source code of the characters package, it's far from simple but looks easier to digest and better documented than grapheme_splitter. The characters package is also maintained by the Dart team.
JSON is not a subset of JavaScript. I need my output to be 100% valid JavaScript; it will be evaluated as such -- i.e., JSON.stringify will not (always) work for my needs.
Is there a JavaScript stringifier for Node?
As a bonus, it would be nice if it could stringify objects.
You can use JSON.stringify and afterwards replace the remaining U+2028 and U+2029 characters. As the article linked states, the characters can only occur in the strings, so we can safely replace them by their escaped versions without worrying about replacing characters where we should not be replacing them:
JSON.stringify('ro\u2028cks').replace(/\u2028/g,'\\u2028').replace(/\u2029/g,'\\u2029')
From the last paragraph in the article you linked:
The solution
Luckily, the solution is simple: If we look at the JSON specification we see that the only place where a U+2028 or U+2029 can occur is in a string. Therefore we can simply replace every U+2028 with \u2028 (the escape sequence) and U+2029 with \u2029 whenever we need to send out some JSONP.
Itโs already been fixed in Rack::JSONP and I encourage all frameworks or libraries that send out JSONP to do the same. Itโs a one-line patch in most languages and the result is still 100% valid JSON.
Within Node.js, I am using querystring.stringify() to encode an object into a query string for usage in a URL. Values that have spaces are encoded as %20.
I'm working with a particularly finicky web service that will only accept spaces encoded as +, as used to be commonly done prior to RFC3986.
Is there a way to set an option for querystring so that it encodes spaces as +?
Currently I am simply doing a .replace() to replace all instances of %20 with +, but this is a bit tedious if there is an option I can set ahead of time.
If anyone still facing this issue, "qs" npm package has feature to encode spaces as +
qs.stringify({ a: 'b c' }, { format : 'RFC1738' })
I can't think of any library doing that by default, and unfortunately, I'd say your implementation may be the more efficient way to do this, since any other option would probably either do what you're already doing, or will use slower non-compiled pure JavaScript code.
What about asking the web service provider to follow the RFC?
https://github.com/kvz/phpjs is a node.js package that provides all the php functions. The http_build_query implementation at the time of writing this only supports urlencode (the query string includes + instead of spaces), but hopefully soon will include the enc_type parameter / rawurlencode (%20's for spaces).
See http://php.net/http_build_query.
RFC1738 (+'s) will be the default enc_type either way, so you can use it immediately for your purposes.
i have something like
<s:link view="/member/index.xhtml" value="Myย News" propagation="none"/>
<s:link view="/member/index.xhtml" value="#{msg.myText}" propagation="none"/>
where the value of myText in the messages.properties is
myText=Myย News
The first line of the example works fine and replaces the text to "My News", but the second that uses a value from the resource bundle escapes the ambersand, too "My News".
I tried also to use unicode escape sequences for the ambersand and/or hash with My\u0026\u0023160;News, My\u0026#160;News and My\u0026nbsp;News in the properties file without success.
(Used css no-wrap instead of the previous used xml encoding, but would be interested anyway)
EDIT - Answer to clarified question
The first is obviously inline, so interpreter knows that this is safe.
The second one comes from external source (you are using Expression Language) and as such is not safe and need to be escaped. The result of escaping would be as you wrote, basically it will show you the exact value of HTML entity.
This is related to security (XSS for example) and not necessary i18n.
Previous attempt
I don't quite know what you are asking for but I believe it is "how to display it?".
Most of the standard JSF controls contain escape attribute that if set to false won't escape the text. Unfortunately it seems that you are using something like SeamTools which does not have this attribute.
Well, in this case there is not much to be done. Unless you could use standard control, maybe you should go and try to actually save your properties file as Unicode (UTF-16 BigEndian in fact) and simply put valid Unicode non-breaking space character. Theoretically that should work; Unicode-encoded properties files are supported in latest version of Java (although I cannot recall if it was Java SE 5 or Java SE 6)...
I'm working on a PInvoke wrapper for a library that does not support Unicode strings, but does support multi-byte ANSI strings. While investigating FxCop reports on the library, I noticed that the string marshaling being used had some interesting side effects. The PInvoke method was using "best fit" mapping to create a single-byte ANSI string. For illustration, this is what one method looked like:
[DllImport("thedll.dll", CharSet=CharSet.Ansi)]
public static extern int CreateNewResource(string resourceName);
The result of calling this function with a string that contains non-ASCII characters is that Windows finds a "close" character, generally this looks like it ends up being "???". If we pretend that 'a' is a non-ASCII character, then passing "cat" as a parameter would create a resource named "c?t".
If I follow the guidelines in the FxCop rule, I end up with something like this:
[DllImport("thedll.dll", CharSet=CharSet.Ansi, BestFitMapping = false, ThrowOnUnmappableChar = true)]
public static extern int CreateNewResource([MarshalAs(UnmanagedType.LPStr)] string resourceName);
This introduces a change in behavior; now when a character cannot be mapped an exception is thrown. This concerns me because this is a breaking change, so I'd like to try and marshal the strings as multi-byte ANSI but I cannot see a way to do so. UnmanagedType.LPStr is specified to be a single-byte ANSI string, LPTStr will be Unicode or ANSI depending on the system, and LPWStr is not what the library expects.
How would I tell PInvoke to marshal the string as a multibyte string? I see there's a WideCharToMultiByte() API function, could I change the signature to expect an IntPtr to a string I create in unmanaged memory? It seems like this still has many of the problems that the current implementation has (it still might have to drop or substitute characters), so I'm not sure if this is an improvement. Is there another method of marshaling that I'm missing?
ANSI is multi-byte, and ANSI strings are encoded according to the codepage currently enabled on the system. WideCharToMultiByte works the same way as P/Invoke.
Maybe what you're after is conversion to UTF-8. Although WideCharToMultiByte supports this, I don't think P/Invoke does, since it's not possible to adopt UTF-8 as the system-wide ANSI code page. At this point you'd be looking at passing the string as an IntPtr instead, although if you're doing that, you may as well use the managed Encoding class to do the conversion, rather than WideCharToMultiByte.
Here is the best way I've found to accomplish this. Instead of marshalling as a string, marshal as a byte[]. Put the responsibility on the caller of the pinvoke function API to convert to a byte array in the most appropriate fashion. Most likely by using one of the Text.Encoding classes.
If you end up having to call WideCharToMultiByte manually, I would get rid of the p/invoke and manually marshal this using WideCharToMultiByte in a C++/CLI wrapper function. Managed C++ is much better at these interop scenarios than C# is.
Though, if this is the only p/invoke you have, it's probably not worth it.