How to reverse strings that contain surrogate pairs in Dart? - string

I was playing with algorithms using Dart and as I actually followed TDD, I realized that my code has some limitations.
I was trying to reverse strings as part of an interview problem, but I couldn't get the surrogate pairs correctly reversed.
const simple = 'abc';
const emoji = '๐ŸŽ๐Ÿ๐Ÿ›';
const surrogate = '๐Ÿ‘ฎ๐Ÿฝโ€โ™‚๏ธ๐Ÿ‘ฉ๐Ÿฟโ€๐Ÿ’ป';
String rev(String s) {
return String.fromCharCodes(s.runes.toList().reversed);
}
void main() {
print(simple);
print(rev(simple));
print(emoji);
print(rev(emoji));
print(surrogate);
print(rev(surrogate));
}
The output:
abc
cba
๐ŸŽ๐Ÿ๐Ÿ›
๐Ÿ›๐Ÿ๐ŸŽ
๐Ÿ‘ฎ๐Ÿฝโ€โ™‚๏ธ๐Ÿ‘ฉ๐Ÿฟโ€๐Ÿ’ป
๐Ÿ’ปโ€๐Ÿฟ๐Ÿ‘ฉ๏ธโ™‚โ€๐Ÿฝ๐Ÿ‘ฎ
You can see that the simple emojis are correctly reversed as I'm using the runes instead of just simply executing s.split('').toList().reversed.join(''); but the surrogate pairs are reversed incorrectly.
How can I reverse a string that might contain surrogate pairs using the Dart programming language?

When reversing strings, you must operate on graphemes, not characters nor code units. Use grapheme_splitter.

Dart 2.7 introduced a new package that supports grapheme cluster-aware operations. The package is called characters. characters is a package for characters represented as Unicode extended grapheme clusters.
Dartโ€™s standard String class uses the UTF-16 encoding. This is a common choice in programming languages, especially those that offer support for running both natively on devices, and on the web.
UTF-16 strings usually work well, and the encoding is transparent to the developer. However, when manipulating strings, and especially when manipulating strings entered by users, you may experience a difference between what the user perceives as a character, and what is encoded as a code unit in UTF-16.
Source: "Announcing Dart 2.7: A safer, more expressive Dart" by Michael Thomsen, section "Safe substring handling"
The package will also help to reverse your strings with emojis the way a native programmer would expect.
Using simple Strings, you find issues:
String hi = 'Hi ๐Ÿ‡ฉ๐Ÿ‡ฐ';
print('String.length: ${hi.length}');
// Prints 7; would expect 4
With characters
String hi = 'Hi ๐Ÿ‡ฉ๐Ÿ‡ฐ';
print(hi.characters.length);
// Prints 4
print(hi.characters.last);
// Prints ๐Ÿ‡ฉ๐Ÿ‡ฐ
It's worth taking a look at the source code of the characters package, it's far from simple but looks easier to digest and better documented than grapheme_splitter. The characters package is also maintained by the Dart team.

Related

Enhancing String Literals Delimiters to Support Raw Text Swift

I recently found this code snippets on the Swift 5 Book.
print(#"Write an interpolated string in Swift using \(multiplier)."#)
// Prints "Write an interpolated string in Swift using \(multiplier).โ€
print(#"6 times 7 is \#(6 * 7)."#)
// Prints "6 times 7 is 42.โ€
I learnt it was an accepted proposal in Swift 5 for enhancing string literals delimiters to support raw text, with so many examples given.
My question is when and how is it used in practical cases because from the examples given above, I would still clearly achieve what I want to even without the # signs!
To give just one example where it is very useful. How about when writing Regex, previously it was a nightmare as you had to escape all special characters. E.g.
let regex1 = "\\\\[A-Z]+[A-Za-z]+\\.[a-z]+"
Can now be replaced with
let regex2 = #"\\[A-Z]+[A-Za-z]+\.[a-z]+"#
Much easier to write. Now when you find a regex online, you can just copy and paste it in without having to spend ages escaping special characters.
Edit:
Can read here
https://www.hackingwithswift.com/articles/162/how-to-use-raw-strings-in-swift

Erlang Terms to Unicode String

I have a list of tuples, produced by some function, which looks like:
[{"a","ฤ…"},
{"ฤ…","a"},
{"a","o"},
{"o","e"}]
But when I print it, I see in terminal:
[{"a",[261]},
{[261],"a"},
{"a","o"},
{"o","e"}]
I usually print it with this command:
io:format("~p~n", [functionThatGeneratesListOfTuples()]),
So far I found that you need to use ~ts when printing Unicode strings, so I tried this:
Pairs = functionThatGeneratesListOfTuples(),
PairsStr = io_lib:format("~p", [Pairs]),
io:format("~ts~n", [PairsStr]),
Is there any possibility to achieve that Unicode Strings would be represented appropriately?
The heuristics for detecting lists-of-integers as strings only recognize Latin-1 characters by default, so [65,66,67] is printed as "ABC" but [665,666,667] is printed as "[665,666,667]" even if you use ~tp. You have to start Erlang as erl +pc unicode to make it accept printable unicode code points above 255. In that mode, [665,666,667] is printed as "ส™สšส›" with ~tp (but not with ~p).
See http://erlang.org/doc/man/io.html#printable_range-0 for more info, and also this recent improvement of the documentation, which will be included in OTP 21: https://github.com/erlang/otp/pull/1737/files

Converting Unicode in Swift

I currently have a string as follows which I received through an API call:
\n\nIt\U2019s a great place to discover Berlin and a comfortable place
to come home to.
And I want to convert it into something like this which is more readable:
It's a great place to discover Berlin and a comfortable place to come
home to.
I've taken a look at this post, but that's manually writing down every conversion, and there may be more of these unicode scalar characters introduced.
What I understand is \u{2019} is unicode scalar, but the format for this is \U2019 and I'm quite confused. Are there any built in methods to do this conversion?
This answer suggests using the NSString method stringByFoldingWithOptions.
The Swift String class has a concept called a "view" which lets you operate on the string under different encodings. It's pretty neat, and there are some views that might help you.
If you're dealing with strings in Swift, read this excellent post by Mike Ash. He discusses the idea of what a string really is with great detail and has some helpful hints for Swift 2.
Assuming you are already splitting the string and can get the offending format separately:
func convertFormat(stringOrig: String) -> Character {
let subString = String(stringOrig.characters.split("U").map({$0})[1])
let scalarValue = Int(subString)
let scalar = UnicodeScalar(scalarValue!)
return Character(scalar)
}
This will convert the String "\U2019" to the Character represented by "\u{2019}".

ServiceStack does not escape control characters in JSON

ServiceStack's JsonSerializer does not seem to encode control characters correctly.
For example, this C# expression....
JsonSerializer.SerializeToString(new { Text = "\u0010" })
... evaluates to this...
{"Text":"?"}
... where the "?" is the literal control character.
Instead, according to http://www.json.org it should evaluate to this:
{"Text":"\u0010"}
Is this a known bug or am I missing something?
The bad JSON output by my services is causing errors during deserialization by my service consumers.
You need to tell the serializer to escape unicode characters.
JsConfig.EscapeUnicode = true;
JsonSerializer.SerializeToString(new{Text = "\u0010"});
The above evaluates to this:
{"Text":"\u0010"}
Thanks Mike, that works. But I think this approach escapes ALL non-ASCII Unicode characters in addition to control characters.
I'm expecting to have a lot of foreign language characters in my data (Arabic, for example) so this will cause significant size bloat versus just including those unescaped unicode characters in the JSON (which is still standard-compliant).
I imagine the purpose of EscapeUnicode = true is to produce JSON that can be stored or transmitted with simple ASCII encoding, which is certainly useful. And it apparently also encodes ASCII control characters as a side-effect which does solve my problem.
But in my opinion, JsonSerializer should escape control characters regardless of the EscapeUnicode setting since the standard requires it. I consider this a bug.
Since this is primarily a problem for me within my Service Stack services I also found this solution:
SetConfig(new EndpointHostConfig
{
UseBclJsonSerializers = true
});
This tells Service Stack to use .NET's built-in DataContractJsonSerializer instead of Service Stack's JsonSerializer. I have verified that DataContractJsonSerializer does escape control.characters correctly.
So it appears that I need to choose between JsonSerializer with EscapeUnicode = true (faster but with bloated output) and DataContractJsonSerializer (slower but with compact Unicode output).

How to convert between bytes and strings in Python 3?

This is a Python 101 type question, but it had me baffled for a while when I tried to use a package that seemed to convert my string input into bytes.
As you will see below I found the answer for myself, but I felt it was worth recording here because of the time it took me to unearth what was going on. It seems to be generic to Python 3, so I have not referred to the original package I was playing with; it does not seem to be an error (just that the particular package had a .tostring() method that was clearly not producing what I understood as a string...)
My test program goes like this:
import mangler # spoof package
stringThing = """
<Doc>
<Greeting>Hello World</Greeting>
<Greeting>ไฝ ๅฅฝ</Greeting>
</Doc>
"""
# print out the input
print('This is the string input:')
print(stringThing)
# now make the string into bytes
bytesThing = mangler.tostring(stringThing) # pseudo-code again
# now print it out
print('\nThis is the bytes output:')
print(bytesThing)
The output from this code gives this:
This is the string input:
<Doc>
<Greeting>Hello World</Greeting>
<Greeting>ไฝ ๅฅฝ</Greeting>
</Doc>
This is the bytes output:
b'\n<Doc>\n <Greeting>Hello World</Greeting>\n <Greeting>\xe4\xbd\xa0\xe5\xa5\xbd</Greeting>\n</Doc>\n'
So, there is a need to be able to convert between bytes and strings, to avoid ending up with non-ascii characters being turned into gobbledegook.
The 'mangler' in the above code sample was doing the equivalent of this:
bytesThing = stringThing.encode(encoding='UTF-8')
There are other ways to write this (notably using bytes(stringThing, encoding='UTF-8'), but the above syntax makes it obvious what is going on, and also what to do to recover the string:
newStringThing = bytesThing.decode(encoding='UTF-8')
When we do this, the original string is recovered.
Note, using str(bytesThing) just transcribes all the gobbledegook without converting it back into Unicode, unless you specifically request UTF-8, viz., str(bytesThing, encoding='UTF-8'). No error is reported if the encoding is not specified.
In python3, there is a bytes() method that is in the same format as encode().
str1 = b'hello world'
str2 = bytes("hello world", encoding="UTF-8")
print(str1 == str2) # Returns True
I didn't read anything about this in the docs, but perhaps I wasn't looking in the right place. This way you can explicitly turn strings into byte streams and have it more readable than using encode and decode, and without having to prefex b in front of quotes.
This is a Python 101 type question,
It's a simple question but one where the answer is not so simple.
In python3, a "bytes" object represents a sequence of bytes, a "string" object represents a sequence of unicode code points.
To convert between from "bytes" to "string" and from "string" back to "bytes" you use the bytes.decode and string.encode functions. These functions take two parameters, an encoding and an error handling policy.
Sadly there are an awful lot of cases where sequences of bytes are used to represent text, but it is not necessarily well-defined what encoding is being used. Take for example filenames on unix-like systems, as far as the kernel is concerned they are a sequence of bytes with a handful of special values, on most modern distros most filenames will be UTF-8 but there is no gaurantee that all filenames will be.
If you want to write robust software then you need to think carefully about those parameters. You need to think carefully about what encoding the bytes are supposed to be in and how you will handle the case where they turn out not to be a valid sequence of bytes for the encoding you thought they should be in. Python defaults to UTF-8 and erroring out on any byte sequence that is not valid UTF-8.
print(bytesThing)
Python uses "repr" as a fallback conversion to string. repr attempts to produce python code that will recreate the object. In the case of a bytes object this means among other things escaping bytes outside the printable ascii range.
TRY THIS:
StringVariable=ByteVariable.decode('UTF-8','ignore')
TO TEST TYPE:
print(type(StringVariable))
Here 'StringVariable' represented as a string. 'ByteVariable' represent as Byte. Its not relevent to question Variables..

Resources