How to represent this utf-8 encoded string in Rust?

How to represent this utf-8 encoded string in Rust? - rust

On this RFC: https://www.rfc-editor.org/rfc/rfc7616#page-19 at page 19, there's this example of a text encoded in UTF-8:
J U+00E4 s U+00F8 n D o e
4A C3A4 73 C3B8 6E 20 44 6F 65
How do I represent it in a Rust String?
I tried https://mothereff.in/utf-8 and doing J\00E4s\00F8nDoe but it didn't work.

"Jäsøn Doe" should work fine. Rust source files are always UTF-8 encoded and a string literal may contain any Unicode scalar value (that is, any code point except surrogates, which must not be encoded in UTF-8).
If your editor does not support UTF-8 encoding, but supports ASCII, you can use Unicode code point escapes, which are documented in the Rust reference:
A 24-bit code point escape starts with U+0075 (u) and is followed by up to six hex digits surrounded by braces U+007B ({) and U+007D (}). It denotes the Unicode code point equal to the provided hex value.
suggesting the correct syntax should be "J\u{E4}s\u{F8}n Doe".

You can refer to Rust By Example as everything is not covered in rust eBook
(https://doc.rust-lang.org/stable/rust-by-example/std/str.html#literals-and-escapes)
You can use the syntax \u{your_unicode}
let unicode_str = String::from("J\u{00E4}s\u{00F8}nDoe");
println!("{}", unicode_str);

Related

Buffers in Node

Consider following Node code
const bufA = Buffer.from('tést');
bufA :
<Buffer 74 c3 a9 73 74>
Why four characters in input translated to 5 hexadecimal bytes ?

When you call Buffer.from(string), a couple of things happen:
The encoding is defaulted to utf-8
The JS string, which is internally stored as an array of UCS-2 characters, is encoded into UTF-8
In UTF-8, é is a multi-byte character, like most accented characters from Latin scripts. Here's more information on how the character is encoded in different systems: https://www.fileformat.info/info/unicode/char/e9/index.htm
As you can see, the UTF-8 representation of this character is 0xC3 0xA9, which corresponds to the second and third bytes c3 a9 in your Buffer.
This also means that, when decoding Buffers (e.g. when concatenating data coming in from a Stream), some characters may fall on the buffer boundary and it may be impossible to decode the string until you have the remainder of the character (0xC3 on its own would be invalid). This is why code examples you find on the Web which do:
let result = '';
stream.on('data', function(buf) {
// BUG! Does not account for multi-byte characters.
result += buf.toString();
});
are almost always wrong - unless the Stream is already set up for handling encoding itself (then, you'll get strings, not buffers, when reading).

ByteArray convert to String with UTF-8 charset in kotlin problem

i have a little confused
// default charset utf8
val bytes = byteArrayOf(78, 23, 41, 51, -32, 42)
val str = String(bytes)
// there i got array [78, 23, 41, 51, -17, -65, -67, 42]
val weird = str.toByteArray()
i put random value into the bytes property, for some reason. why is it inconsistent???

The issue here is that your bytes aren't a valid UTF-8 sequence.
Any sequence of bytes can be interpreted as valid ISO Latin-1, for example.  (There may be issues with bytes having values 0–31, but those generally don't stop the characters being stored and processed.)  Similar applies to most other 8-bit character sets.
But the same isn't true of UTF-8.  While all sequences of bytes in the range 1–127 are valid UTF-8 (and interpreted the same as they are in ASCII and most 8-bit encodings), bytes in the range 128–255 can only appear in certain well-defined combinations.  (This has several very useful properties: it lets you identify UTF-8 with a very high probability; it also avoids issues with synchronisation, searching, sorting, &c.)
In this case, the sequence in the question (which is 4E 17 29 33 E0 2A in unsigned hex) isn't valid UTF-8.
So when you try to convert it to a string using the default encoding (UTF-8), the JVM substitutes the replacement character — value U+FFFD, which looks like this: � — in place of each invalid character.
Then, when you convert that back to UTF-8, you get the UTF-8 encoding of the replacment character, which is EF BF BD.  And if you interpret that as signed bytes, you get -17 -65 -67 — as in the question.
So Kotlin/JVM is handling the invalid input as best it can.

Capital text with space. What kind of text format is this?

I've got a list of songs that look like this.
ＧＩＦＴ〜白〜（冬恋／君の歌をうたう）【完全生産限定盤】
The latin letters GIFT here looks odd and I can't figure out how to make it read like a normal text. For example if you copy this word, it doesn't have a space in-between the letters or anything but seems to be in a different text format.
Can someone help me how I can convert this into normal text?

These are Unicode characters.
For example, the 'G' is
Unicode Character 'FULLWIDTH LATIN CAPITAL LETTER G' (U+FF27)
UTF-8 (hex) 0xEF 0xBC 0xA7 (efbca7)
see here
You can copy the string to Notepad++ and then convert it into Hex code (Extensions/Converter/ASCII->HEX)
and get EFBCA7EFBCA9EFBCA6EFBCB4 for the word 'GIFT'
Then serach for "Unicode EFBCA7" to find the above information.
This can be converted into normal latin characters. For example in .Net there is the Normalize function:
using System;
using System.Text;
public class Program
{
public static void Main()
{
Console.WriteLine("Unicode:");
String text = "ＧＩＦＴ";
Console.WriteLine(text);
byte[] bytes = Encoding.UTF8.GetBytes(text);
foreach(var b in bytes)
Console.Write("{0:X} ", b);
Console.WriteLine("\nASCII:");
String text2 = text.Normalize(NormalizationForm.FormKC);
Console.WriteLine(text2);
bytes = Encoding.UTF8.GetBytes(text2);
foreach(var b in bytes)
Console.Write("{0:X} ", b);
}
}
try it on .Net Fiddle
This will print out:
Unicode:
ＧＩＦＴ
EF BC A7 EF BC A9 EF BC A6 EF BC B4
ASCII:
GIFT
47 49 46 54
Probably there are similar functions in other languages too. Now you know what you're looking for.
The search term "convert unicode FULLWIDTH LATIN" will help you.
See also here
When you can't find a function for that, you can also just do your own conversion, after all the character code is just an offset to the normal ASCII/UTF-8 latin character set. See examples here.

String Index Error (Julia)

I'm a Julia newbie. When I was testing out the language, I got this error.
First of all, I'm defining String b to "he§y".
Julia seems behaving strangely when I have "special" characters in a String...
When I'm trying to get the third character of b (it's supposed to be '§'), everything is OK
However when I'm trying to get the fourth character of b (it's supposed to be 'y'), a "StringIndexError" is thrown.

I don't believe the compiler could throw you the error. Do you mean a runtime error?
I know nothing about Julian language but the symptoms seems to be related to indexing of string is not based on code point, but to some encoding.
The document from Julia lang seems supporting my hypothesis:
https://docs.julialang.org/en/stable/manual/strings/
The built-in concrete type used for strings (and string literals) in Julia is String. This supports the full range of Unicode characters via the UTF-8 encoding. (A transcode function is provided to convert to/from other Unicode encodings.)
...
Conceptually, a string is a partial function from indices to characters: for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings.
Edit: Quoted from Julia document, which is an example demonstrating exact "problem" you are facing.
julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"
Whether these Unicode characters are displayed as escapes or shown as
special characters depends on your terminal's locale settings and its
support for Unicode. String literals are encoded using the UTF-8
encoding. UTF-8 is a variable-width encoding, meaning that not all
characters are encoded in the same number of bytes. In UTF-8, ASCII
characters – i.e. those with code points less than 0x80 (128) – are
encoded as they are in ASCII, using a single byte, while code points
0x80 and above are encoded using multiple bytes – up to four per
character. This means that not every byte index into a UTF-8 string is
necessarily a valid index for a character. If you index into a string
at such an invalid byte index, an error is thrown:
julia> s[1]
'∀': Unicode U+2200 (category Sm: Symbol, math)
julia> s[2]
ERROR: StringIndexError("∀ x ∃ y", 2)
[...]
julia> s[3]
ERROR: StringIndexError("∀ x ∃ y", 3)
Stacktrace:
[...]
julia> s[4]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

Trouble Converting a Hex String (in Lua)

I've been trying different ways to read this hex string, and I cannot figure out how. Each method only converts part of it. The online converters don't do it, and this is the method I tried:
function string.fromhex(str)
return (str:gsub('..', function (cc)
return string.char(tonumber(cc, 16))
end))
end
packedStr = "1b4c756151000104040408001900000040746d702f66756e632e416767616d656e696f6e2e6c756100160000002b0000000000000203000000240000001e0000011e008000000000000100000000000000170000002900000000000008410000000500000006404000068040001a400000160000801e0080000a00800041c0000022408000450001004640c1005a40000016800c80458001008500000086404001868040018600420149808083458001008580020086c04201c5000000c640c001c680c001c600c301014103009c80800149808084450001004980c38245c003004600c4005c408000458002004640c40085800400860040018640400186c0440186004201c14003005c00810116000280450105004641c50280010000c00100025c8180015a01000016400080420180005e010001614000001600fd7f4580050081c005005c400001430080004700060043008000478001004300800047c003001e008000190000000405000000676d63700004050000004368617200040700000053746174757300040b000000416767616d656e696f6e000404000000746d7000040f000000636865636b65645f7374617475730004030000007477000409000000636861724e616d6500040900000066756c6c6e616d6500040a0000006775696c644e616d65000407000000737472696e670004060000006d617463680004060000006775696c6400040400000025772b0001010403000000756900040d00000064726177456c656d656e7473000407000000676d617463680004030000005f470004050000004e616d650004060000007461626c6500040900000069734d656d6265720004060000006572726f7200044e0000005468697320786d6c206973206e6f742070726f7065726c7920636f6e6669677572656420666f722074686973206368617261637465722e20506c6561736520636f6e74616374204b616575732100040900000071756575654163740000000000410000001800000018000000180000001800000018000000180000001900000019000000190000001a0000001a0000001a0000001a0000001b0000001b0000001b0000001b0000001b0000001b0000001c0000001c0000001c0000001c0000001c0000001c0000001c0000001c0000001c0000001c0000001d0000001d0000001e0000001e0000001e0000001f0000001f0000001f0000001f0000001f0000001f0000001f0000001f0000001f0000001f0000002000000020000000200000002000000020000000200000002000000021000000210000001f000000220000002400000024000000240000002500000025000000260000002600000027000000270000002900000005000000020000006e0009000000400000001000000028666f722067656e657261746f7229002b000000370000000c00000028666f7220737461746529002b000000370000000e00000028666f7220636f6e74726f6c29002b000000370000000200000074002c000000350000000000000003000000290000002a0000002b000000010000000800000072657466756e6300010000000200000000000000"
local f = assert(io.open("unsquished.lua", "w+"));
f:write(packedStr:fromhex());
f:close()
This simply gives me a bunch of gibberish surrounded by a few readable strings.
Could someone please tell me how to convert the entirety of this string into readable format? Thank you!

Break your packedStr in parts of 2
1b = 27
4c = 76
75 = 117
61 = 97
and so forth. When you use string.char() with the resulting decimal output, it converts them to equivalent ASCII values. Of the total possible 256 ASCII values in extended ASCII table, only 95 are printable characters.
Thus, you'll always receive the gibberish text. Here's what you'd receive when trying to print each of the character separately: http://codepad.org/orM7pmAb and that is the only possible "readable" output.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string