i have a little confused
// default charset utf8
val bytes = byteArrayOf(78, 23, 41, 51, -32, 42)
val str = String(bytes)
// there i got array [78, 23, 41, 51, -17, -65, -67, 42]
val weird = str.toByteArray()
i put random value into the bytes property, for some reason. why is it inconsistent???
The issue here is that your bytes aren't a valid UTF-8 sequence.
Any sequence of bytes can be interpreted as valid ISO Latin-1, for example. (There may be issues with bytes having values 0–31, but those generally don't stop the characters being stored and processed.) Similar applies to most other 8-bit character sets.
But the same isn't true of UTF-8. While all sequences of bytes in the range 1–127 are valid UTF-8 (and interpreted the same as they are in ASCII and most 8-bit encodings), bytes in the range 128–255 can only appear in certain well-defined combinations. (This has several very useful properties: it lets you identify UTF-8 with a very high probability; it also avoids issues with synchronisation, searching, sorting, &c.)
In this case, the sequence in the question (which is 4E 17 29 33 E0 2A in unsigned hex) isn't valid UTF-8.
So when you try to convert it to a string using the default encoding (UTF-8), the JVM substitutes the replacement character — value U+FFFD, which looks like this: � — in place of each invalid character.
Then, when you convert that back to UTF-8, you get the UTF-8 encoding of the replacment character, which is EF BF BD. And if you interpret that as signed bytes, you get -17 -65 -67 — as in the question.
So Kotlin/JVM is handling the invalid input as best it can.
Related
On this RFC: https://www.rfc-editor.org/rfc/rfc7616#page-19 at page 19, there's this example of a text encoded in UTF-8:
J U+00E4 s U+00F8 n D o e
4A C3A4 73 C3B8 6E 20 44 6F 65
How do I represent it in a Rust String?
I tried https://mothereff.in/utf-8 and doing J\00E4s\00F8nDoe but it didn't work.
"Jäsøn Doe" should work fine. Rust source files are always UTF-8 encoded and a string literal may contain any Unicode scalar value (that is, any code point except surrogates, which must not be encoded in UTF-8).
If your editor does not support UTF-8 encoding, but supports ASCII, you can use Unicode code point escapes, which are documented in the Rust reference:
A 24-bit code point escape starts with U+0075 (u) and is followed by up to six hex digits surrounded by braces U+007B ({) and U+007D (}). It denotes the Unicode code point equal to the provided hex value.
suggesting the correct syntax should be "J\u{E4}s\u{F8}n Doe".
You can refer to Rust By Example as everything is not covered in rust eBook
(https://doc.rust-lang.org/stable/rust-by-example/std/str.html#literals-and-escapes)
You can use the syntax \u{your_unicode}
let unicode_str = String::from("J\u{00E4}s\u{00F8}nDoe");
println!("{}", unicode_str);
Consider following Node code
const bufA = Buffer.from('tést');
bufA :
<Buffer 74 c3 a9 73 74>
Why four characters in input translated to 5 hexadecimal bytes ?
When you call Buffer.from(string), a couple of things happen:
The encoding is defaulted to utf-8
The JS string, which is internally stored as an array of UCS-2 characters, is encoded into UTF-8
In UTF-8, é is a multi-byte character, like most accented characters from Latin scripts. Here's more information on how the character is encoded in different systems: https://www.fileformat.info/info/unicode/char/e9/index.htm
As you can see, the UTF-8 representation of this character is 0xC3 0xA9, which corresponds to the second and third bytes c3 a9 in your Buffer.
This also means that, when decoding Buffers (e.g. when concatenating data coming in from a Stream), some characters may fall on the buffer boundary and it may be impossible to decode the string until you have the remainder of the character (0xC3 on its own would be invalid). This is why code examples you find on the Web which do:
let result = '';
stream.on('data', function(buf) {
// BUG! Does not account for multi-byte characters.
result += buf.toString();
});
are almost always wrong - unless the Stream is already set up for handling encoding itself (then, you'll get strings, not buffers, when reading).
I'm writing a hex viewer on python for examining raw packet bytes. I use dpkt module.
I supposed that one hex byte may have value between 0x00 and 0xFF. However, I've noticed that python bytes representation looks differently:
b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1'
I don't understand what do these symbols mean. How can I translate these symbols to original 1-byte values which could be shown in hex viewer?
The \xhh indicates a hex value of hh. i.e. it is the Python 3 way of encoding 0xhh.
See https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
The b at the start of the string is an indication that the variables should be of bytes type rather than str. The above link also covers that. The \n is a newline character.
You can use bytearray to store and access the data. Here's an example using the byte string in your question.
example_bytes = b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1'
encoded_array = bytearray(example_bytes)
print(encoded_array)
>>> bytearray(b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1')
# Print the value of \x8a which is 138 in decimal.
print(encoded_array[0])
>>> 138
# Encode value as Hex.
print(hex(encoded_array[0]))
>>> 0x8a
Hope this helps.
I would like to read in a JPEG-Header and analyze it.
According to Wikipedia, the header consists of a sequences of markers. Each Marker starts with FF xx, where xx is a specific Marker-ID.
So my idea, was to simply read in the image in binary format, and seek for the corresponding character-combinations in the binary stream. This should enable me to split the header in the corresponding marker-fields.
For instance, this is, what I receive, when I read in the first 20 bytes of an image:
binary_data = open('picture.jpg','rb').read(20)
print(binary_data)
b'\xff\xd8\xff\xe1-\xfcExif\x00\x00MM\x00*\x00\x00\x00\x08'
My questions are now:
1) Why does python not return me nice chunks of 2 bytes (in hex-format).
Somthing like this I would expect:
b'\xff \xd8 \xff \xe1 \x-' ... and so on. Some blocks delimited by '\x' are much longer than 2 bytes.
2) Why are there symbols like -, M, * in the returned string? Those are no characters of a hex representation I expect from a byte string (only: 0-9, a-f, I think).
Both observations hinder me in writing a simple parser.
So ultimately my question summarizes to:
How do I properly read-in and parse a JPEG Header in Python?
You seem overly worried about how your binary data is represented on your console. Don't worry about that.
The default built-in string-based representation that print(..) applies to a bytes object is just "printable ASCII characters as such (except a few exceptions), all others as an escaped hex sequence". The exceptions are semi-special characters such as \, ", and ', which could mess up the string representation. But this alternative representation does not change the values in any way!
>>> a = bytes([1,2,4,92,34,39])
>>> a
b'\x01\x02\x04\\"\''
>>> a[0]
1
See how the entire object is printed 'as if' it's a string, but its individual elements are still perfectly normal bytes?
If you have a byte array and you don't like the appearance of this default, then you can write your own. But – for clarity – this still doesn't have anything to do with parsing a file.
>>> binary_data = open('iaijiedc.jpg','rb').read(20)
>>> binary_data
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00H\x00H\x00\x00'
>>> ''.join(['%02x%02x ' % (binary_data[2*i],binary_data[2*i+1]) for i in range(len(binary_data)>>1)])
'ffd8 ffe0 0010 4a46 4946 0001 0201 0048 0048 0000 '
Why does python not return me nice chunks of 2 bytes (in hex-format)?
Because you don't ask it to. You are asking for a sequence of bytes, and that's what you get. If you want chunks of two-bytes, transform it after reading.
The code above only prints the data; to create a new list that contains 2-byte words, loop over it and convert each 2 bytes or use unpack (there are actually several ways):
>>> wd = [unpack('>H', binary_data[x:x+2])[0] for x in range(0,len(binary_data),2)]
>>> wd
[65496, 65504, 16, 19014, 18758, 1, 513, 72, 72, 0]
>>> [hex(x) for x in wd]
['0xffd8', '0xffe0', '0x10', '0x4a46', '0x4946', '0x1', '0x201', '0x48', '0x48', '0x0']
I'm using the little-endian specifier < and unsigned short H in unpack, because (I assume) these are the conventional ways to represent JPEG 2-byte codes. Check the documentation if you want to derive from this.
Can someone please explain how does this code work:
[...Buffer('abc')]
the result is:
[ 97, 98, 99 ]
First of all, consider this piece of code:
console.log([...[1, 2, 3]]); //[1, 2, 3]
The Spread operator will try to take an array and transform it to a list of arguments.
Node.js's buffer object are actually arrays of bytes, a way to represent character and deal with binary data simultaneously. you can read more about it at https://nodejs.org/api/buffer.html.
Now, since 'abc' is actually three ascii characters, each character will only take up one byte, and that byte will correspond to its ascii code.
You can get this by doing: myString.charCodeAt(pos), in your case 'abc'.charCodeAt(0) will return 97.
So,
[...Buffer('abc')]
will actually return an array containing the ascii codes of each character of 'abc'. that is [97, 98, 99]
Since buffers' encoding is by default UTF-8, things will get more exciting when you are dealing with unicode.
console.log([...Buffer('漢字')]); //[230 188 162 229 173 151]
Sorry for potential typos and hope this helps.