Using hexdump and how to find associated character? - linux

I execute hexdump on a data file and it prints out the following :
> hexdump myFile.data
a4c3
After switching byte order I have the following :
c3a4
Do I assume those HEX values are actual Unicode values?
If so, the values are :
and
Or do I take the c3a4 and treat it as UTF-8 data (since my Putty session is set to UTF-8) then convert it to Unicode?
If so, it results into E4 which then is
Which is the proper interpretation?

You cannot assume those hex values are Unicode values. In fact, hexdump will never (well, see below...) give you Unicode values.
Those hex values represent the binary data as it was written to disk when the file was created. But in order to translate that data back to any specific characters/symbols/glyphs, you need to know what specific character encoding was used when the file was created (ASCII, UTF-8, and so on).
Also, I recommend using hexdump with the -C option (that's the uppercase C) to give the so-called "canonical" representation of the hex data:
c3 a4 0a
In my case, there is also a 0a representing a newline character.
So, in the above example we have 0xc3 followed by 0xa4 (I added the 0x part to indicate we are dealing with hex values). I happen to know that this file used UTF-8 when it was created. I can therefore determine that the character in the file is ä (also referred to by Unicode U+00e4).
But the key point is: you must know how the file was encoded, to know with certainty how to interpret the bytes provided by hexdump.
Unicode is (amongst other things) an abstract numbering system for characters, separate from any specific encoding. That is one of the reasons why it is so useful. But it just so happens that its designers used the same encoding as ASCII for the initial set of characters. So that is why ASCII letter a has the same code value as Unicode a. As you can see with Unicode vs. UTF-8, the encodings are not the same, once you get beyond that initial ASCII code range.

Related

How to save bytes to file as binary mode

I have a bytes-like object something like:
aa = b'abc\u6df7\u5408def.mp3'
I want to save it into a file in binary mode. the codes are below, but not work well
if __name__=="__main__":
aa = b'abc\u6df7\u5408def.mp3'
print(aa.decode('unicode-escape'))
with open('database.bin', "wb") as datafile:
datafile.write(aa)
the data in file is like that:
enter image description here
but i want the right format is like this, unicodes in binary data:
enter image description here
How can i convert the bytes to save it in file?
\uNNNN escapes do not make sense in byte strings because they do not specify a sequence of bytes. Unicode code points are conceptually abstract representations of strings, and do not straightforwardly map to a serialization format (consisting of bytes, or, in principle, any other sort of concrete symbolic representation).
There are well-defined serialization formats for Unicode; these are known as "encodings". You seem to be looking for the UTF-16 big-endian encoding of these characters.
aa = 'abc\u6df7\u5408def.mp3'.encode('utf-16-be')
With that out of the way, I believe the rest of your code should work as expected.
Unicode on disk is always encoded but you obviously have to know the encoding in order to read it correctly. An optional byte-order mark (BOM) is sometimes written to the beginning of serialized Unicode text files to help the reader discover the encoding; this is a single non-printing character whose sole purpose is to help disambiguate the encoding, and in particular its byte order (big-endian vs little-endian).
However, many places are standardizing on UTF-8 which doesn't require a BOM. The encoding itself is byte-oriented, so it is immune to byte order issues. Perhaps see also https://utf8everywhere.org/

Is the [0xff, 0xfe] prefix required on utf-16 encoded strings?

Rewritten question!
I am working with a vendor's device that requires "unicode encoding" of strings, where each character is represented in two bytes. My strings will always be ASCII based, so I thought this would be the way to translate my string into the vendor's string:
>>> b1 = 'abc'.encode('utf-16')
But examining the result, I see that there's a leading [0xff, 0xfe] on the bytearray:
>>> [hex(b) for b in b1]
['0xff', '0xfe', '0x61', '0x0', '0x62', '0x0', '0x63', '0x0']
Since the vendor's device is not expecting the [0xff, 0xfe], I can strip it off...
>>> b2 = 'abc'.encode('utf-16')[2:]
>>> [hex(b) for b in b2]
['0x61', '0x0', '0x62', '0x0', '0x63', '0x0']
... which is what I want.
But what really surprises me that I can decode b1 and b2 and they both reconstitute to the original string:
>>> b1.decode('utf-16') == b2.decode('utf-16')
True
So my two intertwined questions:
What is the significance of the 0xff, 0xfe on the head of the encoded bytes?
Is there any hazard in stripping off the 0xff, 0xfe prefix, as with b2 above?
This observation
... what really surprises me that I can decode b1 and b2 and they both reconstitute to the original string:
b1.decode('utf-16') == b2.decode('utf-16')
True
suggests there is a built-in default, because there are two possible arrangements for the 16-bit wide UTF-16 codes: Big and Little Endian.
Normally, Python deduces the endianness to use from the BOM when reading – and so it always adds one when writing. If you want to force a specific endianness, you can use the explicit encodings utf-16-le and utf-16-be:
… when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as ‘utf-16-le’ and ‘utf-16-be’ for little-endian and big-endian encodings, that specify one particular byte ordering and don’t skip the BOM.
(https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data)
But if you do not use a specific ordering, then what default gets used? The original Unicode proposal, PEP 100, warns
Note: 'utf-16' should be implemented by using and requiring byte order marks (BOM) for file input/output.
(https://www.python.org/dev/peps/pep-0100/, my emph.)
Yet it works for you. If we look up in the Python source code how this is managed, we find this comment in _codecsmodule.c:
/* This version provides access to the byteorder parameter of the
builtin UTF-16 codecs as optional third argument. It defaults to 0
which means: use the native byte order and prepend the data with a
BOM mark.
*/
and deeper, in unicodeobject.c,
/* Check for BOM marks (U+FEFF) in the input and adjust current
byte order setting accordingly. In native mode, the leading BOM
mark is skipped, in all other modes, it is copied to the output
stream as-is (giving a ZWNBSP character). */
So initially, the byte order is set to the default for your system, and when you start decoding UTF-16 data and a BOM follows, the byte order gets set to whatever this specifies. The "native order" in this last comment refers to whether or not a certain byte order has been explicitly declared OR has been encountered by way of a BOM; and when neither is true, it will use your system's endianness.
This is the byte order mark. It's a prefix to a UTF document that indicates what endianness the document uses. It does this by encoding the code point 0xFEFF in the byte order - in this case, little endian (less significant byte first). Anything trying to read it the other way around, in big endian (more significant byte first), will read the first character as 0xFFFE, which is a code point that is specifically not a valid character, informing the reader it needs to error or switch endianness for the rest of the document.
It is the byte order mark, a.k.a. BOM: see https://en.wikipedia.org/wiki/UTF-16 (look at the subheadin gByte order encoding schemes).
It's purpose is to allow the decoder to detect if the encoding is little-endian or big-endian.
It is the Unicode byte order mark encoded in UTF-16. Its purpose is to communicate the byte order to a reader expecting text encoded with a Unicode character encoding.
You can omit it if the reader otherwise knows or comes to know the byte order.
'abc'.encode('utf-16-le')
The answers, and especially the comment from usr2564301 are helpful: the 0xff 0xfe prefix is the "Byte Order Marker", and it carries the endian-ness information along with the byte string. If you know which endian-ness you want, you can specify utf-16-le or utf-16-be as part of the encoding.
This makes it clear:
>>> 'abc'.encode('utf-16').hex()
'fffe610062006300'
>>> 'abc'.encode('utf-16-le').hex()
'610062006300'
>>> 'abc'.encode('utf-16-be').hex()
'006100620063'

How are Linux shells and filesystem Unicode-aware?

I understand that Linux filesystem stores file names as byte sequences which is meant to be Unicode-encoding-independent.
But, encodings other than UTF-8 or Enhanced UTF-8 may very well use 0 byte as part of a multibyte representation of a Unicode character that can appear in a file name. And everywhere in Linux filesystem C code you terminate strings with 0 byte. So how does Linux filesystem support Unicode? Does it assume all applications that create filenames use UTF-8 only? But that is not true, is it?
Similarly, the shells (such as bash) use * in patterns to match any number of filename characters. I can see in shell C code that it simply uses the ASCII byte for * and goes byte-by-byte to delimit the match. Fine for UTF-8 encoded names, because it has the property that if you take the byte representation of a string, then match some bytes from the start with *, and match the rest with another string, then the bytes at the beginning in fact matched a string of whole characters, not just bytes.
But other encodings do not have that property, do they? So again, do shells assume UTF-8?
It is true that UTF-16 and other "wide character" encodings cannot be used for pathnames in Linux (nor any other POSIX-compliant OS).
It is not true in principle that anyone assumes UTF-8, although that may come to be true in the future as other encodings die off. What Unix-style programs assume is an ASCII-compatible encoding. Any encoding with these properties is ASCII-compatible:
The fundamental unit of the encoding is a byte, not a larger entity. Some characters might be encoded as a sequence of bytes, but there must be at least 127 characters that are encoded using only a single byte, namely:
The characters defined by ASCII (nowadays, these are best described as Unicode codepoints U+000000 through U+00007F, inclusive) are encoded as single bytes, with values equal to their Unicode codepoints.
Conversely, the bytes with values 0x00 through 0x7F must always decode to the characters defined by ASCII, regardless of surrounding context. (For instance, the string 0x81 0x2F must decode to two characters, whatever 0x81 decodes to and then /.)
UTF-8 is ASCII-compatible, but so are all of the ISO-8859-n pages, the EUC encodings, and many, many others.
Some programs may also require an additional property:
The encoding of a character, viewed as a sequence of bytes, is never a proper prefix nor a proper suffix of the encoding of any other character.
UTF-8 has this property, but (I think) EUC-JP doesn't.
It is also the case that many "Unix-style" programs reserve the codepoint U+000000 (NUL) for use as a string terminator. This is technically not a constraint on the encoding, but on the text itself. (The closely-related requirement that the byte 0x00 not appear in the middle of a string is a consequence of this plus the requirement that 0x00 maps to U+000000 regardless of surrounding context.)
There is no encoding of filenames in Linux (in ext family of filesystems at any rate). Filenames are sequences of bytes, not characters. It is up to application programs to interpret these bytes as UTF-8 or anything else. The filesystem doesn't care.
POSIX stipulates that the shell obeys the locale environment vsriables such as LC_CTYPE when performing pattern matches. Thus, pattern-matching code that just compares bytes regardless of the encoding would not be compatible with your hypothetical encoding, or with any stateful encoding. But this doesn't seem to matter much as such encodings are not commonly supported by existing locales. UTF-8 on the other hand seems to be supported well: in my experiments bash correctly matches the ? character with a single Unicode character, rather than a single byte, in the filename (given an UTF-8 locale) as prescribed by POSIX.

What is the difference between a unicode and binary string?

I am in python3.3.
What is the difference between a unicode string and a binary string?
b'\\u4f60'
u'\x4f\x60'
b'\x4f\x60'
u'4f60'
The concept of Unicode and binary string is confusing. How can i change b'\\u4f60' into b'\x4f\x60' ?
First - there is no difference between unicode literals and string literals in python 3. They are one and the same - you can drop the u up front. Just write strings. So instantly you should see that the literal u'4f60' is just like writing actual '4f60'.
A bytes literal - aka b'some literal' - is a series of bytes. Bytes between 32 and 127 (aka ASCII) can be displayed as their corresponding glyph, the rest are displayed as the \x escaped version. Don't be confused by this - b'\x61' is the same as b'a'. It's just a matter of printing.
A string literal is a string literal. It can contain unicode codepoints. There is far too much to cover to explain how unicode works here, but basically a codepoint represents a glyph (essentially, a character - a graphical representation of a letter/digit), it does not specify how the machine needs to represent it. In fact there are a great many different ways.
Thus there is a very large difference between bytes literals and str literals. The former describe the machine representation, the latter describe the alphanumeric glyphs that we are reading right now. The mapping between the two domains is encoding/decoding.
I'm skipping over a lot of vital information here. That should get us somewhere though. I highly recommend reading more since this is not an easy topic.
How can i change b'\\u4f60' into b'\x4f\x60' ?
Let's walk through it:
b'\u4f60'
Out[101]: b'\\u4f60' #note, unicode-escaped
b'\x4f\x60'
Out[102]: b'O`'
'\u4f60'
Out[103]: '你'
So, notice that \u4f60 is that Han ideograph glyph. \x4f\x60 is, if we represent it in ascii (or utf-8, actually), the letter O (\x4f) followed by backtick.
I can ask python to turn that unicode-escaped bytes sequence into a valid string with the according unicode glyph:
b'\\u4f60'.decode('unicode-escape')
Out[112]: '你'
So now all we need to do is to re-encode to bytes, right? Well...
Coming around to what I think you're wanting to ask -
How can i change '\\u4f60' into its proper bytes representation?
There is no 'proper' bytes representation of that unicode codepoint. There is only a representation in the encoding that you want. It so happens that there is one encoding that directly matches the transformation to b'\x4f\x60' - utf-16be.
b'\\u4f60'.decode('unicode-escape').encode('utf-16-be')
Out[47]: 'O`'
The reason this works is that utf-16 is a variable-length encoding. For code points below 16 bits it just directly uses the codepoint as the 2-byte encoding, and for points above it uses something called "surrogate pairs", which I won't get into.

Is there a unicode range that is a copy of the first 128 characters?

I would like to be able to put and other characters into a text without it being interpreted by the computer. So was wondering is there a range that is defined as mapping to the same glyphs etc as the range 0-0x7f (the ascii range).
Please note I state that the range 0-0x7f is the same as ascii, so the question is not what range maps to ascii.
I am asking is there another range that also maps to the same glyphs. I.E when rendered will look the same. But when interpreted may be can be seen as a different code.
so I can write
print "hello "world""
characters in bold avoid the 0-0x7f (ascii range)
Additional:
I was meaning homographic and behaviourally, well everything the same except a different code point. I was hopping for the whole ascii/128bit set, directly mapped (an offset added to them all).
The reason: to avoid interpretation by any language that uses some of the ascii characters as part of its language but allows any unicode character in literal strings e.g. (when uft-8 encoded) C, html, css, …
I was trying to retro-fix the idea of “no reserved words” / “word colours” (string literals one colour, keywords another, variables another, numbers another, etc) so that a string literal or variable-name(though not in this case) can contain any character.
I interpret the question to mean "is there a set of code points which are homographic with the low 7-bit ASCII set". The answer is no.
There are some code points which are conventionally rendered homographically (e.g. Cyrillic upparcase А U+0410 looks identical to ASCII 65 in many fonts, and quite similar in most fonts which support this code point) but they are different code points with different semantics. Similarly, there are some code points which basically render identically, but have a specific set of semantics, like the non-breaking space U+00A0 which renders identically to ASCII 32 but is specified as having a particular line-breaking property; or the RIGHT SINGLE QUOTATION MARK U+2019 which is an unambiguous quotation mark, as opposed to its twin ASCII 39, the "apostrophe".
But in summary, there are many symbols in the basic ASCII block which do not coincide with a homograph in another code block. You might be able to find homographs or near-homographs for your sample sentence, though; I would investigate the IPA phonetic symbols and the Greek and Cyrillic blocks.
The answer to the question asked is “No”, as #tripleee described, but the following note might be relevant if the purpose is trickery or fun of some kind:
The printable ASCII characters excluding the space have been duplicated at U+FF01 to U+FF5E, but these are fullwidth characters intended for use in CJK texts. Their shape is (and is meant to be) different: hello  world. (Your browser may be unable to render them.) So they are not really homographic with ASCII characters but could be used for some special purposes. (I have no idea of what the purpose might be here.)
Depends on the Unicode standard you use.
In UTF-8, the first 128 characters have the exact ASCII counterparts as code numbers. In UTF-16, the first 128 ASCII characters are between 0x0000 and 0x007F (2 bytes).

Resources