How to save bytes to file as binary mode - python-3.x

I have a bytes-like object something like:
aa = b'abc\u6df7\u5408def.mp3'
I want to save it into a file in binary mode. the codes are below, but not work well
if __name__=="__main__":
aa = b'abc\u6df7\u5408def.mp3'
print(aa.decode('unicode-escape'))
with open('database.bin', "wb") as datafile:
datafile.write(aa)
the data in file is like that:
enter image description here
but i want the right format is like this, unicodes in binary data:
enter image description here
How can i convert the bytes to save it in file?

\uNNNN escapes do not make sense in byte strings because they do not specify a sequence of bytes. Unicode code points are conceptually abstract representations of strings, and do not straightforwardly map to a serialization format (consisting of bytes, or, in principle, any other sort of concrete symbolic representation).
There are well-defined serialization formats for Unicode; these are known as "encodings". You seem to be looking for the UTF-16 big-endian encoding of these characters.
aa = 'abc\u6df7\u5408def.mp3'.encode('utf-16-be')
With that out of the way, I believe the rest of your code should work as expected.
Unicode on disk is always encoded but you obviously have to know the encoding in order to read it correctly. An optional byte-order mark (BOM) is sometimes written to the beginning of serialized Unicode text files to help the reader discover the encoding; this is a single non-printing character whose sole purpose is to help disambiguate the encoding, and in particular its byte order (big-endian vs little-endian).
However, many places are standardizing on UTF-8 which doesn't require a BOM. The encoding itself is byte-oriented, so it is immune to byte order issues. Perhaps see also https://utf8everywhere.org/

Related

Using hexdump and how to find associated character?

I execute hexdump on a data file and it prints out the following :
> hexdump myFile.data
a4c3
After switching byte order I have the following :
c3a4
Do I assume those HEX values are actual Unicode values?
If so, the values are :
and
Or do I take the c3a4 and treat it as UTF-8 data (since my Putty session is set to UTF-8) then convert it to Unicode?
If so, it results into E4 which then is
Which is the proper interpretation?
You cannot assume those hex values are Unicode values. In fact, hexdump will never (well, see below...) give you Unicode values.
Those hex values represent the binary data as it was written to disk when the file was created. But in order to translate that data back to any specific characters/symbols/glyphs, you need to know what specific character encoding was used when the file was created (ASCII, UTF-8, and so on).
Also, I recommend using hexdump with the -C option (that's the uppercase C) to give the so-called "canonical" representation of the hex data:
c3 a4 0a
In my case, there is also a 0a representing a newline character.
So, in the above example we have 0xc3 followed by 0xa4 (I added the 0x part to indicate we are dealing with hex values). I happen to know that this file used UTF-8 when it was created. I can therefore determine that the character in the file is ä (also referred to by Unicode U+00e4).
But the key point is: you must know how the file was encoded, to know with certainty how to interpret the bytes provided by hexdump.
Unicode is (amongst other things) an abstract numbering system for characters, separate from any specific encoding. That is one of the reasons why it is so useful. But it just so happens that its designers used the same encoding as ASCII for the initial set of characters. So that is why ASCII letter a has the same code value as Unicode a. As you can see with Unicode vs. UTF-8, the encodings are not the same, once you get beyond that initial ASCII code range.

Is the [0xff, 0xfe] prefix required on utf-16 encoded strings?

Rewritten question!
I am working with a vendor's device that requires "unicode encoding" of strings, where each character is represented in two bytes. My strings will always be ASCII based, so I thought this would be the way to translate my string into the vendor's string:
>>> b1 = 'abc'.encode('utf-16')
But examining the result, I see that there's a leading [0xff, 0xfe] on the bytearray:
>>> [hex(b) for b in b1]
['0xff', '0xfe', '0x61', '0x0', '0x62', '0x0', '0x63', '0x0']
Since the vendor's device is not expecting the [0xff, 0xfe], I can strip it off...
>>> b2 = 'abc'.encode('utf-16')[2:]
>>> [hex(b) for b in b2]
['0x61', '0x0', '0x62', '0x0', '0x63', '0x0']
... which is what I want.
But what really surprises me that I can decode b1 and b2 and they both reconstitute to the original string:
>>> b1.decode('utf-16') == b2.decode('utf-16')
True
So my two intertwined questions:
What is the significance of the 0xff, 0xfe on the head of the encoded bytes?
Is there any hazard in stripping off the 0xff, 0xfe prefix, as with b2 above?
This observation
... what really surprises me that I can decode b1 and b2 and they both reconstitute to the original string:
b1.decode('utf-16') == b2.decode('utf-16')
True
suggests there is a built-in default, because there are two possible arrangements for the 16-bit wide UTF-16 codes: Big and Little Endian.
Normally, Python deduces the endianness to use from the BOM when reading – and so it always adds one when writing. If you want to force a specific endianness, you can use the explicit encodings utf-16-le and utf-16-be:
… when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as ‘utf-16-le’ and ‘utf-16-be’ for little-endian and big-endian encodings, that specify one particular byte ordering and don’t skip the BOM.
(https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data)
But if you do not use a specific ordering, then what default gets used? The original Unicode proposal, PEP 100, warns
Note: 'utf-16' should be implemented by using and requiring byte order marks (BOM) for file input/output.
(https://www.python.org/dev/peps/pep-0100/, my emph.)
Yet it works for you. If we look up in the Python source code how this is managed, we find this comment in _codecsmodule.c:
/* This version provides access to the byteorder parameter of the
builtin UTF-16 codecs as optional third argument. It defaults to 0
which means: use the native byte order and prepend the data with a
BOM mark.
*/
and deeper, in unicodeobject.c,
/* Check for BOM marks (U+FEFF) in the input and adjust current
byte order setting accordingly. In native mode, the leading BOM
mark is skipped, in all other modes, it is copied to the output
stream as-is (giving a ZWNBSP character). */
So initially, the byte order is set to the default for your system, and when you start decoding UTF-16 data and a BOM follows, the byte order gets set to whatever this specifies. The "native order" in this last comment refers to whether or not a certain byte order has been explicitly declared OR has been encountered by way of a BOM; and when neither is true, it will use your system's endianness.
This is the byte order mark. It's a prefix to a UTF document that indicates what endianness the document uses. It does this by encoding the code point 0xFEFF in the byte order - in this case, little endian (less significant byte first). Anything trying to read it the other way around, in big endian (more significant byte first), will read the first character as 0xFFFE, which is a code point that is specifically not a valid character, informing the reader it needs to error or switch endianness for the rest of the document.
It is the byte order mark, a.k.a. BOM: see https://en.wikipedia.org/wiki/UTF-16 (look at the subheadin gByte order encoding schemes).
It's purpose is to allow the decoder to detect if the encoding is little-endian or big-endian.
It is the Unicode byte order mark encoded in UTF-16. Its purpose is to communicate the byte order to a reader expecting text encoded with a Unicode character encoding.
You can omit it if the reader otherwise knows or comes to know the byte order.
'abc'.encode('utf-16-le')
The answers, and especially the comment from usr2564301 are helpful: the 0xff 0xfe prefix is the "Byte Order Marker", and it carries the endian-ness information along with the byte string. If you know which endian-ness you want, you can specify utf-16-le or utf-16-be as part of the encoding.
This makes it clear:
>>> 'abc'.encode('utf-16').hex()
'fffe610062006300'
>>> 'abc'.encode('utf-16-le').hex()
'610062006300'
>>> 'abc'.encode('utf-16-be').hex()
'006100620063'

What exactly is encoding-independent means

While reading the Strings and Characters chapter of the official Swift document I found the following sentence
"Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations"
Question What exactly do encoding-independent mean?
From my reading on Advanced Swift By Chris and other experiences, the thing that this sentence is trying to convey can be 2 folds.
First, what are various unicode representations:
UTF-8 : compatible with ASCII
UTF-16
UTF-32
The number on the right hand side means how many bits a Character will take when it represented or stored.
For a character, UTF-8 requires 8 bits while UTF-32 requires 32 bits.
However, a chinese character which can be represented by 1 UTF-32 memory might not always fit in 1 block of UTF-16 memory. If the character aquires all 32 bits then in UTF-8 it will have a count of 4.
Then comes the storing part. When you store a character in the String, it doesn't matter how you want to read it later.
For example:
Every string is composed of encoding-independent Unicode characters, and provide support for accessing those characters in various Unicode representations
This means, you can compose String by any way you like. And this wont effect the representation when reading on various unicode encoding formats like UTF-8 or 16 or 32.
This is seen clearly in the above example, When i try to load a Japanese Character which takes up 24 bit to store. The same character is displayed irrespective of my choice of encoding.
However, count value will differ. There are other points to consider like Code Unit and Code Point that make up this Strings.
For Unicode Encoding variants
I would highly recommend reading this article which goes way deeper into String api in swift.
Detail View of String API in swift

File in Base64 string occupies more space than original file

I'm in this kind of... Problem... I'm adding to my program the resources by encoding in base64 string the files (images, videos and audio) and adding them to a String. What I do is to read the file and then, convert the bytes to a Base64 string and write it to a txt file, but the txt file occupies slightly MORE space than the original file. Also this happens when I add the string to my program code. The compiled executable occupies a lot of space. Ex:
An MP3 file occupies 2.3 MB
The Base64 string in a txt file occupies 3.19 MB
Any solution or way to optimize the space of base64 string?
P.D. This is just something I'm trying to do for fun. Do not comment below "WHY" or the reason "FOR WHAT" I want this. The answer is: just for fun.
That's inherent to Base64.
Base64 uses 4 octets to encode 3 octets, because it's a reasonably efficient way of encoding arbitrary binary data using just those bytes that mean something printable in ASCII and also avoid many characters that are special in many contexts. It's more compact than say hexadecimal strings (2 octets to encode each octet), but always larger than raw binary. It's value is only in contexts where raw binary won't work, so the extra size is worth it.
(Strictly it's 4 characters to encode 3 octets, so if that was then encoded in UTF-16 or UTF-32 it could be 8 or 16 octets per 3 encoded).

What is the difference between a unicode and binary string?

I am in python3.3.
What is the difference between a unicode string and a binary string?
b'\\u4f60'
u'\x4f\x60'
b'\x4f\x60'
u'4f60'
The concept of Unicode and binary string is confusing. How can i change b'\\u4f60' into b'\x4f\x60' ?
First - there is no difference between unicode literals and string literals in python 3. They are one and the same - you can drop the u up front. Just write strings. So instantly you should see that the literal u'4f60' is just like writing actual '4f60'.
A bytes literal - aka b'some literal' - is a series of bytes. Bytes between 32 and 127 (aka ASCII) can be displayed as their corresponding glyph, the rest are displayed as the \x escaped version. Don't be confused by this - b'\x61' is the same as b'a'. It's just a matter of printing.
A string literal is a string literal. It can contain unicode codepoints. There is far too much to cover to explain how unicode works here, but basically a codepoint represents a glyph (essentially, a character - a graphical representation of a letter/digit), it does not specify how the machine needs to represent it. In fact there are a great many different ways.
Thus there is a very large difference between bytes literals and str literals. The former describe the machine representation, the latter describe the alphanumeric glyphs that we are reading right now. The mapping between the two domains is encoding/decoding.
I'm skipping over a lot of vital information here. That should get us somewhere though. I highly recommend reading more since this is not an easy topic.
How can i change b'\\u4f60' into b'\x4f\x60' ?
Let's walk through it:
b'\u4f60'
Out[101]: b'\\u4f60' #note, unicode-escaped
b'\x4f\x60'
Out[102]: b'O`'
'\u4f60'
Out[103]: '你'
So, notice that \u4f60 is that Han ideograph glyph. \x4f\x60 is, if we represent it in ascii (or utf-8, actually), the letter O (\x4f) followed by backtick.
I can ask python to turn that unicode-escaped bytes sequence into a valid string with the according unicode glyph:
b'\\u4f60'.decode('unicode-escape')
Out[112]: '你'
So now all we need to do is to re-encode to bytes, right? Well...
Coming around to what I think you're wanting to ask -
How can i change '\\u4f60' into its proper bytes representation?
There is no 'proper' bytes representation of that unicode codepoint. There is only a representation in the encoding that you want. It so happens that there is one encoding that directly matches the transformation to b'\x4f\x60' - utf-16be.
b'\\u4f60'.decode('unicode-escape').encode('utf-16-be')
Out[47]: 'O`'
The reason this works is that utf-16 is a variable-length encoding. For code points below 16 bits it just directly uses the codepoint as the 2-byte encoding, and for points above it uses something called "surrogate pairs", which I won't get into.

Resources