File in Base64 string occupies more space than original file - string

I'm in this kind of... Problem... I'm adding to my program the resources by encoding in base64 string the files (images, videos and audio) and adding them to a String. What I do is to read the file and then, convert the bytes to a Base64 string and write it to a txt file, but the txt file occupies slightly MORE space than the original file. Also this happens when I add the string to my program code. The compiled executable occupies a lot of space. Ex:
An MP3 file occupies 2.3 MB
The Base64 string in a txt file occupies 3.19 MB
Any solution or way to optimize the space of base64 string?
P.D. This is just something I'm trying to do for fun. Do not comment below "WHY" or the reason "FOR WHAT" I want this. The answer is: just for fun.

That's inherent to Base64.
Base64 uses 4 octets to encode 3 octets, because it's a reasonably efficient way of encoding arbitrary binary data using just those bytes that mean something printable in ASCII and also avoid many characters that are special in many contexts. It's more compact than say hexadecimal strings (2 octets to encode each octet), but always larger than raw binary. It's value is only in contexts where raw binary won't work, so the extra size is worth it.
(Strictly it's 4 characters to encode 3 octets, so if that was then encoded in UTF-16 or UTF-32 it could be 8 or 16 octets per 3 encoded).

Related

python3 str to bytes convertation problem

In python 3.8.5 I try to convert some bytes to string and then string to bytes:
>>> a=chr(128)
>>> a
'\x80'
>>> type(a)
<class 'str'>
But when I try to do back convertation:
>>> a.encode()
b'\xc2\x80'
What is \xc2 bytes? Why it appears?
Thanks for any responce!
This a UTF-8 encoding, so the \xc2 comes from here, take a look here.
In a Python string, \x80 means Unicode codepoint #128 (Padding Character). When we encode that codepoint in UTF-8, it takes two bytes.
The original ASCII encoding only had 128 different characters, there are many thousands of Unicode codepoints, and a single byte can only represent 256 different values. A lot of computing is based on ASCII, and we’d like that stuff to keep working, but we need non-English-speakers to be able to use computers too, so we need to be able to represent their characters.
The answer is UTF-8, a scheme that encodes the first 128 Unicode code points (0-127, the ASCII characters) as a single byte – so text that only uses those characters is completely compatible with ASCII. The next 1920 characters, containing the most common non-English characters (U+80 up to U+7FF) are spread across two bytes.
So, in exchange for being slightly less efficient with some characters that could fit in a one-byte encoding (such as \x80), we gain the ability to represent every character from every written language.
For more reading, try this SO question
For example if you want to remove the \xc2 try to encode your string as latin-1
a=chr(128)
print(a)
#'\x80'
print(a.encode())
#b'\xc2\x80'
a.encode('latin-1')
#b'\x80'

How to know encoding of this file?

I thought this is base64 encoding so i try to decode it in that way but it seems this is not base64 encoding. I want to decode this.
O7hrHYO5UUFHFPVILQPc6A==:hEnb3PVrxgHbEL1VT+cu8ic4ocIOfoaWkJ2b2MCrVy4=:jXB0R2OctZ6i1K3s2DlLNS5D/PSdhzKM7GX7gVh6AvXbWrA5i/4j3maFlgk1X2BpmOXYoZab2hAJS4lCBtWi6WnE3zDLhBvWJWFyAN93fIvS66PXJiINmaEhKi8mBIjc
I am learning about reverse eng. and i got this file. This is simple quiz app. (android) in database file it has question with above encoding string. I put here first one. There are many more questions like this.
The colon character : cannot appear in base64 output, and also = can only appear at the end of base64 output, so this string seems to be composed of 3 parts, each individually encoded in base64:
O7hrHYO5UUFHFPVILQPc6A==
hEnb3PVrxgHbEL1VT+cu8ic4ocIOfoaWkJ2b2MCrVy4=
jXB0R2OctZ6i1K3s2DlLNS5D/PSdhzKM7GX7gVh6AvXbWrA5i/4j3maFlgk1X2BpmOXYoZab2hAJS4lCBtWi6WnE3zDLhBvWJWFyAN93fIvS66PXJiINmaEhKi8mBIjc
These don't decode to anything meaningful in base64, so my guess is some encryption scheme has been applied. After decoding, the lengths of these are all multiple of 16 bytes, which hints at a block cipher with blocks of 16 bytes (128 bits).

How to save bytes to file as binary mode

I have a bytes-like object something like:
aa = b'abc\u6df7\u5408def.mp3'
I want to save it into a file in binary mode. the codes are below, but not work well
if __name__=="__main__":
aa = b'abc\u6df7\u5408def.mp3'
print(aa.decode('unicode-escape'))
with open('database.bin', "wb") as datafile:
datafile.write(aa)
the data in file is like that:
enter image description here
but i want the right format is like this, unicodes in binary data:
enter image description here
How can i convert the bytes to save it in file?
\uNNNN escapes do not make sense in byte strings because they do not specify a sequence of bytes. Unicode code points are conceptually abstract representations of strings, and do not straightforwardly map to a serialization format (consisting of bytes, or, in principle, any other sort of concrete symbolic representation).
There are well-defined serialization formats for Unicode; these are known as "encodings". You seem to be looking for the UTF-16 big-endian encoding of these characters.
aa = 'abc\u6df7\u5408def.mp3'.encode('utf-16-be')
With that out of the way, I believe the rest of your code should work as expected.
Unicode on disk is always encoded but you obviously have to know the encoding in order to read it correctly. An optional byte-order mark (BOM) is sometimes written to the beginning of serialized Unicode text files to help the reader discover the encoding; this is a single non-printing character whose sole purpose is to help disambiguate the encoding, and in particular its byte order (big-endian vs little-endian).
However, many places are standardizing on UTF-8 which doesn't require a BOM. The encoding itself is byte-oriented, so it is immune to byte order issues. Perhaps see also https://utf8everywhere.org/

Proper encoding for fixed-length storage of Unicode strings?

I'm going to be working on software (in c#) that needs to read/write Unicode strings (specifically English, German, Spanish and Arabic) to a hardware device. The firmware developer tells me that his code expects to store each string as fixed-length byte array in one binary file so he can quickly access any string using an index (index * length = starting offset and then read the fixed-length number of bytes). I understand that .NET internally uses a UTF-16 encoding which I believe is technically a variable-length encoding (depending upon the number of the Unicode code point). I'm fairly certain that English, German and Spanish would all use two bytes/character when encoded using UTF-16 but I'm not so sure about Arabic. It looks like there might be some Arabic characters that could possibly require three bytes each in UTF-16 and that would seem to break the firmware developers plan to store the strings as a fixed length.
First, can anyone confirm my understanding of the variable-length nature of UTF-8/UTF-16 encodings? And second, although it would waste a lot of space, is UTF-32 (fixed-size, each character represented using 4 bytes) the best option for ensuring that each string could be stored as a fixed length? Thanks!
Unicode terminology:
Each entry in the Unicode character set is a code point
Encoded code points consist of one or more code units in a transformation format (UTF-8 uses 8 bit code units; UTF-16 uses 16 bit code units)
The user-visible grapheme might consist of a sequence of code points
So:
A code point in UTF-8 is 1, 2, 3 or 4 octets wide
A code point in UTF-16 is 2 or 4 octets wide
A code point in UTF-32 is 4 octets wide
The number of graphemes rendered on the screen might be less than the number of code points
So, if you want to support the entire Unicode range you need to make the fixed-length strings a multiple of 32 bits regardless of which of these UTFs you choose as the encoding (I'm assuming unused bytes will be set to 0x0 and that these will be appended, trimmed during I/O.)
In terms of communicating length restrictions via a user interface you'll probably want to decide on some compromise based on a code unit size and the typical customer rather than try to find the width of the most complicated grapheme you can build.

Compress bytes into a readable string (no null or endofline)

I'm searching for the most appropriated encoding or method to compress bytes into character that can be read with a ReadLine-like command that only recognizes readable char and terminates on end of line char. There is probably a common practice to achieve it, but I don't know a lot about encoding.
Currently, I'm outputing bytes as a string of hex, so I need 2 bytes to represent 1 byte. It works well, but it is slow. Ex: byte with a value 255 is represented as 'FF'.
I'm sure it could be 3 or 4 times smaller, though there's a limit since I'm outputing MP3 data, but I don't know how. Should I just ZIP my string or there would be too much overhead on it?
Will ASCII85 contains random null bytes and EndOfLine or I'm safe with it?
Don't zip mp3 files, that will not gain much (or anything at all).
I'm a bit disappointed that you did not read up on Ascii85 before asking as I think the Wikipedia article explains fairly clearly that it uses only printable ASCII characters; so, no line endings or null bytes. It is efficient and the conversion is also fairly simple and quick - split your data to 4-byte ints; you will convert these to just five Ascii85 digits by repeatedly dividing the int value by 85 and taking ASCII value of the modulo + 33.
You can also consider using Base64 or UUEncode. These are fairly popular (e.g. used in email attachments) so you will find many libraries preparing these. But they are less efficient.

Resources