Is an empty string valid base64 encoded data of zero bytes length? - base64

One of my colleges was telling me that the empty string is not a valid base64 encoded data string. I don't think this is true (he is too lazy to parse it), but after googling around a bit and even checking the RFC I have not found any documentation that explicitly states how to properly encode a blob of zero bytes length in base64.
So, the question is: Do you have a link to some official documentation that explicitly states how zero bytes should be encoded in base64?

According to RFC 4648 Section 10, Test Vectors,
BASE64("") = ""
I would assume the inverse must hold as well.

My thought on this is that there are two possible base64 values that an empty string could produce; either an empty string, or a string that consists entirely of pad characters ('==='). Any other valid base64 string contains information. With the second case, we can apply the following rule from the RFC:
If more than the allowed number of pad characters are found at the end
of the string, e.g., a base 64 string terminated with "===", the
excess pad characters could be ignored.
As they can be ignored, they can be dropped from the resultant encoded string without consequence, once again leaving us with an empty string as the base64 representation of an empty string.

Related

How to to remove b or byte object prefix after packing hex in little endian?

i currently pack hex number in little endian with struct.pack() or p32() from pwnlib, i always got bytes object output.
b'\xde\xad\xbe\xef'
i tried str.decode('utf-8') but in some case there is error output.
is there a way to decode this ?
im using python3 and pwntools 4.3
The byte object prefix is important, it is wrong to delete it.
It is just a python-internal representation of the object. If you need it to write a C string containing the same bytes, you should write a function for it, or encode it using hex escapes or octal escapes.
In python 3 bytes is not text. It is a sequence of octets, array of bytes.
Text is a sequence of unicode codepoints, like unicode type in python 2.
'\xaa' is just a shorthand for '\u00aa', and the shorthand is only creating confusion, so avoid it if possible. Use bytes objects where you mean binary data and unicode string text objects where you mean text.
See https://github.com/Gallopsled/pwntools-tutorial/blob/master/bytes.md

python translate bytecode to utf-8 using a variable

I have the following problem:
From a SQL Server database I am reading data using python module pypyodbc and ODBC Driver 13 for SQL Server and writing to txt files.
Database contains all kinds of special characters and they read as:
'PR\xc3\x86KVAL'
The '\xc3\x86' part is bytecode and should be interpreted that way. The other characters should be interpreted as shown. UTF8 would translate '\xc3\x86' to Æ.
If I type the value in b'PR\xc3\x86KVAL' , python recognizes it as bytecode and I can translate it to PRÆKVAL. See below:
s = b'PR\xc3\x86KVAL'
print(s)
bb = s.decode('utf-8')
print(bb)
The problem is that I don’t know how I can turn 'PR\xc3\x86KVAL’ to be recognized as a bytecode object.
I want the value that has to be decoded to be a variable so that all data from database can flow through it.
I Also tried ast.literal_eval(r”b'PR\xc3\x86KVAL'”), but variables won’t work in this way.
Since you start out with PR\xc3\x86KVAL as a text string and decode indeed expects a raw byte sequence, you need to convert the text string into a bytes object. But when converting from one "encoding" value to another, Python needs to know what encoding it is starting with!
The easiest way to do so is explicitly encoding the string, using an encoding that does not change the special characters. You must be careful, because it is very well possible that a character code might be translated to something else, destroying their meaning.
You can see that with a simple example: attempting to tell Python this should be plain ASCII fails, for an obvious reason.
>>> s = 'PR\xc3\x86KVAL'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
Even though there are more than 1,000 questions on Stack Overflow about this, the reason for the failure should be easy to understand. All an encoder/decoder pair does is translate each character from 'source' to 'destination'. This can only work if the character in question actually exists in both the 'source' and 'destination' encodings. Suppose you want to translate a Greek character β to a Russian б, then the source must be able to decode the Greek character (because that is what you entered it in) and the destination must be able to encode the Russian character.
So you must be careful to choose an encoding which does not change the character \x86 in your input string into Ж (which it would do when using cp866, for example).
Fortunately, as quoted from https://stackoverflow.com/a/2617930/2564301, there is an encoding that does not mess up things:
Pass data.decode('latin1') to the codec. latin1 maps bytes 0-255 to Unicode characters 0-255, which is kinda elegant.
and so this should work:
>>> s = 'PR\xc3\x86KVAL'.encode('latin1')
>>> print(s)
b'PR\xc3\x86KVAL'
Now s is a properly encoded byte object, so you can decode it at will:
>>> bb = s.decode('utf-8')
>>> print(bb)
PRÆKVAL
Done!

In Python 3, how can I convert ascii to string, *without encoding/decoding*

Python 3.6
I converted a string from utf8 to this:
b'\xe6\x88\x91\xe6\xb2\xa1\xe6\x9c\x89\xe7\x94\xb5#xn--ssdcsrs-2e1xt16k.com.au'
I now want that chunk of ascii back into string form, so there is no longer the little b for bytes at the beginning.
BUT I don't want it converted back to UTF8, I want that same sequence of characters that you ses above in my Python string.
How can I do so? All I can find are ways of converting bytes to string along with encoding or decoding.
The (wrong) answer is quite simple:
chr(asciiCode)
In your special case:
myString = ""
for char in b'\xe6\x88\x91\xe6\xb2\xa1\xe6\x9c\x89\xe7\x94\xb5#xn--ssdcsrs-2e1xt16k.com.au':
myString+=chr(char)
print(myString)
gives:
æ没æçµ#xn--ssdcsrs-2e1xt16k.com.au
Maybe you are also interested in the right answer? It will probably not please you, because it says you have ALWAYS to deal with encoding/decoding ... because myString is now both UTF-8 and ASCII at the same time (exactly as it already was before you have "converted" it to ASCII).
Notice that how myString shows up when you print it will depend on the implicit encoding/decoding used by print.
In other words ...
there is NO WAY to avoid encoding/decoding
but there is a way of doing it a not explicit way.
I suppose that reading my answer provided HERE: Converting UTF-8 (in literal) to Umlaute will help you much in understanding the whole encoding/decoding thing.
What you have there is not ASCII, as it contains for instance the byte \xe6, which is higher than 127. It's still UTF8.
The representation of the string (with the 'b' at the start, then a ', then a '\', ...), that is ASCII. You get it with repr(yourstring). But the contents of the string that you're printing is UTF8.
But I don't think you need to turn that back into an UTF8 string, but it may depend on the rest of your code.

Encode a String, given a decoder

Given the following decoder, write the encoder. (The encoder should be written to compress whenever possible):
p14a8xkpq -> p14akkkkkkkkpq
(8xk gets decoded to kkkkkkkk. The only other requirement is that encodings be unambiguous)
Note that the String can have any possible ascii character
My approach would be to find sequences of repeating characters and replace them. For e.g. kkkkkkkk will b replaced by 8xk. However the problem with this solutin is that its ambigious. "8xk" may appear in the uncompressed string itself. I was thinking of using some special character to distinguish it, but then the string can have any possible character so that does not really help

Converting bits to string (data)

I have file which contains some data (text copied and pasted from the "What You Will Learn" portion of this PDF). Firstly, I have converted the contents in the file to bits successfully. However, when I try to convert it back to the original format, some of the characters are not correctly converted, as shown below:
Cisco has
developed the Cisco Open Network Environment (ONE)
architecture as a multifaceted approach to network
programmability delivered across three pillars:
??)É¥ Í?н??ÁÁ±¥?Ñ¥½¸ÁɽÉ?µµ¥¹?¥¹Ñ?É???Ì?¡A%̤?)?áÁ½Í??¥É?ѱ佸Íݥѡ?Ì?¹É½ÕÑ?ÉÌѼ?Õµ?¹Ð?)?á¥ÍÑ¥¹?=Á?¹±½ÜÍÁ?¥?¥?Ñ¥½¹Ì* ¤&öGV7F?öâ×&VG?÷VäfÆ÷r6öçG&öÆÆW"æB÷VäfÆ÷r ¦vVçG0¨?HÝZ]HÙ??ÙXÝÈÈ[]?\??\X[Ý?\?^\Ë?\X[?Ù\?XÙ\Ë[??\ÛÝ\?ÙHÜ?Ú\Ý?][Û?Ø\X?[]Y\È[?H?]HÙ[
As you can see here some characters are converted successfully, others are not.
My code is below:
file = open("test.txt",'r')
myfile = ''.join(map(str,file))
l = []
for i in myfile:
asc11 = ord(i)
b = "{0:08b}".format(asc11)
l.extend(int(y) for y in b)
string_bin = ''.join(map(str,l))
mydata = ''.join(chr(int(string_bin[i:i+8], 2)) for i in range(0,len(string_bin), 8))
print(mydata)
What wrong with my code? What I need to change to make it work properly?
What's Going On?
You are running into an encoding issue because some characters in the PDF are non-ASCII characters. For example, the bullet points are U+2022 which require 3 bytes of storage.
When Python reads from your file, it doesn't know what encoding you used to write that data. Thus it reads bytes from the file and uses a character encoding to translate them into strs which are stored using Python's own internal unicode format. (This differs from Python 2 where open() returned raw bytes stored in a str which you could then manually decoded to unicode.)
Thus, in Python 3, open() accepts a named encoding parameter. For example open("test.txt",'r', encoding='ascii'). Because you don't specify the encoding when you call open(), you end up using your system's default encoding. For instance, on my laptop, the default encoding is CP1252 (LATIN-1). Yours may differ.
Whatever encoding Python uses to interpret your file, it then internally uses it's own unicode format to store your string. This means that your string may internally use mutli-byte characters even if the original encoding did not. For example, my laptop uses CP1252 to interpret U+2022 as • which is internally stored as U+00e2, U+20AC and U+00A2 -- € is stored using a multi-byte character even though it was just one byte in the original file.
Let's assume you computer is sane and uses UTF-8 by default (this explanation is similar for many multi-byte characters). When you reach a bullet point, it is stored as U+2022. When you call ord('\u2022') the result is 8226. When you then call "{0:08b}".format(8226) this returns "10000000100010". That's a 14 character string. Your parsing code assumes all of the ordinals will generate 8 character strings. Because of this, the "binary" output becomes misaligned. This means that when you then parse the binary string in 8-character segments, it gets thrown off and starts interpreting things as control characters and all sorts of foreign language characters.
If you call open(..., encoding='ascii'), Python will actually throw an exception because it reads non-valid ASCII characters.
Possible Solutions
I'm not sure why exactly you are converting the input string into the representation that you are using. It's not binary, as your question title would suggest. Rather, you've converted the data into a textual representation of it's binary encoding.
Technically speaking, when you store encoded text to a file, it's stored using a binary representation. Python, and any text editor, has to decode those bytes into it's internal character representation before it can display them as text. Thus, calling open("test.txt", "r", encoding="utf-8") reads the binary data out of your text file and converts it into Python's internal unicode format. Similarly, calling myfile.encode('utf-8') will return the UTF-8 encoded bytes which can then be written to a file, network socket, etc.
If, however, you do need to use a format similar to what you are currently using, first, I still recommend you specify an encoding when you call open() (I recommend UTF-8). Then you can consider these options:
Detect and omit non-ASCII characters. They will have an ordinal >= 128.
Mimic UTF-16 or UTF-32 and output multi-byte output for all characters. For example, use "{0:032b}".format(asc11) and then parse the result in 32-character chunks. It's memory and storage inefficient, but it will preserve multi-byte characters.
Regardless, I highly recommend reading the Dive Into Python 3 chapter about strings.

Resources