Is there a way to convert between text and 8-bit binary (extended ASCII) in Python? - text

So I'm working on a program and, as part of the program, I need to convert text to binary and binary to text. However, the binary must be 8-bit meaning 8 bits for every character.
I have looked on the internet and only found examples such as:
a = 'text'
a_bytes = bytes(a, "ascii")
print(' '.join(["{0:b}".format(x) for x in a_bytes]))
As I explained earlier, I need the binary in 8-bit/extended ASCII format but this and all other internet examples seem to convert to 7-bit/regular ASCII binary! I'm using Python 3.9.7 if it helps!
I have tried just adding a zero in front but it is then awkward to convert back to text!
Can somebody please help!
I tried:
a = 'text'
a_bytes = bytes(a, "ascii")
print(' '.join(["{0:b}".format(x) for x in a_bytes]))
It gave:
1110100 1100101 1110011 1110100
I expected:
01110100 01100101 01111000 01110100

Related

Converting bytes data do hex with special character '\'

Given example of bytes retrieved from packet capture:
b'\x18\x05'
how can i hexlify it properly considering special character '' ?
When i hexlify it with python i'm getting b'1805' but when i remove manually special character '' (b'x18x05') i'm getting proper value b'783138783035'.
Considering online hex encoders ( for example : https://www.hexator.com/ ) the result of b'\x18\x05' is 62275c7831385c78303527.
Thank you in advance
b'\x18\x05' is the two bytes 0x18 and 0x05. That's the "proper value". b'' is just the default display notation for a bytes Python object. \nn is an escape code representing the hexadecimal value of a single byte.
For display you can use:
data = b'\x18\x15'
print(data)
print(data.hex())
print(data.hex(sep=' '))
Output:
b'\x18\x15'
1815
18 15

How to Turn string into bytes?

Using python3 and I've got a string which displayed as bytes
strategyName=\xe7\x99\xbe\xe5\xba\xa6
I need to change it into readable chinese letter through decode
orig=b'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result=orig.decode('UTF-8')
print()
which shows like this and it is what I want
strategyName=百度
But if I save it in another string,it works different
str0='strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte=str0.encode('UTF-8')
result_str=result_byte.decode('UTF-8')
print(result_str)
strategyName=ç¾åº¦é£é©ç­ç¥
Please help me about why this happening,and how can I fix it.
Thanks a lot
Your problem is using a str literal when you're trying to store the UTF-8 encoded bytes of your string. You should just use the bytes literal, but if that str form is necessary, the correct approach is to encode in latin-1 (which is a 1-1 converter for all ordinals below 256 to the matching byte value) to get the bytes with utf-8 encoded data, then decode as utf-8:
str0 = 'strategyName=\xe7\x99\xbe\xe5\xba\xa6'
result_byte = str0.encode('latin-1') # Only changed line
result_str = result_byte.decode('UTF-8')
print(result_str)
Of course, the other approach could be to just type the Unicode escapes you wanted in the first place instead of byte level escapes that correspond to a UTF-8 encoding:
result_str = 'strategyName=\u767e\u5ea6'
No rigmarole needed.

Python bytes representation

I'm writing a hex viewer on python for examining raw packet bytes. I use dpkt module.
I supposed that one hex byte may have value between 0x00 and 0xFF. However, I've noticed that python bytes representation looks differently:
b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1'
I don't understand what do these symbols mean. How can I translate these symbols to original 1-byte values which could be shown in hex viewer?
The \xhh indicates a hex value of hh. i.e. it is the Python 3 way of encoding 0xhh.
See https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
The b at the start of the string is an indication that the variables should be of bytes type rather than str. The above link also covers that. The \n is a newline character.
You can use bytearray to store and access the data. Here's an example using the byte string in your question.
example_bytes = b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1'
encoded_array = bytearray(example_bytes)
print(encoded_array)
>>> bytearray(b'\x8a\n\x1e+\x1f\x84V\xf2\xca$\xb1')
# Print the value of \x8a which is 138 in decimal.
print(encoded_array[0])
>>> 138
# Encode value as Hex.
print(hex(encoded_array[0]))
>>> 0x8a
Hope this helps.

Reading-in a binary JPEG-Header (in Python)

I would like to read in a JPEG-Header and analyze it.
According to Wikipedia, the header consists of a sequences of markers. Each Marker starts with FF xx, where xx is a specific Marker-ID.
So my idea, was to simply read in the image in binary format, and seek for the corresponding character-combinations in the binary stream. This should enable me to split the header in the corresponding marker-fields.
For instance, this is, what I receive, when I read in the first 20 bytes of an image:
binary_data = open('picture.jpg','rb').read(20)
print(binary_data)
b'\xff\xd8\xff\xe1-\xfcExif\x00\x00MM\x00*\x00\x00\x00\x08'
My questions are now:
1) Why does python not return me nice chunks of 2 bytes (in hex-format).
Somthing like this I would expect:
b'\xff \xd8 \xff \xe1 \x-' ... and so on. Some blocks delimited by '\x' are much longer than 2 bytes.
2) Why are there symbols like -, M, * in the returned string? Those are no characters of a hex representation I expect from a byte string (only: 0-9, a-f, I think).
Both observations hinder me in writing a simple parser.
So ultimately my question summarizes to:
How do I properly read-in and parse a JPEG Header in Python?
You seem overly worried about how your binary data is represented on your console. Don't worry about that.
The default built-in string-based representation that print(..) applies to a bytes object is just "printable ASCII characters as such (except a few exceptions), all others as an escaped hex sequence". The exceptions are semi-special characters such as \, ", and ', which could mess up the string representation. But this alternative representation does not change the values in any way!
>>> a = bytes([1,2,4,92,34,39])
>>> a
b'\x01\x02\x04\\"\''
>>> a[0]
1
See how the entire object is printed 'as if' it's a string, but its individual elements are still perfectly normal bytes?
If you have a byte array and you don't like the appearance of this default, then you can write your own. But – for clarity – this still doesn't have anything to do with parsing a file.
>>> binary_data = open('iaijiedc.jpg','rb').read(20)
>>> binary_data
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00H\x00H\x00\x00'
>>> ''.join(['%02x%02x ' % (binary_data[2*i],binary_data[2*i+1]) for i in range(len(binary_data)>>1)])
'ffd8 ffe0 0010 4a46 4946 0001 0201 0048 0048 0000 '
Why does python not return me nice chunks of 2 bytes (in hex-format)?
Because you don't ask it to. You are asking for a sequence of bytes, and that's what you get. If you want chunks of two-bytes, transform it after reading.
The code above only prints the data; to create a new list that contains 2-byte words, loop over it and convert each 2 bytes or use unpack (there are actually several ways):
>>> wd = [unpack('>H', binary_data[x:x+2])[0] for x in range(0,len(binary_data),2)]
>>> wd
[65496, 65504, 16, 19014, 18758, 1, 513, 72, 72, 0]
>>> [hex(x) for x in wd]
['0xffd8', '0xffe0', '0x10', '0x4a46', '0x4946', '0x1', '0x201', '0x48', '0x48', '0x0']
I'm using the little-endian specifier < and unsigned short H in unpack, because (I assume) these are the conventional ways to represent JPEG 2-byte codes. Check the documentation if you want to derive from this.

ValueError: invalid literal for int() with base 2 using Python 3

I have created my own version of AES (baby version) everything is working correctly however.
Some binary numbers somehow pick up a 'b' within them example: b1b10101
I am not very clued up on how python works with binary conversions but when trying to convert to a decimal using: pepee = int(pepe,2). It throws the error mentioned in the title when the string contains 'b'.
I found one other answer for this error on here, however the solution does not work for me. using 'format(pepe,'b')' throws an error for me.
I suspect it was written for Python 2.
I need to know, how I can prevent these b's from occurring in my binary strings, or how I can convert them back to the original bit value.
Sample code:
subList2 = ['b1', 'b1', '00', '00']
subStr = b1b10000
subStr = ''.join(subList2)
subDec = int(subStr,2)
Please note I did not intend these b's to appear in the string, they appear during runtime
Have you tried making a smale code snippet to just convert a binary string? Where do you get the binary strings from? If you for example make binary string using bin(), the string will contain a 'b' character.
print(bin(10))
# Outputs: 0b1010
But if you use format(int, 'b') instead, it will not contain the 'b'.
# Set test to a binary string and print it
test = '101001'
print(test)
# Convert test from binary string to int and print it
test = int(test, 2)
print(test)
# Convert test from int to binary string and print it
test = format(test, 'b')
print(test)
Ok,
I got it working.
I had made an athematic error in my code, that was producing minus numbers for binary conversion. which created these 'b' characters in place of the minus numbers. now it is fixed.

Resources