How to convert bytestream to string? - python-3.x

How can I convert my byte stream (socket.recv()) to string or maybe how should I decode?
I want to build a connection to a server but when I try to decode my recv() bytestream then I got an UnicodeDecodeError which I can solve with add ignore to decode but then I got with utf-8 a string which I can't read (maybe utf-8 is the false codec, but how should I know which I have to use?)
This is what I get from the recv()
b'HTTP/1.1 200 OK\r\nDate: Tue, 14 May 2019 17:28:11 GMT\r\nServer: Apache/2.2.22 (Debian)\r\nAccept-Encoding\r\nContent-Encoding: gzip\r\nContent-Length: 5934\r\nKeep-Alive: timeout=2, max=5\r\nConnection: Keep-Alive\r\nContent-Type: text/html;charset=utf-8\r\n\r\n\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed]\
..
.a little bit more bytes
...
#3\x81\x15ntT (at the end)

Related

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b
using pyspark.
import array
from io import StringIO
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 4106)
def mapper(features):
a = array.array('f')
a.frombytes(features)
return a.tolist()
def byte_mapper(bytes):
return str(bytes)
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
When just product_id is selected from the rdd using
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
The output for product_id is
["b'1582480311'", "b'\\x00\\x00\\x00\\x00\\x88c-?\\xeb\\xe2'", "b'7#\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\xec/\\x0b?\\x00\\x00\\x00\\x00K\\xea'", "b'\\x00\\x00c\\x7f\\xd9?\\x00\\x00\\x00\\x00'", "b'L\\xa6\\n>\\x00\\x00\\x00\\x00\\xfe\\xd4'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\xe5\\xd0\\xa2='", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'"]
The file is hosted on s3.
The file in each row has first 10 bytes for product_id next 4096 bytes as image_features
I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.
EDIT:
Finally, the problem comes from the recordLength. It's not 4096 + 10 but 4096*4 + 10. Chaging to :
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 16394)
Should work.
Actually you can find this in the provided code from the web site you downloaded the binary file:
for i in range(4096):
feature.append(struct.unpack('f', f.read(4))) # <-- so 4096 * 4
Old answer:
I think the issue comes from your byte_mapper function.
That's not the correct way to convert bytes to string. You should be using decode:
bytes = b'1582480311'
print(str(bytes))
# output: "b'1582480311'"
print(bytes.decode("utf-8"))
# output: '1582480311'
If you're getting the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte
That means product_id string contains non-utf8 characters. If you don't know the input encoding, it's difficult to convert into strings.
However, you may want to ignore those characters by adding option ignore to decode function:
bytes.decode("utf-8", "ignore")

MicroPython client not receive text file larger than 4kb (4096 bytes) from Python Server

I have an micropython client on esp32 board, and Python on linux server. I am trying send 5.5kb text file from Python Server to MicroPython client. It sends successfully but MicroPython client does not receive all data. Codes as follows;
Python Server:
with open('downloads/%s' % (request_path), 'rb') as f:
data = f.read()
self.wfile.write(data) #data is 5.5kb
MicroPython Client
recvData = sock.read(4096).decode('utf-8').split("\r\n")
print("Response_Received:: %s" % recvData)
sock.close()
Response_Received:: ['HTTP/1.0 200 OK', 'Server: SimpleHTTP/0.6 Python/3.5.3', 'Date: Sat, 09 Jun 2018 09:29:41 GMT', '', '# Ity: asdasd\n# ksduygfkhsgdkjfksjdhfg\n kjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy98\n 47y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhs\n gdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349rio\n t34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r3\n 49riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkv\n nvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogijiksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweu\n oiruy9847y397r349riot34jt;o\n giji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbs\n djkvbjcxbvhweioufhoiweuoiruy9847y397\n r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufh\n oiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduyg\n fkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhwei\n oufhoiweuoiruy9847y397r349riot\n 34jt;ogiji4vuijo4vjlkvnvl;kksduyg\n fkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhw\n eioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjd\n hfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiwe\n uoiruy9847y397r349riot34jt;ogiji4vuij\n o4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhwe\n ioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfk\n hsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;o\n giji4vuijo4vjlkvnvl;kksduygfkhsgdk\n jfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvh\n weioufhoiweuoiruy9847y397r349riot34jt;ogiji\n 4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiru\n y9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;k4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduyg\n fkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcx\n bvhweioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhs\n gdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdj\n nvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogijiksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9bfkjcbsdjkvbjcxbvhweioufhoi847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweu\nnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogijiksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweu\nnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogijiksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufhoiweuoiruy9847y397r349riot34jt;ogiji4vuijo4vjlkvnvl;kksduygfkhsgdkjfksjdhfgkjdhsbfkjdhsbfkjcbsdjkvbjcxbvhweioufnvl;k']
Client receives only 4140 bytes of the array data in due to buffer size(4096), 4th element of the recvData is lost. MicroPython does not accept over this Buffer size. How can i receive all my data (5.5kb) in 4th element of recvData array without any loss?
I have tried to fragment the received data, but it was not successful.
while True:
chunck = s.recv(4096)
if not chunck:
break
fragments.append(chunck)
Since your goal is to write the file to the filesystem, the simplest solution is to stop trying to hold the entire file in memory. Instead of building up your fragments array, just write the received chunks to a file:
with open('datafile', 'w') as fd:
while True:
chunk = s.recv(4096)
if not chunk:
break
fd.write(chunk)
This requires a constant amount of memory and can be used to receive
files of arbitrary size.

How to decode the byte into string in python3

After receiving the bytes from Server, it needs to convert into string. When I try below code, not works per expected.
a
Out[140]: b'NC\x00\x00\x00'
a.decode()
Out[141]: 'NC\x00\x00\x00'
a.decode('ascii')
Out[142]: 'NC\x00\x00\x00'
a.decode('ascii').strip()
Out[143]: 'NC\x00\x00\x00'
a.decode('utf-8').strip()
Out[147]: 'NC\x00\x00\x00'
# I need the Output as 'NC'
This is not an encoding issue, as the trailing bytes are all NUL bytes. Looks like your server is padding with Null bytes. To remove them just use
a.strip(b'\x00')

How to write string as unicode byte in python?

When i write '你' in agend and save it as test-unicode.txt in unicode mode,open it with xxd g:\\test-unicode.txt ,i got :
0000000: fffe 604f ..`O
1.fffe stand for little endian
2.the unicode of 你 is \x4f\x60
I want to write the 你 as 604f or 4f60 in the file.
output=open("g://test-unicode.txt","wb")
str1="你"
output.write(str1)
output.close()
error:
TypeError: 'str' does not support the buffer interface
When i change it into the following ,there is no errror.
output=open("g://test-unicode.txt","wb")
str1="你"
output.write(str1.encode())
output.close()
when open it with xxd g:\\test-unicode.txt ,i got :
0000000: e4bd a0 ...
How can i write 604f or 4f60 into my file the same way as microsoft aengda do(save as unicode format)?
"Unicode" as an encoding is actually UTF-16LE.
with open("g:/test-unicode.txt", "w", encoding="utf-16le") as output:
output.write(str1)

can anyone tell me what the encoding of this string is? Its meant to be base64

cpxSR2bnPUihaNxIFFA8Sc+8gUnWuJxJi8ywSW5ju0npWrFJHW2MSZAeMklcZ71IjrBySF2ci0gdecRI0vD/SM4ZF0m1ZSJJBY8bSZJl/0intaxIlQJBSPdY3EdBLM9Hp4wLSOK8Nki8L1pIoglxSAvNbkjHg0VIDlv7R6B2Y0elCqVGFWuVRgagAkdxHTdHELxRR9i2VkdyEUlHU84kRzTS2kalKFxG
This is a string from an XML file from my mass spectrometer. I am trying to write a program to load two such files, subtract one set of values from another, and write the results to a new file. According to the specification file for the .mzML format, the encoding of the numerical data is alleged to be base64. I can't convert this data string to anything legible using any of the many online base64 converter or using NotepaD++ and the MIME toolkit's base64 converter.
The string, in the context of the results file, looks like this:
<binaryDataArray encodedLength="224">
<cvParam cvRef="MS" accession="MS:1000515" name="intensity array" unitAccession="MS:1000131" unitName="number of counts" unitCvRef="MS"/>
<cvParam cvRef="MS" accession="MS:1000521" name="32-bit float" />
<cvParam cvRef="MS" accession="MS:1000576" name="no compression" />
<binary>cpxSR2bnPUihaNxIFFA8Sc+8gUnWuJxJi8ywSW5ju0npWrFJHW2MSZAeMklcZ71IjrBySF2ci0gdecRI0vD/SM4ZF0m1ZSJJBY8bSZJl/0intaxIlQJBSPdY3EdBLM9Hp4wLSOK8Nki8L1pIoglxSAvNbkjHg0VIDlv7R6B2Y0elCqVGFWuVRgagAkdxHTdHELxRR9i2VkdyEUlHU84kRzTS2kalKFxG</binary>
I can't proceed until I can work out what format this encoding is meant to be!
Thanks in advance for any replies.
You can use this trivial program to convert it to plaintext:
#include <stdio.h>
int main(void)
{
float f;
while (fread(&f, 1, 4, stdin) == 4)
printf("%f\n", f);
}
I compiled this to "floatdecode" and used this command:
echo "cpxSR2bnPUihaNxIFFA8Sc+8gUnWuJxJi8ywSW5ju0npWrFJHW2MSZAeMklcZ71IjrBySF2ci0gdecRI0vD/SM4ZF0m1ZSJJBY8bSZJl/0intaxIlQJBSPdY3EdBLM9Hp4wLSOK8Nki8L1pIoglxSAvNbkjHg0VIDlv7R6B2Y0elCqVGFWuVRgagAkdxHTdHELxRR9i2VkdyEUlHU84kRzTS2kalKFxG" | base64 -d | ./floatdecode
Output is:
53916.445312
194461.593750
451397.031250
771329.250000
1062809.875000
1283866.750000
1448337.375000
1535085.750000
1452893.125000
1150371.625000
729577.000000
387898.875000
248514.218750
285922.906250
402376.906250
524166.562500
618908.875000
665179.312500
637168.312500
523052.562500
353709.218750
197642.328125
112817.929688
106072.507812
142898.609375
187123.531250
223422.937500
246822.531250
244532.171875
202255.109375
128694.109375
58230.625000
21125.322266
19125.541016
33440.023438
46877.441406
53692.062500
54966.843750
51473.445312
42190.324219
28009.101562
14090.161133
Yet another Java Base64 decode with options to uncompress should you need it.
Vendor spec indicated "32-bit float" = IEEE-754 and specified little-endian.
Schmidt's converter shows the bit pattern for IEEE-754.
One more Notepad++ step to look at the hex codes:
Notepad++ TextFX plugin (after the Base64 decode you already did)
select the text
TextFX > TextFX Convert > Convert text to Hex-32
lets you look at the hex codes:
"000000000 72 9C 52 47 66 E7 3D 48- ... 6E 63 BB 49 |rœRGfç=H¡hÜHP
Little-endian: 47529C72 converts (via Schmidt) as shown above by David.
You can access such data from mzML files in Python through pymzML, a python interface to mzML files.
http://pymzml.github.com/

Resources