How to write an Image to string in Julia? - string

I want to encode an image in my directory "x.png" to a String or Array{UInt8, 1}.
I am writing a code in Julia to serialize an image using protobufs. It requires the image to be in encoded
String format.
In Python, it is done as follows. I am looking for similar functionality in Julia.
from PIL import Image
img = Image.load('x.png')
import io
output = io.BytesIO()
img.save(output, 'PNG')
img_string_data = output.getvalue()
output.close()
The output may be a String object or an Array{UInt8, 1}

In Julia you can achieve by writing:
img_string_data = read("x.png")
img_string_data now is Vector{UInt8}. You could also write read("x.png", String) to get a String (which is not that useful though as it will probably mostly contain invalid characters).
There is one difference between Julia solution and your Python solution. Julia approach will store in img_string_data the contents identical to what "x.png" holds on binary level while your Python solution will store an identical image, but possibly different on binary level (i.e. PIL might change some bytes in your file).

Related

Converting b-string to png in Python 3.9.6

I have been trying to convert this b-string to a png image.
Here is the bytes string for a barcode received from an api. It is called Cloudmersive 1D barcode generator api.
I have tried to use base64.b64decode() and then write binary to an image file but it does not work. I also tried using BytesIO but that does not work either.
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00h\x00\x00\x00d\x08\x02\x00\x00\x00\xe5\xbc\xe2\x8d\x00\x00\x00\x01sRGB\x00\xae\xce\x1c\xe9\x00\x00\x00\x04gAMA\x00\x00\xb1\x8f\x0b\xfca\x05\x00\x00\x00\tpHYs\x00\x00\x0e\xc3\x00\x00\x0e\xc3\x01\xc7o\xa8d\x00\x00\x0c\x8aIDATx^\xed\x95\xd1\x95d\xc5\x0e\x041\x0f\x830\x07_p\xe5y\xc2SU\xce\xc4\x84\xa4Z\x18\xf6{\xe3C'\x94\xca{\xbb\xa7\x81\xc3o\x7f\x8b\xdf~;+\xd3R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083M\xdbG\xd5R\x9c\x86\xf2\xb8'\xdc\xe3\xbb\t;\xc4}\x1a\x85\x11Z\xf0"\xbe\xc3\xac\x84\x968\x12\x083\xcd\xdc\x7f\xf1M~\xfdp?\xc9\xaf\x1f\xee'\xf9\xc6\x0f\xf7\xd7\x1f\xf7\xbf\xf7\xe2\xf7?\xff\xf7\x91}R7\x87\xff\xfb\xf3\xf7\x1fU{s\xbf\xf3+\xf9\xe4\x8f\xbf\xfa\xe1c?<?\x87b\xfb\xec\xef>\xfe\x95\x1dT\xfe\x01\xff\xfa\xc3\x9d\x0f\xcek\xee\xab\xfd\xc6|\xa7\xaf\xcf>{\xb6S\xf5\xd7\xef\xcds\xfd|\xcf\xd7\xeb\x85\x1e\xaf{L\xc5\x97\xea\xcb\xf9\xc3\xbf\xfb\xf8\xd5\x0f\xfb\x1e\xff\xf6\xc3\xe9\xd5w\xf9\xf8F\xf7{\xfe\xf6\xfb\x1f\x7f\xe8\x07\xfa\xba\x9aW\xd3\xec\xa7\x9e\x7fx\xe2\xfbM\xda\x13\x9f\xcb3\xfc\xfe\xe3\xe7\xfc\x95~\x87\xff\xf8o\x1c_\xe3\xaf\xbf>>n\x7f\xb3\xce\xa3)N<\x9e\xfa\xfa\xc0\xf1\xce\xf6\x17\x7f\xf2\x19\xb6\xe3\xe7+\xbe\xff\xf8\x95\xdf\xcf?\xe2\xf3Oy\xd5\x1e|\xe7\x7f\x0e\xe7{\xbc_\xe8\x9f\xe3~\x87\xbf\xce\x0fq\xe8?\xc6\xeb\x87\xbb\xbf\xd9zi\xef\xfd\xdb_~\xbeY\xb2\xfb\xb6\x8f\xea\xfd\xbe\xdf\xf9\xe1\xfa\xe3\x1fW\xbf\xe9\x1f\xf8\xd6\xbfq\x1f\xaf<\xda\xdf\xa8\xcf\xcb\xd7\xd5\x87\x7f\xe6a'\x97\xf9-G\xed\xac?\xfc\xcb\xef\xb3J\xee\x178\xfc\xf1\xe7\xc7c\xff\xe9\xf1/\xf6\xdf\xf9\xe0\xdf~\xb8\xf9\xd9\xfd\x8d\xe7\xfa\xf9\xd9\xed\xd8\x1e;\xb8\xd9\xe8\x7f\xcf\xf8\xeb\xe6\xa7\x7f\xdd\xce\xe5\xfd\xc2\xa2\x8e\xf7\xf4\xf3\x8f\xf7/\xff\xe2?\xfdp\xeb\x8d'\xf8\xfc\xf8\xef7\x1b\xed\xef\xa9\xa5?\xa6\xab^y\xf4\x87\x7f\xb6\x9b?\xf5x\xffJ?\xe2_\xffS=\x9f\xf2\xf1\x9a\xf5g\xf9X|}\xe0?7u\xed/8[\x7f\xec\xab\xfb\xf5P\x7f\xe6\x03\x85\xfe\xf0\xef>\xaeg\x8e\xae\xfb\xe6\xbf\xfc\xcf\xe1\x1f\xbf\xee\xe5\xec\xe1\xf3\x9b\x7f\xd2\x9b_\xbdV\xd4\xd7\xff\x82O\xd7\x1f\xd6\xc9\xe1\xeb\x9d\xedk\xfe\xf7\xc7\xf7Wx\xf1\x9d\x1f\xee\x17\x0f~\xfdp?\xc9\xaf\x1f\xee\xa7\xf8\xfb\xef\xff\x03\x11\xda\xa3\xaefM\x89\xbf\x00\x00\x00\x00IEND\xaeB`\x82'
There's no need to use b64decode or any other operation on that byte string, it's ready to write to the file as is.
with open(r'c:\temp\temp.png', 'wb') as f:
f.write(b_str)
It produces this:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte while accessing csv file

I am trying to access csv file from aws s3 bucket and getting error 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte code is below I am using python 3.7 version
from io import BytesIO
import boto3
import pandas as pd
import gzip
s3 = boto3.client('s3', aws_access_key_id='######',
aws_secret_access_key='#######')
response = s3.get_object(Bucket='#####', Key='raw.csv')
# print(response)
s3_data = StringIO(response.get('Body').read().decode('utf-8')
data = pd.read_csv(s3_data)
print(data.head())
kindly help me out here how i can resolve this issue
using gzip worked for me
client = boto3.client('s3', aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
csv_obj = client.get_object(Bucket=####, Key=###)
body = csv_obj['Body']
with gzip.open(body, 'rt') as gf:
csv_file = pd.read_csv(gf)
The error you're getting means the CSV file you're getting from this S3 bucket is not encoded using UTF-8.
Unfortunately the CSV file format is quite under-specified and doesn't really carry information about the character encoding used inside the file... So either you need to know the encoding, or you can guess it, or you can try to detect it.
If you'd like to guess, popular encodings are ISO-8859-1 (also known as Latin-1) and Windows-1252 (which is roughly a superset of Latin-1). ISO-8859-1 doesn't have a character defined for 0x8b (so that's not the right encoding), but Windows-1252 uses that code to represent a left single angle quote (‹).
So maybe try .decode('windows-1252')?
If you'd like to detect it, look into the chardet Python module which, given a file or BytesIO or similar, will try to detect the encoding of the file, giving you what it thinks the correct encoding is and the degree of confidence it has in its detection of the encoding.
Finally, I suggest that, instead of using an explicit decode() and using a StringIO object for the contents of the file, store the raw bytes in an io.BytesIO and have pd.read_csv() decode the CSV by passing it an encoding argument.
import io
s3_data = io.BytesIO(response.get('Body').read())
data = pd.read_csv(s3_data, encoding='windows-1252')
As a general practice, you want to delay decoding as much as you can. In this particular case, having access to the raw bytes can be quite useful, since you can use that to write a copy of them to a local file (that you can then inspect with a text editor, or on Excel.)
Also, if you want to do detection of the encoding (using chardet, for example), you need to do so before you decode it, so again in that case you need the raw bytes, so that's yet another advantage to using the BytesIO here.

Difference between io.StringIO and a string variable in python

I am new to python.
Can anybody explain what's the difference between a string variable and io.StringIO . In both we can save character.
e.g
String variable
k= 'RAVI'
io.stringIO
string_out = io.StringIO()
string_out.write('A sample string which we have to send to server as string data.')
string_out.getvalue()
If we print k or string_out.getvalue() both will print the text
print(k)
print(string_out.getvalue())
They are similar because both str and StringIO represent strings, they just do it in different ways:
str: Immutable
StringIO: Mutable, file-like interface, which stores strs
A text-mode file handle (as produced by open("somefile.txt")) is also very similar to StringIO (both are "Text I/O"), with the latter allowing you to avoid using an actual file for file-like operations.
you can use io.StringIO() to simulate files, since python is dynamic with variable types usually if you have something that accepts a file object you can also use io.StringIO() with it, meaning you can have a "file" in memory that you can control the contents of without actually writing any temporary files to disk

Decoding/Encoding using sklearn load_files

I'm following the tutorial here
https://github.com/amueller/introduction_to_ml_with_python/blob/master/07-working-with-text-data.ipynb
to learn about machine learning and text.
In my case, I'm using tweets I downloaded, with positive and negative tweets in the exact same directory structure they are using (trying to learn sentiment analysis).
Here in the iPython Notebook I load my data just like they do:
tweets_train =load_files('Path to my training Tweets')
And then I try to fit them with CountVectorizer
vect = CountVectorizer().fit(text_train)
I get
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position
561: invalid continuation byte
Is this because my Tweets have all sorts of non standard text in them? I didn't do any cleanup of my Tweets (I assume there are libraries that help with that in order to make a bag of words work?)
EDIT:
Code I use using Twython to download tweets:
def get_tweets(user):
twitter = Twython(CONSUMER_KEY,CONSUMER_SECRET,ACCESS_KEY,ACCESS_SECRET)
user_timeline = twitter.get_user_timeline(screen_name=user,count=1)
lis = user_timeline[0]['id']
lis = [lis]
for i in range(0, 16): ## iterate through all tweets
## tweet extract method with the last list item as the max_id
user_timeline = twitter.get_user_timeline(screen_name=user,
count=200, include_retweets=False, max_id=lis[-1])
for tweet in user_timeline:
lis.append(tweet['id']) ## append tweet id's
text = str(tweet['text']).replace("'", "")
text_file = open(user, "a")
text_file.write(text)
text_file.close()
You get a UnicodeDecodeError because your files are being decoded with the wrong text encoding.
If this means nothing to you, make sure you understand the basics of Unicode and text encoding, eg. with the official Python Unicode HOWTO.
First, you need to find out what encoding was used to store the tweets on disk.
When you saved them to text files, you used the built-in open function without specifying an encoding. This means that the system's default encoding was used. Check this, for example, in an interactive session:
>>> f = open('/tmp/foo', 'a')
>>> f
<_io.TextIOWrapper name='/tmp/foo' mode='a' encoding='UTF-8'>
Here you can see that in my local environment the default encoding is set to UTF-8. You can also directly inspect the default encoding with
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
There are other ways to find out what encoding was used for the files.
For example, the Unix tool file is pretty good at guessing the encoding of existing files, if you happen to be working on a Unix platform.
Once you think you know what encoding was used for writing the files, you can specify this in the load_files() function:
tweets_train = load_files('path to tweets', encoding='latin-1')
... in case you find out Latin-1 is the encoding that was used for the tweets; otherwise adjust accordingly.

namelist() from ZipFile returns strings with an invalid encoding

The problem is that for some archives or files up-loaded to the python application, ZipFile's namelist() returns badly decoded strings.
from zip import ZipFile
for name in ZipFile('zipfile.zip').namelist():
print('Listing zip files: %s' % name)
How to fix that code so i always decode file names in unicode (so Chineeze, Russian and other languages supported)?
I've seen some samples for Python 2, but since string's nature is changed in python3, i have no clue how to re-encode it, or apply chardet on it.
How to fix that code so i always decode file names in unicode (so Chineeze, Russian and other languages supported)?
Automatically? You can't. Filenames in a basic ZIP file are strings of bytes with no attached encoding information, so unless you know what the encoding was on the machine that created the ZIP you can't reliably get a human-readable filename back out.
There is an extension to the flags on modern ZIP files to tell you that the filename is UTF-8. Unfortunately files you receive from Windows users typically don't have it, so you'll left guessing with inherently unreliable methods like chardet.
I've seen some samples for Python 2, but since string's nature is changed in python3, i have no clue how to re-encode it, or apply chardet on it.
Python 2 would just give you raw bytes back. In Python 3 the new behaviour is:
if the UTF-8 flag is set, it decodes the filenames using UTF-8 and you get the correct string value back
otherwise, it decodes the filenames using DOS code page 437, which is pretty unlikely to be what was intended. However you can re-encode the string back to the original bytes, and then try to decode again using the code page you actually want, eg name.encode('cp437').decode('cp1252').
Unfortunately (again, because the unfortunatelies never end where ZIP is concerned), ZipFile does this decoding silently without telling you what it did. So if you want to switch and only do the transcode step when the filename is suspect, you have to duplicate the logic for sniffing whether the UTF-8 flag was set:
ZIP_FILENAME_UTF8_FLAG = 0x800
for info in ZipFile('zipfile.zip').filelist():
filename = info.filename
if info.flag_bits & ZIP_FILENAME_UTF8_FLAG == 0:
filename_bytes = filename.encode('437')
guessed_encoding = chardet.detect(filename_bytes)['encoding'] or 'cp1252'
filename = filename_bytes.decode(guessed_encoding, 'replace')
...
Here's the code that decodes filenames in zipfile.py according to the zip spec that supports only cp437 and utf-8 character encodings:
if flags & 0x800:
# UTF-8 file names extension
filename = filename.decode('utf-8')
else:
# Historical ZIP filename encoding
filename = filename.decode('cp437')
As you can see, if 0x800 flag is not set i.e., if utf-8 is not used in your input zipfile.zip then cp437 is used and therefore the result for "Chineeze, Russian and other languages" is likely to be incorrect.
In practice, ANSI or OEM Windows codepages may be used instead of cp437.
If you know the actual character encoding e.g., cp866 (OEM (console) codepage) may be used on Russian Windows then you could reencode filenames to get the original filenames:
filename = corrupted_filename.encode('cp437').decode('cp866')
The best option is to create the zip archive using utf-8 so that you can support multiple languages in the same archive:
c:\> 7z.exe a -tzip -mcu archive.zip <files>..
or
$ python -mzipfile -c archive.zip <files>..`
Got the same problem, but with defined language (Russian).
Most simple solution is just to convert it with this utility: https://github.com/vlm/zip-fix-filename-encoding
For me it works on 98% of archives (failed to run on 317 files from corpus of 11388)
More complex solution: use python module chardet with zipfile. But it depends on python version (2 or 3) you use - it has some differences on zipfile. For python 3 I wrote a code:
import chardet
original_name = name
try:
name = name.encode('cp437')
except UnicodeEncodeError:
name = name.encode('utf8')
encoding = chardet.detect(name)['encoding']
name = name.decode(encoding)
This code try to work with old style zips (having encoding CP437 and just has it broken), and if fails, it seems that zip archive is new style (UTF-8). After determining proper encoding, you can extract files by code like:
from shutil import copyfileobj
fp = archive.open(original_name)
fp_out = open(name, 'wb')
copyfileobj(fp, fp_out)
In my case, this resolved last 2% of failed files.

Resources