Cannot save byte literals to csv file with Python - python-3.x

I am trying to write a program that encrypts text data input by the user and saves the encrypted information to a csv file. To use the stream cipher I am first converting the string type data to bytes literals, then trying to save it in this format. The problem comes as I re-read the csv file the next time I open the program, the data I saved as bytes type has been converted to string type, including the b''. Please refer to the code below.
IN:
from Crypto.Cipher import Salsa20
import pandas as pd
df = pd.DataFrame({'col1': ['secret info', 'more secret info'], 'col2': ['top secret stuff', 'hide from prying eyes']})
key = b'*Thirty-two byte (256 bits) key*'
nonce = b'*8 byte*'
cipher = Salsa20.new(key=key, nonce=nonce)
for col in df.columns:
df[col] = df[col].apply(lambda a: a.encode('utf-8'))
df[col] = df[col].apply(lambda a: cipher.encrypt(a))
print(f"Format of data in dataframe pre saving: {type(df.iloc[0, 0])}")
df.to_csv('my_data.csv', encoding='utf-8')
encrypted_df = pd.read_csv('my_data.csv', encoding='utf-8', index_col=0)
print(f"Format of data in re-read dataframe: {type(encrypted_df.iloc[0, 0])}")
OUT:
Format of data in dataframe pre saving: <class 'bytes'>
Format of data in re-read dataframe: <class 'str'>
Is there a way to read the csv file so that the data is of bytes type and not a string so that I can easily decrypt it?
I have tried:
Decoding the data back to strings prior to writing to the csv file, however this gives rise to a unicode decode error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 0: invalid start byte
Stripping the b'' from the strings and then encoding to bytes type, however the encoder does string escape adding loads backslashes so I then can't decrypt the text.
I'm relatively new to coding and very new to encryption so simple answers would be highly appreciated.

As you have found, CSV format can only deal with strings. A common way to convert bytes to a string is Base64. Change your bytes into a Base64 string which you can put into your CSV file.
When you read the CSV file you will get your Base64 back, You then need to convert the Base64 back into the original bytes.

Related

Reading NULL bytes using pandas.read_csv()

I have a file in CSV format which contains NULL bytes (may be 0x84) in each line. I need to read this file using c engine of pd.read_csv() .
This values causes an error while reading - 'utf-8' codec can't decode byte 0x84 in position 14 .
Is there any way out to fix it without changing the file ?
Try these options if it helps:
Option 1:
Set the engine as python.
pd.read_csv(filename, engine='python')
Option 2:
Try utf-16 encoding, because the error could also mean the file is encoded in UTF-16. Or change the encoding to the correct format example
encoding = "cp1252"
encoding = "ISO-8859-1"
Option 3:
Read the file as bytes
with open(filename, 'rb') as f:
data = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.
Alternatively you can use open method from the codecs module to read in the file:
import codecs
with codecs.open(filename, 'r', encoding='utf-8', errors='ignore') as f:
This will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
https://docs.python.org/3/howto/unicode.html#the-string-type

Parse string representation of binary loaded from CSV [duplicate]

I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
Now, the text data is stored like this:
"b'Lorem Ipsum\xc2\xa0Assignment '"
I tried to decode this using this code (there is more data in other columns, text is in 3rd column):
with open('data.csv','rt',encoding='utf-8') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print(row[3])
But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[3]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!
How can I decode the text data?
Edit: Here's a sample line from the csv file:
67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6 | #abcde',52,18
Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.
The easiest way is as below. Try it out.
import csv
from io import StringIO
byte_content = b"iam byte content"
content = byte_content.decode()
file = StringIO(content)
csv_data = csv.reader(file, delimiter=",")
If your input file really contains strings with Python syntax b prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval() function as #Ry suggested — although I would use it in a slightly different manner, as shown below.
This will provide a safe way to parse strings in the file which are prefixed with a b indicating they are byte-strings. The rest will be passed through unchanged.
Note that this doesn't require reading the entire CSV file into memory.
import ast
import csv
def _parse_bytes(field):
"""Convert string represented in Python byte-string literal b'' syntax into
a decoded character string - otherwise return it unchanged.
"""
result = field
try:
result = ast.literal_eval(field)
finally:
return result.decode() if isinstance(result, bytes) else result
def my_csv_reader(filename, /, **kwargs):
with open(filename, 'r', newline='') as file:
for row in csv.reader(file, **kwargs):
yield [_parse_bytes(field) for field in row]
reader = my_csv_reader('bytes_data.csv', delimiter=',')
for row in reader:
print(row)
You can use ast.literal_eval to convert the incorrect fields back to bytes safely:
import ast
def _parse_bytes(bytes_repr):
result = ast.literal_eval(bytes_repr)
if not isinstance(result, bytes):
raise ValueError("Malformed bytes repr")
return result

Decode a Python string

Sorry for the generic title.
I am receiving a string from an external source: txt = external_func()
I am copying/pasting the output of various commands to make sure you see what I'm talking about:
In [163]: txt
Out[163]: '\\xc3\\xa0 voir\\n'
In [164]: print(txt)
\xc3\xa0 voir\n
In [165]: repr(txt)
Out[165]: "'\\\\xc3\\\\xa0 voir\\\\n'"
I am trying to transform that text to UTF-8 (?) to have txt = "à voir\n", and I can't see how.
How can I do transformations on this variable?
You can encode your txt to a bytes-like object using the encode-method of the str class.
Then this byte-like object can be decoded again with the encoding unicode_escape.
Now you have your string with all escape sequences parsed, but latin-1 decoded. You still have to encode it with latin-1 and then decode it again with utf-8.
>>> txt = '\\xc3\\xa0 voir\\n'
>>> txt.encode('utf-8').decode('unicode_escape').encode('latin-1').decode('utf-8')
'à voir\n'
The codecs module also has an undocumented funciton called escape_decode:
>>> import codecs
>>> codecs.escape_decode(bytes('\\xc3\\xa0 voir\\n', 'utf-8'))[0].decode('utf-8')
'à voir\n'

Python 3: Persist strings without b'

I am confused. This talk explains, that you should only use unicode-strings in your code. When strings leave your code, you should turn them into bytes. I did this for a csv file:
import csv
with open('keywords.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter='\t', quotechar='\"')
for (p, keywords) in ml_data:
writer.writerow([p.encode("utf-8"), ', '.join(keywords).encode("utf-8")])
This leads to an annoying effect, where b' is added in front of every string, this didn't happen for me in python 2.7. When not encoding the strings before writing them into the csv file, the b' is not there, but don't I need to turn them into bytes when persisting? How do I write bytes into a file without this b' annoyance?
Stop trying to encode the individual strings, instead you should specify the encoding for the entire file:
import csv
with open('keywords.csv', 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter='\t', quotechar='\"')
for (p, keywords) in ml_data:
writer.writerow([p, ', '.join(keywords)])
The reason your code goes wrong is that writerow is expecting you to give it strings but you're passing bytes so it uses the repr() of the bytes which has the extra b'...' around it. If you pass it strings but use the encoding parameter when you open the file then the strings will be encoded correctly for you.
See the csv documentation examples. One of these shows you how to set the encoding.

csv to pandas.DataFrame while keeping data original encoding

I have a csv file with some utf8 unicode characters in it, which I want to load into a pandas.DataFrame while keeping the unicode characters as is, not escaping them.
Input .csv:
letter,unicode_primary,unicode_alternatives
8,\u0668,"\u0668,\u06F8"
Code:
df = pd.DataFrame.from_csv("file.csv")
print(df.loc[0].unicode_primary)
Result:
> \\u0668
Desired Result:
> \u0668
or
> 8
Please use read_csv instead of from_csv as follows.
df = pd.DataFrame(pd.read_csv("file.csv", encoding = 'utf_8'))
print(df.loc[0].unicode_primary)

Resources