Python 3: Persist strings without b' - string

I am confused. This talk explains, that you should only use unicode-strings in your code. When strings leave your code, you should turn them into bytes. I did this for a csv file:
import csv
with open('keywords.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter='\t', quotechar='\"')
for (p, keywords) in ml_data:
writer.writerow([p.encode("utf-8"), ', '.join(keywords).encode("utf-8")])
This leads to an annoying effect, where b' is added in front of every string, this didn't happen for me in python 2.7. When not encoding the strings before writing them into the csv file, the b' is not there, but don't I need to turn them into bytes when persisting? How do I write bytes into a file without this b' annoyance?

Stop trying to encode the individual strings, instead you should specify the encoding for the entire file:
import csv
with open('keywords.csv', 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter='\t', quotechar='\"')
for (p, keywords) in ml_data:
writer.writerow([p, ', '.join(keywords)])
The reason your code goes wrong is that writerow is expecting you to give it strings but you're passing bytes so it uses the repr() of the bytes which has the extra b'...' around it. If you pass it strings but use the encoding parameter when you open the file then the strings will be encoded correctly for you.
See the csv documentation examples. One of these shows you how to set the encoding.

Related

Parse string representation of binary loaded from CSV [duplicate]

I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
Now, the text data is stored like this:
"b'Lorem Ipsum\xc2\xa0Assignment '"
I tried to decode this using this code (there is more data in other columns, text is in 3rd column):
with open('data.csv','rt',encoding='utf-8') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print(row[3])
But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[3]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!
How can I decode the text data?
Edit: Here's a sample line from the csv file:
67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6 | #abcde',52,18
Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.
The easiest way is as below. Try it out.
import csv
from io import StringIO
byte_content = b"iam byte content"
content = byte_content.decode()
file = StringIO(content)
csv_data = csv.reader(file, delimiter=",")
If your input file really contains strings with Python syntax b prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval() function as #Ry suggested — although I would use it in a slightly different manner, as shown below.
This will provide a safe way to parse strings in the file which are prefixed with a b indicating they are byte-strings. The rest will be passed through unchanged.
Note that this doesn't require reading the entire CSV file into memory.
import ast
import csv
def _parse_bytes(field):
"""Convert string represented in Python byte-string literal b'' syntax into
a decoded character string - otherwise return it unchanged.
"""
result = field
try:
result = ast.literal_eval(field)
finally:
return result.decode() if isinstance(result, bytes) else result
def my_csv_reader(filename, /, **kwargs):
with open(filename, 'r', newline='') as file:
for row in csv.reader(file, **kwargs):
yield [_parse_bytes(field) for field in row]
reader = my_csv_reader('bytes_data.csv', delimiter=',')
for row in reader:
print(row)
You can use ast.literal_eval to convert the incorrect fields back to bytes safely:
import ast
def _parse_bytes(bytes_repr):
result = ast.literal_eval(bytes_repr)
if not isinstance(result, bytes):
raise ValueError("Malformed bytes repr")
return result

Python 3 - correct way to write to .html without TypeError

# save-webpage.py (To write first 100 characters of html source into 'simple.html')
import urllib.request, io, sys
f = urllib.request.urlopen('https://news.google.com')
webContent = f.read(100)
#g = io.open('simple.html', 'w', encoding='UTF-8')
g = io.open('simple.html', 'w')
#g.write(webContent)
g.write(webContent.decode("UTF-8"))
g.close()
2019-01-11: See above for corrected working code after answers were received. Thanks guys.
Original question:
Upon execution, the file, simple.html, is created with 0 bytes.
Along with an error:
TypeError: must be str, not bytes.
Please help. I've gone about this several ways but to no avail. Thank you in advance!
g.write(webContent.decode("utf-8"))
File objects opened in text mode require you to write Unicode Text.
In this line you encoded to UTF-8 bytes
g = io.open('simple.html', 'w', encoding='UTF-8')
You could either not encode or try decoding it after.

Python3 CSV encoding problems

I'm trying to add some information to a row in a CSV-file, but in the updated CSV-file the special characters (², ±) aren't encoded well. Let me clarify:
So basically I have a CSV-file with the following example row:
row = ["Some info", "Cell with special characters (², ±)", "Some more info"]
I'm trying to add some text to the second column, so I wrote the following code
import csv
new_rows = []
# Open CSV file
with open('file_1.csv', 'r', encoding='UTF-8') as f1
reader = csv.reader(f1, delimiter=',')
for row in reader:
# Add text to cell, save row to memory
row[1] = row[1] + " is causing trouble"
new_rows.append(row)
# Write rows in new CSV-file
with open('file_2.csv', 'w', encoding='UTF-8', newline='') as of:
writer = csv.writer(of, delimiter=',')
writer.writerows(new_rows)
The output in the second column I expected would be as follows:
"Cell with special characters (², ±) is causing troubles"
But instead I get the following output:
"Cell with special characters (??, ??) is causing troubles"
I tried to solve the problem by changing the encoding to Latin-1, or even by not mentioning the encoding in the code at all, but nothing seems to work!
What am I missing or doing wrong here?
Your source file is probably latin-1, you are encoding mojibakes to utf-8 in the second file. Your code has nothing wrong.
You just need to decide if you want to produce a latin1 or utf-8 csv ( and maybe add a BOM https://github.com/pyexcel/pyexcel-io/issues/30 ) as a destination after reading the source in the good encoding ( maybe you can check with a hexadecimal dump how it is encoded and give us some clues ).

Python 3 split('\n')

How do I split a text string according to an explicit newline ('\n')?
Unfortunately, instead of a properly formatted csv file, I am dealing with a long string of text with "\n" where the newline would be. (example format: "A0,B0\nA1,B1\nA2,B2\nA3,B3\n ...") I thought a simple bad_csv_list = text.split('\n') would give me a list of the two-valued cells (example split ['A0,B0', 'A1,B1', 'A2,B2', 'A3,B3', ...]). Instead I end up with one cell and "\n" gets converted to "\\n". I tried copy-pasting a section of the string and using split('\n') and it works as I had hoped. The print statement for the file object tells me the following:
<_io.TextIOWrapper name='stats.csv' mode='r' encoding='cp1252'>
...so I suspect the problem is with the cp1252 encoding? Of note tho: Notepad++ says the file I am working with is "UTF-8 without BOM"... I've looked in the docs and around SO and tried importing io and codec and prepending the open statement and declaring encoding='utf8' but I am at a loss and I don't really grok text encoding. Maybe there is a better solution?
from sys import argv
# import io, codec
filename = argv[1]
file_object = open(filename, 'r')
# file_object = io.open(filename, 'r', encoding='utf8')
# file_object = codec.open(filename, 'r', encoding='utf8')
file_contents = file_object.read()
file_list = file_contents.split('\n')
print("1.) Here's the name of the file: {}".format(filename))
print("2.) Here's the file object info: {}".format(file_object))
print("3.) Here's all the files contents:\n{}".format(file_contents))
print("4.) Here's a list of the file contents:\n{}".format(file_list))
Any help would be greatly appreciated, thank you.
If it helps to explain what I am dealing with, here's the contents of the stats.csv file:
Albuquerque,749\nAnaheim,371\nAnchorage,828\nArlington,503\nAtlanta,1379\nAurora,425\nAustin,408\nBakersfield,542\nBaltimore,1405\nBoston,835\nBuffalo,1288\nCharlotte-Mecklenburg,647\nCincinnati,974\nCleveland,1383\nColorado Springs,455\nCorpus Christi,658\nDallas,675\nDenver,615\nDetroit,2122\nEl Paso,423\nFort Wayne,362\nFort Worth,587\nFresno,543\nGreensboro,563\nHenderson,168\nHouston,992\nIndianapolis,1185\nJacksonville,617\nJersey City,734\nKansas City,1263\nLas Vegas,784\nLexington,352\nLincoln,397\nLong Beach,575\nLos Angeles,481\nLouisville Metro,598\nMemphis,1750\nMesa,399\nMiami,1172\nMilwaukee,1294\nMinneapolis,992\nMobile,522\nNashville,1216\nNew Orleans,815\nNew York,639\nNewark,1154\nOakland,1993\nOklahoma City,919\nOmaha,594\nPhiladelphia,1160\nPhoenix,636\nPittsburgh,752\nPlano,130\nPortland,517\nRaleigh,423\nRiverside,443\nSacramento,738\nSan Antonio,503\nSan Diego,413\nSan Francisco,704\nSan Jose,363\nSanta Ana,401\nSeattle,597\nSt. Louis,1776\nSt. Paul,722\nStockton,1548\nTampa,616\nToledo,1171\nTucson,724\nTulsa,990\nVirginia Beach,169\nWashington,1177\nWichita,742
And the result from the split('\n'):
['Albuquerque,749\\nAnaheim,371\\nAnchorage,828\\nArlington,503\\nAtlanta,1379\\nAurora,425\\nAustin,408\\nBakersfield,542\\nBaltimore,1405\\nBoston,835\\nBuffalo,1288\\nCharlotte-Mecklenburg,647\\nCincinnati,974\\nCleveland,1383\\nColorado Springs,455\\nCorpus Christi,658\\nDallas,675\\nDenver,615\\nDetroit,2122\\nEl Paso,423\\nFort Wayne,362\\nFort Worth,587\\nFresno,543\\nGreensboro,563\\nHenderson,168\\nHouston,992\\nIndianapolis,1185\\nJacksonville,617\\nJersey City,734\\nKansas City,1263\\nLas Vegas,784\\nLexington,352\\nLincoln,397\\nLong Beach,575\\nLos Angeles,481\\nLouisville Metro,598\\nMemphis,1750\\nMesa,399\\nMiami,1172\\nMilwaukee,1294\\nMinneapolis,992\\nMobile,522\\nNashville,1216\\nNew Orleans,815\\nNew York,639\\nNewark,1154\\nOakland,1993\\nOklahoma City,919\\nOmaha,594\\nPhiladelphia,1160\\nPhoenix,636\\nPittsburgh,752\\nPlano,130\\nPortland,517\\nRaleigh,423\\nRiverside,443\\nSacramento,738\\nSan Antonio,503\\nSan Diego,413\\nSan Francisco,704\\nSan Jose,363\\nSanta Ana,401\\nSeattle,597\\nSt. Louis,1776\\nSt. Paul,722\\nStockton,1548\\nTampa,616\\nToledo,1171\\nTucson,724\\nTulsa,990\\nVirginia Beach,169\\nWashington,1177\\nWichita,742']
Why does it ADD a \ ?
dOh!!! ROYAL FACE PALM! I just wrote all this out an then realized that all I needed to do was put an escape slash before the \newline:
file_list = file_contents.split('\\n')
I'm gonna post this anyways so y'all can have a chuckle ^_^

How do I write a map object to a csv?

I have a map object x such that
list(x)
produces a list. I am presently trying:
def writethis(name, mapObject):
import csv
with open(name, 'wb') as f:
csv.writer(f, delimiter = ',').writerows(list(mapObject))
but get either get an error saying the list produced isn't a valid sequence, or, if I have another function pass writethis a map object or list in the interpreter, a blank CSV file.
I'm not very familiar with the csv module. What's going on?
First, note that you open files for CSV writing differently in Python 3. You want
with open(name, 'w', newline='') as f:
instead. Second, writerows expects a sequence of sequences, and it seems like you're only passing it a sequence. If you only have one row to write, then
def writethis(name, mapObject):
with open(name, 'w', newline='') as f:
csv.writer(f, delimiter = ',').writerow(list(mapObject))
should work.

Resources