I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
Now, the text data is stored like this:
"b'Lorem Ipsum\xc2\xa0Assignment '"
I tried to decode this using this code (there is more data in other columns, text is in 3rd column):
with open('data.csv','rt',encoding='utf-8') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print(row[3])
But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[3]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!
How can I decode the text data?
Edit: Here's a sample line from the csv file:
67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6 | #abcde',52,18
Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.
The easiest way is as below. Try it out.
import csv
from io import StringIO
byte_content = b"iam byte content"
content = byte_content.decode()
file = StringIO(content)
csv_data = csv.reader(file, delimiter=",")
If your input file really contains strings with Python syntax b prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval() function as #Ry suggested — although I would use it in a slightly different manner, as shown below.
This will provide a safe way to parse strings in the file which are prefixed with a b indicating they are byte-strings. The rest will be passed through unchanged.
Note that this doesn't require reading the entire CSV file into memory.
import ast
import csv
def _parse_bytes(field):
"""Convert string represented in Python byte-string literal b'' syntax into
a decoded character string - otherwise return it unchanged.
"""
result = field
try:
result = ast.literal_eval(field)
finally:
return result.decode() if isinstance(result, bytes) else result
def my_csv_reader(filename, /, **kwargs):
with open(filename, 'r', newline='') as file:
for row in csv.reader(file, **kwargs):
yield [_parse_bytes(field) for field in row]
reader = my_csv_reader('bytes_data.csv', delimiter=',')
for row in reader:
print(row)
You can use ast.literal_eval to convert the incorrect fields back to bytes safely:
import ast
def _parse_bytes(bytes_repr):
result = ast.literal_eval(bytes_repr)
if not isinstance(result, bytes):
raise ValueError("Malformed bytes repr")
return result
# save-webpage.py (To write first 100 characters of html source into 'simple.html')
import urllib.request, io, sys
f = urllib.request.urlopen('https://news.google.com')
webContent = f.read(100)
#g = io.open('simple.html', 'w', encoding='UTF-8')
g = io.open('simple.html', 'w')
#g.write(webContent)
g.write(webContent.decode("UTF-8"))
g.close()
2019-01-11: See above for corrected working code after answers were received. Thanks guys.
Original question:
Upon execution, the file, simple.html, is created with 0 bytes.
Along with an error:
TypeError: must be str, not bytes.
Please help. I've gone about this several ways but to no avail. Thank you in advance!
g.write(webContent.decode("utf-8"))
File objects opened in text mode require you to write Unicode Text.
In this line you encoded to UTF-8 bytes
g = io.open('simple.html', 'w', encoding='UTF-8')
You could either not encode or try decoding it after.
I'm trying to add some information to a row in a CSV-file, but in the updated CSV-file the special characters (², ±) aren't encoded well. Let me clarify:
So basically I have a CSV-file with the following example row:
row = ["Some info", "Cell with special characters (², ±)", "Some more info"]
I'm trying to add some text to the second column, so I wrote the following code
import csv
new_rows = []
# Open CSV file
with open('file_1.csv', 'r', encoding='UTF-8') as f1
reader = csv.reader(f1, delimiter=',')
for row in reader:
# Add text to cell, save row to memory
row[1] = row[1] + " is causing trouble"
new_rows.append(row)
# Write rows in new CSV-file
with open('file_2.csv', 'w', encoding='UTF-8', newline='') as of:
writer = csv.writer(of, delimiter=',')
writer.writerows(new_rows)
The output in the second column I expected would be as follows:
"Cell with special characters (², ±) is causing troubles"
But instead I get the following output:
"Cell with special characters (??, ??) is causing troubles"
I tried to solve the problem by changing the encoding to Latin-1, or even by not mentioning the encoding in the code at all, but nothing seems to work!
What am I missing or doing wrong here?
Your source file is probably latin-1, you are encoding mojibakes to utf-8 in the second file. Your code has nothing wrong.
You just need to decide if you want to produce a latin1 or utf-8 csv ( and maybe add a BOM https://github.com/pyexcel/pyexcel-io/issues/30 ) as a destination after reading the source in the good encoding ( maybe you can check with a hexadecimal dump how it is encoded and give us some clues ).
How do I split a text string according to an explicit newline ('\n')?
Unfortunately, instead of a properly formatted csv file, I am dealing with a long string of text with "\n" where the newline would be. (example format: "A0,B0\nA1,B1\nA2,B2\nA3,B3\n ...") I thought a simple bad_csv_list = text.split('\n') would give me a list of the two-valued cells (example split ['A0,B0', 'A1,B1', 'A2,B2', 'A3,B3', ...]). Instead I end up with one cell and "\n" gets converted to "\\n". I tried copy-pasting a section of the string and using split('\n') and it works as I had hoped. The print statement for the file object tells me the following:
<_io.TextIOWrapper name='stats.csv' mode='r' encoding='cp1252'>
...so I suspect the problem is with the cp1252 encoding? Of note tho: Notepad++ says the file I am working with is "UTF-8 without BOM"... I've looked in the docs and around SO and tried importing io and codec and prepending the open statement and declaring encoding='utf8' but I am at a loss and I don't really grok text encoding. Maybe there is a better solution?
from sys import argv
# import io, codec
filename = argv[1]
file_object = open(filename, 'r')
# file_object = io.open(filename, 'r', encoding='utf8')
# file_object = codec.open(filename, 'r', encoding='utf8')
file_contents = file_object.read()
file_list = file_contents.split('\n')
print("1.) Here's the name of the file: {}".format(filename))
print("2.) Here's the file object info: {}".format(file_object))
print("3.) Here's all the files contents:\n{}".format(file_contents))
print("4.) Here's a list of the file contents:\n{}".format(file_list))
Any help would be greatly appreciated, thank you.
If it helps to explain what I am dealing with, here's the contents of the stats.csv file:
Albuquerque,749\nAnaheim,371\nAnchorage,828\nArlington,503\nAtlanta,1379\nAurora,425\nAustin,408\nBakersfield,542\nBaltimore,1405\nBoston,835\nBuffalo,1288\nCharlotte-Mecklenburg,647\nCincinnati,974\nCleveland,1383\nColorado Springs,455\nCorpus Christi,658\nDallas,675\nDenver,615\nDetroit,2122\nEl Paso,423\nFort Wayne,362\nFort Worth,587\nFresno,543\nGreensboro,563\nHenderson,168\nHouston,992\nIndianapolis,1185\nJacksonville,617\nJersey City,734\nKansas City,1263\nLas Vegas,784\nLexington,352\nLincoln,397\nLong Beach,575\nLos Angeles,481\nLouisville Metro,598\nMemphis,1750\nMesa,399\nMiami,1172\nMilwaukee,1294\nMinneapolis,992\nMobile,522\nNashville,1216\nNew Orleans,815\nNew York,639\nNewark,1154\nOakland,1993\nOklahoma City,919\nOmaha,594\nPhiladelphia,1160\nPhoenix,636\nPittsburgh,752\nPlano,130\nPortland,517\nRaleigh,423\nRiverside,443\nSacramento,738\nSan Antonio,503\nSan Diego,413\nSan Francisco,704\nSan Jose,363\nSanta Ana,401\nSeattle,597\nSt. Louis,1776\nSt. Paul,722\nStockton,1548\nTampa,616\nToledo,1171\nTucson,724\nTulsa,990\nVirginia Beach,169\nWashington,1177\nWichita,742
And the result from the split('\n'):
['Albuquerque,749\\nAnaheim,371\\nAnchorage,828\\nArlington,503\\nAtlanta,1379\\nAurora,425\\nAustin,408\\nBakersfield,542\\nBaltimore,1405\\nBoston,835\\nBuffalo,1288\\nCharlotte-Mecklenburg,647\\nCincinnati,974\\nCleveland,1383\\nColorado Springs,455\\nCorpus Christi,658\\nDallas,675\\nDenver,615\\nDetroit,2122\\nEl Paso,423\\nFort Wayne,362\\nFort Worth,587\\nFresno,543\\nGreensboro,563\\nHenderson,168\\nHouston,992\\nIndianapolis,1185\\nJacksonville,617\\nJersey City,734\\nKansas City,1263\\nLas Vegas,784\\nLexington,352\\nLincoln,397\\nLong Beach,575\\nLos Angeles,481\\nLouisville Metro,598\\nMemphis,1750\\nMesa,399\\nMiami,1172\\nMilwaukee,1294\\nMinneapolis,992\\nMobile,522\\nNashville,1216\\nNew Orleans,815\\nNew York,639\\nNewark,1154\\nOakland,1993\\nOklahoma City,919\\nOmaha,594\\nPhiladelphia,1160\\nPhoenix,636\\nPittsburgh,752\\nPlano,130\\nPortland,517\\nRaleigh,423\\nRiverside,443\\nSacramento,738\\nSan Antonio,503\\nSan Diego,413\\nSan Francisco,704\\nSan Jose,363\\nSanta Ana,401\\nSeattle,597\\nSt. Louis,1776\\nSt. Paul,722\\nStockton,1548\\nTampa,616\\nToledo,1171\\nTucson,724\\nTulsa,990\\nVirginia Beach,169\\nWashington,1177\\nWichita,742']
Why does it ADD a \ ?
dOh!!! ROYAL FACE PALM! I just wrote all this out an then realized that all I needed to do was put an escape slash before the \newline:
file_list = file_contents.split('\\n')
I'm gonna post this anyways so y'all can have a chuckle ^_^
I have a map object x such that
list(x)
produces a list. I am presently trying:
def writethis(name, mapObject):
import csv
with open(name, 'wb') as f:
csv.writer(f, delimiter = ',').writerows(list(mapObject))
but get either get an error saying the list produced isn't a valid sequence, or, if I have another function pass writethis a map object or list in the interpreter, a blank CSV file.
I'm not very familiar with the csv module. What's going on?
First, note that you open files for CSV writing differently in Python 3. You want
with open(name, 'w', newline='') as f:
instead. Second, writerows expects a sequence of sequences, and it seems like you're only passing it a sequence. If you only have one row to write, then
def writethis(name, mapObject):
with open(name, 'w', newline='') as f:
csv.writer(f, delimiter = ',').writerow(list(mapObject))
should work.