Error in reading a ascii encoded csv file? - python-3.x

I have a csv file named Qid-NamedEntityMapping.csv having data like this:
Q1000070 b'Myron V. George'
Q1000296 b'Fred (footballer, born 1979)'
Q1000799 b'Herbert Greenfield'
Q1000841 b'Stephen A. Northway'
Q1001203 b'Buddy Greco'
Q100122 b'Kurt Kreuger'
Q1001240 b'Buddy Lester'
Q1001867 b'Fyodor Stravinsky'
The second column is 'ascii' encoded, and when I am reading the file using the following code, then also it not being read properly:
import chardet
import pandas as pd
def find_encoding(fname):
r_file = open(fname, 'rb').read()
result = chardet.detect(r_file)
charenc = result['encoding']
return charenc
my_encoding = find_encoding('datasets/KGfacts/Qid-
NamedEntityMapping.csv')
df = pd.read_csv('datasets/KGfacts/Qid-
NamedEntityMapping.csv',error_bad_lines=False, encoding=my_encoding)
But the output looks like this:
Also, I tried to use encoding='UTF-8'. but still, the output is the same.
What can be done to read it properly?

Looks like you have an improperly saved TSV file. Once you circumvent the TAB problem (as suggested in my comment), you can convert the column with names to a more suitable representation.
Let's assume that the second column of the dataframe is called "names". The b'XXX' thing is probably a bytes [mis]representation of a string. Convert it to a bytes object with ast.literal_eval and then decode to a string:
import ast
df["names"].apply(ast.literal_eval).apply(bytes.decode)
#0 Myron...
#1 Fred...
Last but not least, your problem has almost nothing to do with encodings or charsets.

Your issue looks like the CSV is actually tab separated; so you need to have sep='\t' in the read_csv function. It's reading everything else as a single column, except "born 1979" in the first row, as that is the only cell with a comma in it.

Related

Store 2 different encoded data in a file in python

I have 2 types of encoded data
ibm037 encoded - a single delimiter variable - value is ###
UTF8 encoded - a pandas dataframe with 100s of columns.
Example dataframe:
Date Time
1 2
My goal is to write this data into a python file. The format should be:
### 1 2
In this way I need to have all the rows of the dataframe in a python file where the 1st character for every line is ###.
I tried to store this character at the first location in the pandas dataframe as a new column and then write to the file but it throws error saying that two different encodings can't be written to a file.
Tried another way to write it:
df_orig_data = pandas dataframe,
Record_Header = encoded delimiter
f = open("_All_DelimiterOfRecord.txt", "a")
for row in df_orig_data.itertuples(index=False):
f.write(Record_Header)
f.write(str(row))
f.close()
It also doesn't work.
Is this kind of data write even possible? How can I write these 2 encoded data in 1 file?
Edit:
StringData = StringIO(
"""Date,Time
1,2
1,2
"""
)
df_orig_data = pd.read_csv(StringData, sep=",")
Record_Header = "2 "
f = open("_All_DelimiterOfRecord.txt", "a")
for index, row in df_orig_data.iterrows():
f.write(
"\t".join(
[
str(Record_Header.encode("ibm037")),
str(row["Date"]),
str(row["Time"]),
]
)
)
f.close()
I would suggest doing the encoding yourself, and writing a bytes object to the file. This isn't a situation where you can rely on the built-in encoding do it.
That means that the program opens the file in binary mode (ab), all of the constants are byte-strings, and it works with byte-strings whenever possible.
The question doesn't say, but I assumed you probably wanted a UTF8 newline after each line, rather than an IBM newline.
I also replaced the file handling with a context manager, since that makes it impossible to forget to close a file after you're done.
import io
import pandas as pd
StringData = io.StringIO(
"""Date,Time
1,2
1,2
"""
)
df_orig_data = pd.read_csv(StringData, sep=",")
Record_Header = "2 "
with open("_All_DelimiterOfRecord.txt", "ab") as f:
for index, row in df_orig_data.iterrows():
f.write(Record_Header.encode("ibm037"))
row_bytes = [str(cell).encode('utf8') for cell in row]
f.write(b'\t'.join(row_bytes))
# Note: this is an UTF8 newline, not an IBM newline.
f.write(b'\n')

How do you maintain whitespace formatting within CSV column to be expressed with Selenium send_keys?

Python 3.7
Windows
CSV row data looks like this.
data,data,data,some text\n some {0} more data\n even more data\n,data
How do you keep the newlines and use format when using selenium?
payloads = []
with open(filepath,) as _file:
dgroups = csv.reader(_file, delimiter=',' )
bpost = {
'name':dgroups[1],
'text':dgroups[3],
}
...
#Selenium section to send the formated text to the browser.
Textbox.send_keys(payloads[i]['text'].format( payloads[i]['name'])
Expected
some text
some MYNAME more data
even more data
Actual
some text\n some {0} more data\n even more data\n
This is happening because the csv.reader function is reading in the strings with the backslash being escaped as double backslash:
"some text\\n some {0} more data\\n even more data\\n"
To solve this issue, you can do the following to made sure newlines are created correctly:
Textbox.send_keys(payloads[i]['text'].replace("\\n", "\n").format( payloads[i]['name'])

Remove double quotes while printing string in dataframe to text file

I have a dataframe which contains one column with multiple strings. Here is what the data looks like:
Value
EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1
There are almost 100,000 such rows in the dataframe. I want to write this data into a text file.
For this, I tried the following:
df.to_csv(filename, header=None,index=None,mode='a')
But I am getting the entire string in quotes when I do this. The output I obtain is:
"EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1"
But what I want is:
EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1 -> No Quotes
I also tried this:
df.to_csv(filename,header=None,index=None,mode='a',
quoting=csv.QUOTE_NONE)
However, I get an error that an escapechar is required. If i add escapechar='/' into the code, I get '/' in multiple places (but no quotes). I don't want the '/' either.
Is there anyway I can remove the quotes while writing into a text file WITHOUT adding any other escape characters ?
Based on OP's comment, I believe the semicolon is messing things up. I no longer have unwanted \ if using tabs to delimit csv.
import pandas as pd
import csv
df = pd.DataFrame(columns=['col'])
df.loc[0] = "EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1"
df.to_csv("out.csv", sep="\t", quoting=csv.QUOTE_NONE, quotechar="", escapechar="")
Original Answer:
According to this answer, you need to specify escapechar="\\" to use csv.QUOTE_NONE.
Have you tried:
df.to_csv("out.csv", sep=",", quoting=csv.QUOTE_NONE, quotechar="", escapechar="\\")
I was able to write a df to a csv using a single space as the separator and get the "quotes" around strings removed by replacing existing in-string spaces in the dataframe with non-breaking spaces before I wrote it as as csv.
df = df.applymap(lambda x: str(x).replace(' ', u"\u00A0"))
df.to_csv(outpath+filename, header=True, index=None, sep=' ', mode='a')
I couldn't use a tab delimited file for what I was writing output for, though that solution also works using additional keywords to df.to_csv(): quoting=csv.QUOTE_NONE, quotechar="", escapechar="")

Parse string representation of binary loaded from CSV [duplicate]

I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
Now, the text data is stored like this:
"b'Lorem Ipsum\xc2\xa0Assignment '"
I tried to decode this using this code (there is more data in other columns, text is in 3rd column):
with open('data.csv','rt',encoding='utf-8') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print(row[3])
But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[3]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!
How can I decode the text data?
Edit: Here's a sample line from the csv file:
67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6 | #abcde',52,18
Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.
The easiest way is as below. Try it out.
import csv
from io import StringIO
byte_content = b"iam byte content"
content = byte_content.decode()
file = StringIO(content)
csv_data = csv.reader(file, delimiter=",")
If your input file really contains strings with Python syntax b prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval() function as #Ry suggested — although I would use it in a slightly different manner, as shown below.
This will provide a safe way to parse strings in the file which are prefixed with a b indicating they are byte-strings. The rest will be passed through unchanged.
Note that this doesn't require reading the entire CSV file into memory.
import ast
import csv
def _parse_bytes(field):
"""Convert string represented in Python byte-string literal b'' syntax into
a decoded character string - otherwise return it unchanged.
"""
result = field
try:
result = ast.literal_eval(field)
finally:
return result.decode() if isinstance(result, bytes) else result
def my_csv_reader(filename, /, **kwargs):
with open(filename, 'r', newline='') as file:
for row in csv.reader(file, **kwargs):
yield [_parse_bytes(field) for field in row]
reader = my_csv_reader('bytes_data.csv', delimiter=',')
for row in reader:
print(row)
You can use ast.literal_eval to convert the incorrect fields back to bytes safely:
import ast
def _parse_bytes(bytes_repr):
result = ast.literal_eval(bytes_repr)
if not isinstance(result, bytes):
raise ValueError("Malformed bytes repr")
return result

Reading tsv files with byte strings with pandas

I have a tsv file with a column containing utf-8 encoded byte strings (e.g., b'La croisi\xc3\xa8re'). I am trying to read this file with the pandas method read_csv, but what I get is a column of strings, not byte strings (e.g., "b'La croisi\xc3\xa8re'").
How can I read that column as byte strings instead of regular strings in Python 3? I tried to use dtype={'my_bytestr_col': bytes} in read_csv with no luck.
Another way to put it: How can I go from something like "b'La croisi\xc3\xa8re'" to b'La croisi\xc3\xa8re'?
sample file:
First Name Last Name bytes
0 foo bar b'La croisi\xc3\xa8re'
then try this:
import pandas as pd
import ast
df = pd.read_csv('file.tsv', sep='\t')
df['bytes'].apply(ast.literal_eval)
Out:
0 b'La croisi\xc3\xa8re'
Name: bytes, dtype: object

Resources