csv to pandas.DataFrame while keeping data original encoding - python-3.x

I have a csv file with some utf8 unicode characters in it, which I want to load into a pandas.DataFrame while keeping the unicode characters as is, not escaping them.
Input .csv:
letter,unicode_primary,unicode_alternatives
8,\u0668,"\u0668,\u06F8"
Code:
df = pd.DataFrame.from_csv("file.csv")
print(df.loc[0].unicode_primary)
Result:
> \\u0668
Desired Result:
> \u0668
or
> 8

Please use read_csv instead of from_csv as follows.
df = pd.DataFrame(pd.read_csv("file.csv", encoding = 'utf_8'))
print(df.loc[0].unicode_primary)

Related

Store 2 different encoded data in a file in python

I have 2 types of encoded data
ibm037 encoded - a single delimiter variable - value is ###
UTF8 encoded - a pandas dataframe with 100s of columns.
Example dataframe:
Date Time
1 2
My goal is to write this data into a python file. The format should be:
### 1 2
In this way I need to have all the rows of the dataframe in a python file where the 1st character for every line is ###.
I tried to store this character at the first location in the pandas dataframe as a new column and then write to the file but it throws error saying that two different encodings can't be written to a file.
Tried another way to write it:
df_orig_data = pandas dataframe,
Record_Header = encoded delimiter
f = open("_All_DelimiterOfRecord.txt", "a")
for row in df_orig_data.itertuples(index=False):
f.write(Record_Header)
f.write(str(row))
f.close()
It also doesn't work.
Is this kind of data write even possible? How can I write these 2 encoded data in 1 file?
Edit:
StringData = StringIO(
"""Date,Time
1,2
1,2
"""
)
df_orig_data = pd.read_csv(StringData, sep=",")
Record_Header = "2 "
f = open("_All_DelimiterOfRecord.txt", "a")
for index, row in df_orig_data.iterrows():
f.write(
"\t".join(
[
str(Record_Header.encode("ibm037")),
str(row["Date"]),
str(row["Time"]),
]
)
)
f.close()
I would suggest doing the encoding yourself, and writing a bytes object to the file. This isn't a situation where you can rely on the built-in encoding do it.
That means that the program opens the file in binary mode (ab), all of the constants are byte-strings, and it works with byte-strings whenever possible.
The question doesn't say, but I assumed you probably wanted a UTF8 newline after each line, rather than an IBM newline.
I also replaced the file handling with a context manager, since that makes it impossible to forget to close a file after you're done.
import io
import pandas as pd
StringData = io.StringIO(
"""Date,Time
1,2
1,2
"""
)
df_orig_data = pd.read_csv(StringData, sep=",")
Record_Header = "2 "
with open("_All_DelimiterOfRecord.txt", "ab") as f:
for index, row in df_orig_data.iterrows():
f.write(Record_Header.encode("ibm037"))
row_bytes = [str(cell).encode('utf8') for cell in row]
f.write(b'\t'.join(row_bytes))
# Note: this is an UTF8 newline, not an IBM newline.
f.write(b'\n')

Python: How to Remove range of Characters \x91\x87\xf0\x9f\x91\x87 from File

I have this file with some lines that contain some unicode literals like:
"b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arrieta\xe2\x80\x99s no-hitter."
I want to remove those xe2\x80\x99 like characters.
I can remove them if I declare a string that contains these characters but my solutions don't work when reading from a CSV file. I used pandas to read the file.
SOLUTIONS TRIED
1.Regex
2.Decoding and Encoding
3.Lambda
Regex Solution
line = "b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arrieta\xe2\x80\x99s no-hitter."
code = (re.sub(r'[^\x00-\x7f]',r'', line))
print (code)
LAMBDA SOLUTION
stripped = lambda s: "".join(i for i in s if 31 < ord(i) < 127)
code2 = stripped(line)
print(code2)
ENCODING SOLUTION
code3 = (line.encode('ascii', 'ignore')).decode("utf-8")
print(code3)
HOW FILE WAS READ
df = pandas.read_csv('file.csv',encoding = "utf-8")
for index, row in df.iterrows():
print(stripped(row['text']))
print(re.sub(r'[^\x00-\x7f]',r'', row['text']))
print(row['text'].encode('ascii', 'ignore')).decode("utf-8"))
SUGGESTED METHOD
df = pandas.read_csv('file.csv',encoding = "utf-8")
for index, row in df.iterrows():
en = row['text'].encode()
print(type(en))
newline = en.decode('utf-8')
print(type(newline))
print(repr(newline))
print(newline.encode('ascii', 'ignore'))
print(newline.encode('ascii', 'replace'))
Your string is valid utf-8. Therefore it can be directly converted to a python string.
You can then encode it to ascii with str.encode(). It can ignore non-ascii characters with 'ignore'.
Also possible: 'replace'
line_raw = b'Who\xe2\x80\x99s he?'
line = line_raw.decode('utf-8')
print(repr(line))
print(line.encode('ascii', 'ignore'))
print(line.encode('ascii', 'replace'))
'Who’s he?'
b'Whos he?'
b'Who?s he?'
To come back to your original question, your 3rd method was correct. It was just in the wrong order.
code3 = line.decode("utf-8").encode('ascii', 'ignore')
print(code3)
To finally provide a working pandas example, here you go:
import pandas
df = pandas.read_csv('test.csv', encoding="utf-8")
for index, row in df.iterrows():
print(row['text'].encode('ascii', 'ignore'))
There is no need to do decode('utf-8'), because pandas does that for you.
Finally, if you have a python string that contains non-ascii characters, you can just strip them by doing
text = row['text'].encode('ascii', 'ignore').decode('ascii')
This converts the text to ascii bytes, strips all the characters that cannot be represented as ascii, and then converts back to text.
You should look up the difference between python3 strings and bytes, that should clear things up for you, I hope.

Error in reading a ascii encoded csv file?

I have a csv file named Qid-NamedEntityMapping.csv having data like this:
Q1000070 b'Myron V. George'
Q1000296 b'Fred (footballer, born 1979)'
Q1000799 b'Herbert Greenfield'
Q1000841 b'Stephen A. Northway'
Q1001203 b'Buddy Greco'
Q100122 b'Kurt Kreuger'
Q1001240 b'Buddy Lester'
Q1001867 b'Fyodor Stravinsky'
The second column is 'ascii' encoded, and when I am reading the file using the following code, then also it not being read properly:
import chardet
import pandas as pd
def find_encoding(fname):
r_file = open(fname, 'rb').read()
result = chardet.detect(r_file)
charenc = result['encoding']
return charenc
my_encoding = find_encoding('datasets/KGfacts/Qid-
NamedEntityMapping.csv')
df = pd.read_csv('datasets/KGfacts/Qid-
NamedEntityMapping.csv',error_bad_lines=False, encoding=my_encoding)
But the output looks like this:
Also, I tried to use encoding='UTF-8'. but still, the output is the same.
What can be done to read it properly?
Looks like you have an improperly saved TSV file. Once you circumvent the TAB problem (as suggested in my comment), you can convert the column with names to a more suitable representation.
Let's assume that the second column of the dataframe is called "names". The b'XXX' thing is probably a bytes [mis]representation of a string. Convert it to a bytes object with ast.literal_eval and then decode to a string:
import ast
df["names"].apply(ast.literal_eval).apply(bytes.decode)
#0 Myron...
#1 Fred...
Last but not least, your problem has almost nothing to do with encodings or charsets.
Your issue looks like the CSV is actually tab separated; so you need to have sep='\t' in the read_csv function. It's reading everything else as a single column, except "born 1979" in the first row, as that is the only cell with a comma in it.

Reading tsv files with byte strings with pandas

I have a tsv file with a column containing utf-8 encoded byte strings (e.g., b'La croisi\xc3\xa8re'). I am trying to read this file with the pandas method read_csv, but what I get is a column of strings, not byte strings (e.g., "b'La croisi\xc3\xa8re'").
How can I read that column as byte strings instead of regular strings in Python 3? I tried to use dtype={'my_bytestr_col': bytes} in read_csv with no luck.
Another way to put it: How can I go from something like "b'La croisi\xc3\xa8re'" to b'La croisi\xc3\xa8re'?
sample file:
First Name Last Name bytes
0 foo bar b'La croisi\xc3\xa8re'
then try this:
import pandas as pd
import ast
df = pd.read_csv('file.tsv', sep='\t')
df['bytes'].apply(ast.literal_eval)
Out:
0 b'La croisi\xc3\xa8re'
Name: bytes, dtype: object

How to convert a tab delimited text file to a csv file in Python

I have the following problem:
I want to convert a tab delimited text file to a csv file. The text file is the SentiWS dictionary which I want to use for a sentiment analysis ( https://github.com/MechLabEngineering/Tatort-Analyzer-ME/tree/master/SentiWS_v1.8c ).
The code I used to do this is the following:
txt_file = r"SentiWS_v1.8c_Positive.txt"
csv_file = r"NewProcessedDoc.csv"
in_txt = csv.reader(open(txt_file, "r"), delimiter = '\t')
out_csv = csv.writer(open(csv_file, 'w'))
out_csv.writerows(in_txt)
This code writes everything in one row but I need the data to be in three rows as normally intended from the file itself. There is also a blank line under each data and I don´t know why.
I want the data to be in this form:
Row1 Row2 Row3
Word Data Words
Word Data Words
instead of
Row1
Word,Data,Words
Word,Data,Words
Can anyone help me?
import pandas
It will convert tab delimiter text file into dataframe
dataframe = pandas.read_csv("SentiWS_v1.8c_Positive.txt",delimiter="\t")
Write dataframe into CSV
dataframe.to_csv("NewProcessedDoc.csv", encoding='utf-8', index=False)
Try this:
import csv
txt_file = r"SentiWS_v1.8c_Positive.txt"
csv_file = r"NewProcessedDoc.csv"
with open(txt_file, "r") as in_text:
in_reader = csv.reader(in_text, delimiter = '\t')
with open(csv_file, "w") as out_csv:
out_writer = csv.writer(out_csv, newline='')
for row in in_reader:
out_writer.writerow(row)
There is also a blank line under each data and I don´t know why.
You're probably using a file created or edited in a Windows-based text editor. According to the Python 3 csv module docs:
If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.

Resources