Reading tsv files with byte strings with pandas - python-3.x

I have a tsv file with a column containing utf-8 encoded byte strings (e.g., b'La croisi\xc3\xa8re'). I am trying to read this file with the pandas method read_csv, but what I get is a column of strings, not byte strings (e.g., "b'La croisi\xc3\xa8re'").
How can I read that column as byte strings instead of regular strings in Python 3? I tried to use dtype={'my_bytestr_col': bytes} in read_csv with no luck.
Another way to put it: How can I go from something like "b'La croisi\xc3\xa8re'" to b'La croisi\xc3\xa8re'?

sample file:
First Name Last Name bytes
0 foo bar b'La croisi\xc3\xa8re'
then try this:
import pandas as pd
import ast
df = pd.read_csv('file.tsv', sep='\t')
df['bytes'].apply(ast.literal_eval)
Out:
0 b'La croisi\xc3\xa8re'
Name: bytes, dtype: object

Related

Store 2 different encoded data in a file in python

I have 2 types of encoded data
ibm037 encoded - a single delimiter variable - value is ###
UTF8 encoded - a pandas dataframe with 100s of columns.
Example dataframe:
Date Time
1 2
My goal is to write this data into a python file. The format should be:
### 1 2
In this way I need to have all the rows of the dataframe in a python file where the 1st character for every line is ###.
I tried to store this character at the first location in the pandas dataframe as a new column and then write to the file but it throws error saying that two different encodings can't be written to a file.
Tried another way to write it:
df_orig_data = pandas dataframe,
Record_Header = encoded delimiter
f = open("_All_DelimiterOfRecord.txt", "a")
for row in df_orig_data.itertuples(index=False):
f.write(Record_Header)
f.write(str(row))
f.close()
It also doesn't work.
Is this kind of data write even possible? How can I write these 2 encoded data in 1 file?
Edit:
StringData = StringIO(
"""Date,Time
1,2
1,2
"""
)
df_orig_data = pd.read_csv(StringData, sep=",")
Record_Header = "2 "
f = open("_All_DelimiterOfRecord.txt", "a")
for index, row in df_orig_data.iterrows():
f.write(
"\t".join(
[
str(Record_Header.encode("ibm037")),
str(row["Date"]),
str(row["Time"]),
]
)
)
f.close()
I would suggest doing the encoding yourself, and writing a bytes object to the file. This isn't a situation where you can rely on the built-in encoding do it.
That means that the program opens the file in binary mode (ab), all of the constants are byte-strings, and it works with byte-strings whenever possible.
The question doesn't say, but I assumed you probably wanted a UTF8 newline after each line, rather than an IBM newline.
I also replaced the file handling with a context manager, since that makes it impossible to forget to close a file after you're done.
import io
import pandas as pd
StringData = io.StringIO(
"""Date,Time
1,2
1,2
"""
)
df_orig_data = pd.read_csv(StringData, sep=",")
Record_Header = "2 "
with open("_All_DelimiterOfRecord.txt", "ab") as f:
for index, row in df_orig_data.iterrows():
f.write(Record_Header.encode("ibm037"))
row_bytes = [str(cell).encode('utf8') for cell in row]
f.write(b'\t'.join(row_bytes))
# Note: this is an UTF8 newline, not an IBM newline.
f.write(b'\n')

Read CSV using pandas

I'm trying to read data from https://download.bls.gov/pub/time.series/bp/bp.measure using pandas, like this:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t')
However, I just need to get the dataset with the two columns: measure_code and measure_text. As this dataset as a title BP measure I also tried to read it like:
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t', skiprows=1)
But in this case it returns a dataset with just one column and I'm not being able to slipt it:
>>> df.columns
Index([' measure_code measure_text'], dtype='object')
Any suggestion/idea on a better approach to get this dataset?
It's definitely possible, but the format has a few quirks.
As you noted, the column headers start on line 2, so you need skiprows=1.
The file is space-separated, not tab-separated.
Column values are continued across multiple lines.
Issues 1 and 2 can be fixed using skiprows and sep. Problem 3 is harder, and requires you to preprocess the file a little. For that reason, I used a slightly more flexible way of fetching the file, using the requests library. Once I have the file, I use regular expressions to fix problem 3, and give the file back to pandas.
Here's the code:
import requests
import re
import io
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
# Get the URL, convert the document from DOS to Unix linebreaks
measure_codes = requests.get(url) \
.text \
.replace("\r\n", "\n")
# If there's a linebreak, followed by at least 7 spaces, combine it with
# previous line
measure_codes = re.sub("\n {7,}", " ", measure_codes)
# Convert the string to a file-like object
measure_codes = io.BytesIO(measure_codes.encode('utf-8'))
# Read in file, interpreting 4 spaces or more as a delimiter.
# Using a regex like this requires using the slower Python engine.
# Use skiprows=1 to skip the header
# Use dtype="str" to avoid converting measure code to integer.
df = pd.read_csv(measure_codes, engine="python", sep=" {4,}", skiprows=1, dtype="str")
print(df)

How to write a list of floats to csv in columns?

i am searching everywhere for a method to write a list of floats into csv but must be in column format.
My code for writing csv as follow:
csvfile=open('Test.csv','w', newline='')
obj=csv.writer(csvfile)
obj.writerow(list_dis_B1_avg)
csvfile.close()
It turn out that the floats are written in rows.
I have a list of floats stored under "list_dis_B1_avg"
How can i just write it in column?
You dont need any csv module to do that:
with open("Test.csv", "w") as f: # use with to close the file in any case
f.write("\n".join(list_dis_B1_avg)) # newline between the elements
More about the with keyword: https://www.geeksforgeeks.org/with-statement-in-python/
More about str.join(): https://www.programiz.com/python-programming/methods/string/join

Error in reading a ascii encoded csv file?

I have a csv file named Qid-NamedEntityMapping.csv having data like this:
Q1000070 b'Myron V. George'
Q1000296 b'Fred (footballer, born 1979)'
Q1000799 b'Herbert Greenfield'
Q1000841 b'Stephen A. Northway'
Q1001203 b'Buddy Greco'
Q100122 b'Kurt Kreuger'
Q1001240 b'Buddy Lester'
Q1001867 b'Fyodor Stravinsky'
The second column is 'ascii' encoded, and when I am reading the file using the following code, then also it not being read properly:
import chardet
import pandas as pd
def find_encoding(fname):
r_file = open(fname, 'rb').read()
result = chardet.detect(r_file)
charenc = result['encoding']
return charenc
my_encoding = find_encoding('datasets/KGfacts/Qid-
NamedEntityMapping.csv')
df = pd.read_csv('datasets/KGfacts/Qid-
NamedEntityMapping.csv',error_bad_lines=False, encoding=my_encoding)
But the output looks like this:
Also, I tried to use encoding='UTF-8'. but still, the output is the same.
What can be done to read it properly?
Looks like you have an improperly saved TSV file. Once you circumvent the TAB problem (as suggested in my comment), you can convert the column with names to a more suitable representation.
Let's assume that the second column of the dataframe is called "names". The b'XXX' thing is probably a bytes [mis]representation of a string. Convert it to a bytes object with ast.literal_eval and then decode to a string:
import ast
df["names"].apply(ast.literal_eval).apply(bytes.decode)
#0 Myron...
#1 Fred...
Last but not least, your problem has almost nothing to do with encodings or charsets.
Your issue looks like the CSV is actually tab separated; so you need to have sep='\t' in the read_csv function. It's reading everything else as a single column, except "born 1979" in the first row, as that is the only cell with a comma in it.

csv to pandas.DataFrame while keeping data original encoding

I have a csv file with some utf8 unicode characters in it, which I want to load into a pandas.DataFrame while keeping the unicode characters as is, not escaping them.
Input .csv:
letter,unicode_primary,unicode_alternatives
8,\u0668,"\u0668,\u06F8"
Code:
df = pd.DataFrame.from_csv("file.csv")
print(df.loc[0].unicode_primary)
Result:
> \\u0668
Desired Result:
> \u0668
or
> 8
Please use read_csv instead of from_csv as follows.
df = pd.DataFrame(pd.read_csv("file.csv", encoding = 'utf_8'))
print(df.loc[0].unicode_primary)

Resources