Seaborn Error: 'ascii' codec can't encode character '\xda' in position 4710: ordinal not in range(128) - python-3.x

I'm trying to create a bar graph similar to this example https://seaborn.pydata.org/examples/grouped_barplot.html
My code is as follows:
sns.set(style="whitegrid")
df_noshow = sns.load_dataset(df)
g = sns.catplot(x="Noshow", y="SMS_received", hue="Gender",
data=df_noshow, height=6, kind="bar", palette="muted")
g.despine(left=True)
g.set_ylabels("Text Message Received")
I'm getting this error: UnicodeEncodeError: 'ascii' codec can't encode character '\xda' in position 4710: ordinal not in range(128)
Additionally, I'm not 100% sure I did the following code correctly:
df_noshow = sns.load_dataset(df)
I did create df earlier with
import pandas as pd
df= pd.read_csv('noshow2016.csv')
and all the previous code has been working and I can't imagine the unicode error having anything to do with csv file not loading correctly, however I wanted to include it just in case. Thank you.

Related

Read data with multi delimiter in pyspark

I have a input file which looks like this and has "|" as multi-delimiter :
162300111000000000106779"|"2005-11-16 14:12:32.860000000"|"1660320"|"0"|"2005-11-16 14:12:32.877000000"|""|""|""|""|""|""|""|"False"|"120600111000000000106776```
I can read this type of record with UDF as below :
inputDf = glueContext.sparkSession.read.option("delimiter", input_file_delimiter,)
.csv("s3://" + landing_bucket_name + "/" + input_file_name)
udf = UserDefinedFunction(lambda x: re.sub('"', '', str(x)))
new_df = inputDf.select(*[udf(column).alias(column) for column in inputDf.columns])
but when i get the input file as
000/00"|"AE71501"|"Complaint for Attachment of Earnings Order"|"In accordance with section test of the Attachment of Test Act Test."|"Non-Test"|"Other non-test offences"|"N"|"Other Non-Test"|"Non-Test
I am getting below exception while reading it, using the same UDF, my code fails at exact same location where i have mu UDF :
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 66: ordinal not in range(128)
Any help on below will be great :
- Optimized code to read both type of files , considering "|" as separator.
- How my existing UDF can handle the second type of input records.
This is likely caused by running in Python 2.x which has two separate types for string-like objects (unicode strings and non-unicode strings, which are nowadays simply byte sequences).
Spark will read in your data (which are bytes, as there is no such thing as plain text), and decode the lines as a sequence of Unicode strings. When you call str on a Unicode string that has a codepoint that is not in the ASCII range of codepoints, Python 2 will produce an error:
# python2.7>>> unicode_str = u"ú"
>>> type(unicode_str)
<type 'unicode'>
>>> str(unicode_str)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 0: ordinal not in range(128)
The recommended path is that you work with Unicode strings (which is the default string object in Python 3) all throughout your program, except at the point where you either read/receive data (where you should provide a suitable encoding scheme, so that you can decode the raw bytes) and at the point where you write/send data (again, where you use an encoding to encode the data as a series of bytes). This is called “the Unicode sandwich”.
Many libraries, including Spark, already decode bytes and encode unicode strings for you. If you simply remove the call to str in your user defined function, your code will likely work:
#pyspark shell using Python 2.7
>>> spark.sparkContext.setLogLevel("OFF") # to hide the big Py4J traceback that is dumped to the console, without modifying the log4j.properties file
>>> from py4j.protocol import Py4JJavaError
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> df = spark.read.csv("your_file.csv", sep="|")
>>> def strip_double_quotes_after_str_conversion(s):
... import re
... return re.sub('"', '', str(s))
...
>>> def strip_double_quotes_without_str_conversion(s):
... import re
... return re.sub('"', '', s)
...
>>> df.select(*[udf(strip_double_quotes_without_str_conversion, StringType())(column).alias(column) for column in df.columns]).show()
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
| _c0| _c1| _c2| _c3| _c4| _c5|_c6| _c7| _c8|
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
|037/02|TH68150|Aggravated vehicl...|Contrary to secti...|Theft of motor ve...|Vehicle offences| Y|Aggravated Vehicl...|37.2|
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
>>> try:
... df.select(*[udf(strip_double_quotes_after_str_conversion, StringType())(column).alias(column) for column in df.columns]).show()
... except Py4JJavaError as e:
... print("That failed. Root cause: %s" % e.java_exception.getCause().getMessage().rsplit("\n", 2)[-2])
...
That failed. Root cause: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 78: ordinal not in range(128)
So, the solution to the experienced problem is simple: don’t use str in your UDF.
Note that Python 2.x will no longer be maintained as of January 1st 2020. You’d do well transitioning to Python 3.x before that. In fact, had you executed this in a Python 3 interpreter, you would not have experienced the issue at all.

Trying to convert CSV to Excel file in python

I have tried different codes and check online for the Solution. But not getting success in the below code.
df_new = pd.read_csv(path+'output.csv')
writer = pd.ExcelWriter(path+'output.xlsx')
df_new.to_excel(writer, index = False)
writer.save()
I am getting the below error when I am trying to execute it, I have try to add encoded as latin . But it is not working. Please guide me with it. When I am doing ignore_error it is running , but not providing any result.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

'charmap' codec can't encode characters in position XX

I have a simple script that is attempting to extract mutiple json objects from a single file, and store it as a list:
import json
URL = r"C:\Users\Kenneth\Youtube_comment_parser\Testing.txt"
with open(URL, 'r', encoding="utf-8") as handle:
json_data = [json.loads(line) for line in handle]
print(json_data) # Can't .encode() because it's a list
Even after specifying utf-8 encoding, I'm still running into a codec error. If possible, I would also like to change this object into a dictionary, but this is as far as I've got.
The exact error reads:
UnicodeEncodeError: 'charmap' codec can't encode characters in position
394-395: character maps to <undefined>
Thanks in advance.
I was able to solve this issue by removing one unicode character that was producing "/undefined>", the string '\ufeff', and then the rest was able to display nicely. This required me to iterate over the keys in the list of dictionaries, and replace as necessary.
import json
URL = r"C:\Users\Kenneth\Youtube_comment_parser\Testing.txt"
json1_file = open(URL, encoding='utf-8')
json1_str = json1_file.read()
json1_str = [d.strip() for d in json1_str.splitlines()]
json1_data = [json.loads(i) for i in json1_str]
json1_data = [{key:value.replace(u'\ufeff', '') for
key, value in json1_data[index].items()} for
index in range(len(json1_data))]
print(json1_data[1]['text'].encode('utf-8'))
Still not sure why I have to open with utf-8 and then encode again with my print statement, but it produced the string nicely.

Request() with scandinavian letters

When I am trying to request following URL that has the special letter ä:
req = Request(https://www.booli.se/slutpriser/frosunda/874692/?objectType=Lägenhet&page=1)
I get the following error when i run it:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in
position 45: ordinal not in range(128)
Is there a special encode I should import in order to solve this error?
That's not valid code, and not a terribly clear use of the Requests class (assuming you're using requests).
Perhaps this will do what you want:
#!/usr/bin/python3
import requests
req = requests.get("https://www.booli.se/slutpriser/frosunda/874692/?objectType=Lägenhet&page=1")
print (req.text)

Python3.x,pandas,csv,utf-8 error

I am trying to import a dataset using pandas and getting following error message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 10: invalid start byte
I read about encoding and tried to use it as
df=pd.read_csv("file.csv",encoding="ISO-xxxx")
It showed error as invalid syntax.
I am sharing the link to my data if you guys want to have a look: https://www.kaggle.com/venkatramakrishnan/india-water-quality-data
import pandas as pd
df = pd.read_csv('IndiaAffectedWaterQualityAreas.csv',encoding = 'latin-1')
The above code is one of the solution written in python 3.6 and pandas '0.20.1'.
why does this problem occur?
There are some special character which by default utf-8 is cannot be used to
decode. if you have the raw data,try making the csv using pandas with
the following code:
df.to_csv('IndiaAffectedWaterQualityAreas.csv',encoding = 'latin-1')

Resources