I am trying to import a dataset using pandas and getting following error message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 10: invalid start byte
I read about encoding and tried to use it as
df=pd.read_csv("file.csv",encoding="ISO-xxxx")
It showed error as invalid syntax.
I am sharing the link to my data if you guys want to have a look: https://www.kaggle.com/venkatramakrishnan/india-water-quality-data
import pandas as pd
df = pd.read_csv('IndiaAffectedWaterQualityAreas.csv',encoding = 'latin-1')
The above code is one of the solution written in python 3.6 and pandas '0.20.1'.
why does this problem occur?
There are some special character which by default utf-8 is cannot be used to
decode. if you have the raw data,try making the csv using pandas with
the following code:
df.to_csv('IndiaAffectedWaterQualityAreas.csv',encoding = 'latin-1')
Related
I'm trying to load a pickle file with importlib.resources, but I'm getting the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
The bit that is raising the error is:
with importlib.resources.open_text("directory_with_pickle_file", "pickle_file.pkl") as f:
data = pickle.load(f)
I'm certain that the file (pickle_file.pkl) was created with pickle.dump.
What am I doing wrong?
Through lots of trial and error I figured out that importlib.resources has a read_binary function which can be used to read pickled files like so:
text = importlib.resources.read_binary("directory_with_pickle_file", "pickle_file.pkl")
data = pickle.loads(text)
Here, data is the pickled object.
I have a input file which looks like this and has "|" as multi-delimiter :
162300111000000000106779"|"2005-11-16 14:12:32.860000000"|"1660320"|"0"|"2005-11-16 14:12:32.877000000"|""|""|""|""|""|""|""|"False"|"120600111000000000106776```
I can read this type of record with UDF as below :
inputDf = glueContext.sparkSession.read.option("delimiter", input_file_delimiter,)
.csv("s3://" + landing_bucket_name + "/" + input_file_name)
udf = UserDefinedFunction(lambda x: re.sub('"', '', str(x)))
new_df = inputDf.select(*[udf(column).alias(column) for column in inputDf.columns])
but when i get the input file as
000/00"|"AE71501"|"Complaint for Attachment of Earnings Order"|"In accordance with section test of the Attachment of Test Act Test."|"Non-Test"|"Other non-test offences"|"N"|"Other Non-Test"|"Non-Test
I am getting below exception while reading it, using the same UDF, my code fails at exact same location where i have mu UDF :
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 66: ordinal not in range(128)
Any help on below will be great :
- Optimized code to read both type of files , considering "|" as separator.
- How my existing UDF can handle the second type of input records.
This is likely caused by running in Python 2.x which has two separate types for string-like objects (unicode strings and non-unicode strings, which are nowadays simply byte sequences).
Spark will read in your data (which are bytes, as there is no such thing as plain text), and decode the lines as a sequence of Unicode strings. When you call str on a Unicode string that has a codepoint that is not in the ASCII range of codepoints, Python 2 will produce an error:
# python2.7>>> unicode_str = u"ú"
>>> type(unicode_str)
<type 'unicode'>
>>> str(unicode_str)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 0: ordinal not in range(128)
The recommended path is that you work with Unicode strings (which is the default string object in Python 3) all throughout your program, except at the point where you either read/receive data (where you should provide a suitable encoding scheme, so that you can decode the raw bytes) and at the point where you write/send data (again, where you use an encoding to encode the data as a series of bytes). This is called “the Unicode sandwich”.
Many libraries, including Spark, already decode bytes and encode unicode strings for you. If you simply remove the call to str in your user defined function, your code will likely work:
#pyspark shell using Python 2.7
>>> spark.sparkContext.setLogLevel("OFF") # to hide the big Py4J traceback that is dumped to the console, without modifying the log4j.properties file
>>> from py4j.protocol import Py4JJavaError
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> df = spark.read.csv("your_file.csv", sep="|")
>>> def strip_double_quotes_after_str_conversion(s):
... import re
... return re.sub('"', '', str(s))
...
>>> def strip_double_quotes_without_str_conversion(s):
... import re
... return re.sub('"', '', s)
...
>>> df.select(*[udf(strip_double_quotes_without_str_conversion, StringType())(column).alias(column) for column in df.columns]).show()
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
| _c0| _c1| _c2| _c3| _c4| _c5|_c6| _c7| _c8|
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
|037/02|TH68150|Aggravated vehicl...|Contrary to secti...|Theft of motor ve...|Vehicle offences| Y|Aggravated Vehicl...|37.2|
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
>>> try:
... df.select(*[udf(strip_double_quotes_after_str_conversion, StringType())(column).alias(column) for column in df.columns]).show()
... except Py4JJavaError as e:
... print("That failed. Root cause: %s" % e.java_exception.getCause().getMessage().rsplit("\n", 2)[-2])
...
That failed. Root cause: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 78: ordinal not in range(128)
So, the solution to the experienced problem is simple: don’t use str in your UDF.
Note that Python 2.x will no longer be maintained as of January 1st 2020. You’d do well transitioning to Python 3.x before that. In fact, had you executed this in a Python 3 interpreter, you would not have experienced the issue at all.
I'm trying to create a bar graph similar to this example https://seaborn.pydata.org/examples/grouped_barplot.html
My code is as follows:
sns.set(style="whitegrid")
df_noshow = sns.load_dataset(df)
g = sns.catplot(x="Noshow", y="SMS_received", hue="Gender",
data=df_noshow, height=6, kind="bar", palette="muted")
g.despine(left=True)
g.set_ylabels("Text Message Received")
I'm getting this error: UnicodeEncodeError: 'ascii' codec can't encode character '\xda' in position 4710: ordinal not in range(128)
Additionally, I'm not 100% sure I did the following code correctly:
df_noshow = sns.load_dataset(df)
I did create df earlier with
import pandas as pd
df= pd.read_csv('noshow2016.csv')
and all the previous code has been working and I can't imagine the unicode error having anything to do with csv file not loading correctly, however I wanted to include it just in case. Thank you.
I have tried different codes and check online for the Solution. But not getting success in the below code.
df_new = pd.read_csv(path+'output.csv')
writer = pd.ExcelWriter(path+'output.xlsx')
df_new.to_excel(writer, index = False)
writer.save()
I am getting the below error when I am trying to execute it, I have try to add encoded as latin . But it is not working. Please guide me with it. When I am doing ignore_error it is running , but not providing any result.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I am trying to read and convert binary into text that anyone could read. I am having trouble with the error message:
'utf-8' codec can't decode byte 0x81 in position 11: invalid start byte
I have gone throughout: Reading binary file and looping over each byte
trying multiple versions of trying to open and read the binary file in some way. After reading about this error message, most people either had trouble with .cvs files, or had to change the utf-8 to -16. But reading up on https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes , Python does not use -16 anymore.
Also, if I add encoding = utf-16/32, the error states: binary mode doesn't take an encoding argument
Here is my code:
with open(b"P:\Projects\2018\1809-0068-R\Bin_Files\snap-pac-eb1-R10.0d.bin", "rb") as f:
byte = f.read(1)
while byte != b"":
byte = f.read(1)
print(f)
I am expecting to be able to read and write to the binary file. I would like to translate it to Hex and then to text (or to legible text somehow), but I think I have to go through this step before. If anyone could help with what I am missing, that would be greatly appreciated! Any way to open and read a binary file would be accepted. Thank you for your time!
I am not sure but this might help:
import binascii
with open('snap-pac-eb1-R10.0d.bin', 'rb') as f:
header = f.read(6)
b = bytearray(header)
binary=[bin(i)[2:].zfill(8) for i in b]
n = int('0b'+''.join(binary), 2)
nn = binascii.unhexlify('%x' % n)
nnn=nn.decode("ascii")[0:-1]
result='.'.join(str(ord(c)) for c in nnn[0:-1])
print(result)
Output:
16.0.8.0