UnicodeDecodeError when running a test in PyCharm - python-3.x

I have a test in my project that involves opening and reading a file. The chunk of code is basically something like this:
import pandas as pd
with open('path/to/the/file') as file:
df = pd.read_csv(file, comment='#')
When I run the test in PyCharm I got this error:
File "/Users/andrea.marchini/.local/share/virtualenvs/customer-ontology-OqrlJ3HG/lib/python3.5/site-packages/pandas/io/parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/andrea.marchini/.local/share/virtualenvs/customer-ontology-OqrlJ3HG/lib/python3.5/site-packages/pandas/io/parsers.py", line 411, in _read
data = parser.read(nrows)
File "/Users/andrea.marchini/.local/share/virtualenvs/customer-ontology-OqrlJ3HG/lib/python3.5/site-packages/pandas/io/parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "/Users/andrea.marchini/.local/share/virtualenvs/customer-ontology-OqrlJ3HG/lib/python3.5/site-packages/pandas/io/parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)
File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)
File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
File "pandas/_libs/parsers.pyx", line 2173, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28589)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 202538: ordinal not in range(128)
Anyway if I run the same code from the python console everything works fine.

Adding this line to the .bash_profile file solves the issue
if [ -z "$LANG" ]; then export LANG="$(defaults read -g AppleLocale | sed 's/#.*$//g').UTF-8"; fi

Related

"UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 3965: invalid start byte" when using Pyinstaller

I am trying to create an executable from two python scripts. One script defines the GUI for the other backend script. The backend is reading in excel files, creating DataFrames with them for manipulation, then outputting a new excel file. This is the code that reads in the excel file, where "user_path, userAN, userRev1, userRev2" are grabbed as user input from the GUI:
import pandas as pd
import numpy as np
import string
from tkinter import messagebox
import os
def generate_BOM(user_path, userAN, userRev1, userRev2):
## Append filepath with '/' if it does not include directory separator
if not (user_path.endswith('/') or user_path.endswith('\\')):
user_path = user_path + '/'
## Set filepath to current directory if user inputted path does not exist
if not os.path.exists(user_path):
user_path = '.'
fileFormat1 = userAN + '_' + userRev1 + '.xls'
fileFormat2 = userAN + '_' + userRev2 + '.xls'
for file in os.listdir(path=user_path):
if file.endswith(fileFormat1):
df1 = pd.read_excel(user_path+file, index_col=None)
if file.endswith(fileFormat2):
df2 = pd.read_excel(user_path+file, index_col=None)
When running the two scripts through Spyder, everything works perfectly. To create the exe, I am using Pyinstaller with the following command:
pyinstaller --onefile Delta_BOM_Creator.py
This results in the following error:
Traceback (most recent call last):
File "c:\users\davhar\anaconda3\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\users\davhar\anaconda3\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\davhar\Anaconda3\Scripts\pyinstaller.exe\__main__.py", line 7, in <module>
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\__main__.py", line 114, in run
run_build(pyi_config, spec_file, **vars(args))
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\__main__.py", line 65, in run_build
PyInstaller.building.build_main.main(pyi_config, spec_file, **kwargs)
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\building\build_main.py", line 737, in main
build(specfile, kw.get('distpath'), kw.get('workpath'), kw.get('clean_build'))
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\building\build_main.py", line 684, in build
exec(code, spec_namespace)
File "C:\Users\davhar\.spyder-py3\DELTA_BOM_Creator\Delta_BOM_Creator.spec", line 7, in <module>
a = Analysis(['Delta_BOM_Creator.py'],
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\building\build_main.py", line 242, in __init__
self.__postinit__()
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\building\datastruct.py", line 160, in __postinit__
self.assemble()
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\building\build_main.py", line 414, in assemble
priority_scripts.append(self.graph.run_script(script))
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\depend\analysis.py", line 303, in run_script
self._top_script_node = super(PyiModuleGraph, self).run_script(
File "c:\users\davhar\anaconda3\lib\site-packages\PyInstaller\lib\modulegraph\modulegraph.py", line 1411, in run_script
contents = fp.read() + '\n'
File "c:\users\davhar\anaconda3\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 3965: invalid start byte
I've tried everything I could find that somewhat related to this issue. To list just a few:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 15: invalid start byte
https://www.dlology.com/blog/solution-pyinstaller-unicodedecodeerror-utf-8-codec-cant-decode-byte/
Pandas read _excel: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte
I've never used Pyinstaller, or created an executable from python at all, so apologies for being a big time noob.
SOLUTION: I found a solution. I went into the codecs.py file mentioned in the error and added 'ignore' to line 322
(result, consumed) = self.buffer_decode(data, 'ignore', final)

python pandas, unicode decode error on read_csv

When importing a csv file I am getting an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 15: invalid start byte
traceback:
Traceback (most recent call last):
File "<ipython-input-2-99e71d524b4b>", line 1, in <module>
runfile('C:/AppData/FinRecon/py_code/python3/DataJoin.py', wdir='C:/AppData/FinRecon/py_code/python3')
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/AppData/FinRecon/py_code/python3/DataJoin.py", line 500, in <module>
M5()
File "C:/AppData/FinRecon/py_code/python3/DataJoin.py", line 221, in M5
s3 = pd.read_csv(working_dir+"S3.csv", sep=",") #encode here encoding='utf-16
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 435, in _read
data = parser.read(nrows)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1176, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1315, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1553, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 15: invalid start byte
What i've tried:
`s3 = pd.read_csv(working_dir+"S3.csv", sep=",", encoding='utf-16')`
I get error UnicodeError: UTF-16 stream does not start with BOM
What can be done to get this file to be read properly?
Try using s3 = pd.read_csv(working_dir+"S3.csv", sep=",", encoding='Latin-1')
Mostly encoding issues arise with the characters within the data. While utf-8 supports all languages according to pandas' documentation, utf-8 has a byte structure that must be respected at all times. Some of the values not included in utf-8 are latin small letters i with diaeresis, right-pointing double angle quotation mark, inverted question mark. This are mapped as 0xef, 0xbb and 0xbf bytes respectively. Hence your error.

How can i convert a .ARFF file to CSV in Pycharm using a Mac?

Im new to data analysis and looking for help. I converted an .arff file to csv using Pycharm but I am not sure that I did in the right way. Is it possible, right? If so how? Also when I tried this code:
import pandas as pd
data_frame = pd.read_csv("train.csv")
I got an error.
Traceback (most recent call last):
File "/Users/cristinamulas/PycharmProjects/exam/exam_p.py", line 53, in
data_frame = pd.read_csv("train.csv")
File "/Users/cristinamulas/exam/lib/python3.7/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/cristinamulas/exam/lib/python3.7/site-packages/pandas/io/parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Users/cristinamulas/exam/lib/python3.7/site-packages/pandas/io/parsers.py", line 787, in init
self._make_engine(self.engine)
File "/Users/cristinamulas/exam/lib/python3.7/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Users/cristinamulas/exam/lib/python3.7/site-packages/pandas/io/parsers.py", line 1708, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 384, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 695, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: File b'train.csv' does not exist

CParserError when reading csv file into python Spyder

I am trying to read a big csv file (around 17GB) into python Spyder using pandas module. Here is my code
data =pd.read_csv('example.csv', encoding = 'ISO-8859-1')
But I keep getting CParserError error message
Traceback (most recent call last):
File "<ipython-input-3-3993cadd40d6>", line 1, in <module>
data =pd.read_csv('newsall.csv', encoding = 'ISO-8859-1')
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 325, in _read
return parser.read()
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748)
File "pandas\parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9003)
File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731)
File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602)
File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325)
CParserError: Error tokenizing data. C error: out of memory
I am aware there are some discussions about the issue, but it seems quite specific and varies from case to case. Does anyone can help me out here?
I'm using python 3 on windows system. Thanks in advance.
EDIT:
As suggested by ResMar, I tried the following code
data = pd.DataFrame()
reader = pd.read_csv('newsall.csv', encoding = 'ISO-8859-1', chunksize = 10000)
for chunk in reader:
data.append(chunk, ignore_index=True)
But it returns nothing with
data.shape
Out[12]: (0, 0)
Then I tried the following code
data = pd.DataFrame()
reader = pd.read_csv('newsall.csv', encoding = 'ISO-8859-1', chunksize = 10000)
for chunk in reader:
data = data.append(chunk, ignore_index=True)
It again shows run out of memory error, here is the trackback
Traceback (most recent call last):
File "<ipython-input-23-ee9021fcc9b4>", line 3, in <module>
for chunk in reader:
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 795, in __next__
return self.get_chunk()
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 836, in get_chunk
return self.read(nrows=size)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748)
File "pandas\parser.pyx", line 839, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9208)
File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731)
File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602)
File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325)
CParserError: Error tokenizing data. C error: out of memory
It seems to me to be pretty obvious what your error is: the computer runs out of memory. The file itself is 17GB, and as a rule of thumb pandas will take up roughly double that much space when it reads the file. So you'd need around 34GB of RAM to read this data in directly.
Most computers these days have 4, 8, or 16 GB; a few have 32. Your computer just runs out of memory and C kills the process when you do.
You can get around this by reading in your data in chunks, doing whatever you want to do on it with each segment in turn. See the chunksize parameter to pd.read_csv for more details on that, but you'll basically want something that looks like:
for chunk in pd.read_csv("...", chunksize=10000):
do_something()

Error when reading avro files in python

I installed Apache Avro successfully in Python. Then I try to read Avro files into Python following the instruction below.
https://avro.apache.org/docs/1.8.1/gettingstartedpython.html
I have a bunch of Avros in a directory which has already been set as the right path in Python. Here is my code:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
print (user)
reader.close()
However it returns this error:
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 5, in <module>
reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 349, in __init__
self._read_header()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 459, in _read_header
META_SCHEMA, META_SCHEMA, self.raw_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg \avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 515, in read_data
return self.read_fixed(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 568, in read_fixed
return decoder.read(writer_schema.size)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 170, in read
input_bytes = self.reader.read(n)
File "I:\Program Files\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 863: character maps to <undefined>
I am indeed aware that in the example in the instruction, a schema is created first. But what is a avsc file? How shall I create it and the corresponding schema in my case? Ideally, I would like to read Avro files into Python and save it into csv format in the disk or dataframe/list type in Python for further analysis. I'm using Python 3 on Windows 7.
EDITED
I tried Stephane's code, and it returns a new error
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 5, in <module>
reader = DataFileReader(open("part-00000-of-01733.avro", "rb"), DatumReader())
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 352, in __init__
self.codec = self.GetMeta('avro.codec').decode('utf-8')
AttributeError: 'NoneType' object has no attribute 'decode'
EDITED2: Stephane's code works in most cases, but sometimes it incurs AssertionError like this
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 42, in <module>
for user in reader:
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 522, in __next__
datum = self.datum_reader.read(self.datum_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 480, in read
return self.read_data(self.writer_schema, self.reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 523, in read_data
return self.read_union(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 689, in read_union
return self.read_data(selected_writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 493, in read_data
return self.read_data(writer_schema, s, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 503, in read_data
return decoder.read_utf8()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 248, in read_utf8
input_bytes = self.read_bytes()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 241, in read_bytes
return self.read(nbytes)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 171, in read
assert (len(input_bytes) == n), input_bytes
AssertionError: b'BlackRock Group\n\n17 December 2015\n\nFORM 8.3\n\nPUBLIC OPENING POSITION DISCLOSURE/DEALING DISCLOSURE BY\n\nA PERSON WITH INTERESTS IN RELEVANT SECURITIES REPRESENTING 1% OR MORE\n\nRule 8.3 of the Takeover Code (the "Code") \n\n\n 1. KEY INFORMATION \n \n (a) Full name of discloser: BlackRock, Inc. \n------------------------------------------------------------------------------------------------- ----------------- \n (b) Owner or controller of interests and short positions disclosed, if diffe
You're using windows and Python 3.
in Python 3 by default open opens files in text mode. It means that when further read operations happen, Python will try to decode the content of the file from some charset to unicode.
you did not specify a default charset, so Python tries to decode the content as if such content was encoded using charmap (by default on windows).
obviously your avro file is not encoded in charmap, and the decoded fails with an exception
as far as i remember, avro headers anyway are binary content... not textual (not sure about that). so maybe first you should try NOT to decode the file with open:
reader = DataFileReader(open("part-00000-of-01733.avro", 'rb'), DatumReader())
(notice 'rb', binary mode)
EDIT: For the next problem (AttributeError), you've been hit by a known bug that's not fixed in 1.8.1. Until next version is out, you could just do something like:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter, VALID_CODECS, SCHEMA_KEY
from avro.io import DatumReader, DatumWriter
from avro import io as avro_io
class MyDataFileReader(DataFileReader):
def __init__(self, reader, datum_reader):
"""Initializes a new data file reader.
Args:
reader: Open file to read from.
datum_reader: Avro datum reader.
"""
self._reader = reader
self._raw_decoder = avro_io.BinaryDecoder(reader)
self._datum_decoder = None # Maybe reset at every block.
self._datum_reader = datum_reader
# read the header: magic, meta, sync
self._read_header()
# ensure codec is valid
avro_codec_raw = self.GetMeta('avro.codec')
if avro_codec_raw is None:
self.codec = "null"
else:
self.codec = avro_codec_raw.decode('utf-8')
if self.codec not in VALID_CODECS:
raise DataFileException('Unknown codec: %s.' % self.codec)
self._file_length = self._GetInputFileLength()
# get ready to read
self._block_count = 0
self.datum_reader.writer_schema = (
schema.Parse(self.GetMeta(SCHEMA_KEY).decode('utf-8')))
reader = MyDataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
print (user)
reader.close()
It is very strange that such stupid bug could go to releases though, and that's not a sign a code maturity!

Resources