CParserError when reading csv file into python Spyder - python-3.x

I am trying to read a big csv file (around 17GB) into python Spyder using pandas module. Here is my code
data =pd.read_csv('example.csv', encoding = 'ISO-8859-1')
But I keep getting CParserError error message
Traceback (most recent call last):
File "<ipython-input-3-3993cadd40d6>", line 1, in <module>
data =pd.read_csv('newsall.csv', encoding = 'ISO-8859-1')
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 325, in _read
return parser.read()
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748)
File "pandas\parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9003)
File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731)
File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602)
File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325)
CParserError: Error tokenizing data. C error: out of memory
I am aware there are some discussions about the issue, but it seems quite specific and varies from case to case. Does anyone can help me out here?
I'm using python 3 on windows system. Thanks in advance.
EDIT:
As suggested by ResMar, I tried the following code
data = pd.DataFrame()
reader = pd.read_csv('newsall.csv', encoding = 'ISO-8859-1', chunksize = 10000)
for chunk in reader:
data.append(chunk, ignore_index=True)
But it returns nothing with
data.shape
Out[12]: (0, 0)
Then I tried the following code
data = pd.DataFrame()
reader = pd.read_csv('newsall.csv', encoding = 'ISO-8859-1', chunksize = 10000)
for chunk in reader:
data = data.append(chunk, ignore_index=True)
It again shows run out of memory error, here is the trackback
Traceback (most recent call last):
File "<ipython-input-23-ee9021fcc9b4>", line 3, in <module>
for chunk in reader:
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 795, in __next__
return self.get_chunk()
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 836, in get_chunk
return self.read(nrows=size)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748)
File "pandas\parser.pyx", line 839, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9208)
File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731)
File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602)
File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325)
CParserError: Error tokenizing data. C error: out of memory

It seems to me to be pretty obvious what your error is: the computer runs out of memory. The file itself is 17GB, and as a rule of thumb pandas will take up roughly double that much space when it reads the file. So you'd need around 34GB of RAM to read this data in directly.
Most computers these days have 4, 8, or 16 GB; a few have 32. Your computer just runs out of memory and C kills the process when you do.
You can get around this by reading in your data in chunks, doing whatever you want to do on it with each segment in turn. See the chunksize parameter to pd.read_csv for more details on that, but you'll basically want something that looks like:
for chunk in pd.read_csv("...", chunksize=10000):
do_something()

Related

Not able to parse csv file from pandas

I am writing python script in which i am generating two different csv files and then reading these file by using pandas. I am able to read file1 with pandas but getting error while reading file2 which in same format(same column name) as file1 but different/same values. Please find the below error that i am getting and sample code that i am using.
Error:
Traceback (most recent call last):
File "MSReport.py", line 168, in <module>
fail = pd.read_csv('/home/cisapp/msLogFailure.csv', sep=',')
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/cisapp/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 532, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Code:
df = pd.read_csv(BASE_LOCATION+'/msLog_Success.csv', engine='python')
f_output = df.groupby('MSISDN').last()
#print(df)
print(f_output)
fail = pd.read_csv(BASE_LOCATION+'/msLogFailure.csv', engine='python')
fail = fail['MSISDN']
fail = fail.tolist()
for i in fail:
succ = f_output[f_output.MSISDN != i]
In above sample code there is no error while reading file df = pd.read_csv(BASE_LOCATION+'/msLog_Success.csv', engine='python') but while reading file fail = pd.read_csv(BASE_LOCATION+'/msLogFailure.csv', engine='python') i am facing the error as mentioned above. Please help to resolve.
Note: I am running code by using python3.
I faced the same problem and resolved. So you can check using below idea.
Check the delimitator and mention like below examples
pd.read_csv(BASE_LOCATION+'/msLog_Success.csv', encoding='utf-16', sep='\t')
pd.read_csv(BASE_LOCATION+'/msLog_Success.csv', delim_whitespace=True)
You can also add 'r' before file path.
Otherwise share the file image
Your sample of msLogFailure file looks OK - 6 column names and 6 data fields.
I looked for posts concerning just this error message and I found an advice to:
read the input file into a string variable,
read_csv from this string, e.g. pd.read_csv(io.StringIO(txt),...).
Maybe this will help.

How do I append rows to an excel spreadsheet with openpyxl and then save the excel file?

I'm writing a script that adds new data to an existing excel spreadsheet. Currently, the spreadsheet is 500k+ rows long. I've been using openpyxl to open the spreadsheet as xlsxwriter doesn't currently have any editing capabilities. However, when I use the provided append() method as explained in this answer to a similar problem.
I'm currently running Python 3.7.3 with openpyxl 2.6.2 on a Windows 7 computer.
from openpyxl import load_workbook, Workbook
records = [object list] #this is just a list of objects
file_name = 'existing_excel_file.xlsx'
excel_workbook = load_workbook(file_name, read_only=False)
worksheet = excel_workbook.active
row_list = []
for record in records:
row_number += 1
row_list.append([
str(record.weekno),
str(record.date1),
str(record.code),
str(record.customer),
str(record.date2
])
for row in row_list:
worksheet.append(row)
excel_workbook.save(file_name )
Obviously, it's supposed to save the file with the appended lines.
append() is working alright, but when I try to execute the save() method, I receive this error:
ValueError: I/O operation on closed file.
EDIT: At the suggestion of #CharlieClark, I grabbed the full traceback. I also noticed that there is a MemoryError that I simply didn't notice before (careless, I know) which might be the source of my issue; until this is resolved, I'm researching how to increase the memory being used in openpyxl as I'm sure that's probably the key. Regardless, here's the dump. Warning: it's a big hairy traceback.
Traceback (most recent call last):
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 836, in _get_writer
yield file.write
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 777, in write
short_empty_elements=short_empty_elements)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 942, in _serialize_xml
short_empty_elements=short_empty_elements)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 942, in _serialize_xml
short_empty_elements=short_empty_elements)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 942, in _serialize_xml
short_empty_elements=short_empty_elements)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 935, in _serialize_xml
write(" %s=\"%s\"" % (qnames[k], v))
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "manage.py", line 21, in <module>
main()
File "manage.py", line 17, in main
execute_from_command_line(sys.argv)
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\django\core\management\__init__.py", line 381, in execute_from_command_line
utility.execute()
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\django\core\management\__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\django\core\management\base.py", line 316, in run_from_argv
self.execute(*args, **cmd_options)
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\django\core\management\base.py", line 353, in execute
output = self.handle(*args, **options)
File "C:\Users\davidm\projects\web_admin\web_admin\ezcorp\management\commands\codes.py", line 183, in handle
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\openpyxl\workbook\workbook.py", line 397, in save
save_workbook(self, filename)
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\openpyxl\writer\excel.py", line 294, in save_workbook
writer.save()
File "C:\Users\davidm\projects\web_admin\venv\lib\site-packages\openpyxl\writer\excel.py", line 276, in save
self.write_data()
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\openpyxl\writer\excel.py", line 76, in write_data
self._write_worksheets()
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\openpyxl\writer\excel.py", line 216, in _write_worksheets
self.write_worksheet(ws)
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\openpyxl\writer\excel.py", line 201, in write_worksheet
writer.write()
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\openpyxl\worksheet\_writer.py", line 358, in write
self.close()
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\openpyxl\worksheet\_writer.py", line 366, in close
self.xf.close()
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\openpyxl\worksheet\_writer.py", line 297, in get_stream
pass
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\contextlib.py", line 119, in __exit__
next(self.gen)
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\et_xmlfile\xmlfile.py", line 50, in element
self._write_element(el)
File "C:\Users\davidm\projects\rdm_admin\venv\lib\site-packages\et_xmlfile\xmlfile.py", line 77, in _write_element
xml = tostring(element)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 1136, in tostring
short_empty_elements=short_empty_elements)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 777, in write
short_empty_elements=short_empty_elements)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\xml\etree\ElementTree.py", line 836, in _get_writer
yield file.write
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\contextlib.py", line 511, in __exit__
raise exc_details[1]
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\contextlib.py", line 496, in __exit__
if cb(*exc_details):
File "C:\Users\davidm\AppData\Local\Programs\Python\Python37-32\Lib\contextlib.py", line 383, in _exit_wrapper
callback(*args, **kwds)
ValueError: I/O operation on closed file.

Getting `EOFError: Compressed file ended before the end-of-stream marker was reached` error

I wrote a python script to download a file, extract it and train the AI. Code is as given below:
def maybe_download_and_extract(data_url):
dest_directory = FLAGS.model_dir
if not os.path.exists(dest_directory):
os.makedirs(dest_directory)
filename = data_url.split('/')[-1]
filepath = os.path.join(dest_directory, filename)
if not os.path.exists(filepath):
def _progress(count, block_size, total_size):
sys.stdout.write('\r>> Downloading %s %.1f%%' %
(filename, float(count * block_size) / float(total_size) * 100.0))
sys.stdout.flush()
filepath, _ = urllib.request.urlretrieve(data_url, filepath, _progress)
print()
statinfo = os.stat(filepath)
tf.logging.info('Successfully downloaded', filename, statinfo.st_size, 'bytes.')
tarfile.open(filepath, 'r:gz').extractall(dest_directory)
When I run it, I get this error:
Traceback (most recent call last):
File "scripts/retrain.py", line 1326, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "C:\Users\kulkaa\AppData\Local\conda\conda\envs\mlcc\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "scripts/retrain.py", line 982, in main
maybe_download_and_extract(model_info['data_url'])
File "scripts/retrain.py", line 340, in maybe_download_and_extract
tarfile.open(filepath, 'r:gz').extractall(dest_directory)
File "C:\Users\kulkaa\AppData\Local\conda\conda\envs\mlcc\lib\tarfile.py", line 2010, in extractall
numeric_owner=numeric_owner)
File "C:\Users\kulkaa\AppData\Local\conda\conda\envs\mlcc\lib\tarfile.py", line 2052, in extract
numeric_owner=numeric_owner)
File "C:\Users\kulkaa\AppData\Local\conda\conda\envs\mlcc\lib\tarfile.py", line 2122, in _extract_member
self.makefile(tarinfo, targetpath)
File "C:\Users\kulkaa\AppData\Local\conda\conda\envs\mlcc\lib\tarfile.py", line 2171, in makefile
copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
File "C:\Users\kulkaa\AppData\Local\conda\conda\envs\mlcc\lib\tarfile.py", line 249, in copyfileobj
buf = src.read(bufsize)
File "C:\Users\kulkaa\AppData\Local\conda\conda\envs\mlcc\lib\gzip.py", line 276, in read
return self._buffer.read(size)
File "C:\Users\kulkaa\AppData\Local\conda\conda\envs\mlcc\lib\_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "C:\Users\kulkaa\AppData\Local\conda\conda\envs\mlcc\lib\gzip.py", line 482, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
It seems that file was partially downloaded. Where is that file? I deleted contents of tmp folder and ran that program again, but got same error.

How can I read the csv file in pandas which is separated with ";"?

I started working with pandas in python 3.4 for couple of days. I chose to work on Book-Crossing data set.
The book information table is like this:
The Book rating table is like this:
I want to grab the "ISBN","Book-title" from the book information table and merge it with the book-rating table in which both match the "ISBN" and after that write the results in another csv file.
I used the code below:
udata = pd.read_csv('1', names = ('User_ID', 'ISBN', 'Book-Rating'), encoding="ISO-8859-1", sep=';', usecols=[0,1,2])
uitem = pd.read_csv('2', names = ('ISBN', 'Book-Title'), encoding="ISO-8859-1", sep=';', usecols=[0,1])
ratings = pd.merge(udata, uitem, on='ISBN')
ratings.to_csv('ratings.csv', index=False)
Unfortunately it doesn't work and it gives an error:
Traceback (most recent call last):
File "C:\Users\masoud\Desktop\Dataset\data2\a.py", line 2, in <module>
udata = pd.read_csv('2.csv', names = ('User_ID', 'ISBN', 'Book-Rating'),encoding="ISO-8859-1", sep=';', usecols=[0,1,2])
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 491, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 278, in _read
return parser.read()
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 740, in read
ret = self._engine.read(nrows)
File "C:\WinPython-64bit-3.4.3.6\python-3.4.3.amd64\lib\site-packages\pandas\io\parsers.py", line 1187, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 758, in pandas.parser.TextReader.read (pandas\parser.c:7919)
File "pandas\parser.pyx", line 780, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:8175)
File "pandas\parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas\parser.c:8868)
File "pandas\parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8736)
File "pandas\parser.pyx", line 1732, in pandas.parser.raise_parser_error (pandas\parser.c:22105)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 8 fields in line 6452, saw 9
I was wondering if anybody could fix the error?
In the first and second row, change sep to ;.
sep=';'

Error when reading avro files in python

I installed Apache Avro successfully in Python. Then I try to read Avro files into Python following the instruction below.
https://avro.apache.org/docs/1.8.1/gettingstartedpython.html
I have a bunch of Avros in a directory which has already been set as the right path in Python. Here is my code:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
print (user)
reader.close()
However it returns this error:
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 5, in <module>
reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 349, in __init__
self._read_header()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 459, in _read_header
META_SCHEMA, META_SCHEMA, self.raw_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg \avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 515, in read_data
return self.read_fixed(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 568, in read_fixed
return decoder.read(writer_schema.size)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 170, in read
input_bytes = self.reader.read(n)
File "I:\Program Files\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 863: character maps to <undefined>
I am indeed aware that in the example in the instruction, a schema is created first. But what is a avsc file? How shall I create it and the corresponding schema in my case? Ideally, I would like to read Avro files into Python and save it into csv format in the disk or dataframe/list type in Python for further analysis. I'm using Python 3 on Windows 7.
EDITED
I tried Stephane's code, and it returns a new error
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 5, in <module>
reader = DataFileReader(open("part-00000-of-01733.avro", "rb"), DatumReader())
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 352, in __init__
self.codec = self.GetMeta('avro.codec').decode('utf-8')
AttributeError: 'NoneType' object has no attribute 'decode'
EDITED2: Stephane's code works in most cases, but sometimes it incurs AssertionError like this
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 42, in <module>
for user in reader:
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 522, in __next__
datum = self.datum_reader.read(self.datum_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 480, in read
return self.read_data(self.writer_schema, self.reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 523, in read_data
return self.read_union(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 689, in read_union
return self.read_data(selected_writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 493, in read_data
return self.read_data(writer_schema, s, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 503, in read_data
return decoder.read_utf8()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 248, in read_utf8
input_bytes = self.read_bytes()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 241, in read_bytes
return self.read(nbytes)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 171, in read
assert (len(input_bytes) == n), input_bytes
AssertionError: b'BlackRock Group\n\n17 December 2015\n\nFORM 8.3\n\nPUBLIC OPENING POSITION DISCLOSURE/DEALING DISCLOSURE BY\n\nA PERSON WITH INTERESTS IN RELEVANT SECURITIES REPRESENTING 1% OR MORE\n\nRule 8.3 of the Takeover Code (the "Code") \n\n\n 1. KEY INFORMATION \n \n (a) Full name of discloser: BlackRock, Inc. \n------------------------------------------------------------------------------------------------- ----------------- \n (b) Owner or controller of interests and short positions disclosed, if diffe
You're using windows and Python 3.
in Python 3 by default open opens files in text mode. It means that when further read operations happen, Python will try to decode the content of the file from some charset to unicode.
you did not specify a default charset, so Python tries to decode the content as if such content was encoded using charmap (by default on windows).
obviously your avro file is not encoded in charmap, and the decoded fails with an exception
as far as i remember, avro headers anyway are binary content... not textual (not sure about that). so maybe first you should try NOT to decode the file with open:
reader = DataFileReader(open("part-00000-of-01733.avro", 'rb'), DatumReader())
(notice 'rb', binary mode)
EDIT: For the next problem (AttributeError), you've been hit by a known bug that's not fixed in 1.8.1. Until next version is out, you could just do something like:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter, VALID_CODECS, SCHEMA_KEY
from avro.io import DatumReader, DatumWriter
from avro import io as avro_io
class MyDataFileReader(DataFileReader):
def __init__(self, reader, datum_reader):
"""Initializes a new data file reader.
Args:
reader: Open file to read from.
datum_reader: Avro datum reader.
"""
self._reader = reader
self._raw_decoder = avro_io.BinaryDecoder(reader)
self._datum_decoder = None # Maybe reset at every block.
self._datum_reader = datum_reader
# read the header: magic, meta, sync
self._read_header()
# ensure codec is valid
avro_codec_raw = self.GetMeta('avro.codec')
if avro_codec_raw is None:
self.codec = "null"
else:
self.codec = avro_codec_raw.decode('utf-8')
if self.codec not in VALID_CODECS:
raise DataFileException('Unknown codec: %s.' % self.codec)
self._file_length = self._GetInputFileLength()
# get ready to read
self._block_count = 0
self.datum_reader.writer_schema = (
schema.Parse(self.GetMeta(SCHEMA_KEY).decode('utf-8')))
reader = MyDataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
print (user)
reader.close()
It is very strange that such stupid bug could go to releases though, and that's not a sign a code maturity!

Resources