I installed Apache Avro successfully in Python. Then I try to read Avro files into Python following the instruction below.
https://avro.apache.org/docs/1.8.1/gettingstartedpython.html
I have a bunch of Avros in a directory which has already been set as the right path in Python. Here is my code:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
print (user)
reader.close()
However it returns this error:
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 5, in <module>
reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 349, in __init__
self._read_header()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 459, in _read_header
META_SCHEMA, META_SCHEMA, self.raw_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg \avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 515, in read_data
return self.read_fixed(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 568, in read_fixed
return decoder.read(writer_schema.size)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 170, in read
input_bytes = self.reader.read(n)
File "I:\Program Files\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 863: character maps to <undefined>
I am indeed aware that in the example in the instruction, a schema is created first. But what is a avsc file? How shall I create it and the corresponding schema in my case? Ideally, I would like to read Avro files into Python and save it into csv format in the disk or dataframe/list type in Python for further analysis. I'm using Python 3 on Windows 7.
EDITED
I tried Stephane's code, and it returns a new error
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 5, in <module>
reader = DataFileReader(open("part-00000-of-01733.avro", "rb"), DatumReader())
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 352, in __init__
self.codec = self.GetMeta('avro.codec').decode('utf-8')
AttributeError: 'NoneType' object has no attribute 'decode'
EDITED2: Stephane's code works in most cases, but sometimes it incurs AssertionError like this
Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 42, in <module>
for user in reader:
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 522, in __next__
datum = self.datum_reader.read(self.datum_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 480, in read
return self.read_data(self.writer_schema, self.reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 523, in read_data
return self.read_union(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 689, in read_union
return self.read_data(selected_writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 493, in read_data
return self.read_data(writer_schema, s, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 503, in read_data
return decoder.read_utf8()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 248, in read_utf8
input_bytes = self.read_bytes()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 241, in read_bytes
return self.read(nbytes)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 171, in read
assert (len(input_bytes) == n), input_bytes
AssertionError: b'BlackRock Group\n\n17 December 2015\n\nFORM 8.3\n\nPUBLIC OPENING POSITION DISCLOSURE/DEALING DISCLOSURE BY\n\nA PERSON WITH INTERESTS IN RELEVANT SECURITIES REPRESENTING 1% OR MORE\n\nRule 8.3 of the Takeover Code (the "Code") \n\n\n 1. KEY INFORMATION \n \n (a) Full name of discloser: BlackRock, Inc. \n------------------------------------------------------------------------------------------------- ----------------- \n (b) Owner or controller of interests and short positions disclosed, if diffe
You're using windows and Python 3.
in Python 3 by default open opens files in text mode. It means that when further read operations happen, Python will try to decode the content of the file from some charset to unicode.
you did not specify a default charset, so Python tries to decode the content as if such content was encoded using charmap (by default on windows).
obviously your avro file is not encoded in charmap, and the decoded fails with an exception
as far as i remember, avro headers anyway are binary content... not textual (not sure about that). so maybe first you should try NOT to decode the file with open:
reader = DataFileReader(open("part-00000-of-01733.avro", 'rb'), DatumReader())
(notice 'rb', binary mode)
EDIT: For the next problem (AttributeError), you've been hit by a known bug that's not fixed in 1.8.1. Until next version is out, you could just do something like:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter, VALID_CODECS, SCHEMA_KEY
from avro.io import DatumReader, DatumWriter
from avro import io as avro_io
class MyDataFileReader(DataFileReader):
def __init__(self, reader, datum_reader):
"""Initializes a new data file reader.
Args:
reader: Open file to read from.
datum_reader: Avro datum reader.
"""
self._reader = reader
self._raw_decoder = avro_io.BinaryDecoder(reader)
self._datum_decoder = None # Maybe reset at every block.
self._datum_reader = datum_reader
# read the header: magic, meta, sync
self._read_header()
# ensure codec is valid
avro_codec_raw = self.GetMeta('avro.codec')
if avro_codec_raw is None:
self.codec = "null"
else:
self.codec = avro_codec_raw.decode('utf-8')
if self.codec not in VALID_CODECS:
raise DataFileException('Unknown codec: %s.' % self.codec)
self._file_length = self._GetInputFileLength()
# get ready to read
self._block_count = 0
self.datum_reader.writer_schema = (
schema.Parse(self.GetMeta(SCHEMA_KEY).decode('utf-8')))
reader = MyDataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
print (user)
reader.close()
It is very strange that such stupid bug could go to releases though, and that's not a sign a code maturity!
Related
I want to stream data generated by python to a webpage.
I came up with the following example, put together using examples from
https://holoviews.org/user_guide/Streaming_Data.html
and
http://holoviews.org/user_guide/Deploying_Bokeh_Apps.html
However I get a document lock error:
"'_pending_writes should be non-None when we have a document lock, and we should have the lock when the document changes'"
This is my example:
import numpy as np
import holoviews as hv
import holoviews.plotting.bokeh
import streamz
import streamz.dataframe
renderer = hv.renderer('bokeh')
from holoviews import opts
from holoviews.streams import Pipe, Buffer
hv.extension('bokeh')
source_df = streamz.dataframe.Random(freq='5ms', interval='100ms')
sdf = (source_df-0.5).cumsum()
raw_dmap = hv.DynamicMap(hv.Curve, streams=[Buffer(sdf.x)])
smooth_dmap = hv.DynamicMap(hv.Curve, streams=[Buffer(sdf.x.rolling('50ms').mean())])
fig = (raw_dmap.relabel('raw') * smooth_dmap.relabel('smooth')).opts(
opts.Curve(width=500, show_grid=True))
server = renderer.app(fig, show=True, new_window=True)
A page opens, figure shows up but is not updating. In my notebook I get the following error:
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x00000234E3CB9400>, <Future finished exception=RuntimeError('_pending_writes should be non-None when we have a document lock, and we should have the lock when the document changes')>)
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\tornado\ioloop.py", line 758, in _run_callback
ret = callback()
File "C:\ProgramData\Anaconda3\lib\site-packages\tornado\stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\tornado\ioloop.py", line 779, in _discard_future_result
future.result()
File "C:\ProgramData\Anaconda3\lib\site-packages\tornado\gen.py", line 1147, in run
yielded = self.gen.send(value)
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\dataframe\core.py", line 802, in _cb
yield source._emit((last, now, freq))
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\core.py", line 298, in _emit
r = downstream.update(x, who=self)
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\core.py", line 563, in update
return self._emit(result)
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\core.py", line 298, in _emit
r = downstream.update(x, who=self)
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\core.py", line 563, in update
return self._emit(result)
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\core.py", line 298, in _emit
r = downstream.update(x, who=self)
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\core.py", line 747, in update
return self._emit(result)
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\core.py", line 298, in _emit
r = downstream.update(x, who=self)
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\core.py", line 563, in update
return self._emit(result)
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\core.py", line 298, in _emit
r = downstream.update(x, who=self)
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\core.py", line 563, in update
return self._emit(result)
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\core.py", line 298, in _emit
r = downstream.update(x, who=self)
File "C:\ProgramData\Anaconda3\lib\site-packages\streamz\core.py", line 516, in update
result = self.func(x, *self.args, **self.kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\holoviews\streams.py", line 436, in send
self.event(data=data)
File "C:\ProgramData\Anaconda3\lib\site-packages\holoviews\streams.py", line 375, in event
self.trigger([self])
File "C:\ProgramData\Anaconda3\lib\site-packages\holoviews\streams.py", line 156, in trigger
subscriber(**dict(union))
File "C:\ProgramData\Anaconda3\lib\site-packages\holoviews\plotting\plot.py", line 615, in refresh
self._trigger_refresh(stream_key)
File "C:\ProgramData\Anaconda3\lib\site-packages\holoviews\plotting\plot.py", line 624, in _trigger_refresh
self.update(key)
File "C:\ProgramData\Anaconda3\lib\site-packages\holoviews\plotting\plot.py", line 596, in update
item = self.__getitem__(key)
File "C:\ProgramData\Anaconda3\lib\site-packages\holoviews\plotting\plot.py", line 261, in __getitem__
self.update_frame(frame)
File "C:\ProgramData\Anaconda3\lib\site-packages\holoviews\plotting\bokeh\element.py", line 1944, in update_frame
self._update_ranges(element, ranges)
File "C:\ProgramData\Anaconda3\lib\site-packages\holoviews\plotting\bokeh\element.py", line 657, in _update_ranges
self._shared['x'], self.logx, streaming)
File "C:\ProgramData\Anaconda3\lib\site-packages\holoviews\plotting\bokeh\element.py", line 702, in _update_range
axis_range.trigger(k, old, new)
File "C:\ProgramData\Anaconda3\lib\site-packages\bokeh\model.py", line 599, in trigger
super(Model, self).trigger(attr, old, new, hint=hint, setter=setter)
File "C:\ProgramData\Anaconda3\lib\site-packages\bokeh\util\callback_manager.py", line 143, in trigger
self._document._notify_change(self, attr, old, new, hint, setter, invoke)
File "C:\ProgramData\Anaconda3\lib\site-packages\bokeh\document\document.py", line 1004, in _notify_change
self._trigger_on_change(event)
File "C:\ProgramData\Anaconda3\lib\site-packages\bokeh\document\document.py", line 1099, in _trigger_on_change
self._with_self_as_curdoc(invoke_callbacks)
File "C:\ProgramData\Anaconda3\lib\site-packages\bokeh\document\document.py", line 1112, in _with_self_as_curdoc
return f()
File "C:\ProgramData\Anaconda3\lib\site-packages\bokeh\document\document.py", line 1098, in invoke_callbacks
cb(event)
File "C:\ProgramData\Anaconda3\lib\site-packages\bokeh\document\document.py", line 668, in <lambda>
self._callbacks[receiver] = lambda event: event.dispatch(receiver)
File "C:\ProgramData\Anaconda3\lib\site-packages\bokeh\document\events.py", line 244, in dispatch
super(ModelChangedEvent, self).dispatch(receiver)
File "C:\ProgramData\Anaconda3\lib\site-packages\bokeh\document\events.py", line 126, in dispatch
receiver._document_patched(self)
File "C:\ProgramData\Anaconda3\lib\site-packages\bokeh\server\session.py", line 214, in _document_patched
raise RuntimeError("_pending_writes should be non-None when we have a document lock, and we should have the lock when the document changes")
RuntimeError: _pending_writes should be non-None when we have a document lock, and we should have the lock when the document changes
Any clues what I'm doing wrong?
Kind regards
I changed the last line to renderer.server_doc(fig), saved everything as a notebook named test.ipynb. In the command prompt, I ran >bokeh serve --show .\test.ipynb. The server is up and the streaming of data is shown in the browser as expected.
import numpy as np
import holoviews as hv
import holoviews.plotting.bokeh
import streamz
import streamz.dataframe
renderer = hv.renderer('bokeh')
from holoviews import opts
from holoviews.streams import Pipe, Buffer
hv.extension('bokeh')
source_df = streamz.dataframe.Random(freq='5ms', interval='100ms')
sdf = (source_df-0.5).cumsum()
raw_dmap = hv.DynamicMap(hv.Curve, streams=[Buffer(sdf.x)])
smooth_dmap = hv.DynamicMap(hv.Curve, streams=[Buffer(sdf.x.rolling('50ms').mean())])
fig = (raw_dmap.relabel('raw') * smooth_dmap.relabel('smooth')).opts(
opts.Curve(width=500, show_grid=True))
renderer.server_doc(fig)
I have a file global_vars.py that contains file paths saved as variables:
from pandas import Timestamp
final_vol_path = 'datasets/final_vols.csv'
final_price_path = 'datasets/final_prices.csv'
final_start_date = Timestamp('2017-01-01')
with other variables written in a similar fashion. However, the functions that I'm using to read in the data throw a FileNotFoundError when attempting to do the following in file1.py:
import scripts.global_vars as gv
read_data(gv.final_vol_path, gv.final_price_path) # throws FileNotFoundError
read_data('datasets/final_vols.csv', 'datasets/final_prices.csv') # this passes
Additionally, I've checked the file paths, and have gotten the following:
gv.final_vol_path == 'datasets/final_vols.csv' # returns True
gv.final_price_path == 'datasets/final_prices.csv' # returns True
Moreover, the pandas Timestamp object is processed without any problems.
Is there any explanation for why the FileNotFoundError is being thrown when attempting to access the file path as a variable from global_vars.py, but is not thrown when the actual string is passed in?
EDIT: The overall directory structure is as follows:
working_dir
L file1.py
L scripts
L global_vars.py
L datasets
L final_vols.csv
L final_prices.csv
EDIT 2: I added in a try-catch block to ensure the rest of the function doesn't break, not sure if that has affected the traceback, but here's what I get:
Traceback (most recent call last):
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\runpy.py", line
184, in _run_module_as_main
"__main__", mod_spec)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\Ananth\Anaconda3\envs\analytics-cpu\Scripts\nose2.exe\__main__.py", line 9, in <module>
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\nose2\main.py", line 306, in discover
return main(*args, **kwargs)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\nose2\main.py", line 100, in __init__
super(PluggableTestProgram, self).__init__(**kw)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\unittest\main.py", line 93, in __init__
self.parseArgs(argv)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\nose2\main.py", line 133, in parseArgs
self.createTests()
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\nose2\main.py", line 258, in createTests
self.testNames, self.module)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\nose2\loader.py", line 69, in loadTestsFromNames
for name in event.names]
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\nose2\loader.py", line 69, in <listcomp>
for name in event.names]
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\nose2\loader.py", line 84, in loadTestsFromName
result = self.session.hooks.loadTestsFromName(event)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\nose2\events.py", line 224, in __call__
result = getattr(plugin, self.method)(event)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\nose2\plugins\loader\testcases.py", line 56, in loadTestsFromName
result = util.test_from_name(name, module)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\nose2\util.py", line 106, in test_from_name
parent, obj = object_from_name(name, module)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\nose2\util.py", line 117, in object_from_name
module = __import__('.'.join(parts_copy))
File "C:\Users\Ananth\Desktop\Modules\PortfolioVARModule\tests\test_simulation.py", line 24, in <module>
gv.test_start_date)
File "C:\Users\Ananth\Desktop\Modules\PortfolioVARModule\scripts\prep_data.py", line 119, in read_data
priceDF = pd.read_csv(pricepath).dropna()
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\pandas\io\parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\pandas\io\parsers.py", line 389, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\pandas\io\parsers.py", line 730, in __init__
self._make_engine(self.engine)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\pandas\io\parsers.py", line 923, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "c:\users\ananth\anaconda3\envs\analytics-cpu\lib\site-packages\pandas\io\parsers.py", line 1390, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas\parser.pyx", line 373, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4184)
File "pandas\parser.pyx", line 667, in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:8449)
FileNotFoundError: File b'datasets/corn_price.csv' does not exist
Problem is the addition of the letter b in front of your file's path.
You get the b because you encoded to utf-8.
Try:
read_data(str(gv.final_vol_path,'utf-8'), str(gv.final_price_path, 'utf-8'))
I am trying to read a big csv file (around 17GB) into python Spyder using pandas module. Here is my code
data =pd.read_csv('example.csv', encoding = 'ISO-8859-1')
But I keep getting CParserError error message
Traceback (most recent call last):
File "<ipython-input-3-3993cadd40d6>", line 1, in <module>
data =pd.read_csv('newsall.csv', encoding = 'ISO-8859-1')
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 325, in _read
return parser.read()
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748)
File "pandas\parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9003)
File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731)
File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602)
File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325)
CParserError: Error tokenizing data. C error: out of memory
I am aware there are some discussions about the issue, but it seems quite specific and varies from case to case. Does anyone can help me out here?
I'm using python 3 on windows system. Thanks in advance.
EDIT:
As suggested by ResMar, I tried the following code
data = pd.DataFrame()
reader = pd.read_csv('newsall.csv', encoding = 'ISO-8859-1', chunksize = 10000)
for chunk in reader:
data.append(chunk, ignore_index=True)
But it returns nothing with
data.shape
Out[12]: (0, 0)
Then I tried the following code
data = pd.DataFrame()
reader = pd.read_csv('newsall.csv', encoding = 'ISO-8859-1', chunksize = 10000)
for chunk in reader:
data = data.append(chunk, ignore_index=True)
It again shows run out of memory error, here is the trackback
Traceback (most recent call last):
File "<ipython-input-23-ee9021fcc9b4>", line 3, in <module>
for chunk in reader:
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 795, in __next__
return self.get_chunk()
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 836, in get_chunk
return self.read(nrows=size)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "I:\Program Files\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748)
File "pandas\parser.pyx", line 839, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9208)
File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731)
File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602)
File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325)
CParserError: Error tokenizing data. C error: out of memory
It seems to me to be pretty obvious what your error is: the computer runs out of memory. The file itself is 17GB, and as a rule of thumb pandas will take up roughly double that much space when it reads the file. So you'd need around 34GB of RAM to read this data in directly.
Most computers these days have 4, 8, or 16 GB; a few have 32. Your computer just runs out of memory and C kills the process when you do.
You can get around this by reading in your data in chunks, doing whatever you want to do on it with each segment in turn. See the chunksize parameter to pd.read_csv for more details on that, but you'll basically want something that looks like:
for chunk in pd.read_csv("...", chunksize=10000):
do_something()
The code below worked when using Python 2.7, but raises a StatementError when using Python 3.5. I haven't found a good explanation for this online yet.
Why doesn't sqlalchemy accept simple Python 3 string objects in this situation? Is there a better way to insert rows into a table?
from sqlalchemy import Table, MetaData, create_engine
import json
def add_site(site_id):
engine = create_engine('mysql+pymysql://root:password#localhost/database_name', encoding='utf8', convert_unicode=True)
metadata = MetaData()
conn = engine.connect()
table_name = Table('table_name', metadata, autoload=True, autoload_with=engine)
site_name = 'Buffalo, NY'
p_profile = {"0": 300, "1": 500, "2": 100}
conn.execute(table_name.insert().values(updated=True,
site_name=site_name,
site_id=site_id,
p_profile=json.dumps(p_profile)))
add_site(121)
EDIT The table was previously created with this function:
def create_table():
engine = create_engine('mysql+pymysql://root:password#localhost/database_name')
metadata = MetaData()
# Create table for updating sites.
table_name = Table('table_name', metadata,
Column('id', Integer, Sequence('user_id_seq'), primary_key=True),
Column('updated', Boolean),
Column('site_name', BLOB),
Column('site_id', SMALLINT),
Column('p_profile', BLOB))
metadata.create_all(engine)
EDIT Full error:
>>> scd.add_site(121)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1073, in _execute_context
context = constructor(dialect, self, conn, *args)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/default.py", line 610, in _init_compiled
for key in compiled_params
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/default.py", line 610, in <genexpr>
for key in compiled_params
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/sql/sqltypes.py", line 834, in process
return DBAPIBinary(value)
File "/usr/local/lib/python3.5/dist-packages/pymysql/__init__.py", line 79, in Binary
return bytes(x)
TypeError: string argument without an encoding
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user1/Desktop/server_algorithm/database_tools.py", line 194, in add_site
failed_acks=json.dumps(p_profile)))
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 914, in execute
return meth(self, multiparams, params)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
compiled_sql, distilled_params
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1078, in _execute_context
None, None)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
exc_info
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/util/compat.py", line 202, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/util/compat.py", line 185, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1073, in _execute_context
context = constructor(dialect, self, conn, *args)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/default.py", line 610, in _init_compiled
for key in compiled_params
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/default.py", line 610, in <genexpr>
for key in compiled_params
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/sql/sqltypes.py", line 834, in process
return DBAPIBinary(value)
File "/usr/local/lib/python3.5/dist-packages/pymysql/__init__.py", line 79, in Binary
return bytes(x)
sqlalchemy.exc.StatementError: (builtins.TypeError) string argument without an encoding [SQL: 'INSERT INTO table_name (updated, site_name, site_id, p_profile) VALUES (%(updated)s, %(site_name)s, %(site_id)s, %(p_profile)s)']
As univerio mentioned, the solution was to encode the string as follows:
conn.execute(table_name.insert().values(updated=True,
site_name=site_name,
site_id=site_id,
p_profile=bytes(json.dumps(p_profile), 'utf8')))
BLOBs require binary data, so we need bytes in Python 3 and str in Python 2, since Python 2 strings are sequences of bytes.
If we want to use Python 3 str, we need to use TEXT instead of BLOB.
You simply just need to convert your string to a byte string ex:
site_name=str.encode(site_name),
site_id=site_id,
p_profile=json.dumps(p_profile)))```
or
```site_name = b'Buffalo, NY'```
from jira.client import JIRA
jira_options={'server': 'https://abcjira.atlassian.net/login?dest- url=%2Fsecure%2FMyJiraHome.jspa'}
jira=JIRA(options=jira_options,basic_auth=('user','password'))
I want to do basic authentication in jira with python. Wrote above code but it gives me traceback. Could anyone please tell me what is the problem here.
Traceback (most recent call last):
File "lab.py", line 5, in <module>
jira=JIRA(options=jira_options,basic_auth=('user','password'))
File "C:\Python34\lib\site-packages\jira\client.py", line 261, in __init__
si = self.server_info()
File "C:\Python34\lib\site-packages\jira\client.py", line 1619, in server_info
return self._get_json('serverInfo')
File "C:\Python34\lib\site-packages\jira\client.py", line 2040, in _get_json
raise e
File "C:\Python34\lib\site-packages\jira\client.py", line 2037, in _get_json
r_json = json_loads(r)
File "C:\Python34\lib\site-packages\jira\utils.py", line 81, in json_loads
return json.loads(r.text)
File "C:\Python34\lib\json\__init__.py", line 318, in loads
return _default_decoder.decode(s)
File "C:\Python34\lib\json\decoder.py", line 343, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python34\lib\json\decoder.py", line 361, in raw_decode
raise ValueError(errmsg("Expecting value", s, err.value)) from None
ValueError: Expecting value: line 4 column 1 (char 3)