OverflowError while saving large Pandas df to hdf - python-3.x

I have a large Pandas dataframe (~15GB, 83m rows) that I am interested in saving as an h5 (or feather) file. One column contains long ID strings of numbers, which should have string/object type. But even when I ensure that pandas parses all columns as object:
df = pd.read_csv('data.csv', dtype=object)
print(df.dtypes) # sanity check
df.to_hdf('df.h5', 'df')
> client_id object
event_id object
account_id object
session_id object
event_timestamp object
# etc...
I get this error:
File "foo.py", line 14, in <module>
df.to_hdf('df.h5', 'df')
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/generic.py", line 1996, in to_hdf
return pytables.to_hdf(path_or_buf, key, self, **kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 279, in to_hdf
f(store)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 273, in <lambda>
f = lambda store: store.put(key, value, **kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 890, in put
self._write_to_group(key, value, append=append, **kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 1367, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 2963, in write
self.write_array('block%d_values' % i, blk.values, items=blk_items)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 2730, in write_array
vlarr.append(value)
File "/shared_directory/projects/env/lib/python3.6/site-packages/tables/vlarray.py", line 547, in append
self._append(nparr, nobjects)
File "tables/hdf5extension.pyx", line 2032, in tables.hdf5extension.VLArray._append
OverflowError: value too large to convert to int
Apparently it is trying to convert this to an int anyway, and failing.
When running df.to_feather() I have a similar issue:
df.to_feather('df.feather')
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/frame.py", line 1892, in to_feather
to_feather(self, fname)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/feather_format.py", line 83, in to_feather
feather.write_dataframe(df, path)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/feather.py", line 182, in write_feather
writer.write(df)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/feather.py", line 93, in write
table = Table.from_pandas(df, preserve_index=False)
File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 501, in dataframe_to_arrays
convert_fields))
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 487, in convert_column
raise e
File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 481, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Could not convert 1542852887489 with type str: tried to convert to double', 'Conversion failed for column session_id with type object')
So:
Is anything that looks like a number forcibly converted to a number
in storage?
Could the presence of NaNs affect what's happening here?
Is there an alternative storage solution? What would be the best?

Having done some reading on this topic, it seems like the issue is dealing with string-type columns. My string columns contain a mixture of all-number strings and strings with characters. Pandas has the flexible option of keeping strings as an object, without a declared type, but when serializing to hdf5 or feather the content of the column is converted to a type (str or double, say) and cannot be mixed. Both of these libraries fail when confronted with a sufficiently large library of mixed type.
Force-converting my mixed column to strings allowed me to save it in feather, but in HDF5 the file ballooned and the process ended when I ran out of disk space.
Here is an answer in a comparable case where a commenter notes (2 years ago) "This problem is very standard, but solutions are few".
Some background:
String types in Pandas are called object, but this obscures that they may either be pure strings or mixed dtypes (numpy has builtin string types, but Pandas never uses them for text). So the first thing to do in a case like this is to enforce all string cols as string type (with df[col].astype(str)). But even so, in a large enough file (16GB, with long strings) this still failed. Why?
The reason I was encountering this error was that I had data that long and high-entropy (many different unique values) strings. (With low-entropy data, it might have been worthwhile to switch to categorical dtype.) In my case, I realised that I only needed these strings in order to identify rows - so I could replace them with unique integers!
df[col] = df[col].map(dict(zip(df[col].unique(), range(df[col].nunique()))))
Other Solutions:
For text data, there are other recommended solutions than hdf5/feather, including:
json
msgpack (note that in Pandas 0.25 read_msgpack is deprecated)
pickle (which has known security issues, so be careful - but it should be OK for internal storage/transfer of dataframes)
parquet, part of the Apache Arrow ecosystem.
Here is an answer from Matthew Rocklin (one of the dask developers) comparing msgpack and pickle. He wrote a broader comparison on his blog.

HDF5 is not the suitable solution for this use case. hdf5 is a better solution if you have many dataframes you want to store in a single structure. It has more overhead when opening the file and then it allows you to efficiently load each dataframe and also easily load slices of them. It should be thought of as a file system that stores dataframes.
In the case of a single dataframe of time series event the recommended formats would be one of the Apache Arrow project formats, i.e. feather or parquet. One should think of those as column based (compressed) csv files. The particular trade off between those two is laid out nicely under What are the differences between feather and parquet?.
One particular issue to consider is data types. Since feather is not designed to optimize disk space by compression it can offer support for a larger variety of data types. While parquet tries to provide very efficient compression it can support only a limited subset that would allow it to handle the data compression better.

Related

Trouble obtaining data from multiple *.root files...but no problems using only one

I am using pythong version 3.6.5 and have a jagged TTree with a multi-dimensional structure. This data is spread over more than 1000 files, all with the same identical TTree structure.
suppose I have two files though and I'll call them
fname1.root
fname2.root
The following code has no problem opening either of these by itself:
import uproot as upr
import awkward
import boost_histogram as bh
import math
import matplotlib.pyplot as plt
#
# define a plotting program
# def plotter(h)
#
# preparing the file location for files
pth = '/fullpathName/'
fname1 = 'File755.root'
fname2 = 'File756.root'
fileList = [pth+fname1, pth+fname2]
#
# print out the path and filename that I've made to show the user
for file in fileList:
print(file)
print('\n')
#
# Let's make a histogram This one has 50 bins, starts at zero and ends at 1000.0
# It will be a histogram of Jet pT's.
jhist = bh.histogram(bh.axis.regular(50,0.0,1000.0))
#
#show what you've just done
print(jhist)
#
# does not work, only fills first file!
for chunk in upr.iterate(fileList,"bTag_AntiKt4EMTopoJets",["jet_pt"]):
jhist.fill(chunk[b"jet_pt"][:, :2].flatten()*0.001)
#
#
# what does my histogram look like?
ptHist = plt.bar(jhist.axes[0].centers, jhist.view(), width=jhist.axes[0].widths)
plt.show()
As I said, the above code works if I put only ONE file in 'fileList'.
The naive thing to do doesn't work.
If I create a 'list' of files using
files = [pth+fname1 , pth+fname2]
and re-run that code. I get the following error...which is very much the same error I have been getting all along.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 48, in <module>
File "/home/huffman/.local/lib/python3.6/site-packages/uproot/tree.py", line 116, in iterate
for tree, branchesinterp, globalentrystart, thispath, thisfile in _iterate(path, treepath, branches, awkward, localsource, xrootdsource, httpsource, **options):
File "/home/huffman/.local/lib/python3.6/site-packages/uproot/tree.py", line 163, in _iterate
file = uproot.rootio.open(path, localsource=localsource, xrootdsource=xrootdsource, httpsource=httpsource, **options)
File "/home/huffman/.local/lib/python3.6/site-packages/uproot/rootio.py", line 54, in open
return ROOTDirectory.read(openfcn(path), **options)
File "/home/huffman/.local/lib/python3.6/site-packages/uproot/rootio.py", line 51, in <lambda>
openfcn = lambda path: MemmapSource(path, **kwargs)
File "/home/huffman/.local/lib/python3.6/site-packages/uproot/source/memmap.py", line 21, in __init__
self._source = numpy.memmap(self.path, dtype=numpy.uint8, mode="r")
File "/cvmfs/sft.cern.ch/lcg/views/LCG_94python3/x86_64-slc6-gcc8-opt/lib/python3.6/site-packages/numpy/core/memmap.py", line 264, in __new__
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OSError: [Errno 12] Cannot allocate memory
Lazy arrays are just an interface for convenience—i.e. you can transform it with one function call, rather than iterating in an explicit loop over chunks. Internally, lazy arrays contain an implicit loop over chunks, so if you're running out of memory one way, you would be the other way.
Your problem is not closing files (they're memory-mapped, so "closing" didn't have a clear meaning—they're a view into memory that the operating system is allocating for itself, anyway)—your problem is with deleting arrays. That's the only thing that can use up all the memory on your computer.
There are a few things you can do here: one is to
for chunk in uproot.iterate(files, "bTag_AntiKt4EMTopoJets", ["jet_pt", "jet_eta"]):
# fill with chunk[b"jet_pt"] and chunk[b"jet_eta"], which correspond
# to the same sets of events, one-to-one.
to explicitly loop over the chunks ("explicit" because you see and control the loop here, and because you have to specify which branches you want to load into the dict chunk). You can control the size of the chunks with entrysteps. The other is to
cache = uproot.ArrayCache("1 GB")
events = uproot.lazyarrays(files, "bTag_AntiKt4EMTopoJets", cache=cache)
to keep the loop implicit. The ArrayCache will throw out chunks of arrays, so that they have to be loaded again, if it gets to the 1 GB limit. If you make that limit too small, it won't be able to hold one chunk, but if you make it too large, you'll run out of memory.
By the way, although you're reporting a memory issue, there's another major performance issue with your code: you're looking over events in Python. Instead of
events.jet_pt[i][:2]*0.001
to get the jet pT for event i, do
events.jet_pt[:, :2]*0.001
for the jet pT of all events as a single array. You might then need to .flatten() that array to fit the histogram's fill method.

Parsing xml file from url to a astropy votable without downloading

From http://svo2.cab.inta-csic.es/theory/fps/ you can get the transmission curves for many filters used in astronomical observations. I would like to get these data by opening the url with the corresponding xml file (for each filter), parse it to astropy's votable that helps to read the table data easily.
I have managed to do this by opening the file converting it to a UTF-8 file and saving in locally as an xml. Then opening the local file works fine, as it is obvious form the following example.
However I do not want to save the file and open it again. When I tried that by doing: votable = parse(xml_file), it raises an OSError: File name too long as it takes all the file as a string.
from urllib.request import urlopen
fltr = 'http://svo2.cab.inta-csic.es/theory/fps/fps.php?ID=2MASS/2MASS.H'
url = urlopen(fltr).read()
xml_file = url.decode('UTF-8')
with open('tmp.xml','w') as out:
out.write(xml_file)
votable = parse('tmp.xml')
data = votable.get_first_table().to_table(use_names_over_ids=True)
print(votable)
print(data["Wavelength"])
The output in this case is:
<VOTABLE>... 1 tables ...</VOTABLE>
Wavelength
AA
----------
12890.0
13150.0
...
18930.0
19140.0
Length = 58 rows
Indeed according to the API documentation, votable.parse's first argument is either a filename or a readable file-like object. It doesn't specify this exactly, but apparently the file also has to be seekable meaning that it can be read with random access.
The HTTPResponse object returned by urlopen is indeed a file-like object with a .read() method, so in principle it might be possible to pass directly to parse(), but this is how I found out it has to be seekable:
fltr = 'http://svo2.cab.inta-csic.es/theory/fps/fps.php?ID=2MASS/2MASS.H'
u = urlopen(fltr)
>>> parse(u)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "astropy/io/votable/table.py", line 135, in parse
_debug_python_based_parser=_debug_python_based_parser) as iterator:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "astropy/utils/xml/iterparser.py", line 157, in get_xml_iterator
with _convert_to_fd_or_read_function(source) as fd:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "astropy/utils/xml/iterparser.py", line 63, in _convert_to_fd_or_read_function
with data.get_readable_fileobj(fd, encoding='binary') as new_fd:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "astropy/utils/data.py", line 210, in get_readable_fileobj
fileobj.seek(0)
io.UnsupportedOperation: seek
So you need to wrap the data in a seekable file-like object. Along the lines that #keflavich wrote you can use io.BytesIO (io.StringIO won't work as explained below).
It turns out that there's no reason to explicitly decode the UTF-8 data to unicode. I'll spare the example, but after trying it myself it turns out parse() works on raw bytes (which I find a bit odd, but okay). So you can read the entire contents of the URL into an io.BytesIO which is just an in-memory file-like object that supports random access:
>>> u = urlopen(fltr)
>>> s = io.BytesIO(u.read())
>>> v = parse(s)
WARNING: W42: None:2:0: W42: No XML namespace specified [astropy.io.votable.tree]
>>> v.get_first_table().to_table(use_names_over_ids=True)
<Table masked=True length=58>
Wavelength Transmission
AA
float32 float32
---------- ------------
12890.0 0.0
13150.0 0.0
... ...
18930.0 0.0
19140.0 0.0
This is, in general, the way in Python to do something with some data as though it were a file, without writing an actual file to the filesystem.
Note, however, this won't work if the entire file can't fit in memory. In that case you still might need to write it out to disk. But if it's just for some temporary processing and you don't want to litter your disk tmp.xml like in your example, you can always use the tempfile module to, among other things, create temporary files that are automatically deleted once they're no longer in use.

Does IPOPT not support Pyomo's quicksum function? ValueError for unsupported expression type

I am trying to solve a nonlinear feasibility problem on Pyomo using ipopt solver. The problem has 2 RangeSet declarations of a combined size 28, 4 Param declarations of a combined size 68, and 5 Var declarations of a combined size 88. There's also 90 constraint declarations (2 redundant) of which some are linear and some are non-linear.
This model is supposed to be simulating a chemical system. Calling model.pprint() gives all the information that it must: all the declarations as stated above. This is the error output I received:
Traceback (most recent call last):
File "sample.py", line 420, in <module>
main()
File "sample.py", line 416, in main
org_n, aq_n, org, aq = _equilibrium_solver(inputfile, T, org_n, aq_n, org, aq)
File "sample.py", line 370, in _equilibrium_solver
opt.solve(model, tee=True)
File "$HOME/.local/lib/python3.6/site-packages/pyomo/opt/base/solvers.py", line 569, in solve
self._presolve(*args, **kwds)
File "$HOME/.local/lib/python3.6/site-packages/pyomo/opt/solver/shellcmd.py", line 200, in _presolve
OptSolver._presolve(self, *args, **kwds)
File "$HOME/.local/lib/python3.6/site-packages/pyomo/opt/base/solvers.py", line 669, in _presolve
**kwds)
File "$HOME/.local/lib/python3.6/site-packages/pyomo/opt/base/solvers.py", line 740, in _convert_problem
**kwds)
File "$HOME/.local/lib/python3.6/site-packages/pyomo/opt/base/convert.py", line 105, in convert_problem
problem_files, symbol_map = converter.apply(*tmp, **tmpkw)
File "$HOME/.local/lib/python3.6/site-packages/pyomo/solvers/plugins/converter/model.py", line 191, in apply
io_options=io_options)
File "$HOME/.local/lib/python3.6/site-packages/pyomo/core/base/block.py", line 1716, in write
io_options)
File "$HOME/.local/lib/python3.6/site-packages/pyomo/repn/plugins/ampl/ampl_.py", line 378, in __call__
include_all_variable_bounds=include_all_variable_bounds)
File "$HOME/.local/lib/python3.6/site-packages/pyomo/repn/plugins/ampl/ampl_.py", line 1528, in _print_model_NL
wrapped_repn.repn.nonlinear_expr)
File "$HOME/.local/lib/python3.6/site-packages/pyomo/repn/plugins/ampl/ampl_.py", line 527, in _print_nonlinear_terms_NL
self._print_nonlinear_terms_NL(exp.arg(1))
File "$HOME/.local/lib/python3.6/site-packages/pyomo/repn/plugins/ampl/ampl_.py", line 637, in _print_nonlinear_terms_NL
% (exp_type))
ValueError: Unsupported expression type (<class 'pyomo.core.expr.expr_pyomo5.LinearExpression'>) in _print_nonlinear_terms_NL
I thought it was a fairly simple calculation that shouldn't have any problems, but now I am out of ideas as to what went wrong. I am not sure what it means by that value error it raises. Am I asking it to print some linear expression in nonlinear terms? There is only thing I can think of: I used quicksum twice, which uses the linear_expression object. Should I replace it with some other sum expression like summation (not sure if summation uses the same object)?
Edit:
I traced the error back to this specific constraint. The constraint gives the relationship between mole fractions and moles.
def x2n_org(m,i):
return model.org[i]*sum_product(model.org_n) == model.org_n[i]
model.xton_org = Constraint(model.nc, rule=x2n_org)
Somehow, sum_product or Summation are the reason for the raised ValueError. It would be nice if someone can see what is wrong with the expression.
If I disable this constraint, the solver returns a different error:
ValueError: Cannot load a SolverResults object with bad status: error
However, this error at least tells me that the solver is trying to solve the model even though it couldn't find a solution.
I solved it by reformulating the constraint. Basically just rearranged the terms to have the sum_product on one side and the two variables (model.org[i] and model.org_n[i]) on the other side:
def x2n_org(m,i):
return sum_product(model.org_n) == model.org_n[i]/model.org[i]
model.xton_org = Constraint(model.nc, rule=x2n_org)
Now IPOPT says the problem failed the restoration phase, but that's a different problem. So this question can be marked as solved.

Pandas ast.literal_eval crashes with non-basic datatypes while reading lists from csv

I have a Pandas dataframe that was saved as a csv by using command:
file.to_csv(filepath + name + ".csv", encoding='utf-8', sep=";", quoting=csv.QUOTE_NONNUMERIC)
I know that while saving, all of the columns in the dataframe are converted to String format, but I've managed to convert them back using
raw_table['synset'] = raw_table['synset'].map(ast.literal_eval)
This seems to work fine when the lists in columns contain numbers or text, but not when there's a different datatype. Point in question comes as I try to open a column "synset" with following values (where every line represents a different row in column, and also empty lists are included):
[Synset('report.n.03')]
[]
[Synset('application.n.04')]
[]
[Synset('legal_profession.n.01')]
[Synset('demeanor.n.01')]
[Synset('demeanor.n.01')]
[Synset('outgrowth.n.01')]
These values come from nltk-package, and are Synset-objects.
Trying to evaluate these using ast.literal_eval() causes following crash:
File "/.../site-packages/pandas/core/series.py", line 2158, in map
new_values = map_f(values, arg)
File "pandas/_libs/src/inference.pyx", line 1574, in pandas._libs.lib.map_infer
File "/.../ast.py", line 85, in literal_eval
return _convert(node_or_string)
File "/.../ast.py", line 61, in _convert
return list(map(_convert, node.elts))
File "/.../ast.py", line 84, in _convert
raise ValueError('malformed node or string: ' + repr(node))
ValueError: malformed node or string: <_ast.Call object at 0x7f1918146ac8>
Any help on how I could resolve this problem?

How does one add string to tarfile in Python3

I have problem adding an str to a tar arhive in python. In python 2 I used such method:
fname = "archive_name"
params_src = "some arbitrarty string to be added to the archive"
params_sio = io.StringIO(params_src)
archive = tarfile.open(fname+".tgz", "w:gz")
tarinfo = tarfile.TarInfo(name="params")
tarinfo.size = len(params_src)
archive.addfile(tarinfo, params_sio)
Its essentially the same what can be found in this here.
It worked well. However, going to python 3 it broke and results with the following error:
File "./translate_report.py", line 67, in <module>
main()
File "./translate_report.py", line 48, in main
archive.addfile(tarinfo, params_sio)
File "/usr/lib/python3.2/tarfile.py", line 2111, in addfile
copyfileobj(fileobj, self.fileobj, tarinfo.size)
File "/usr/lib/python3.2/tarfile.py", line 276, in copyfileobj
dst.write(buf)
File "/usr/lib/python3.2/gzip.py", line 317, in write
self.crc = zlib.crc32(data, self.crc) & 0xffffffff
TypeError: 'str' does not support the buffer interface
To be honest I have trouble understanding where it comes from since I do not feed any str to tarfile module back to the point where I do construct StringIO object.
I know the meanings of StringIO and str, bytes and such changed a bit from python 2 to 3 but I do not see a mistake and cannot come up with better logic to solve this task.
I create StringIO object precisely to provide buffer methods around the string I want to add to the archive. Yet it strikes me that some str does not provide it. On top of it the exception is raised around lines that seem to be responsible for checksum calculations.
Can some one please explain what I am miss-understanding or at least give an example how to add a simple str to the tar archive with out creating an intermediate file on the file-system.
When writing to a file, you need to encode your unicode data to bytes explicitly; StringIO objects do not do this for you, it's a text memory file. Use io.BytesIO() instead and encode:
params_sio = io.BytesIO(params_src.encode('utf8'))
Adjust your encoding to your data, of course.

Resources