Parsing xml file from url to a astropy votable without downloading - python-3.x

From http://svo2.cab.inta-csic.es/theory/fps/ you can get the transmission curves for many filters used in astronomical observations. I would like to get these data by opening the url with the corresponding xml file (for each filter), parse it to astropy's votable that helps to read the table data easily.
I have managed to do this by opening the file converting it to a UTF-8 file and saving in locally as an xml. Then opening the local file works fine, as it is obvious form the following example.
However I do not want to save the file and open it again. When I tried that by doing: votable = parse(xml_file), it raises an OSError: File name too long as it takes all the file as a string.
from urllib.request import urlopen
fltr = 'http://svo2.cab.inta-csic.es/theory/fps/fps.php?ID=2MASS/2MASS.H'
url = urlopen(fltr).read()
xml_file = url.decode('UTF-8')
with open('tmp.xml','w') as out:
out.write(xml_file)
votable = parse('tmp.xml')
data = votable.get_first_table().to_table(use_names_over_ids=True)
print(votable)
print(data["Wavelength"])
The output in this case is:
<VOTABLE>... 1 tables ...</VOTABLE>
Wavelength
AA
----------
12890.0
13150.0
...
18930.0
19140.0
Length = 58 rows

Indeed according to the API documentation, votable.parse's first argument is either a filename or a readable file-like object. It doesn't specify this exactly, but apparently the file also has to be seekable meaning that it can be read with random access.
The HTTPResponse object returned by urlopen is indeed a file-like object with a .read() method, so in principle it might be possible to pass directly to parse(), but this is how I found out it has to be seekable:
fltr = 'http://svo2.cab.inta-csic.es/theory/fps/fps.php?ID=2MASS/2MASS.H'
u = urlopen(fltr)
>>> parse(u)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "astropy/io/votable/table.py", line 135, in parse
_debug_python_based_parser=_debug_python_based_parser) as iterator:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "astropy/utils/xml/iterparser.py", line 157, in get_xml_iterator
with _convert_to_fd_or_read_function(source) as fd:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "astropy/utils/xml/iterparser.py", line 63, in _convert_to_fd_or_read_function
with data.get_readable_fileobj(fd, encoding='binary') as new_fd:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "astropy/utils/data.py", line 210, in get_readable_fileobj
fileobj.seek(0)
io.UnsupportedOperation: seek
So you need to wrap the data in a seekable file-like object. Along the lines that #keflavich wrote you can use io.BytesIO (io.StringIO won't work as explained below).
It turns out that there's no reason to explicitly decode the UTF-8 data to unicode. I'll spare the example, but after trying it myself it turns out parse() works on raw bytes (which I find a bit odd, but okay). So you can read the entire contents of the URL into an io.BytesIO which is just an in-memory file-like object that supports random access:
>>> u = urlopen(fltr)
>>> s = io.BytesIO(u.read())
>>> v = parse(s)
WARNING: W42: None:2:0: W42: No XML namespace specified [astropy.io.votable.tree]
>>> v.get_first_table().to_table(use_names_over_ids=True)
<Table masked=True length=58>
Wavelength Transmission
AA
float32 float32
---------- ------------
12890.0 0.0
13150.0 0.0
... ...
18930.0 0.0
19140.0 0.0
This is, in general, the way in Python to do something with some data as though it were a file, without writing an actual file to the filesystem.
Note, however, this won't work if the entire file can't fit in memory. In that case you still might need to write it out to disk. But if it's just for some temporary processing and you don't want to litter your disk tmp.xml like in your example, you can always use the tempfile module to, among other things, create temporary files that are automatically deleted once they're no longer in use.

Related

Trouble obtaining data from multiple *.root files...but no problems using only one

I am using pythong version 3.6.5 and have a jagged TTree with a multi-dimensional structure. This data is spread over more than 1000 files, all with the same identical TTree structure.
suppose I have two files though and I'll call them
fname1.root
fname2.root
The following code has no problem opening either of these by itself:
import uproot as upr
import awkward
import boost_histogram as bh
import math
import matplotlib.pyplot as plt
#
# define a plotting program
# def plotter(h)
#
# preparing the file location for files
pth = '/fullpathName/'
fname1 = 'File755.root'
fname2 = 'File756.root'
fileList = [pth+fname1, pth+fname2]
#
# print out the path and filename that I've made to show the user
for file in fileList:
print(file)
print('\n')
#
# Let's make a histogram This one has 50 bins, starts at zero and ends at 1000.0
# It will be a histogram of Jet pT's.
jhist = bh.histogram(bh.axis.regular(50,0.0,1000.0))
#
#show what you've just done
print(jhist)
#
# does not work, only fills first file!
for chunk in upr.iterate(fileList,"bTag_AntiKt4EMTopoJets",["jet_pt"]):
jhist.fill(chunk[b"jet_pt"][:, :2].flatten()*0.001)
#
#
# what does my histogram look like?
ptHist = plt.bar(jhist.axes[0].centers, jhist.view(), width=jhist.axes[0].widths)
plt.show()
As I said, the above code works if I put only ONE file in 'fileList'.
The naive thing to do doesn't work.
If I create a 'list' of files using
files = [pth+fname1 , pth+fname2]
and re-run that code. I get the following error...which is very much the same error I have been getting all along.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 48, in <module>
File "/home/huffman/.local/lib/python3.6/site-packages/uproot/tree.py", line 116, in iterate
for tree, branchesinterp, globalentrystart, thispath, thisfile in _iterate(path, treepath, branches, awkward, localsource, xrootdsource, httpsource, **options):
File "/home/huffman/.local/lib/python3.6/site-packages/uproot/tree.py", line 163, in _iterate
file = uproot.rootio.open(path, localsource=localsource, xrootdsource=xrootdsource, httpsource=httpsource, **options)
File "/home/huffman/.local/lib/python3.6/site-packages/uproot/rootio.py", line 54, in open
return ROOTDirectory.read(openfcn(path), **options)
File "/home/huffman/.local/lib/python3.6/site-packages/uproot/rootio.py", line 51, in <lambda>
openfcn = lambda path: MemmapSource(path, **kwargs)
File "/home/huffman/.local/lib/python3.6/site-packages/uproot/source/memmap.py", line 21, in __init__
self._source = numpy.memmap(self.path, dtype=numpy.uint8, mode="r")
File "/cvmfs/sft.cern.ch/lcg/views/LCG_94python3/x86_64-slc6-gcc8-opt/lib/python3.6/site-packages/numpy/core/memmap.py", line 264, in __new__
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OSError: [Errno 12] Cannot allocate memory
Lazy arrays are just an interface for convenience—i.e. you can transform it with one function call, rather than iterating in an explicit loop over chunks. Internally, lazy arrays contain an implicit loop over chunks, so if you're running out of memory one way, you would be the other way.
Your problem is not closing files (they're memory-mapped, so "closing" didn't have a clear meaning—they're a view into memory that the operating system is allocating for itself, anyway)—your problem is with deleting arrays. That's the only thing that can use up all the memory on your computer.
There are a few things you can do here: one is to
for chunk in uproot.iterate(files, "bTag_AntiKt4EMTopoJets", ["jet_pt", "jet_eta"]):
# fill with chunk[b"jet_pt"] and chunk[b"jet_eta"], which correspond
# to the same sets of events, one-to-one.
to explicitly loop over the chunks ("explicit" because you see and control the loop here, and because you have to specify which branches you want to load into the dict chunk). You can control the size of the chunks with entrysteps. The other is to
cache = uproot.ArrayCache("1 GB")
events = uproot.lazyarrays(files, "bTag_AntiKt4EMTopoJets", cache=cache)
to keep the loop implicit. The ArrayCache will throw out chunks of arrays, so that they have to be loaded again, if it gets to the 1 GB limit. If you make that limit too small, it won't be able to hold one chunk, but if you make it too large, you'll run out of memory.
By the way, although you're reporting a memory issue, there's another major performance issue with your code: you're looking over events in Python. Instead of
events.jet_pt[i][:2]*0.001
to get the jet pT for event i, do
events.jet_pt[:, :2]*0.001
for the jet pT of all events as a single array. You might then need to .flatten() that array to fit the histogram's fill method.

OverflowError while saving large Pandas df to hdf

I have a large Pandas dataframe (~15GB, 83m rows) that I am interested in saving as an h5 (or feather) file. One column contains long ID strings of numbers, which should have string/object type. But even when I ensure that pandas parses all columns as object:
df = pd.read_csv('data.csv', dtype=object)
print(df.dtypes) # sanity check
df.to_hdf('df.h5', 'df')
> client_id object
event_id object
account_id object
session_id object
event_timestamp object
# etc...
I get this error:
File "foo.py", line 14, in <module>
df.to_hdf('df.h5', 'df')
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/generic.py", line 1996, in to_hdf
return pytables.to_hdf(path_or_buf, key, self, **kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 279, in to_hdf
f(store)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 273, in <lambda>
f = lambda store: store.put(key, value, **kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 890, in put
self._write_to_group(key, value, append=append, **kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 1367, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 2963, in write
self.write_array('block%d_values' % i, blk.values, items=blk_items)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 2730, in write_array
vlarr.append(value)
File "/shared_directory/projects/env/lib/python3.6/site-packages/tables/vlarray.py", line 547, in append
self._append(nparr, nobjects)
File "tables/hdf5extension.pyx", line 2032, in tables.hdf5extension.VLArray._append
OverflowError: value too large to convert to int
Apparently it is trying to convert this to an int anyway, and failing.
When running df.to_feather() I have a similar issue:
df.to_feather('df.feather')
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/frame.py", line 1892, in to_feather
to_feather(self, fname)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/feather_format.py", line 83, in to_feather
feather.write_dataframe(df, path)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/feather.py", line 182, in write_feather
writer.write(df)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/feather.py", line 93, in write
table = Table.from_pandas(df, preserve_index=False)
File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 501, in dataframe_to_arrays
convert_fields))
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 487, in convert_column
raise e
File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 481, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Could not convert 1542852887489 with type str: tried to convert to double', 'Conversion failed for column session_id with type object')
So:
Is anything that looks like a number forcibly converted to a number
in storage?
Could the presence of NaNs affect what's happening here?
Is there an alternative storage solution? What would be the best?
Having done some reading on this topic, it seems like the issue is dealing with string-type columns. My string columns contain a mixture of all-number strings and strings with characters. Pandas has the flexible option of keeping strings as an object, without a declared type, but when serializing to hdf5 or feather the content of the column is converted to a type (str or double, say) and cannot be mixed. Both of these libraries fail when confronted with a sufficiently large library of mixed type.
Force-converting my mixed column to strings allowed me to save it in feather, but in HDF5 the file ballooned and the process ended when I ran out of disk space.
Here is an answer in a comparable case where a commenter notes (2 years ago) "This problem is very standard, but solutions are few".
Some background:
String types in Pandas are called object, but this obscures that they may either be pure strings or mixed dtypes (numpy has builtin string types, but Pandas never uses them for text). So the first thing to do in a case like this is to enforce all string cols as string type (with df[col].astype(str)). But even so, in a large enough file (16GB, with long strings) this still failed. Why?
The reason I was encountering this error was that I had data that long and high-entropy (many different unique values) strings. (With low-entropy data, it might have been worthwhile to switch to categorical dtype.) In my case, I realised that I only needed these strings in order to identify rows - so I could replace them with unique integers!
df[col] = df[col].map(dict(zip(df[col].unique(), range(df[col].nunique()))))
Other Solutions:
For text data, there are other recommended solutions than hdf5/feather, including:
json
msgpack (note that in Pandas 0.25 read_msgpack is deprecated)
pickle (which has known security issues, so be careful - but it should be OK for internal storage/transfer of dataframes)
parquet, part of the Apache Arrow ecosystem.
Here is an answer from Matthew Rocklin (one of the dask developers) comparing msgpack and pickle. He wrote a broader comparison on his blog.
HDF5 is not the suitable solution for this use case. hdf5 is a better solution if you have many dataframes you want to store in a single structure. It has more overhead when opening the file and then it allows you to efficiently load each dataframe and also easily load slices of them. It should be thought of as a file system that stores dataframes.
In the case of a single dataframe of time series event the recommended formats would be one of the Apache Arrow project formats, i.e. feather or parquet. One should think of those as column based (compressed) csv files. The particular trade off between those two is laid out nicely under What are the differences between feather and parquet?.
One particular issue to consider is data types. Since feather is not designed to optimize disk space by compression it can offer support for a larger variety of data types. While parquet tries to provide very efficient compression it can support only a limited subset that would allow it to handle the data compression better.

Doesn't python's ffmpy works with temporary files made using tempfile.TemporaryFile?

I am making a function whose purpose is to take a mp3 file and analyse and process it. So, taking help from this SO answer, I am making a temporary wav file, and then using python ffmpy library I am trying to convert mp3(actual given file) to wav file. But the catch is that I am giving the temporary wav file generated above as the output file to ffmpy to store the result to i.e. I am doing this:
import ffmpy
import tempfile
from scipy.io import wavfile
# audioFile variable is known here
tempWavFile = tempfile.TemporaryFile(suffix="wav")
ff_obj = ffmpy.FFmpeg(
global_options="hide_banner",
inputs={audioFile:None},
outputs={tempWavFile: " -acodec pcm_s16le -ac 1 -ar 44000"}
)
ff_obj.run()
[fs, frames] = wavfile.read(tempWavFile)
print(" fs is: ", fs)
print(" frames is: ", frames)
But on line ff_obj.run() this error occurs:
File "/home/tushar/.local/lib/python3.5/site-packages/ffmpy.py", line 95, in run
stderr=stderr
File "/usr/lib/python3.5/subprocess.py", line 947, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.5/subprocess.py", line 1490, in _execute_child
restore_signals, start_new_session, preexec_fn)
TypeError: Can't convert '_io.TextIOWrapper' object to str implicitly
So, my question is:
When I replaced tempWavFile = tempfile.TemporaryFile(suffix="wav") with tempWavFile = tempfile.mktemp('.wav'), no error occurs, why so ?
What does this error mean and what is the cause of it's occurrence and how can it be corrected ?
tempfile.TemporaryFile returns a file-like object:
>>> tempWavFile = tempfile.TemporaryFile(suffix="wav")
>>> tempWavFile
<_io.BufferedRandom name=12>
On the other hand, tempfile.mktemp returns a string to a path to a real file which has just been created on the file system:
>>> tempWavFile = tempfile.mktemp('.wav')
>>> tempWavFile
'/var/folders/f1/9b4sf0gx0dx78qpkq57sz4bm0000gp/T/tmpf2117fap.wav'
After creating tempWavFile, you pass it to ffmpy.FFmpeg, which will aggregate input and output files and parameters in a single command, to be passed to subprocess. The command-line takes the form of a list, and probably looks something like the following: ["ffmpeg", "-i", "input.wav", "output.wav"].
Finally, ffmpy passes this list to subprocess.Popen and that's where it explodes when you use tempfile.TemporaryFile. This is normal because subprocess does not know anything about your arguments and expects all of them to be strings. When it sees the _io.BufferedRandom object returned by tempfile.TemporaryFile, it doesn't know what to do.
So, to fix it, just use tempfile.mkstemp, which is safer anyway than tempfile.TemporaryFile.
From the Python docs:
tempfile.mkstemp(suffix=None, prefix=None, dir=None, text=False)
Creates a temporary file in the most secure manner possible.
...
Unlike TemporaryFile(), the user of mkstemp() is responsible for deleting the temporary file when done with it.
You originally mentioned mktemp, which is deprecated since Python 2.3 (see docs) and should be replaced by mkstemp.

Storing the Output to a FASTA file

from Bio import SeqIO
from Bio import SeqRecord
from Bio import SeqFeature
for rec in SeqIO.parse("C:/Users/Siva/Downloads/sequence.gp","genbank"):
if rec.features:
for feature in rec.features:
if feature.type =="Region":
seq1 = feature.location.extract(rec).seq
print(seq1)
SeqIO.write(seq1,"region_AA_output1.fasta","fasta")
I am trying to write the output to a FASTA file but i am getting error. Can anybody help me?
This the error which i got
Traceback (most recent call last):
File "C:\Users\Siva\Desktop\region_AA.py", line 10, in <module>
SeqIO.write(seq1,"region_AA_output1.fasta","fasta")
File "C:\Python34\lib\site-packages\Bio\SeqIO\__init__.py", line 472, in write
count = writer_class(fp).write_file(sequences)
File "C:\Python34\lib\site-packages\Bio\SeqIO\Interfaces.py", line 211, in write_file
count = self.write_records(records)
File "C:\Python34\lib\site-packages\Bio\SeqIO\Interfaces.py", line 196, in write_records
self.write_record(record)
File "C:\Python34\lib\site-packages\Bio\SeqIO\FastaIO.py", line 190, in write_record
id = self.clean(record.id)
AttributeError: 'str' object has no attribute 'id'
First, you're trying to write a plain sequence as a fasta record. A fasta record consists of a sequence plus an ID line (prepended by ">"). You haven't provided an ID, so the fasta writer has nothing to write. You should either write the whole record, or turn the sequence into a fasta record by adding an ID yourself.
Second, even if your approach wrote anything, it's continually overwriting each new record into the same file. You'd end up with just the last record in the file.
A simpler approach is to store everything in a list, and then write the whole list when you're done the loop. For example:
new_fasta = []
for rec in SeqIO.parse("C:/Users/Siva/Downloads/sequence.gp","genbank"):
if rec.features:
for feature in rec.features:
if feature.type =="Region":
seq1 = feature.location.extract(rec).seq
# Use an appropriate string for id
new_fasta.append('>%s\n%s' % (rec.id, seq1))
with open('region_AA_output1.fasta', 'w') as f:
f.write('\n'.join(new_fasta))

How does one add string to tarfile in Python3

I have problem adding an str to a tar arhive in python. In python 2 I used such method:
fname = "archive_name"
params_src = "some arbitrarty string to be added to the archive"
params_sio = io.StringIO(params_src)
archive = tarfile.open(fname+".tgz", "w:gz")
tarinfo = tarfile.TarInfo(name="params")
tarinfo.size = len(params_src)
archive.addfile(tarinfo, params_sio)
Its essentially the same what can be found in this here.
It worked well. However, going to python 3 it broke and results with the following error:
File "./translate_report.py", line 67, in <module>
main()
File "./translate_report.py", line 48, in main
archive.addfile(tarinfo, params_sio)
File "/usr/lib/python3.2/tarfile.py", line 2111, in addfile
copyfileobj(fileobj, self.fileobj, tarinfo.size)
File "/usr/lib/python3.2/tarfile.py", line 276, in copyfileobj
dst.write(buf)
File "/usr/lib/python3.2/gzip.py", line 317, in write
self.crc = zlib.crc32(data, self.crc) & 0xffffffff
TypeError: 'str' does not support the buffer interface
To be honest I have trouble understanding where it comes from since I do not feed any str to tarfile module back to the point where I do construct StringIO object.
I know the meanings of StringIO and str, bytes and such changed a bit from python 2 to 3 but I do not see a mistake and cannot come up with better logic to solve this task.
I create StringIO object precisely to provide buffer methods around the string I want to add to the archive. Yet it strikes me that some str does not provide it. On top of it the exception is raised around lines that seem to be responsible for checksum calculations.
Can some one please explain what I am miss-understanding or at least give an example how to add a simple str to the tar archive with out creating an intermediate file on the file-system.
When writing to a file, you need to encode your unicode data to bytes explicitly; StringIO objects do not do this for you, it's a text memory file. Use io.BytesIO() instead and encode:
params_sio = io.BytesIO(params_src.encode('utf8'))
Adjust your encoding to your data, of course.

Resources