Automated creation of multiple datasets in Python-Pytables - python-3.x

In my script, I create several datasets manually:
import tables
dset1 = f.create_earray(f.root, "dataset1", atom=tables.Float64Atom(), shape=(0, 2))
dset2 = f.create_earray(f.root, "dataset2", atom=tables.Float64Atom(), shape=(0, 2))
dset3 = f.create_earray(f.root, "dataset3", atom=tables.Float64Atom(), shape=(0, 2))
...
I want to achieve two things:
Automate the above statements to execute in a loop fashion and create any desired (N) datasets
Then I also use .append method sequentially (as given below) which I also want to automate:
dset1.append(np_array1)
dset2.append(np_array2)
dset3.append(np_array3)
...
Will appreciate any assistance?

It's hard to provide specific advice without more details. If you already have the NumPy arrays, you can create the EArray with the data in a single call (using the obj= parameter). Here's a little code snippet that shows how do do this in a loop.
import tables as tb
import numpy as np
with tb.File('SO_64397597.h5','w') as h5f:
arr1 = np.ones((10,2))
arr2 = 2.*np.ones((10,2))
arr3 = 3.*np.ones((10,2))
arr_list = [arr1, arr2, arr3]
for cnt in range(1,4):
h5f.create_earray("/", "dataset"+str(cnt), obj=arr_list[cnt-1])
The code above doesn't create dataset objects. If you need them, you can access programmatically with this call:
# input where as path to node, name not required
ds = h5f.get_node("/dataset1")
# or
# input where as path to group, and name as dataset name
ds = h5f.get_node("/","dataset1")
If you don't have the arrays when you create the datasets, you can create the EArrays in the first loop, then add the np.array data in a second loop. See below:
with tb.File('SO_64397597.h5','w') as h5f:
for cnt in range(1,4):
h5f.create_earray("/", "dataset"+str(cnt), atom=tables.Float64Atom(), shape=(0, 2))
# get array data...
arr_list = [arr1, arr2, arr3]
# add array data
for cnt in range(1,4):
h5f.get_node("/","dataset"+str(cnt)).append(arr_list[cnt-1])

Related

Reading a set of HDF5 files and then slicing the resulting datasets without storing them in the end

I think some of my question is answered here:1
But the difference that I have is that I'm wondering if it is possible to do the slicing step without having to re-write the datasets to another file first.
Here is the code that reads in a single HDF5 file that is given as an argument to the script:
with h5py.File(args.H5file, 'r') as df:
print('Here are the keys of the input file\n', df.keys())
#interesting point here: you need the [:] behind each of these and we didn't need it when
#creating datasets not using the 'with' formalism above. Adding that even handled the cases
#in the 'hits' and 'truth_hadrons' where there are additional dimensions...go figure.
jetdset = df['jets'][:]
haddset = df['truth_hadrons'][:]
hitdset = df['hits'][:]
Then later I do some slicing operations on these datasets.
Ideally I'd be able to pass a wild-card into args.H5file and then the whole set of files, all with the same data formats, would end up in the three datasets above.
I do not want to store or make persistent these three datasets at the end of the script as the output are plots that use the information in the slices.
Any help would be appreciated!
There are at least 2 ways to access multiple files:
If all files follow a naming pattern, you can use the glob
module. It uses wildcards to find files. (Note: I prefer
glob.iglob; it is an iterator that yields values without creating a list. glob.glob creates a list which you frequently don't need.)
Alternatively, you could input a list of filenames and loop on
the list.
Example of iglob:
import glob
for fname in glob.iglob('img_data_0?.h5'):
with h5py.File(fname, 'r') as h5f:
print('Here are the keys of the input file\n', h5.keys())
Example with a list of names:
filenames = [ 'img_data_01.h5', 'img_data_02.h5', 'img_data_03.h5' ]
for fname in filenames:
with h5py.File(fname, 'r') as h5f:
print('Here are the keys of the input file\n', h5.keys())
Next, your code mentions using [:] when you access a dataset. Whether or not you need to add indices depends on the object you want returned.
If you include [()], it returns the entire dataset as a numpy array. Note [()] is now preferred over [:]. You can use any valid slice notation, e.g., [0,0,:] for a slice of a 3-axis array.
If you don't include [:], it returns a h5py dataset object, which
behaves like a numpy array. (For example, you can get dtype and shape, and slice the data). The advantage? It has a smaller memory footprint. I use h5py dataset objects unless I specifically need an array (for example, passing image data to another package).
Examples of each method:
jets_dset = h5f['jets'] # w/out [()] returns a h5py dataset object
jets_arr = h5f['jets'][()] # with [()] returns a numpy array object
Finally, if you want to create a single array that merges values from 3 datasets, you have to create an array big enough to hold the data, then load with slice notation. Alternatively, you can use np.concatenate() (However, be careful, as concatenating a lot of data can be slow.)
A simple example is shown below. It assumes you know the shape of the dataset, and they are the same for all 3 files. (a0, a1 are the axes lengths for 1 dataset) If you don't know them, you can get them from the .shape attribute
Example for method 1 (pre-allocating array jets3x_arr):
a0, a1 = 100, 100
jets3x_arr = np.empty(shape=(a0, a1, 3)) # add dtype= if not float
for cnt, fname in enumerate(glob.iglob('img_data_0?.h5')):
with h5py.File(fname, 'r') as h5f:
jets3x_arr[:,:,cnt] = h5f['jets']
Example for method 2 (using np.concatenate()):
a0, a1 = 100, 100
for cnt, fname in enumerate(glob.iglob('img_data_0?.h5')):
with h5py.File(fname, 'r') as h5f:
if cnt == 0:
jets3x_arr= h5f['jets'][()].reshape(a0,a1,1)
else:
jets3x_arr= np.concatenate(\
(jets3x_arr, h5f['jets'][()].reshape(a0,a1,1)), axis=2)

the issues of appending arrays generated in the loop with initial empty array

I have the following code, aiming to concatenate arrays generated within a loop, and initiate an empty array before entering the loop. For illustration purposes, i just make loop have one iteration only. However, I found that the result_array still include the initial empty array. I am not clear how to fix this issue, is that possible to append the arrays generated within a loop without setting up initial empty array? For reproducing the problem, am including both code and screenshot of running.
import numpy as np
import hdmedians as hd
x = np.random.rand(5,10,2)
result_array = np.empty((1,2),float)
for i in range(1):#x.shape[1]):
print('i----',i)
x1 = x[:,i,:]
print(x1)
x2 = hd.medoid(x1,axis=0)
x2 = x2.reshape((1,2))
print(x2)
print('---------')
result_array = np.append(result_array, x2, axis =0 )
I don't know the API of hdmedians, so I can't really suggest a vectorial code (if possible), but you can replace your append in a loop (that is inefficient) with:
np.vstack([hd.medoid(x[:,i,:], axis=0)
for i in range(x.shape[1])])
Output:
array([[0.3595079 , 0.43703195],
[0.60276338, 0.54488318],
[0.4236548 , 0.64589411],
[0.52324805, 0.09394051],
[0.52184832, 0.41466194],
[0.57019677, 0.43860151],
[0.45615033, 0.56843395],
[0.20887676, 0.16130952],
[0.65310833, 0.2532916 ],
[0.46631077, 0.24442559]])

How to load a big numpy array from a text file with Dask?

I have a text file containing data that I read to memory with numpy.genfromtxt enforcing a custom numpy.dtype. Although the text file is smaller than the available RAM, I often get a MemoryError (which I don't understand, but it is not the point of this question). When looking for ways to resolve it, I came across dask. In the API I found methods for data loading but none of them reads from text files, not to mention my need to support converters in genfromtxt().
I see there is a dask.dataframe.read_csv() method, but in my case I don't use pandas, but rather plain numpy.array with custom dtypes and colum names, as mentioned above. The text file I have is not CSV anyway (thus the abovementioned use of converters in genfromtxt()).
Any ideas on how could I handle it will be appreciated.
You should use the function dask.bytes.read_bytes with delimiter="\n" to read your file(s) and split them into blocks at line-endings. You get back a set of dask.delayed objects, which you can pass to numpy. Unfortunately, numpy wants a file-like, so you must pack the bytes again:
import dask
import dask.array as da
_, blocks = dask.bytes.read_bytes(files, delimiter="\n")
#dask.delayed
def parse(block):
return numpy.genfromtext(io.BytesIO(block), ...)
arrays = [da.from_delayed(parse(block), ...) for block in blocks]
arr = da.stack/concat(arrays)
SO editors rejected my edit to #mdurant's answer, thus, I post the working code (based on that answer) here:
import numpy
import dask
import dask.array as da
import io
fname = 'data.txt'
# data.txt is:
# 1 2
# 3 4
# 5 6
files = [fname]
_, blocks = dask.bytes.read_bytes(files, delimiter="\n")
my_type = numpy.dtype([
('f1', numpy.float64),
('f2', numpy.float64)
])
native_type = numpy.float
used_type = numpy.float64
# If the below line is uncommented, then creating the dask array will work, but it won't
# be possible to perform any operations on it
# used_type = my_type
# Debug
# print('blocks', blocks)
# print('type(blocks)', type(blocks))
# print('blocks[0]', blocks[0])
# print('type(blocks[0])', type(blocks[0]))
#dask.delayed
def parse(block):
r = numpy.genfromtxt(io.BytesIO(block[0]))
print('parse() about to return:\n', r, '\n')
return r
# Below I added shape, which seems compulsatory, the reason for which I don't understand
arrays = [da.from_delayed(value=parse(block), shape=(3, ), dtype=used_type) for block in blocks]
# da.concat did not not work for me
arr = da.stack(arrays)
# The below will not work if used_type is set to my_type
arr += 1
# Neither the below woudl work, it raises NotImplementedError
# arr['f1'] += 1
arr_np = arr.compute()
print('numpy array incremented by one: \n', arr_np)

What to pass to clf.predict()?

I started playing with Decision Trees lately and I wanted to train my own simple model with some manufactured data. I wanted to use this model to predict some further mock data, just to get a feel of how it works, but then I got stuck. Once your model is trained, how do you pass data to predict()?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Docs state:
clf.predict(X)
Parameters:
X : array-like or sparse matrix of shape = [n_samples, n_features]
But when trying to pass np.array, np.ndarray, list, tuple or DataFrame it just throws an error. Can you help me understand why please?
Code below:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import graphviz
import pandas as pd
import numpy as np
import random
from sklearn import tree
pd.options.display.max_seq_items=5000
pd.options.display.max_rows=20
pd.options.display.max_columns=150
lenght = 50000
miles_commuting = [random.choice([2,3,4,5,7,10,20,25,30]) for x in range(lenght)]
salary = [random.choice([1300,1600,1800,1900,2300,2500,2700,3300,4000]) for x in range(lenght)]
full_time = [random.choice([1,0,1,1,0,1]) for x in range(lenght)]
DataFrame = pd.DataFrame({'CommuteInMiles':miles_commuting,'Salary':salary,'FullTimeEmployee':full_time})
DataFrame['Moving'] = np.where((DataFrame.CommuteInMiles > 20) & (DataFrame.Salary > 2000) & (DataFrame.FullTimeEmployee == 1),1,0)
DataFrame['TargetLabel'] = np.where((DataFrame.Moving == 1),'Considering move','Not moving')
target = DataFrame.loc[:,'Moving']
data = DataFrame.loc[:,['CommuteInMiles','Salary','FullTimeEmployee']]
target_names = DataFrame.TargetLabel
features = data.columns.values
clf = tree.DecisionTreeClassifier()
clf = clf.fit(data, target)
clf.predict(?????) #### <===== What should go here?
clf.predict([30,4000,1])
ValueError: Expected 2D array, got 1D array instead:
array=[3.e+01 4.e+03 1.e+00].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
clf.predict(np.array(30,4000,1))
ValueError: only 2 non-keyword arguments accepted
Where is your "mock data" that you want to predict?
Your data should be of the same shape that you used when calling fit(). From the code above, I see that your X has three columns ['CommuteInMiles','Salary','FullTimeEmployee']. You need to have those many columns in your prediction data, number of rows can be arbitrary.
Now when you do
clf.predict([30,4000,1])
The model is not able to understand that these are columns of a same row or data of different rows.
So you need to convert that into 2-d array, where inner array represents the single row.
Do this:
clf.predict([[30,4000,1]]) #<== Observe the two square brackets
You can have multiple rows to be predicted, each in inner list. Something like this:
X_test = [[30,4000,1],
[35,15000,0],
[40,2000,1],]
clf.predict(X_test)
Now as for your last error clf.predict(np.array(30,4000,1)), this has nothing to do with predict(). You are using the np.array() wrong.
According to the documentation, the signature of np.array is:
(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
Leaving the first (object) all others are keyword arguments, so they need to be used as such. But when you do this: np.array(30,4000,1), each value is considered as input to separate param here: object=30, dtype=4000, copy=1. This is not allowed and hence error. If you want to make a numpy array from list, you need to pass a list.
Like this: np.array([30,4000,1])
Now this will be considered correctly as input to object param.

I could not combine 2 lists into dictionary using zip()

I just learn a new method zip() from Stackoverflow, but it does not work properly.
def diction():
import random
import string
import itertools
dictionary_key={}
upper_list = []
string_dictionary_upper = string.ascii_uppercase
for n in string_dictionary_upper:
upper_list.append(n)
upper_list_new = list(random.shuffle(upper_list))
dictionary_key = dict(zip(upper_list, upper_list_new))
diction()
The error code is 'NoneType' object is not iterable'. But I could not find why.
If you want to create a shuffled copy of a list do so in two steps
1) Copy the list
2) Shuffle the copy:
upper_list_new = upper_list[:] #create a copy
random.shuffle(upper_list_new) #shuffle the copy
The result can then be zipped with other lists.

Resources