I want to do the following with Dask:
Load a matrix from a HDF5 file
Parallelize the calculation of each entry
Here is my code:
def blocked_func(x):
return np.random.random()
with h5py.File(file_path) as f:
d = f['/data']
arr = da.from_array(d, chunks=(chunks_row, chunks_col))
arr2 = arr.map_blocks(blocked_func, dtype='float32').compute()
But the code throws the following error:
File ".../remote_fr_thinkpad/test_big_data.py", line 43, in <module>
arr2 = arr.map_blocks(blocked_func, dtype='float32').compute()
File ".../anaconda3/lib/python3.7/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File ".../anaconda3/lib/python3.7/site-packages/dask/base.py", line 399, in compute
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File ".../anaconda3/lib/python3.7/site-packages/dask/base.py", line 399, in <listcomp>
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File ".../anaconda3/lib/python3.7/site-packages/dask/array/core.py", line 779, in finalize
return concatenate3(results)
File ".../anaconda3/lib/python3.7/site-packages/dask/array/core.py", line 3497, in concatenate3
chunks = chunks_from_arrays(arrays)
File ".../anaconda3/lib/python3.7/site-packages/dask/array/core.py", line 3327, in chunks_from_arrays
result.append(tuple([shape(deepfirst(a))[dim] for a in arrays]))
File ".../anaconda3/lib/python3.7/site-packages/dask/array/core.py", line 3327, in <listcomp>
result.append(tuple([shape(deepfirst(a))[dim] for a in arrays]))
IndexError: tuple index out of range
I googled around and also tried dask's gu_func, but that threw the same error.
Thanks for your help.
map_block expects blocked_func to return an array of the same shape of its input (chunks_row, chunks_col), while it actually just returns a float.
Try either with
1) a function which preserves shape, e.g:
def blocked_func(x):
return x*2
or
2) tell map_blocks that the shape of the output will be different:
arr2 = arr.map_blocks(blocked_func, chunks=(1,1), dtype='float32').compute()
but keep the dimensionality of the input array in blocked_func, e.g.:
def blocked_func(x):
return np.random.random()[None,None]
# or like this
# return np.array([1,1])
Related
I have a function:
def create_variables(name, probabilities, labels):
print('function called')
model = Metrics(probabilities, labels)
prec_curve = model.precision_curve()
kappa_curve = model.kappa_curve()
tpr_curve = model.tpr_curve()
fpr_curve = model.fpr_curve()
pr_auc = auc(tpr_curve, prec_curve)
roc_auc = auc(fpr_curve, tpr_curve)
auk = auc(fpr_curve, kappa_curve)
return [name, prec_curve, kappa_curve, tpr_curve, fpr_curve, pr_auc, roc_auc, auk]
I have the following variables:
svm = pd.read_csv('SVM.csv')
svm_prob_1 = svm.probability[svm.fold_number == 1]
svm_prob_2 = svm.probability[svm.fold_number == 2]
svm_label_1 = svm.true_label[svm.fold_number == 1]
svm_label_2 = svm.true_label[svm.fold_number == 2]
I want to execute the following lines:
svm1 = create_variables('svm_fold1', svm_prob_1, svm_label_1)
svm2 = create_variables('svm_fold2', svm_prob_2, svm_label_2)
Python works as expected for svm1. However, when it starts processing svm2, I receive the following error:
svm2 = create_variables('svm_fold2', svm_prob_2, svm_label_2)
function called
Traceback (most recent call last):
File "<ipython-input-742-702cfac4d100>", line 1, in <module>
svm2 = create_variables('svm_fold2', svm_prob_2, svm_label_2)
File "<ipython-input-741-b8b5a84f0298>", line 6, in create_variables
prec_curve = model.precision_curve()
File "<ipython-input-734-dd9c309be961>", line 59, in precision_curve
self.tp, self.tn, self.fp, self.fn = self.confusion_matrix(self.preds)
File "<ipython-input-734-dd9c309be961>", line 72, in confusion_matrix
if pred == self.labels[i]:
File "C:\Users\20200016\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 1068, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\20200016\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4730, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
svm_prob_1 and svm_prob_2 are both of the same shape and contain non-zero values. svm_label_2 contains 0's and 1's and has the same length as svm_prob_2.
Furthermore, the error seems to be in svm_label_1. After changing this variable, the following line does work:
svm2 = create_variables('svm_fold2', svm_prob_2, svm_label_1
Based on the code below, there seems to be no difference between svm_label_1 and svm_label_2 though.
type(svm_label_1)
Out[806]: pandas.core.series.Series
type(svm_label_2)
Out[807]: pandas.core.series.Series
min(svm_label_1)
Out[808]: 0
min(svm_label_2)
Out[809]: 0
max(svm_label_1)
Out[810]: 1
max(svm_label_2)
Out[811]: 1
sum(svm_label_1)
Out[812]: 81
sum(svm_label_2)
Out[813]: 89
len(svm_label_1)
Out[814]: 856
len(svm_label_2)
Out[815]: 856
Does anyone know what's going wrong here?
I don't know why it works, but converting svm_label_2 into a list worked:
svm_label_2 = list(svm.true_label[svm.fold_number == 2])
Since, svm_label_1 and svm_label_2 are of the same type, I don't understand why the latter raised an error and the first one did not. Therefore, I still welcome any explanation to this phenomenon.
I have a custom keras layers that takes in multiple vectors of the same size (eg: a list of 3 input vectors, each with length 10. In keras, the shape of each input vector will be (?, 10).)
In the custom layer under the call section, I first stack the 3 vectors to form a shape (?, 3, 10), where each vector becomes a row vector and the 3 vectors combine to form a matrix, x (excluding the batch dimension).
Then, x is multiplied by a weight matrix, w, that is of size (3,3) with no batch dimension. The weight matrix is defined in the build part of the custom layer.
The result y is permutated to make the batch dimension the first dimension again.
Lastly, the layer must output 3 vectors of the same length as the original input. So I slice the along axis=1 to give 3 tensors each of same size (?,10).
I tried out a test case and it seems to work. But when I call the model and have a line for model.summary(), it gives the following error:
ValueError: Dimensions must be equal, but are 3 and 0 for 'add' (op: 'Add') with input shapes: [3,3], [0].
I have tried various solutions including K.batch_dot() but for batch_dot() I could not get it to work due to dimension errors again...
Thank you for your help!
Solved
Replace
self.trainable_weights = self._w
with
self.trainable_weights.append(self._w)
Phew
# Test Case
import keras.backend as K
a = K.variable(np.array([[1,2,3],[4,5,6],[7,8,9]]))
b = K.variable(np.repeat(np.array([[1,1,10,1,1],[2,2,20,2,2],[3,3,30,3,3]])[np.newaxis,:],repeats=10,axis=0))
c = K.dot(a,b)
c = K.permute_dimensions(c,pattern=(1,0,2))
y = K.eval(c)
print(y)
print(c.shape) # (10, 3, 5)
# Custom layer build part
def build(self, input_shape):
# input_shape should be a list, since cross stitch must take in inputs from all the individual tasks.
self._input_count = len(input_shape)
w = np.identity(self._input_count) * 0.9
inverse_diag_mask = np.invert(np.identity(self._input_count, dtype=np.bool))
off_value = 0.1 / (self._input_count - 1)
w[inverse_diag_mask] = off_value
self._w = K.variable(np.array(w))
self.trainable_weights = self._w
super(CrossStitchLayer, self).build(input_shape)
# Custom layer call part
def call(self, x, **kwargs):
temp = x # to show shape
x = K.stack(x, axis=1)
y1 = K.dot(self._w, x)
y = K.permute_dimensions(y1, pattern=(1, 0, 2))
results = []
for idx in range(self._input_count):
results.append(y[:, idx, :])
return results
Full Error Message:
Traceback (most recent call last):
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\tensorflow\python\framework\ops.py", line 1628, in _create_c_op
c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimensions must be equal, but are 3 and 0 for 'add' (op: 'Add') with input shapes: [3,3], [0].
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/limka/Desktop/Python/strain_sensor/run_cross_validation.py", line 10, in <module>
k_folds=10, k_shuffle=True, save_model=False, save_model_name=None, save_model_dir='./save/models/')
File "C:\Users\limka\Desktop\Python\strain_sensor\own_package\cross_validation.py", line 57, in run_skf
model = MTmodel(fl=ss_fl, mode=model_mode, hparams=hparams, labels_norm=True)
File "C:\Users\limka\Desktop\Python\strain_sensor\own_package\models.py", line 217, in __init__
cs_model = cross_stitch(self.features_dim, self.labels_dim, self.hparams)
File "C:\Users\limka\Desktop\Python\strain_sensor\own_package\models.py", line 193, in cross_stitch
model.summary()
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\keras\engine\network.py", line 1260, in summary
print_fn=print_fn)
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\keras\utils\layer_utils.py", line 166, in print_summary
print_layer_summary_with_connections(layers[i])
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\keras\utils\layer_utils.py", line 153, in print_layer_summary_with_connections
layer.count_params(),
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\keras\engine\base_layer.py", line 1129, in count_params
return count_params(self.weights)
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\keras\engine\base_layer.py", line 1022, in weights
return self.trainable_weights + self.non_trainable_weights
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\tensorflow\python\ops\variables.py", line 856, in _run_op
return getattr(ops.Tensor, operator)(a._AsTensor(), *args)
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\tensorflow\python\ops\math_ops.py", line 878, in binary_op_wrapper
return func(x, y, name=name)
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 300, in add
"Add", x=x, y=y, name=name)
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op
op_def=op_def)
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\tensorflow\python\framework\ops.py", line 1792, in __init__
control_input_ops)
File "C:\Users\limka\Anaconda3\envs\my-rdkit-env\lib\site-packages\tensorflow\python\framework\ops.py", line 1631, in _create_c_op
raise ValueError(str(e))
ValueError: Dimensions must be equal, but are 3 and 0 for 'add' (op: 'Add') with input shapes: [3,3], [0].
I have a database with 322,098 observations and 3,868 variables. To test the script I generated a sub-sample with 50 observations and 3,868 variables. When I run the script in the sub-sample it works perfectly. However, when I try to run with the complete database (322,098 remarks) deleting the trade variable from the dataframe gives error.
The following is the script:
## Load External DataSet
mydata = pd.read_csv ('C:\\Users\\Inspiron\\Desktop\\policies.csv', sep = ',', na_values = '.')
## Normalized Data
mydata ['normalized'] = (mydata ['trade'] - mydata ['trade'].min ())/(mydata ['trade'].max () - mydata ['trade'].min ())
## Descriptive Statistics for a Single Variable
mydata ['normalized'].describe ()
## Drop Columns
mydata = mydata.drop (['trade'], axis = 1)
The following is the error:
Traceback (most recent call last):
File "C:\Users\Inspiron\OneDrive\academic\articles\2018\non-discriminatory\script-dofile\mfn.py", line 31, in <module>
mydata = mydata.drop (['trade'], axis = 1)
File "C:\Python36\lib\site-packages\pandas\core\generic.py", line 2530, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "C:\Python36\lib\site-packages\pandas\core\generic.py", line 2563, in _drop_axis
dropped = self.reindex(**{axis_name: new_axis})
File "C:\Python36\lib\site-packages\pandas\util\_decorators.py", line 127, in wrapper
return func(*args, **kwargs)
File "C:\Python36\lib\site-packages\pandas\core\frame.py", line 2935, in reindex
return super(DataFrame, self).reindex(**kwargs)
File "C:\Python36\lib\site-packages\pandas\core\generic.py", line 3004, in reindex
self._consolidate_inplace()
File "C:\Python36\lib\site-packages\pandas\core\generic.py", line 3677, in _consolidate_inplace
self._protect_consolidate(f)
File "C:\Python36\lib\site-packages\pandas\core\generic.py", line 3666, in _protect_consolidate
result = f()
File "C:\Python36\lib\site-packages\pandas\core\generic.py", line 3675, in f
self._data = self._data.consolidate()
File "C:\Python36\lib\site-packages\pandas\core\internals.py", line 3826, in consolidate
bm._consolidate_inplace()
File "C:\Python36\lib\site-packages\pandas\core\internals.py", line 3831, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "C:\Python36\lib\site-packages\pandas\core\internals.py", line 4853, in _consolidate
_can_consolidate=_can_consolidate)
File "C:\Python36\lib\site-packages\pandas\core\internals.py", line 4876, in _merge_blocks
new_values = new_values[argsort]
MemoryError
Can anybody help me?
I would try to use sklearn.preprocessing.MinMaxScaler instead:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
# use `.to_frame()` to prevent `ValueError: Expected 2D array, got 1D array instead:`
mydata['normalized'] = mms.fit_transform(mydata.pop('trade').to_frame())
Instead of creating the normalized variable from the trade variable normalize the trade variable directly.
Follow the command:
## Normalized Data
mydata ['trade'] = (mydata ['trade'] - mydata ['trade'].min ())/(mydata ['trade'].max () - mydata ['trade'].min ())
So I have a set of logs with random ip information and other things. Im making these logs into data frames and then trying to read the IP information afterwards and return if they are public or private its through running them with IPy. Im getting back an error and can't figure it out.
My coding looks like this
df2 = pd.read_csv( 'p0f.txt',sep='|',header=0, engine='python')
df2.columns = ['time','client_ip','server_ip','d','connection','mtu','g','raw_ip']
df2['client_ip'] = df2['client_ip'].map(lambda x: str(x)[4:])
df2['server_ip'] = df2['server_ip'].map(lambda x: str(x)[4:])
df2['mtu'] = df2['mtu'].map(lambda x: x.lstrip('raw_mtu=langEnglishnonefreqdistHz').rstrip('=Hz'))
df2['time'] = df2['time'].map(lambda x: x.lstrip('').rstrip('mod=mtu,cli=mod=syn+ackmod=uptimemod=httpreqmod= httpmod=host chang'))
df2_client_ip = df2[['client_ip']]
df_client_ip_check = df.applymap(lambda x: IP(x).iptype())
when I print I get this error
Traceback (most recent call last):
File "/Users/mykelxknight/CodeProjects/chiron/BroHttpParse.py", line 44, in <module>
df_client_ip_check = df.applymap(lambda x: IP(x).iptype())
File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 4453, in applymap
return self.apply(infer)
File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard
results[i] = func(v)
File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 4451, in infer
return lib.map_infer(x.asobject, func)
File "pandas/_libs/src/inference.pyx", line 1574, in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66645)
File "/Users/mykelxknight/CodeProjects/chiron/BroHttpParse.py", line 44, in <lambda>
df_client_ip_check = df.applymap(lambda x: IP(x).iptype())
File "/usr/local/lib/python3.6/site-packages/IPy.py", line 246, in __init__
(self.ip, parsedVersion) = parseAddress(ip)
File "/usr/local/lib/python3.6/site-packages/IPy.py", line 1417, in parseAddress
raise ValueError("%r: single byte must be 0 <= byte < 256" % (ipstr))
ValueError: ("'1433937432.495592': single byte must be 0 <= byte < 256", 'occurred at index #fields')
This is my code in theano
max_max=200
beReplaced=T.matrix()
toReplace=T.matrix()
timeArray=T.arange(max_max)
def f(v,k,w):
return T.concatenate([w[:k],v,w[k+1:]],axis=0)
result,_=theano.scan(f,
sequences=[toReplace,timeArray],
outputs_info=beReplaced)
What I am trying to do is replace beReplaced with toReplace line by line. The way I do it is by concatenate the upper part of w, v and lower parter of w.
vis lines of toReplace.
Here is the error report
Traceback (most recent call last):
File "/Users/qiansteven/Desktop/NLP/RNN/my.py", line 20, in <module>
outputs_info=np.zeros((5,5),dtype=np.float64))
File "/usr/local/lib/python2.7/site-packages/theano/scan_module/scan.py", line 745, in scan
condition, outputs, updates = scan_utils.get_updates_and_outputs(fn(*args))
File "/Users/qiansteven/Desktop/NLP/RNN/my.py", line 16, in f
return T.concatenate([a,b,c],axis=0)
File "/usr/local/lib/python2.7/site-packages/theano/tensor/basic.py", line 4225, in concatenate
return join(axis, *tensor_list)
File "/usr/local/lib/python2.7/site-packages/theano/gof/op.py", line 611, in __call__
node = self.make_node(*inputs, **kwargs)
File "/usr/local/lib/python2.7/site-packages/theano/tensor/basic.py", line 3750, in make_node
axis, tensors, as_tensor_variable_args, output_maker)
File "/usr/local/lib/python2.7/site-packages/theano/tensor/basic.py", line 3816, in _make_node_internal
raise TypeError("Join() can only join tensors with the same "
TypeError: Join() can only join tensors with the same number of dimensions.
What's wrong???????????
Put toReplace into non_sequences, otherwise each timestep will only take a slice of it. Theano will report error when it tries to concatenate a vector with matrix.
def f(k,w,v): #NOTE the argument order change
return T.concatenate([w[:k],v,w[k+1:]],axis=0)
result,_=theano.scan(f,
sequences=timeArray,
outputs_info=beReplaced,
non_sequences=toReplace)
The solution is to concatenate v.dimshuffle('x',0) and that solves the dim problem.