dask array map_blocks, with differently shaped dask array as argument - python-3.x

I'm trying to use dask.array.map_blocks to process a dask array, using a second dask array with different shape as an argument. The use case is firstly running some peak finding on a 2-D stack of images (4-dimensions), which is returned as a 2-D dask array of np.objects. Ergo, the two first dimensions of the two dask arrays are the same. The peaks are then used to extract intensities from the 4-dimensional dataset. In the code below, I've omitted the peak finding part. Dask version 1.0.0.
import numpy as np
import dask.array as da
def test_processing(data_chunk, position_chunk):
output_array = np.empty(data_chunk.shape[:-2], dtype='object')
for index in np.ndindex(data_chunk.shape[:-2]):
islice = np.s_[index]
intensity_list = []
data = data_chunk[islice]
positions = position_chunk[islice]
for x, y in positions:
intensity_list.append(data[x, y])
output_array[islice] = np.array(intensity_list)
return output_array
data = da.random.random(size=(4, 4, 10, 10), chunks=(2, 2, 10, 10))
positions = np.empty(data.shape[:-2], dtype='object')
for index in np.ndindex(positions.shape):
positions[index] = np.arange(10).reshape(5, 2)
data_output = da.map_blocks(test_processing, data, positions, dtype=np.object,
chunks=(2, 2), drop_axis=(2, 3))
data_output.compute()
This gives the error ValueError: Can't drop an axis with more than 1 block. Please useatopinstead., which I'm guessing is due to positions having 3 dimensions, while data has 4 dimensions.
The same function, but without the positions dask array works fine.
import numpy as np
import dask.array as da
def test_processing(data_chunk):
output_array = np.empty(data_chunk.shape[:-2], dtype='object')
for index in np.ndindex(data_chunk.shape[:-2]):
islice = np.s_[index]
intensity_list = []
data = data_chunk[islice]
positions = [[5, 2], [1, 3]]
for x, y in positions:
intensity_list.append(data[x, y])
output_array[islice] = np.array(intensity_list)
return output_array
data = da.random.random(size=(4, 4, 10, 10), chunks=(2, 2, 10, 10))
data_output = da.map_blocks(test_processing, data, dtype=np.object,
chunks=(2, 2), drop_axis=(2, 3))
data_computed = data_output.compute()

This has been fixed in more recent versions of dask: running the same code on version 2.3.0 of dask works fine.

Related

Tensorflow map_fn Out of Memory Issues

I am having issues with my code running out of memory on large data sets. I attempted to chunk the data to feed it into the calculation graph but I eventually get an out of memory error. Would setting it up to use the feed_dict functionality get around this problem?
My code is set up like the following, with a nested map_fn function due to a result of the tf_itertools_product_2D_nest function.
tf_itertools_product_2D_nest function is from Cartesian Product in Tensorflow
I also tried a variation where I made a list of tensor-lists which was significantly slower than doing it purely in tensorflow so I'd prefer to avoid that method.
import tensorflow as tf
import numpy as np
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.9
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess = tf.Session()
sess.run(tf.global_variables_initializer())
tensorboard_log_dir = "../log/"
def tf_itertools_product_2D_nest(a,b): #does not work on nested tensors
a, b = a[ None, :, None ], b[ :, None, None ]
#print(sess.run(tf.shape(a)))
#print(sess.run(tf.shape(b)))
n_feat_dimension_in_common = tf.shape(a)[-1]
c = tf.concat( [ a + tf.zeros_like( b ), tf.zeros_like( a ) + b ], axis = 2 )
return c
def do_calc(arr_pair):
arr_1 = arr_pair[0]
arr_binary = arr_pair[1]
return tf.reduce_max(tf.cumsum(arr_1*arr_binary))
def calc_row_wrapper(row):
return tf.map_fn(do_calc,row)
for i in range(0,10):
a = tf.constant(np.random.random((7,10))*10,tf.float64)
b = tf.constant(np.random.randint(2, size=(3,10)),tf.float64)
a_b_itertools_product = tf_itertools_product_2D_nest(a,b)
'''Creates array like this:
[ [[arr_a0,arr_b0], [arr_a1,arr_b0],...],
[[arr_a0,arr_b1], [arr_a1,arr_b1],...],
[[arr_a0,arr_b2], [arr_a1,arr_b2],...],
...]
'''
with tf.summary.FileWriter(tensorboard_log_dir, sess.graph) as writer:
result_array = sess.run(tf.map_fn(calc_row_wrapper,a_b_itertools_product),
options=run_options,run_metadata=run_metadata)
writer.add_run_metadata(run_metadata,"iteration {}".format(i))
print(result_array.shape)
print(result_array)
print("")
# result_array should be an array with 3 rows (1 for each binary vector in b) and 7 columns (1 for each row in a)
I can imagine that is unnecessarily consuming memory due to the extra dimension added. Is there a way to mimic the outcome of the standard itertools.product() function to output 1 long list of every possible combination of items in the 2 input iterables? Like the result of:
itertools.product([[1,2],[3,4]],[[5,6],[7,8]])
# [([1, 2], [5, 6]), ([1, 2], [7, 8]), ([3, 4], [5, 6]), ([3, 4], [7, 8])]
That would eliminate the need to call map_fn twice.
When map_fn is called within a loop as my code shows, will it keep spawning graphs for every iteration? There appears to be a big "map_" node for every iteration cycle in this code's Tensorboardgraph.
Tensorboard Default View (not enough reputation yet)
When I select a particular iteration based on the tag in Tensorboard, only the map node corresponding to the iteration is highlighted with all the others grayed out. Does that mean that for that cycle only the map node for that cycle is present (and the others no longer, if from a previous cycle , exist in memory)?
Tensorboard 1 iteration view

Issue when trying to plot after applying PCA on a dataset

I am trying to plot the results of PCA of the dataset pima-indians-diabetes.csv. My code shows a problem only in the plotting piece:
import numpy
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
# Dataset Description:
# 1. Number of times pregnant
# 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
# 3. Diastolic blood pressure (mm Hg)
# 4. Triceps skin fold thickness (mm)
# 5. 2-Hour serum insulin (mu U/ml)
# 6. Body mass index (weight in kg/(height in m)^2)
# 7. Diabetes pedigree function
# 8. Age (years)
# 9. Class variable (0 or 1)
path = 'pima-indians-diabetes.data.csv'
dataset = numpy.loadtxt(path, delimiter=",")
X = dataset[:,0:8]
Y = dataset[:,8]
features = ['1','2','3','4','5','6','7','8','9']
df = pd.read_csv(path, names=features)
x = df.loc[:, features].values # Separating out the values
y = df.loc[:,['9']].values # Separating out the target
x = StandardScaler().fit_transform(x) # Standardizing the features
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
# principalDf = pd.DataFrame(data=principalComponents, columns=['pca1', 'pca2'])
# finalDf = pd.concat([principalDf, df[['9']]], axis = 1)
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
for color, i, target_name in zip(colors, [0, 1, 2], ['Negative', 'Positive']):
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of pima-indians-diabetes Dataset')
The error is located at the following line:
Traceback (most recent call last):
File "test.py", line 53, in <module>
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
IndexError: too many indices for array
Kindly, how to fix this?
As the error indicates some kind of shape/dimension mismatch, a good starting point is to check the shapes of the arrays involved in the operation:
principalComponents.shape
yields
(768, 2)
while
(y==i).shape
(768, 1)
Which leads to a shape mismatch when trying to run
principalComponents[y==i, 0]
as the first array is already multidimensional, therefore the error is indicating that you used too many indices for the array.
You can fix this by forcing the shape of y==i to a 1D array ((768,)), e.g. by changing your call to scatter to
plt.scatter(principalComponents[(y == i).reshape(-1), 0],
principalComponents[(y == i).reshape(-1), 1],
color=color, alpha=.8, lw=lw, label=target_name)
which then creates the plot for me
For more information on the difference between arrays of the shape (R, 1)and (R,) this question on StackOverflow provides a nice starting point.

How can I fill an array with sets

Say I have an image of 2x2 pixels named image_array, each pixel color is identified by a tuple of 3 entries (RGB), so the shape of image_array is 2x2x3.
I want to create an np.array c which has the shape 2x2x1 and which last coordinate is an empty set.
I tried this:
import numpy as np
image = (((1,2,3), (1,0,0)), ((1,1,1), (2,1,2)))
image_array = np.array(image)
c = np.empty(image_array.shape[:2], dtype=set)
c.fill(set())
c[0][1].add(124)
print(c)
I get:
[[{124} {124}]
[{124} {124}]]
And instead I would like the return:
[[{} {124}]
[{} {}]]
Any idea ?
The object array has to be filled with separate set() objects. That means creating them individually, as I do with a list comprehension:
In [279]: arr = np.array([set() for _ in range(4)]).reshape(2,2)
In [280]: arr
Out[280]:
array([[set(), set()],
[set(), set()]], dtype=object)
That construction should highlight that fact that this array is closely related to a list, or list of lists.
Now we can do a set operation on one of those elements:
In [281]: arr[0,1].add(124) # more idiomatic than arr[0][1]
In [282]: arr
Out[282]:
array([[set(), {124}],
[set(), set()]], dtype=object)
Note that we cannot operate on more than one set at a time. The object array offers few advantages compared to a list.
This is a 2d array; the sets don't form a dimension. Contrast that with
In [283]: image = (((1,2,3), (1,0,0)), ((1,1,1), (2,1,2)))
...: image_array = np.array(image)
...:
In [284]: image_array
Out[284]:
array([[[1, 2, 3],
[1, 0, 0]],
[[1, 1, 1],
[2, 1, 2]]])
While it started with tuples, it made a 3d array of integers.
Try this:
import numpy as np
x = np.empty((2, 2), dtype=np.object)
x[0, 0] = set(1, 2, 3)
print(x)
[[{1, 2, 3} None]
[None None]]
For non-number types in numpy you should use np.object.
whenever you do fill(set()), this will fill the array with exactly same set, as they refer to the same set. To fix this, just make a set if there isnt one everytime you need to add to the set
c = np.empty(image_array.shape[:2], dtype=set)
if not c[0][1]:
c[0,1] = set([124])
else:
c[0,1].add(124)
print (c)
# [[None {124}]
# [None None]]
Try changing your line c[0][1].add to this.
c[0][1] = 124
print(c)

numpy matrix not functioning as intended

This is my code:
import random
import numpy as np
import math
populacao = 5
x_min = -10
x_max = 10
nbin = 4
def fitness(xy, populacao, resultado):
fit = np.matrix(resultado)
xy_fit = np.append(xy, fit.T, axis = 1)
xy_fit_sorted = xy_fit[np.argsort(xy_fit[:,-1].T),:]
return xy_fit_sorted
def codifica(x, x_min, x_max,n):
x = float(x)
xdec = round((x-x_min)/(x_max-x_min)*(2**n-1))
xbin = int(bin(xdec)[2:])
return(xbin)
xy = np.array([[1, 2],[3,4],[0,0],[-5,-1],[9,-2]])
resultado = np.array([5, 25, 0, 26, 85])
print(xy)
xy_fit_sorted = np.array(fitness(xy, populacao, resultado))
print(xy_fit_sorted)
parents = (xy_fit_sorted[:,:2])
print(parents)
the problem i'm having is that to select the 2 rows of "xy_fit_sorted", i'm doing this strange thing:
parents = (xy_fit_sorted[:,:2])
Intead of what makes sense in my mind:
parents = (xy_fit_sorted[:1,:])
it's like the whole matrix is in one line.
I'm not sure what most of your code is doing, so here's just a guess: are you thrown off by the shape of xy_fit_sorted being (1, 5, 3), having an extra zero axis?
That could be fixed e.g. by constructing xy_fit without the use of np.matrix:
xy_fit = np.append(xy, resultado[:, np.newaxis], axis=1)
Then xy_fit_sorted comes out with a shape of (5, 3).
The underlying issue was that np.matrix is always a 2-D array. When indexing xy_fit[...] you intend to index with a vector. But using np.matrix for xy_fit, xy_fit[:,-1].T is not a vector, but a 2-D array as well (of shape (1,5)). This leads to xy_fit_sorted having an extra dimension as well.
Note that the numpy doc says about np.matrix anyhow:
It is no longer recommended to use this class, even for linear algebra. Instead use regular arrays. The class may be removed in the future.

How to plot accuracy bars for each feature of an array

I have a data set "x" and its label vector "y". I want to plot the accuracy for each attribute (for each column of "x") after applying NaiveBayes and cross-validation. I want a bar graph.
So at the end I need to have 3 bars, because "x" has 3 columns. And the classification has to run 3 times. 3 different accuracies for each feature.
Whenever I execute my code it shows:
ValueError: Found arrays with inconsistent numbers of samples: [1 3]
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
What am I doing wrong?
import matplotlib.pyplot as plt
import numpy as np
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
x = np.array([[0, 0.51, 0.00101], [3, 0.54, 0.00105], [6, 0.57, 0.00108], [9, 0.60, 0.00111], [1, 0.73, 0.00114], [5, 0.76, 0.00117], [8, 0.89, 120]])
y = np.array([1, 0, 0, 1, 1, 1, 0])
scores = list()
scores_std = list()
for i in range(x.shape[1]):
xA=x[:, i]
scoresKF2 = cross_validation.cross_val_score(clf, xA, y, cv=2)
scores.append(np.mean(scoresKF2))
scores_std.append(np.std(scoresKF2))
plt.bar(x[:,i], scores)
plt.show()
Checking the shape of your input data, xA, shows us that it is 1-dimensional -- specifically, it is (7,) shape. As the warning tells us, you are not allowed to pass in a 1d array here. The key to solving this in the warning that was returned Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. Therefore, since it is just a single feature, do this xA = x[:,i].reshape(-1, 1) instead of xA = x[:,i].
I think there is another issue with the plotting. I'm not completely sure what you are expecting to see but you should probably replace plt.bar(x[:,i], scores) with plt.bar(i, np.mean(scoresKF2)).

Resources