How to create 2-D array from 3-D Numpy array? - python-3.x

I have a 3 dimensional Numpy array corresponding to an RGB image. I need to create a 2 dimensional Numpy array from it such that if any pixel in the R, G, or B channel is 1, then the corresponding pixel in the 2-D array is 255.
I know how to use something like a list comprehension on a Numpy array, but the result is the same shape as the original array. I need the new shape to be 2-D.

Ok, assuming you want the output pixel to be 0 where it shouldn't be 255 and your input is MxNx3.
RGB = RGB == 1 # you can skip this if your original (RGB) contains only 0's and 1's anyway
out = np.where(np.logical_or.reduce(RGB, axis=-1), 255, 0)

One approach could be with using any() along the third dim and then multiplying by 255, so that the booleans are automatically upscaled to int type, like so -
(img==1).any(axis=2)*255
Sample run -
In [19]: img
Out[19]:
array([[[1, 8, 1],
[2, 4, 7]],
[[4, 0, 6],
[4, 3, 1]]])
In [20]: (img==1).any(axis=2)*255
Out[20]:
array([[255, 0],
[ 0, 255]])
Runtime test -
In [45]: img = np.random.randint(0,5,(1024,1024,3))
# #Paul Panzer's soln
In [46]: %timeit np.where(np.logical_or.reduce(img==1, axis=-1), 255, 0)
10 loops, best of 3: 22.3 ms per loop
# #nanoix9's soln
In [47]: %timeit np.apply_along_axis(lambda a: 255 if 1 in a else 0, 0, img)
10 loops, best of 3: 40.1 ms per loop
# Posted soln here
In [48]: %timeit (img==1).any(axis=2)*255
10 loops, best of 3: 19.1 ms per loop
Additionally, we could convert to np.uint8 and then multiply it with 255 for some further performance boost -
In [49]: %timeit (img==1).any(axis=2).astype(np.uint8)*255
100 loops, best of 3: 18.5 ms per loop
And more, if we work with individual slices along the third dim -
In [68]: %timeit ((img[...,0]==1) | (img[...,1]==1) | (img[...,2]==1))*255
100 loops, best of 3: 7.3 ms per loop
In [69]: %timeit ((img[...,0]==1) | (img[...,1]==1) | (img[...,2]==1)).astype(np.uint8)*255
100 loops, best of 3: 5.96 ms per loop

use apply_along_axis. e.g.
In [28]: import numpy as np
In [29]: np.random.seed(10)
In [30]: img = np.random.randint(2, size=12).reshape(3, 2, 2)
In [31]: img
Out[31]:
array([[[1, 1],
[0, 1]],
[[0, 1],
[1, 0]],
[[1, 1],
[0, 1]]])
In [32]: np.apply_along_axis(lambda a: 255 if 1 in a else 0, 0, img)
Out[32]:
array([[255, 255],
[255, 255]])
see the doc of numpy for details.

Related

Pytorch, retrieving values from a tensor using several indices. Most computationally efficient solution

If I have an example 3d tensor
a = [[4, 2, 1, 6],[1, 2, 3, 8], [92, 4, 23, 54]]
tensor_a = torch.tensor(a)
I can get 2 of the 1D tensors along the first dimension using
tensor_a[[0, 1]]
tensor([[4, 2, 1, 6],
[1, 2, 3, 8]])
But how about using several indices?
So I have something like this
list_indices = [[0, 0], [0,2], [1, 2]]
I could do something like
combos = []
for indi in list_indices:
combos.append(tensor_a[indi])
But I'm wondering if since there's a for loop, if there's a more computationally way to do this, perhaps also using pytorch
It is more computationally effecient to use the predefined Pytorch function "torch.index_select" to select tensor elements using a list of indices:
a = [[4, 2, 1, 6],[1, 2, 3, 8], [92, 4, 23, 54]]
tensor_a = torch.tensor(a)
list_indices = [[0, 0], [0,2], [1, 2]]
#convert list_indices to Tensor
indices = torch.tensor(list_indices)
#get elements from tensor_a using indices.
tensor_a=torch.index_select(tensor_a, 0, indices.view(-1))
print(tensor_a)
if you want the result to be a list not a tensors, you can convert tensor_a to a list:
tensor_a_list = tensor_a.tolist()
To test the computational efficiency I created 1000000 indices and I compared the execution time. Using the loop takes more time then using my suggested pytorch approach:
import time
import torch
start_time = time.time()
a = [[4, 2, 1, 6],[1, 2, 3, 8], [92, 4, 23, 54]]
tensor_a = torch.tensor(a)
indices = torch.randint(0, 2, (1000000,)).tolist()
for indi in indices:
combos.append(tensor_a[indi])
print("--- %s seconds ---" % (time.time() - start_time))
--- 3.3966853618621826 seconds ---
start_time = time.time()
indices = torch.tensor(indices)
tensor_a=torch.index_select(tensor_a, 0, indices)
print("--- %s seconds ---" % (time.time() - start_time))
--- 0.10641193389892578 seconds ---

Vector similarity with multiple dtypes (string, int, floats etc.)?

I have the following 2 rows in my dataframe:
[1, 1.1, -19, "kuku", "lulu"]
[2.8, 1.1, -20, "kuku", "lilu"]
I want to calculate their similarity by comparing each dimension (equal? 1, otherwise 0) and get the following vector: [0, 1, 0, 1, 0], is there any function that takes a vector and performs such "similarity" against all rows and calculates mean? In our case it would be 2/5 = 0.4.
I would just use a simple = on NumPy arrays, to be casted as int for the vector and numpy.mean() for the mean of the vector:
import numpy as np
a = [1, 1.1, -19, "kuku", "lulu"]
b = [2.8, 1.1, -20, "kuku", "lilu"]
res = (np.array(a) == np.array(b)).astype(int)
print(res)
# [0 1 0 1 0]
v = res.mean()
print(v)
# 0.4
If you do not mind computing everything twice and you can afford the potentially large intermediate temporary objects:
import numpy as np
arr = np.array([
[1, 1.1, -19, "kuku", "lulu"],
[2.8, 1.1, -20, "kuku", "lilu"],
[2.8, 1.1, -20, "kuku", "lulu"]])
corr = arr[None, :, :] == arr[:, None, :]
score = corr.mean(-1)
print(score)
# [[1. 0.4 0.6]
# [0.4 1. 0.8]
# [0.6 0.8 1. ]]

What does the ordering/index of cluster_centers_ represent in KMeans clustering SKlearn

I have implemented the following code
k_mean = KMeans(n_clusters=5,init=centroids,n_init=1,random_state=SEED).fit(X_input)
k_mean.cluster_centers_.shape
>>
(5, 50)
I have 5 clusters of the data.
How are the clusters ordered? Are the indices of the clusters centres representing the labels?
Means does the cluster_center index at 0th position represent the label = 0 or not?
In the docs you have a smiliar example:
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10., 2.],
[ 1., 2.]])
The indexes are ordered yes. Btw with k_mean.cluster_centers_.shapeyou only return the shape of your array, and not the values. So in your case you have 5 clusters, and the dimension of your features is 50.
To get the nearest point, you can have a look here.

numpy apply_along_axis vectorisation

I am trying to implement a function that takes each row in a numpy 2d array and returns me scalar result of a certain calculations. My current code looks like the following:
img = np.array([
[0, 5, 70, 0, 0, 0 ],
[10, 50, 4, 4, 2, 0 ],
[50, 10, 1, 42, 40, 1 ],
[10, 0, 0, 6, 85, 64],
[0, 0, 0, 1, 2, 90]]
)
def get_y(stride):
stride_vals = stride[stride > 0]
pix_thresh = stride_vals.max() - 1.5*stride_vals.std()
return np.argwhere(stride>pix_thresh).mean()
np.apply_along_axis(get_y, 0, img)
>> array([ 2. , 1. , 0. , 2. , 2.5, 3.5])
It works as expected, however, performance isn't great as in real dataset there are ~2k rows and ~20-50 columns for each frame, coming 60 times a second.
Is there a way to speed-up the process, perhaps by not using np.apply_along_axis function?
Here's one vectorized approach setting the zeros as NaN and that let's us use np.nanmax and np.nanstd to compute those max and std values avoiding the zeros, like so -
imgn = np.where(img==0, np.nan, img)
mx = np.nanmax(imgn,0) # np.max(img,0) if all are positive numbers
st = np.nanstd(imgn,0)
mask = img > mx - 1.5*st
out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
Runtime test -
In [94]: img = np.random.randint(-100,100,(2000,50))
In [95]: %timeit np.apply_along_axis(get_y, 0, img)
100 loops, best of 3: 4.36 ms per loop
In [96]: %%timeit
...: imgn = np.where(img==0, np.nan, img)
...: mx = np.nanmax(imgn,0)
...: st = np.nanstd(imgn,0)
...: mask = img > mx - 1.5*st
...: out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
1000 loops, best of 3: 1.33 ms per loop
Thus, we are seeing a 3x+ speedup.

scikit-learn: Get selected features for prediction data

I have a training set of data. The python script for creating the model also calculates the attributes into a numpy array (It's a bit vector). I then want to use VarianceThreshold to eliminate all features that have 0 variance (eg. all 0 or 1). I then run get_support(indices=True) to get the indices of the select columns.
My issue now is how to get only the selected features for the data I want to predict. I first calculate all features and then use array indexing but it does not work:
x_predict_all = getAllFeatures(suppl_predict)
x_predict = x_predict_all[indices] #only selected features
indices is a numpy array.
The returned array x_predict has the correct length len(x_predict) but wrong shape x_predict.shape[1] which is still the original length. My classifier then throws an error due to wrong shape
prediction = gbc.predict(x_predict)
File "C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py", li
ne 1032, in _init_decision_function
self.n_features, X.shape[1]))
ValueError: X.shape[1] should be 1855, not 2090.
How can I solve this issue?
You can do it like this:
Test data
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0, 2, 0, 3],
[0, 1, 4, 3],
[0, 1, 1, 3]])
selector = VarianceThreshold()
Alternative 1
>>> selector.fit(X)
>>> idxs = selector.get_support(indices=True)
>>> X[:, idxs]
array([[2, 0],
[1, 4],
[1, 1]])
Alternative 2
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])

Resources