Change int array into string array based on its values - python-3.x

I have a nparray that contains 0 and 1 values
k = np.array([0, 1, 1, 0 ,1])
I want to transform the array into an array that contains 'blue' if the value is 0 and 'red' if the value is 1. I prefer to know the fastest way possible

You can use np.take to index into a an array/list of 2 elements with those k values as indices, like so -
np.take(['blue','red'],k)
Sample run -
In [19]: k = np.array([0, 1, 1, 0 ,1])
In [20]: np.take(['blue','red'],k)
Out[20]:
array(['blue', 'red', 'red', 'blue', 'red'],
dtype='|S4')
With the explicit indexing method -
In [23]: arr = np.array(['blue','red'])
In [24]: arr[k]
Out[24]:
array(['blue', 'red', 'red', 'blue', 'red'],
dtype='|S4')
Or with initialization with one string and then assigning the other one -
In [41]: out = np.full(k.size, 'blue')
In [42]: out[k==1] = 'red'
In [43]: out
Out[43]:
array(['blue', 'red', 'red', 'blue', 'red'],
dtype='|S4')
Runtime test
Approaches -
def app1(k):
return np.take(['blue','red'],k)
def app2(k):
arr = np.array(['blue','red'])
return arr[k]
def app3(k):
out = np.full(k.size, 'blue')
out[k==1] = 'red'
return out
Timings -
In [46]: k = np.random.randint(0,2,(100000))
In [47]: %timeit app1(k)
...: %timeit app2(k)
...: %timeit app3(k)
...:
1000 loops, best of 3: 413 µs per loop
10000 loops, best of 3: 103 µs per loop
1000 loops, best of 3: 908 µs per loop

Related

Optimization using pulp python

Trying to write an optimization code using pulp.
From the given dataset i want to 5 items which in sum maximize the value whereas having constraints as 2 items having color blue, 2 items having color yellow and a random item
But instead by using the attached code i am getting only 3 items, Please refer output section
Please suggest the changes needs to be done to the existing code
import pandas as pd
import pulp
import re
import sys
sys.setrecursionlimit(10000)
data = [['A', 'blue', 'circle', 0.454],
['B', 'yellow', 'square', 0.570],
['C', 'red', 'triangle', 0.789],
['D', 'red', 'circle', 0.718],
['E', 'red', 'square', 0.828],
['F', 'orange', 'square', 0.709],
['G', 'blue', 'circle', 0.696],
['H', 'orange', 'square', 0.285],
['I', 'orange', 'square', 0.698],
['J', 'orange', 'triangle', 0.861],
['K', 'blue', 'triangle', 0.658],
['L', 'yellow', 'circle', 0.819],
['M', 'blue', 'square', 0.352],
['N', 'orange', 'circle', 0.883],
['O', 'yellow', 'triangle', 0.755]]
df = pd.DataFrame(data, columns = ['item', 'color', 'shape', 'value'])
BlueMatch = lambda x: 1 if x=='blue' else 0
YellowMatch = lambda x: 1 if x=='yellow' else 0
RedMatch = lambda x: 1 if x=='red' else 0
OrangeMatch = lambda x: 1 if x=='orange' else 0
df['color'] = df['color'].astype(str)
df['isBlue'] = df.color.apply(BlueMatch)
df['isYellow'] = df.color.apply(YellowMatch)
df['isRed'] = df.color.apply(RedMatch)
df['isOrange'] = df.color.apply(OrangeMatch)
prob = pulp.LpProblem("complex_napsack", pulp.LpMaximize)
x = pulp.LpVariable.dicts( "x", indexs = df.index, lowBound=0, cat='Integer')
prob += pulp.lpSum([x[i]*df.value[i] for i in df.index ])
prob += pulp.lpSum([x[i]*df.isBlue[i] for i in df.index])==2
prob += pulp.lpSum([x[i]*df.isYellow[i] for i in df.index])==2
prob += pulp.lpSum([x[i] for i in df.index ])==10
prob.solve()
for v in prob.variables():
if v.varValue != 0.0:
mystring = re.search('([0-9]*$)', v.name)
print(v.name, "=", v.varValue)
ind = int(mystring.group(1))
print(df.item[ind])
output:
x_11 = 2.0
L
x_13 = 6.0
N
x_6 = 2.0
G
You just need to declare your variables as Binary instead of Integer, like so:
x = pulp.LpVariable.dicts("x", indexs=df.index, cat=pulp.LpBinary)

Compare panda data frame indices and update the rows

I have two excel files which I read by pandas. I am comparing the index in file 1 with the index in file 2 (not the same length (ex: 10,100) and if they match, the row[index] in the second file will be zeros and else will not change. I am using for and if loops for this, but the more data I want to process(1e3,5e3), the run time becomes longer. So, is there a better way to perform such a comparison?. Here's an example of what I am using.
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=[4, 5, 6], columns=['A', 'B', 'C'])
df1 = pd.DataFrame([['w'], ['y' ], ['z']],
index=[4, 5, 1])
for j in df1.index:
for i in df.index:
if i == j:
df.loc[i, :] = 0
else:
df.loc[i, :] = df.loc[i, :]
print(df)
Here loops are not necessary, you can set values to 0 per rows by DataFrame.mask with Series.isin (necessary convert index to Series for avoid ValueError: Array conditional must be same shape as self):
df = df.mask(df.index.to_series().isin(df1.index), 0)
Or with Index.isin and numpy.where if want improve performance:
arr = np.where(df.index.isin(df1.index)[:, None], 0, df)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df)
A B C
4 0 0 0
5 0 0 0
6 10 20 30

creating a dictionary with two tensors tensorFlow

I have two tensors
top_k_values = [[0.1,0.2,0.3]
[0.4, 0.5,0.6]]
top_k_indices= [[1,3,5]
[2, 5,3]]
I want to take the indices and the values and create dictionary like
dict[1] = 0.1
dict[2] = 0.4
dict[3] = 0.2 + 0.6
dict [5] = 0.3 + 0.5
I want to order this dictionary by key and then select the top 3 indices
Could someone please help me.
I have been trying to use map_fn. But this does not seem to be workin
Is the above problem solvable with tensorflow
You can use a counter to accumulate the values for each indice. This is from python standard library. I don't know if you can do the same with tensorflow library.
>>> from collections import counter
>>> d=Counter()
>>> for indice_list, value_list in zip(top_k_indices, top_k_values):
... for indice, value in zip(indice_list, value_list):
... d[indice] += value
>>> d
Counter({3: 0.8, 5: 0.8, 2: 0.4, 1: 0.1})
# this is your expected result
# a counter is a kind of dict, but if you need a real dict:
>>> dict(d)
{1: 0.1, 3: 0.8, 5: 0.8, 2: 0.4}
# 3 indices with maximum values
>>> d.most_common(3)
[(3, 0.8), (5, 0.8), (2, 0.4)]
>>> sorted([indice for indice, value in d.most_common(3)])
[2, 3, 5]

numpy apply_along_axis vectorisation

I am trying to implement a function that takes each row in a numpy 2d array and returns me scalar result of a certain calculations. My current code looks like the following:
img = np.array([
[0, 5, 70, 0, 0, 0 ],
[10, 50, 4, 4, 2, 0 ],
[50, 10, 1, 42, 40, 1 ],
[10, 0, 0, 6, 85, 64],
[0, 0, 0, 1, 2, 90]]
)
def get_y(stride):
stride_vals = stride[stride > 0]
pix_thresh = stride_vals.max() - 1.5*stride_vals.std()
return np.argwhere(stride>pix_thresh).mean()
np.apply_along_axis(get_y, 0, img)
>> array([ 2. , 1. , 0. , 2. , 2.5, 3.5])
It works as expected, however, performance isn't great as in real dataset there are ~2k rows and ~20-50 columns for each frame, coming 60 times a second.
Is there a way to speed-up the process, perhaps by not using np.apply_along_axis function?
Here's one vectorized approach setting the zeros as NaN and that let's us use np.nanmax and np.nanstd to compute those max and std values avoiding the zeros, like so -
imgn = np.where(img==0, np.nan, img)
mx = np.nanmax(imgn,0) # np.max(img,0) if all are positive numbers
st = np.nanstd(imgn,0)
mask = img > mx - 1.5*st
out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
Runtime test -
In [94]: img = np.random.randint(-100,100,(2000,50))
In [95]: %timeit np.apply_along_axis(get_y, 0, img)
100 loops, best of 3: 4.36 ms per loop
In [96]: %%timeit
...: imgn = np.where(img==0, np.nan, img)
...: mx = np.nanmax(imgn,0)
...: st = np.nanstd(imgn,0)
...: mask = img > mx - 1.5*st
...: out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
1000 loops, best of 3: 1.33 ms per loop
Thus, we are seeing a 3x+ speedup.

How to create 2-D array from 3-D Numpy array?

I have a 3 dimensional Numpy array corresponding to an RGB image. I need to create a 2 dimensional Numpy array from it such that if any pixel in the R, G, or B channel is 1, then the corresponding pixel in the 2-D array is 255.
I know how to use something like a list comprehension on a Numpy array, but the result is the same shape as the original array. I need the new shape to be 2-D.
Ok, assuming you want the output pixel to be 0 where it shouldn't be 255 and your input is MxNx3.
RGB = RGB == 1 # you can skip this if your original (RGB) contains only 0's and 1's anyway
out = np.where(np.logical_or.reduce(RGB, axis=-1), 255, 0)
One approach could be with using any() along the third dim and then multiplying by 255, so that the booleans are automatically upscaled to int type, like so -
(img==1).any(axis=2)*255
Sample run -
In [19]: img
Out[19]:
array([[[1, 8, 1],
[2, 4, 7]],
[[4, 0, 6],
[4, 3, 1]]])
In [20]: (img==1).any(axis=2)*255
Out[20]:
array([[255, 0],
[ 0, 255]])
Runtime test -
In [45]: img = np.random.randint(0,5,(1024,1024,3))
# #Paul Panzer's soln
In [46]: %timeit np.where(np.logical_or.reduce(img==1, axis=-1), 255, 0)
10 loops, best of 3: 22.3 ms per loop
# #nanoix9's soln
In [47]: %timeit np.apply_along_axis(lambda a: 255 if 1 in a else 0, 0, img)
10 loops, best of 3: 40.1 ms per loop
# Posted soln here
In [48]: %timeit (img==1).any(axis=2)*255
10 loops, best of 3: 19.1 ms per loop
Additionally, we could convert to np.uint8 and then multiply it with 255 for some further performance boost -
In [49]: %timeit (img==1).any(axis=2).astype(np.uint8)*255
100 loops, best of 3: 18.5 ms per loop
And more, if we work with individual slices along the third dim -
In [68]: %timeit ((img[...,0]==1) | (img[...,1]==1) | (img[...,2]==1))*255
100 loops, best of 3: 7.3 ms per loop
In [69]: %timeit ((img[...,0]==1) | (img[...,1]==1) | (img[...,2]==1)).astype(np.uint8)*255
100 loops, best of 3: 5.96 ms per loop
use apply_along_axis. e.g.
In [28]: import numpy as np
In [29]: np.random.seed(10)
In [30]: img = np.random.randint(2, size=12).reshape(3, 2, 2)
In [31]: img
Out[31]:
array([[[1, 1],
[0, 1]],
[[0, 1],
[1, 0]],
[[1, 1],
[0, 1]]])
In [32]: np.apply_along_axis(lambda a: 255 if 1 in a else 0, 0, img)
Out[32]:
array([[255, 255],
[255, 255]])
see the doc of numpy for details.

Resources