I have two tensors
top_k_values = [[0.1,0.2,0.3]
[0.4, 0.5,0.6]]
top_k_indices= [[1,3,5]
[2, 5,3]]
I want to take the indices and the values and create dictionary like
dict[1] = 0.1
dict[2] = 0.4
dict[3] = 0.2 + 0.6
dict [5] = 0.3 + 0.5
I want to order this dictionary by key and then select the top 3 indices
Could someone please help me.
I have been trying to use map_fn. But this does not seem to be workin
Is the above problem solvable with tensorflow
You can use a counter to accumulate the values for each indice. This is from python standard library. I don't know if you can do the same with tensorflow library.
>>> from collections import counter
>>> d=Counter()
>>> for indice_list, value_list in zip(top_k_indices, top_k_values):
... for indice, value in zip(indice_list, value_list):
... d[indice] += value
>>> d
Counter({3: 0.8, 5: 0.8, 2: 0.4, 1: 0.1})
# this is your expected result
# a counter is a kind of dict, but if you need a real dict:
>>> dict(d)
{1: 0.1, 3: 0.8, 5: 0.8, 2: 0.4}
# 3 indices with maximum values
>>> d.most_common(3)
[(3, 0.8), (5, 0.8), (2, 0.4)]
>>> sorted([indice for indice, value in d.most_common(3)])
[2, 3, 5]
Related
I am trying to accept two arrays from a function and iterate over value pairs in an array
import numpy as np
a = np.zeros(10).astype(np.uint8)
a[0:4] = 1
hist = np.zeros(4)
values, counts = np.unique(a, return_counts=True)
for u, c in zip(values, counts):
hist[u] += c
# This works. hist: [6. 4. 0. 0.]
# for u, c in zip(np.unique(a, return_counts=True)): # ValueError: not enough values to unpack (expected 2, got 1)
# hist[u] += c
# for u, c in np.unique(a, return_counts=True): # IndexError: index 6 is out of bounds for axis 0 with size 4
# hist[u] += c
Code works if I first accept two arrays, then use for k,v in zip(arr1, arr2)
Is it possible two write for k,v in function_returning_two_arrays(args) as a one line statement?
Update. Both zip(*arg) and [arg] work. Can you please elaborate on this syntax, please. A link to an article would be enough. Then I can accept the answer. I got it that a * unpacks a tuple, but what does [some_tupple] do?
Other than the unique step, this just basic python.
In [78]: a = np.zeros(10).astype(np.uint8)
...: a[0:4] = 1
...: ret = np.unique(a, return_counts=True)
unique returns a tuple of arrays, which can be used as is, or unpacked into 2 variables. I think unpacking makes the code clearer.
In [79]: ret
Out[79]: (array([0, 1], dtype=uint8), array([6, 4]))
In [80]: values, counts = ret
In [81]: values
Out[81]: array([0, 1], dtype=uint8)
In [82]: counts
Out[82]: array([6, 4])
The following just makes a list with 1 item - the tuple
In [83]: [ret]
Out[83]: [(array([0, 1], dtype=uint8), array([6, 4]))]
That's different from making a list of the two arrays - which just changes the tuple "wrapper" to a list:
In [84]: [values, counts]
Out[84]: [array([0, 1], dtype=uint8), array([6, 4])]
zip takes multiple items (it has a *args signature)
In [85]: list(zip(*ret)) # same as zip(values, counts)
Out[85]: [(0, 6), (1, 4)]
In [86]: [(i,j) for i,j in zip(*ret)] # using that in an iteration
Out[86]: [(0, 6), (1, 4)]
In [87]: [(i,j) for i,j in zip(values, counts)]
Out[87]: [(0, 6), (1, 4)]
So it pairs the nth element of values with the nth element of counts
Iteration on the [ret] list does something entirely different, or rather it does nothing - compare with `Out[83]:
In [88]: [(i,j) for i,j in [ret]]
Out[88]: [(array([0, 1], dtype=uint8), array([6, 4]))]
I think of list(zip(*arg)) as a list version of transpose:
In [90]: np.transpose(ret)
Out[90]:
array([[0, 6],
[1, 4]])
In [91]: [(i,j) for i,j in np.transpose(ret)]
Out[91]: [(0, 6), (1, 4)]
I have the following 2 rows in my dataframe:
[1, 1.1, -19, "kuku", "lulu"]
[2.8, 1.1, -20, "kuku", "lilu"]
I want to calculate their similarity by comparing each dimension (equal? 1, otherwise 0) and get the following vector: [0, 1, 0, 1, 0], is there any function that takes a vector and performs such "similarity" against all rows and calculates mean? In our case it would be 2/5 = 0.4.
I would just use a simple = on NumPy arrays, to be casted as int for the vector and numpy.mean() for the mean of the vector:
import numpy as np
a = [1, 1.1, -19, "kuku", "lulu"]
b = [2.8, 1.1, -20, "kuku", "lilu"]
res = (np.array(a) == np.array(b)).astype(int)
print(res)
# [0 1 0 1 0]
v = res.mean()
print(v)
# 0.4
If you do not mind computing everything twice and you can afford the potentially large intermediate temporary objects:
import numpy as np
arr = np.array([
[1, 1.1, -19, "kuku", "lulu"],
[2.8, 1.1, -20, "kuku", "lilu"],
[2.8, 1.1, -20, "kuku", "lulu"]])
corr = arr[None, :, :] == arr[:, None, :]
score = corr.mean(-1)
print(score)
# [[1. 0.4 0.6]
# [0.4 1. 0.8]
# [0.6 0.8 1. ]]
I would like to calculate, by group, the mean of one column and the weighted mean of another column in a dataset using the .agg() function within pandas. I am aware of a few solutions, but they aren't very concise.
One solution has been posted here (pandas and groupby: how to calculate weighted averages within an agg, but it still doesn't seem very flexible because the weights column is hard coded in the lambda function definition. I'm looking to create a syntax closer to this:
(
df
.groupby(['group'])
.agg(avg_x=('x', 'mean'),
wt_avg_y=('y', 'weighted_mean', weights='weight')
)
Here is a fully worked example with code that seems needlessly complicated:
import pandas as pd
import numpy as np
# sample dataset
df = pd.DataFrame({
'group': ['a', 'a', 'b', 'b'],
'x': [1, 2, 3, 4],
'y': [5, 6, 7, 8],
'weights': [0.75, 0.25, 0.75, 0.25]
})
df
#>>> group x y weights
#>>> 0 a 1 5 0.75
#>>> 1 a 2 6 0.25
#>>> 2 b 3 7 0.75
#>>> 3 b 4 8 0.25
# aggregation logic
summary = pd.concat(
[
df.groupby(['group']).x.mean(),
df.groupby(['group']).apply(lambda x: np.average(x['y'], weights=x['weights']))
], axis=1
)
# manipulation to format the output of the aggregation
summary = summary.reset_index().rename(columns={'x': 'avg_x', 0: 'wt_avg_y'})
# final output
summary
#>>> group avg_x wt_avg_y
#>>> 0 a 1.50 5.25
#>>> 1 b 3.50 7.25
Using the .apply() method on the entire DataFrame was the simplest solution I could arrive to that does not hardcode the column name inside the function definition.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'group': ['a', 'a', 'b', 'b'],
'x': [1, 2, 3, 4],
'y': [5, 6, 7, 8],
'weights': [0.75, 0.25, 0.75, 0.25]
})
summary = (
df
.groupby(['group'])
.apply(
lambda x: pd.Series([
np.mean(x['x']),
np.average(x['y'], weights=x['weights'])
], index=['avg_x', 'wt_avg_y'])
)
.reset_index()
)
# final output
summary
#>>> group avg_x wt_avg_y
#>>> 0 a 1.50 5.25
#>>> 1 b 3.50 7.25
How about this:
grouped = df.groupby('group')
def wavg(group):
group['mean_x'] = group['x'].mean()
group['wavg_y'] = np.average(group['y'], weights=group.loc[:, "weights"])
return group
grouped.apply(wavg)
Try:
df["weights"]=df["weights"].div(df.join(df.groupby("group")["weights"].sum(), on="group", rsuffix="_2").iloc[:, -1])
df["y"]=df["y"].mul(df["weights"])
res=df.groupby("group", as_index=False).agg({"x": "mean", "y": "sum"})
Outputs:
group x y
0 a 1.5 5.25
1 b 3.5 7.25
Since your weights sum to 1 within groups, you can assign a new column and groupby as usual:
(df.assign(wt_avg_y=df['y']*df['weights'])
.groupby('group')
.agg({'x': 'mean', 'wt_avg_y':'sum', 'weights':'sum'})
.assign(wt_avg_y=lambda x: x['wt_avg_y']/ x['weights'])
)
Output:
x wt_avg_y weights
group
a 1.5 5.25 1.0
b 3.5 7.25 1.0
Steven M. Mortimer's solution is clean and easy to read. Alternatively, one could use dict notation inside pd.Series() such that the index= argument is not needed. This provides slightly better readability in my opinion.
summary = (
df
.groupby(['group'])
.apply(
lambda x: pd.Series({
'avg_x' : np.mean(x['x']),
'wt_avg_y': np.average(x['y'], weights=x['weights'])
}))
.reset_index()
)
Please consider the following code,
x = tf.constant([[[1, np.nan, np.nan], [4, 3, -1]], [[10, np.nan, 3], [20,5,-7]], [[5, np.nan, 3], [np.nan,15,-17]]])
x_max = tf.reduce_max(x, reduction_indices=[0])
with tf.Session() as sess:
print (np.shape(sess.run(x)))
print (sess.run(x))
print (sess.run(x_max))
The output is as following:
(3, 2, 3)
[[[ 1. nan nan]
[ 4. 3. -1.]]
[[ 10. nan 3.]
[ 20. 5. -7.]]
[[ 5. nan 3.]
[ nan 15. -17.]]]
[[ 10. -inf 3.]
[ 20. 15. -1.]]
Now my question is how tensorflow deals with np.nan, like numpy.nanmax or similar?
Quoting this link (credit goes to Yaroslav Bulatov):
Different parts of TensorFlow treat them differently:
* Float computations (usually?) propagate them.
* Int conversion treats them as 0.
* Int computations fail with Python parts of TensorFlow often raise an error on "NaN", ie, trying to add a NaN summary to histogram will fail with Python
exception.
Here is an example for some float operations:
a = tf.constant([1.0, np.nan])
b = tf.constant(np.nan)
r = tf.reduce_min(a)
m = a * b
with tf.Session() as sess:
print(sess.run(r)) # prints 1.0
print(sess.run(m)) # array([nan, nan], dtype=float32)
I want to construct a 1d numpy array a, and I know each a[i] has several possible values. Of course, the numbers of the possible values of any two elements of a can be different. For each a[i], I want to set it be the minimum value of all the possible values.
For example, I have two array:
idx = np.array([0, 1, 0, 2, 3, 3, 3])
val = np.array([0.1, 0.5, 0.2, 0.6, 0.2, 0.1, 0.3])
The array I want to construct is following:
a = np.array([0.1, 0.5, 0.6, 0.1])
So does there exist any function in numpy can finish this work?
Here's one approach -
def groupby_minimum(idx, val):
sidx = idx.argsort()
sorted_idx = idx[sidx]
cut_idx = np.r_[0,np.flatnonzero(sorted_idx[1:] != sorted_idx[:-1])+1]
return np.minimum.reduceat(val[sidx], cut_idx)
Sample run -
In [36]: idx = np.array([0, 1, 0, 2, 3, 3, 3])
...: val = np.array([0.1, 0.5, 0.2, 0.6, 0.2, 0.1, 0.3])
...:
In [37]: groupby_minimum(idx, val)
Out[37]: array([ 0.1, 0.5, 0.6, 0.1])
Here's another using pandas -
import pandas as pd
def pandas_groupby_minimum(idx, val):
df = pd.DataFrame({'ID' : idx, 'val' : val})
return df.groupby('ID')['val'].min().values
Sample run -
In [66]: pandas_groupby_minimum(idx, val)
Out[66]: array([ 0.1, 0.5, 0.6, 0.1])
You can also use binned_statistic:
from scipy.stats import binned_statistic
idx_list=np.append(np.unique(idx),np.max(idx)+1)
stats=binned_statistic(idx,val,statistic='min', bins=idx_list)
a=stats.statistic
I think, in older scipy versions, statistic='min' was not implemented, but you can use statistic=np.min instead. Intervals are half open in binned_statistic, so this implementation is safe.