removing nested list elements with all zero entries - python-3.x

In a nested list (list of lists), how can I remove the elements that have all the entries as zero.
For instance: values =
[[1.1, 3.0], [2.5, 5.2], [4.7, 8.2], [69.2, 36.6], [0.7, 0.0], [0.0, 0.0], [0.4, 17.9], [14.7, 29.1], [6.8, 0.0], [0.0, 0.0]]
should change to
[[1.1, 3.0], [2.5, 5.2], [4.7, 8.2], [69.2, 36.6], [0.7, 0.0], [0.4, 17.9], [14.7, 29.1], [6.8, 0.0]]
Note: nested list could have n number of elements, not just 2.
Trying to do this to crop two other lists. something like:
for label, color, value in zip(labels, colors, values):
if any(value) in values: #this check needs update
new_labels.append(label)
new_colors.append(color)

Take advantage of the fact that 0.0 is falsey and filter using any().
result = [sublist for sublist in arr if any(sublist)]

Related

How to define custom function for scipy's binned_statistic_2d?

The documentation for scipy's binned_statistic_2d function gives an example for a 2D histogram:
from scipy import stats
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=[binx, biny])
Makes sense, but I'm now trying to implement a custom function. The custom function description is given as:
function : a user-defined function which takes a 1D array of values, and outputs a single numerical statistic. This function will be called on the values in each bin. Empty bins will be represented by function([]), or NaN if this returns an error.
I wasn't sure exactly how to implement this, so I thought I'd check my understanding by writing a custom function that reproduces the count option. I tried
def custom_func(values):
return len(values)
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, None, custom_func, bins=[binx, biny])
but this generates an error like so:
556 # Make sure `values` match `sample`
557 if(statistic != 'count' and Vlen != Dlen):
558 raise AttributeError('The number of `values` elements must match the '
559 'length of each `sample` dimension.')
561 try:
562 M = len(bins)
AttributeError: The number of `values` elements must match the length of each `sample` dimension.
How is this custom function supposed to be defined?
The reason for this error is that when using a custom statistic function (or any non-count statistic), you have to pass some array or list of arrays to the values parameter (with the number of elements matching the number in x). You can't just leave it as None as in your example, even though it is irrelevant and does not get used when computing counts of data points in each bin.
So, to match the results, you can just pass the same x object to the values parameter:
def custom_func(values):
return len(values)
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, x, custom_func, bins=[binx, biny])
print(ret)
# BinnedStatistic2dResult(statistic=array([[2., 1.],
# [1., 0.]]), x_edge=array([0. , 0.5, 1. ]), y_edge=array([2. , 2.5, 3. ]), binnumber=array([5, 6, 5, 9]))
The result matches that of the count statistic:
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=[binx, biny])
print(ret)
# BinnedStatistic2dResult(statistic=array([[2., 1.],
# [1., 0.]]), x_edge=array([0. , 0.5, 1. ]), y_edge=array([2. , 2.5, 3. ]), binnumber=array([5, 6, 5, 9]))

How can I retrieve elements in a multidimensional pytorch tensor by a list of indices?

I have two tensors: scores and lists
scores is of shape (x, 8) and lists of (x, 8, 4). I want to filter the max values for each row in scores and filter the respective elements from lists.
Take the following as an example (shape dimension 8 was reduced to 2 for simplicity):
scores = torch.tensor([[0.5, 0.4], [0.3, 0.8], ...])
lists = torch.tensor([[[0.2, 0.3, 0.1, 0.5],
[0.4, 0.7, 0.8, 0.2]],
[[0.1, 0.2, 0.1, 0.3],
[0.4, 0.3, 0.2, 0.5]], ...])
Then I would like to filter these tensors to:
scores = torch.tensor([0.5, 0.8, ...])
lists = torch.tensor([[0.2, 0.3, 0.1, 0.5], [0.4, 0.3, 0.2, 0.5], ...])
NOTE:
I tried so far, to retrieve the indices from the original score vector and use it as an index vector to filter lists:
# PSEUDO-CODE
indices = scores.argmax(dim=1)
for list, idx in zip(lists, indices):
list = list[idx]
That is also where the question name is coming from.
I imagine you tried something like
indices = scores.argmax(dim=1)
selection = lists[:, indices]
This does not work because the indices are selected for every element in dimension 0, so the final shape is (x, x, 4).
The perform the correct selection you need to replace the slice with a range.
indices = scores.argmax(dim=1)
selection = lists[range(indices.size(0)), indices]

How to convert list's values to dict's values

I have a list. Actually this is word's index.
lst = [[1, 2, 3],
[4, 5],
[6]]
and I have a dictionary. Dictionary's value is word's vector(word2vec) and each vector has same dimension(of course).
dic={1:array([0.1, 0.2, 0.3]),
2:array([0.4, 0.5, 0.6]),
3:array([0.7, 0.8, 0.9]),
4:array([1.0, 1.1, 1.2]),
5:array([1.3, 1.4, 1.5]),
6:array([1.6, 1.7, 1.8])}
and I want convert list's values(word index) to dict's values(word vector) what a pair with dictionary(as you look below).
lst = [[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], [0.7, 0.8, 0.9]],
[[1.0, 1.1, 1.2], [1.3, 1.4, 1.5]],
[[1.6, 1.7, 1.8]]]
Can you help me??
One can use the below helper function:
def word2vec(list_param, dict_param):
for i in range(len(list_param)):
for j in range(len(list_param[i])):
list_param[i][j] = dict_param[list[i][j]]
The list will have the updated values as required. Would strongly recommend not to use reserved key words like list,dict... as variable names.
Using map because I love it :)
ll = [[1, 2, 3], [4, 5], [6]]
dd = {1:array([0.1, 0.2, 0.3]),
2:array([0.4, 0.5, 0.6]),
3:array([0.7, 0.8, 0.9]),
4:array([1.0, 1.1, 1.2]),
5:array([1.3, 1.4, 1.5]),
6:array([1.6, 1.7, 1.8])}
res = []
for item in ll:
res.append(list(map(lambda x: list(dd[x]), item)))
print(res)
Gives
[
[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], [0.7, 0.8, 0.9]],
[[1.0, 1.1, 1.2], [1.3, 1.4, 1.5]],
[[1.6, 1.7, 1.8]]
]

difflib: comparing a list of keywords with another list and returning ratio

I am trying to compare a list of words with a whole list of sentences using 'difflib'.
import pandas as pd
from difflib import SequenceMatcher
s1 = ['okay', 'bye', 'what is'] # reference keywords
s2 = ['okay', 'what', 'dont worry', 'what is my name', 'is', 'my', 'name', 'bye'] #actual list
SequenceMatcher(a = s1, b = s2).ratio() # returns 0.36
The above snippet returns 0.36 as an overall result. But I would need a list where the reference keywords are matched with the actual list and the score is '1.0' for them. so in the above case, my result (for example - I am putting random scores here - the values could be [1.0, 0.2, 0.0, 0.5, 0.1, 0.0, 0.0, 0.0, 1.0] . i.e. Exact match = 1.0 , no match = 0.0, partial matches = scores accordingly.
Maybe you're looking for something like this:
[max([SequenceMatcher(None, x, y).ratio() for y in s1]) for x in s2]
>>> [1.0, 0.7272727272727273, 0.2857142857142857, 0.6363636363636364, 0.4444444444444444, 0.4, 0.2857142857142857, 1.0]

How to efficiently deal with nested data in PySpark?

I had a situation here, and found that collect_list in spark is not efficient when the item is already a list.
Basically, I tried to calculate the mean of a nested list (the size of each list is guaranteed to be the same). When the data set becomes, for example, 10 M rows, it may produce out of memory errors. Originally, I thought it has something to do with the udf (to calculate the mean). But actually, I found that the aggregation part (collect_list of lists) is the real problem.
What I am doing now is to divide the 10 M rows into multiple blocks (by 'user'), aggregate each block individually, and then union them at the end. Any better suggestion on efficiently dealing with nested data?
For example, the toy example is like this:
data = [('user1','place1', ['place1', 'place2', 'place3'], [0.0, 0.5, 0.4], [0.0, 0.4, 0.3]),
('user1','place2', ['place1', 'place2', 'place3'], [0.7, 0.0, 0.4], [0.6, 0.0, 0.3]),
('user2','place1', ['place1', 'place2', 'place3'], [0.0, 0.4, 0.3], [0.0, 0.3, 0.4]),
('user2','place3', ['place1', 'place2', 'place3'], [0.1, 0.2, 0.0], [0.3, 0.1, 0.0]),
('user3','place2', ['place1', 'place2', 'place3'], [0.3, 0.0, 0.4], [0.2, 0.0, 0.4]),
]
data_df = sparkApp.sparkSession.createDataFrame(data, ['user', 'place', 'places', 'data1', 'data2'])
data_agg = data_df.groupBy('user') \
.agg(f.collect_list('place').alias('place_list'),
f.first('places').alias('places'),
f.collect_list('data1').alias('data1'),
f.collect_list('data1').alias('data2'),
)
import numpy as np
def average_values(sim_vectors):
if len(sim_vectors) == 1:
return sim_vectors[0]
mat = np.array(sim_vectors)
mean_vector = np.mean(mat, axis=0)
return np.round(mean_vector, 3).tolist()
avg_vectors_udf = f.udf(average_values, ArrayType(DoubleType()))
data_agg_ave = data_agg.withColumn('data1', avg_vectors_udf('data1')) \
.withColumn('data2', avg_vectors_udf('data2'))
The result would be:
+-----+----------------+--------------------+-----------------+-----------------+
| user| place_list| places| data1| data2|
+-----+----------------+--------------------+-----------------+-----------------+
|user1|[place1, place2]|[place1, place2, ...|[0.35, 0.25, 0.4]|[0.35, 0.25, 0.4]|
|user3| [place2]|[place1, place2, ...| [0.3, 0.0, 0.4]| [0.3, 0.0, 0.4]|
|user2|[place1, place3]|[place1, place2, ...|[0.05, 0.3, 0.15]|[0.05, 0.3, 0.15]|
+-----+----------------+--------------------+-----------------+-----------------+

Resources