Fastest way of updating a Python list based on indices

Fastest way of updating a Python list based on indices - python-3.x

I have a Python dictionary like this -
my_dict = {'Names':['Tom', 'Mariam', 'Lata', 'Tina', 'Abin'],
'Attendance':[False, False, False, False, False]}
I also have a Python list of flags for which indices need to changed to True in my_dict['Attendance'] -
flag_list = [0, 2, 3]
Based on the flag_list, my_dict needs to be changed to -
my_dict = {'Names':['Tom', 'Mariam', 'Lata', 'Tina', 'Abin'],
'Attendance':[True, False, True, True, False]}
What would be the fastest way of achieving this? Can it be done without a loop? Thank you for any guidance.

Using a loop
for index in flag_list:
my_dict['Attendance'][index] = True
A micro optimization would be to fetch the list from the dict only once:
attendance_list = my_dict['Attendance']
for index in flag_list:
attendance_list[index] = True
But unless flag_list is thousands elements long I wouldn't worry about it.
Using vectorization
If you are willing to take advantage of vectorization you can use a numpy array:
import numpy as np
my_dict = {'Names':['Tom', 'Mariam', 'Lata', 'Tina', 'Abin'],
'Attendance': np.array([False, False, False, False, False])}
flag_list = [0, 2, 3]
my_dict['Attendance'][flag_list] = True
But again, unless your data is very big I wouldn't worry about optimizing this piece of code very much.
Example timings
import random
from timeit import Timer
import numpy as np
ATTENDANCE_LIST_SIZE = 100000
FLAG_LIST_SIZE = 60000
dict_with_numpy = {'Attendance': np.random.choice([False, True],
ATTENDANCE_LIST_SIZE)}
dict_without_numpy = {'Attendance': random.choices([False, True],
k=ATTENDANCE_LIST_SIZE)}
flag_list = random.choices(range(ATTENDANCE_LIST_SIZE), k=FLAG_LIST_SIZE)
def using_numpy():
dict_with_numpy['Attendance'][flag_list] = True
def no_numpy_pre_fetching_list():
attendance_list = dict_without_numpy['Attendance']
for index in flag_list:
attendance_list[index] = True
def no_numpy():
for index in flag_list:
dict_without_numpy['Attendance'][index] = True
print(f'no_numpy\t\t\t\t\t\t{min(Timer(no_numpy).repeat(3, 3))}')
print(f'no_numpy_pre_fetching_list\t\t{min(Timer(no_numpy_pre_fetching_list).repeat(3, 3))}')
print(f'using_numpy\t\t\t\t\t\t{min(Timer(using_numpy).repeat(3, 3))}')
For this amount of data, the output is (on my machine)
no_numpy 0.009737916999999985
no_numpy_pre_fetching_list 0.0048406370000000365
using_numpy 0.009164470000000036
So using vectorization for this data is not the most efficient.

Related

Return a matrix by applying a boolean mask (a boolean matrix of same size) in python

I have generated a square matrix of size 4 and a boolean matrix of same size by:
import numpy as np
A = np.random.randn(4,4)
B = np.full((4,4), True, dtype = bool)
B[[0],:] = False
B[:,[0]] = False
The following code return two matrices of size 4, A has all the random numbers, and B has all the boolean operators where the enitre first row and column is false
B = [[False, False, False, False],
[False, True, True, True],
[False, True, True, True],
[False, True, True, True]]
What i want is to apply the B boolean matrix to A, such that, i get a 3 by 3 matrix of A where B is True (the elements in B == True).
Is their any logical operator in numpy to perform this operation? or do I have to go through each element of A and B and compare them and then assign it to a new matrix?

In [214]: A = np.random.randn(4,4)
...: B = np.full((4,4), True, dtype = bool)
...: B[[0],:] = False
...: B[:,[0]] = False
In [215]: A
Out[215]:
array([[-0.80676817, -0.20810386, 1.28448594, -0.52667651],
[ 0.6292733 , -0.05575997, 0.32466482, -0.23495175],
[-0.70896794, -1.60571282, -1.43718839, -0.42032337],
[ 0.01541418, -2.00072652, -1.54197002, 1.2626283 ]])
In [216]: B
Out[216]:
array([[False, False, False, False],
[False, True, True, True],
[False, True, True, True],
[False, True, True, True]])
Boolean indexing (with matching size array) always produces a 1d array. In this case it did not select any values for A[0,:]:
In [217]: A[B]
Out[217]:
array([-0.05575997, 0.32466482, -0.23495175, -1.60571282, -1.43718839,
-0.42032337, -2.00072652, -1.54197002, 1.2626283 ])
But because the other 3 rows all have 3 True, reshaping the result does produce a reasonable result:
In [218]: A[B].reshape(3,3)
Out[218]:
array([[-0.05575997, 0.32466482, -0.23495175],
[-1.60571282, -1.43718839, -0.42032337],
[-2.00072652, -1.54197002, 1.2626283 ]])
Whether the reshape makes sense depends on the total number of elements, and your own interpretation of the data.

If you are looking to remove any rows/cols that include at least on False element in them, you can use np.any to find such rows and columns and then use np.ix_ to create 2D array from row/col indices:
A=A[np.ix_(*np.where(np.any(B, axis=0)), *np.where(np.any(B, axis=1)))]
This will give you the output for any 2D numpy array and same shape boolean mask/condition. You can expand this to any dimension numpy array by adding dimensions in brackets.
sample A:
[[-0.36027839 -1.54588632 0.1607951 1.68865218]
[ 0.20959185 0.13962857 1.97189081 -0.7686762 ]
[ 0.03868048 -0.36612182 0.77802273 0.23195807]
[-1.26148984 0.44672696 0.45970364 -1.58457129]]
Masked A with B:
[[ 0.13962857 1.97189081 -0.7686762 ]
[-0.36612182 0.77802273 0.23195807]
[ 0.44672696 0.45970364 -1.58457129]]

torch find indices of matching rows in 2 2D tensors

I have two 2D tensors, in different length, both are different subsets of the same original 2d tensor and I would like to find all the matching "rows"
e.g
A = [[1,2,3],[4,5,6],[7,8,9],[3,3,3]
B = [[1,2,3],[7,8,9],[4,4,4]]
torch.2dintersect(A,B) -> [0,2] (the indecies of A that B also have)
I've only see numpy solutions, that use dtype as dicts, and does not work for pytorch.
Here is how I do it in numpy
arr1 = edge_index_dense.numpy().view(np.int32)
arr2 = edge_index2_dense.numpy().view(np.int32)
arr1_view = arr1.view([('', arr1.dtype)] * arr1.shape[1])
arr2_view = arr2.view([('', arr2.dtype)] * arr2.shape[1])
intersected = np.intersect1d(arr1_view, arr2_view, return_indices=True)

This answer was posted before the OP updated the question with other restrictions that changed the problem quite a bit.
TL;DR You can do something like this:
torch.where((A == B).all(dim=1))[0]
First, assuming you have:
import torch
A = torch.Tensor([[1,2,3],[4,5,6],[7,8,9]])
B = torch.Tensor([[1,2,3],[4,4,4],[7,8,9]])
We can check that A == B returns:
>>> A == B
tensor([[ True, True, True],
[ True, False, False],
[ True, True, True]])
So, what we want is: the rows in which they are all True. For that, we can use the .all() operation and specify the dimension of interest, in our case 1:
>>> (A == B).all(dim=1)
tensor([ True, False, True])
What you actually want to know is where the Trues are. For that, we can get the first output of the torch.where() function:
>>> torch.where((A == B).all(dim=1))[0]
tensor([0, 2])

If A and B are 2D tensors, the following code finds the indices such that A[indices] == B. If multiple indices satisfy this condition, the first index found is returned. If not all elements of B are present in A, the corresponding index is ignored.
values, indices = torch.topk(((A.t() == B.unsqueeze(-1)).all(dim=1)).int(), 1, 1)
indices = indices[values!=0]
# indices = tensor([0, 2])

For every element in a list a, how to count how many times it appear in one specific column in another dataframe

For every element in a dict a, I need to count how many times the element in 'age' column appears in one specific column of another dataframe in pandas
For example , I have a dict below:
a={'age':[22,38,26],'no':[1,2,3]}
and I have another dataframe with a few columns
TableB= {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen'],
'age': [22,38,26,35,41,22,38],
'fare': [7.25, 71.83, 0 , 8.05,7,6.05,6],
'survived?': [False, True, True, False, True, False, True]}
I would like to know how many times every element in dict a appears in the column 'age' in TableB. The result I expect is c={'age':[22,38,26],'count':[2,2,1]}
I have tried apply function but it does not work. It comes with syntax error, I'm new to Pandas, could anyone please help with that? Thank you!
def myfunction(y):
seriesObj = TableB.apply(lambda x: True if y in list(x) else False, axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
return numofRows
c['age']=a['age']
c['count']=a['age'].apply(myfunction)
I would like to know how many times every element in list a appears in the column 'age' in TableB. The result should be
c={'age':[22,38,26],'count':[2,2,1]}

Use value_counts method with pd.Series and to_dict with pd.DataFrame
(pd.Series(TableB['age'])
.value_counts()
.loc[a['age']]
.rename('count')
.rename_axis('age')
.reset_index()
.to_dict(orient='list'))

You can use pandas.Series.value_counts() on the age column and select the results you're interested in. The following solution will also take into account possible missing values in your 'a' list.
a=[22,38,26,99]
TableB= {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen', 'John', 'Jane', 'Doe'],
'age': [22,38,26,35,41,22,38],
'fare': [7.25, 71.83, 0 , 8.05,7,6.05,6],
'survived?': [False, True, True, False, True, False, True]}
tableB_df = pd.DataFrame(TableB)
counts_series = tableB_df['age'].value_counts()
counts_series_intersection = counts_series.loc[counts_series.index.intersection(a)]
counts_df = pd.DataFrame({'age': counts_series.index, 'count': counts_series.values})
Have a look at the following resources for more info:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-with-list-with-missing-labels-is-deprecated
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html

You can just use merging of data frames to filter out the values that don't appear in a and just count the values.
import pandas as pd
a={'age':[22,38,26],'no':[1,2,3]}
TableB= {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen', 'Jones', 'Davis', 'Smith'],
'age': [22,38,26,35,41,22,38],
'fare': [7.25, 71.83, 0 , 8.05,7,6.05,6],
'survived?': [False, True, True, False, True, False, True]}
df_a = pd.DataFrame(a)
df_tb = pd.DataFrame(TableB)
(pd.merge(df_tb, df_a, on='age')['age']
.value_counts()
.rename('count')
.rename_axis('age')
.reset_index()
.to_dict(orient='list'))
{'age': [22, 38, 26], 'count': [2, 2, 1]}

Is there any short way in pandas to check if every value of your matrix lie between certain values of two other "border" matrices?

Example: Checking matrix_1 must return True. And matrix_2 - False
import pandas as pd
low_border = pd.DataFrame({'A': [1,2], 'B':[2,3]})
up_border = pd.DataFrame({'A': [5,4], 'B':[4,8]})
matrix_1 = ({'A': [2,3], 'B':[3,4]})
matrix_2 = ({'A': [6,3], 'B':[3,4]})

You can use something like this:
def test(mt):
matrix=pd.DataFrame(mt)
for column in matrix.columns:
matrix['verify']=pd.Series(((low_border[column] < matrix[column]) & (matrix[column] < up_border[column])), index=matrix.index)
if False in matrix['verify'].tolist():
return False
return True
print(test(matrix_1))
print(test(matrix_2))
It will output:
True
False

Transform an integer into dummies vector in python

Hi there!
I m working on python with pandas' get_dummies function and I try to transform an int into a vector like for example with a 5 categories feature :
1 -> [1,0,0,0,0]
2 -> [0,1,0,0,0]
...
Does a function exist for that?
If not I can built a function but I just ask before reinventing the wheel.
Thanks !

Just cast the relevant Series to a string and then use get_dummies as usual.
pd.get_dummies(df['col'].astype(str))

I think it's so easy you should just write a simple function to do that, instead of asking. Here is one of countless ways to do this.
import numpy as np
def get_dumm(lenn, num):
arr = np.zeros(lenn, dtype='bool_') #replace type with 'int8' if needed
arr[num - 1] = True #replace True with 1 if type of arr is 'int8'
return arr
get_dumm(5,3)
Output:
array([False, False, True, False, False], dtype=bool)
Or if you use int8:
array([0, 0, 1, 0, 0], dtype=int8)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Fastest way of updating a Python list based on indices - python-3.x

Related

Return a matrix by applying a boolean mask (a boolean matrix of same size) in python

torch find indices of matching rows in 2 2D tensors

For every element in a list a, how to count how many times it appear in one specific column in another dataframe

Is there any short way in pandas to check if every value of your matrix lie between certain values of two other "border" matrices?

Transform an integer into dummies vector in python

Categories

Resources