I have a Python dictionary like this -
my_dict = {'Names':['Tom', 'Mariam', 'Lata', 'Tina', 'Abin'],
'Attendance':[False, False, False, False, False]}
I also have a Python list of flags for which indices need to changed to True in my_dict['Attendance'] -
flag_list = [0, 2, 3]
Based on the flag_list, my_dict needs to be changed to -
my_dict = {'Names':['Tom', 'Mariam', 'Lata', 'Tina', 'Abin'],
'Attendance':[True, False, True, True, False]}
What would be the fastest way of achieving this? Can it be done without a loop? Thank you for any guidance.
Using a loop
for index in flag_list:
my_dict['Attendance'][index] = True
A micro optimization would be to fetch the list from the dict only once:
attendance_list = my_dict['Attendance']
for index in flag_list:
attendance_list[index] = True
But unless flag_list is thousands elements long I wouldn't worry about it.
Using vectorization
If you are willing to take advantage of vectorization you can use a numpy array:
import numpy as np
my_dict = {'Names':['Tom', 'Mariam', 'Lata', 'Tina', 'Abin'],
'Attendance': np.array([False, False, False, False, False])}
flag_list = [0, 2, 3]
my_dict['Attendance'][flag_list] = True
But again, unless your data is very big I wouldn't worry about optimizing this piece of code very much.
Example timings
import random
from timeit import Timer
import numpy as np
ATTENDANCE_LIST_SIZE = 100000
FLAG_LIST_SIZE = 60000
dict_with_numpy = {'Attendance': np.random.choice([False, True],
ATTENDANCE_LIST_SIZE)}
dict_without_numpy = {'Attendance': random.choices([False, True],
k=ATTENDANCE_LIST_SIZE)}
flag_list = random.choices(range(ATTENDANCE_LIST_SIZE), k=FLAG_LIST_SIZE)
def using_numpy():
dict_with_numpy['Attendance'][flag_list] = True
def no_numpy_pre_fetching_list():
attendance_list = dict_without_numpy['Attendance']
for index in flag_list:
attendance_list[index] = True
def no_numpy():
for index in flag_list:
dict_without_numpy['Attendance'][index] = True
print(f'no_numpy\t\t\t\t\t\t{min(Timer(no_numpy).repeat(3, 3))}')
print(f'no_numpy_pre_fetching_list\t\t{min(Timer(no_numpy_pre_fetching_list).repeat(3, 3))}')
print(f'using_numpy\t\t\t\t\t\t{min(Timer(using_numpy).repeat(3, 3))}')
For this amount of data, the output is (on my machine)
no_numpy 0.009737916999999985
no_numpy_pre_fetching_list 0.0048406370000000365
using_numpy 0.009164470000000036
So using vectorization for this data is not the most efficient.
Related
I have generated a square matrix of size 4 and a boolean matrix of same size by:
import numpy as np
A = np.random.randn(4,4)
B = np.full((4,4), True, dtype = bool)
B[[0],:] = False
B[:,[0]] = False
The following code return two matrices of size 4, A has all the random numbers, and B has all the boolean operators where the enitre first row and column is false
B = [[False, False, False, False],
[False, True, True, True],
[False, True, True, True],
[False, True, True, True]]
What i want is to apply the B boolean matrix to A, such that, i get a 3 by 3 matrix of A where B is True (the elements in B == True).
Is their any logical operator in numpy to perform this operation? or do I have to go through each element of A and B and compare them and then assign it to a new matrix?
In [214]: A = np.random.randn(4,4)
...: B = np.full((4,4), True, dtype = bool)
...: B[[0],:] = False
...: B[:,[0]] = False
In [215]: A
Out[215]:
array([[-0.80676817, -0.20810386, 1.28448594, -0.52667651],
[ 0.6292733 , -0.05575997, 0.32466482, -0.23495175],
[-0.70896794, -1.60571282, -1.43718839, -0.42032337],
[ 0.01541418, -2.00072652, -1.54197002, 1.2626283 ]])
In [216]: B
Out[216]:
array([[False, False, False, False],
[False, True, True, True],
[False, True, True, True],
[False, True, True, True]])
Boolean indexing (with matching size array) always produces a 1d array. In this case it did not select any values for A[0,:]:
In [217]: A[B]
Out[217]:
array([-0.05575997, 0.32466482, -0.23495175, -1.60571282, -1.43718839,
-0.42032337, -2.00072652, -1.54197002, 1.2626283 ])
But because the other 3 rows all have 3 True, reshaping the result does produce a reasonable result:
In [218]: A[B].reshape(3,3)
Out[218]:
array([[-0.05575997, 0.32466482, -0.23495175],
[-1.60571282, -1.43718839, -0.42032337],
[-2.00072652, -1.54197002, 1.2626283 ]])
Whether the reshape makes sense depends on the total number of elements, and your own interpretation of the data.
If you are looking to remove any rows/cols that include at least on False element in them, you can use np.any to find such rows and columns and then use np.ix_ to create 2D array from row/col indices:
A=A[np.ix_(*np.where(np.any(B, axis=0)), *np.where(np.any(B, axis=1)))]
This will give you the output for any 2D numpy array and same shape boolean mask/condition. You can expand this to any dimension numpy array by adding dimensions in brackets.
sample A:
[[-0.36027839 -1.54588632 0.1607951 1.68865218]
[ 0.20959185 0.13962857 1.97189081 -0.7686762 ]
[ 0.03868048 -0.36612182 0.77802273 0.23195807]
[-1.26148984 0.44672696 0.45970364 -1.58457129]]
Masked A with B:
[[ 0.13962857 1.97189081 -0.7686762 ]
[-0.36612182 0.77802273 0.23195807]
[ 0.44672696 0.45970364 -1.58457129]]
I have two 2D tensors, in different length, both are different subsets of the same original 2d tensor and I would like to find all the matching "rows"
e.g
A = [[1,2,3],[4,5,6],[7,8,9],[3,3,3]
B = [[1,2,3],[7,8,9],[4,4,4]]
torch.2dintersect(A,B) -> [0,2] (the indecies of A that B also have)
I've only see numpy solutions, that use dtype as dicts, and does not work for pytorch.
Here is how I do it in numpy
arr1 = edge_index_dense.numpy().view(np.int32)
arr2 = edge_index2_dense.numpy().view(np.int32)
arr1_view = arr1.view([('', arr1.dtype)] * arr1.shape[1])
arr2_view = arr2.view([('', arr2.dtype)] * arr2.shape[1])
intersected = np.intersect1d(arr1_view, arr2_view, return_indices=True)
This answer was posted before the OP updated the question with other restrictions that changed the problem quite a bit.
TL;DR You can do something like this:
torch.where((A == B).all(dim=1))[0]
First, assuming you have:
import torch
A = torch.Tensor([[1,2,3],[4,5,6],[7,8,9]])
B = torch.Tensor([[1,2,3],[4,4,4],[7,8,9]])
We can check that A == B returns:
>>> A == B
tensor([[ True, True, True],
[ True, False, False],
[ True, True, True]])
So, what we want is: the rows in which they are all True. For that, we can use the .all() operation and specify the dimension of interest, in our case 1:
>>> (A == B).all(dim=1)
tensor([ True, False, True])
What you actually want to know is where the Trues are. For that, we can get the first output of the torch.where() function:
>>> torch.where((A == B).all(dim=1))[0]
tensor([0, 2])
If A and B are 2D tensors, the following code finds the indices such that A[indices] == B. If multiple indices satisfy this condition, the first index found is returned. If not all elements of B are present in A, the corresponding index is ignored.
values, indices = torch.topk(((A.t() == B.unsqueeze(-1)).all(dim=1)).int(), 1, 1)
indices = indices[values!=0]
# indices = tensor([0, 2])
For every element in a dict a, I need to count how many times the element in 'age' column appears in one specific column of another dataframe in pandas
For example , I have a dict below:
a={'age':[22,38,26],'no':[1,2,3]}
and I have another dataframe with a few columns
TableB= {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen'],
'age': [22,38,26,35,41,22,38],
'fare': [7.25, 71.83, 0 , 8.05,7,6.05,6],
'survived?': [False, True, True, False, True, False, True]}
I would like to know how many times every element in dict a appears in the column 'age' in TableB. The result I expect is c={'age':[22,38,26],'count':[2,2,1]}
I have tried apply function but it does not work. It comes with syntax error, I'm new to Pandas, could anyone please help with that? Thank you!
def myfunction(y):
seriesObj = TableB.apply(lambda x: True if y in list(x) else False, axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
return numofRows
c['age']=a['age']
c['count']=a['age'].apply(myfunction)
I would like to know how many times every element in list a appears in the column 'age' in TableB. The result should be
c={'age':[22,38,26],'count':[2,2,1]}
Use value_counts method with pd.Series and to_dict with pd.DataFrame
(pd.Series(TableB['age'])
.value_counts()
.loc[a['age']]
.rename('count')
.rename_axis('age')
.reset_index()
.to_dict(orient='list'))
You can use pandas.Series.value_counts() on the age column and select the results you're interested in. The following solution will also take into account possible missing values in your 'a' list.
a=[22,38,26,99]
TableB= {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen', 'John', 'Jane', 'Doe'],
'age': [22,38,26,35,41,22,38],
'fare': [7.25, 71.83, 0 , 8.05,7,6.05,6],
'survived?': [False, True, True, False, True, False, True]}
tableB_df = pd.DataFrame(TableB)
counts_series = tableB_df['age'].value_counts()
counts_series_intersection = counts_series.loc[counts_series.index.intersection(a)]
counts_df = pd.DataFrame({'age': counts_series.index, 'count': counts_series.values})
Have a look at the following resources for more info:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-with-list-with-missing-labels-is-deprecated
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html
You can just use merging of data frames to filter out the values that don't appear in a and just count the values.
import pandas as pd
a={'age':[22,38,26],'no':[1,2,3]}
TableB= {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen', 'Jones', 'Davis', 'Smith'],
'age': [22,38,26,35,41,22,38],
'fare': [7.25, 71.83, 0 , 8.05,7,6.05,6],
'survived?': [False, True, True, False, True, False, True]}
df_a = pd.DataFrame(a)
df_tb = pd.DataFrame(TableB)
(pd.merge(df_tb, df_a, on='age')['age']
.value_counts()
.rename('count')
.rename_axis('age')
.reset_index()
.to_dict(orient='list'))
{'age': [22, 38, 26], 'count': [2, 2, 1]}
Example: Checking matrix_1 must return True. And matrix_2 - False
import pandas as pd
low_border = pd.DataFrame({'A': [1,2], 'B':[2,3]})
up_border = pd.DataFrame({'A': [5,4], 'B':[4,8]})
matrix_1 = ({'A': [2,3], 'B':[3,4]})
matrix_2 = ({'A': [6,3], 'B':[3,4]})
You can use something like this:
def test(mt):
matrix=pd.DataFrame(mt)
for column in matrix.columns:
matrix['verify']=pd.Series(((low_border[column] < matrix[column]) & (matrix[column] < up_border[column])), index=matrix.index)
if False in matrix['verify'].tolist():
return False
return True
print(test(matrix_1))
print(test(matrix_2))
It will output:
True
False
Hi there!
I m working on python with pandas' get_dummies function and I try to transform an int into a vector like for example with a 5 categories feature :
1 -> [1,0,0,0,0]
2 -> [0,1,0,0,0]
...
Does a function exist for that?
If not I can built a function but I just ask before reinventing the wheel.
Thanks !
Just cast the relevant Series to a string and then use get_dummies as usual.
pd.get_dummies(df['col'].astype(str))
I think it's so easy you should just write a simple function to do that, instead of asking. Here is one of countless ways to do this.
import numpy as np
def get_dumm(lenn, num):
arr = np.zeros(lenn, dtype='bool_') #replace type with 'int8' if needed
arr[num - 1] = True #replace True with 1 if type of arr is 'int8'
return arr
get_dumm(5,3)
Output:
array([False, False, True, False, False], dtype=bool)
Or if you use int8:
array([0, 0, 1, 0, 0], dtype=int8)