The most efficient way to search every element of a list in a dataframe - python-3.x

I have a over 1M dataset like d. I need to find indexes of a dataframe like seekingframe which is over 1500 element in that dataset.
import pandas as pd
I need to find every element of seekingframe in d as fast as possible. I mean, i need a final array like:
array([ 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, -1, -1, -1, -1, -1, -1, -1, -1, 6, 7])
or the difference array like
[11, 12, 13, 14, 15, 16, 17, 18]
or sth that denoting the similarities or differences. Actually, if it is possible, i would rather to drop that different sets.

It's likely faster to use numpy. On these small unique arrays, numpy was more than 100x faster than pandas .isin() without passing assume_unique=True to the numpy function that finds the intersection of two arrays ( np.in1d ) and returns True or False.
It was 300x faster if you did pass assume_unique=True:
#finding similar
%timeit d[d[0].isin(seekingframe[0])].index
404 µs ± 6.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#finding difference
%timeit seekingframe[~seekingframe[0].isin(d[0])].index
458 µs ± 2.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# finding similar with numpy arrays and NOT passing `assume_unique=True`
a = d[0].to_numpy()
b = seekingframe[0].to_numpy()
%timeit np.arange(a.shape[0])[np.in1d(a, b)]
35.4 µs ± 779 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# finding similar with numpy arrays and passing `assume_unique=True`
a = d[0].to_numpy()
b = seekingframe[0].to_numpy()
%timeit np.arange(a.shape[0])[np.in1d(a, b, assume_unique=True)]
12 µs ± 337 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


If/else statement vs Heaviside function

In my code I have to consider different contributions with respect to different thresholds. In particular I have a function my_index whose output must be compared to the thresholds Z_1, Z_2, Z_3 in order to determine the increment to the variable my_value. In the following MWE, for simplicity sake, the function my_index is just a uniform random generator:
import numpy as np
my_len = 100000
Z_1 = 0.2
Z_2 = 0.4
Z_3 = 0.7
first = 1
second = 2
third = -0.0003
my_value = 0
for i in range(my_len):
my_index = np.random.uniform()
my_value += first*np.heaviside(my_index - Z_1,0)*np.heaviside(Z_2 - my_index,0) + second*np.heaviside(my_index - Z_3,0) + third*np.heaviside(Z_3 - my_index,0)
#if Z_1 < my_index < Z_2 add first
#if my_index > Z_3 add second
#if my_index < Z_3 add third
I have replaced if/else's that could have been used for the thresholds with the Heaviside function see. Keep in mind that, in my original code, this code section has to be iterated up to 10^5 times.
My question is: does this practice make the code faster? Or is the heaviside function (np.heaviside) call better in terms of speed than the if/else control?
In [433]: x=np.arange(-10,10)
In [434]: x
array([-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9])
A proper use of heaviside - giving x as array, not a single value:
In [436]: np.heaviside(x,.5)
array([0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.5, 1. , 1. ,
1. , 1. , 1. , 1. , 1. , 1. , 1. ])
A list comprehension equivalent:
In [437]: [.5 if i==0 else (0 if i<0 else 1) for i in x]
Out[437]: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1]
and making an array from that list:
In [438]: np.array([.5 if i==0 else (0 if i<0 else 1) for i in x])
array([0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.5, 1. , 1. ,
1. , 1. , 1. , 1. , 1. , 1. , 1. ])
Compare the times:
In [439]: timeit np.heaviside(x,.5)
2.5 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [440]: timeit np.array([.5 if i==0 else (0 if i<0 else 1) for i in x])
15.1 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Iteration on a list is faster (than on an array):
In [441]: timeit np.array([.5 if i==0 else (0 if i<0 else 1) for i in x.tolist()])
6.66 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and if we skip the conversion back to a list:
In [442]: timeit [.5 if i==0 else (0 if i<0 else 1) for i in x.tolist()]
2.28 µs ± 3.01 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For a much larger array, heaviside performance is even better:
In [445]: x=np.arange(-1000,1000)
In [446]: timeit [.5 if i==0 else (0 if i<0 else 1) for i in x.tolist()]
211 µs ± 7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [447]: timeit np.heaviside(x,.5)
13 µs ± 201 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For the random number generation, taking the whole-array approach is also faster:
In [448]: timeit [np.random.uniform() for _ in range(1000)]
4.62 ms ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [449]: timeit np.random.uniform(1000)
4.74 µs ± 171 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I could also time the scalar use of heaviside - that is worse than the if/else in [446]:
In [450]: timeit [np.heaviside(i,.5) for i in x]
8.64 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In sum:
use whole-array code where possible
when using Python level iteration, use lists and scalar methods instead
Assuming you use the standard CPython interpreter, then performing a Numpy function call like np.heaviside is likely more expensive than doing basic conditionals. However, both are very inefficient. Indeed, conditionals are generally slow and could be replaced with a branchless implementation here (adding/multiplying booleans converted to integers). The most important optimization is to use vectorization because Numpy is design to be efficient on relatively big arrays and not scalar values (mainly due to additional internal checks and function calls). You can generate all the random value in a big array, apply the heaviside function on it multiple times. The resulting code will certainly be 2 or 3 order of magnitude faster!

How can I reduce Execution time of Python code

In this this code I'm calculating difference between squares of n numbers and the square of the sum of n numbers.
Example : n=3, (1+2+3)^2 -(1^2+2^2+3^2) =22
def sum_square_diff(num):
for i in range(1,num+1):
sum1 +=i**2
sum2 +=i
return diff
if __name__=="__main__":
for i in range(n):
This code is correct but it takes too much time to complete execution.
In the first place, the formula that you want to compute has a closed-form representation. There is no need for any loops:
n*n*(n+1)*(n+1)/4 - n*(n+1)*(2*n+1)/6
But if you insist, you can get >3x speedup by using numpy instead of raw Python:
def sum_square_diff1(num):
x = np.arange(1,num+1)
return x.sum()**2-(x**2).sum()
In [7]: %timeit sum_square_diff(100)
19.6 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit sum_square_diff1(100)
5.61 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cupy indexing is slow

I am trying to perform operations on a large cupy array of size 16000. I find mathematical operations such as addition to be quite fast, but indexing using boolean masks to be relatively slow. For example, the following code:
import cupy as cp
arr = cp.random.normal(0, 1, 16000)
%timeit arr * 5
%timeit arr > 0.4
%timeit arr[arr > 0.4] = 0
gives me the output:
28 µs ± 950 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.5 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
104 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Any reason why the final indexing is at least twice as slow? I assumed that multiplication should be slower than setting array elements.
Update: This is not true for numpy indexing. Changing the cupy array to numpy, I get:
6.71 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.42 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.39 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In the 3rd case, cupy is composing the result via a sequence of operations: cupy_greater, cupy_copy, inclusive_scan_kernel, inclusive_scan_kernel, add_scan_blocked_sum_kernel, CUDA memcpy DtoH (perhaps to provide the number of elements that need to be set to zero), CUDA memset (perhaps to set an array to zero), and finally cupy_scatter_update_mask (to scatter the zeros to their correct locations, perhaps).
This is a considerably more complex sequence than arr*5, which seems to run a single cupy_multiply under the hood. You can probably do better with a cupy user-defined kernel:
import cupy as cp
clamp_generic = cp.ElementwiseKernel(
'T x, T c',
'T y',
'y = (y > x)?c:y',
arr = cp.random.normal(0, 1, 16000)
clamp_generic(0.4, 0, arr)

converting pandas dataframe into dictionary??

I have a pandas dataframe as news_datasetwhere column id is an article ID and column Content is Article content (large text). Given as,
ID Content
17283 WASHINGTON — Congressional Republicans have...
17284 After the bullet shells get counted, the blood...
17285 When Walt Disney’s “Bambi” opened in 1942, cri...
17286 Death may be the great equalizer, but it isn’t...
17287 SEOUL, South Korea — North Korea’s leader, ...
Now, all I want to convert pandas dataframe into dictionary such as ID would be a key and Content will the value. Basically, what I have done at first something like,
for i in news_dataset['ID']:
for j in news_dataset['Content']:
This piece of code is pathetic and taking so much time(> 4 minutes) to get processed. So, after checking for some better approaches(stackoverflow). What I have finally did is,
for id_num in news_dataset['ID']:
for content in news_dataset['Content']:
This code takes nearly 15 seconds to get executed.
What I want to ask is,
i) what's wrong in first code and why it take so much time to get processed?
ii) Does using for loop inside another for loop is wrong way to do iterations when it comes to large text data?
iii) what would be right way to create dictionary using for loop within single piece of query?
I think generally loops in pandas should be avoid if exist some non loop, obviously vectorized alternatives.
You can create index by column ID and call Series.to_dict:
Or zip:
#news_dict=dict(zip(news_dataset['ID'].values, news_dataset['Content'].values))
#1000rows sample
news_dataset = pd.DataFrame({'ID':np.arange(1000),
'Content':np.random.choice(list('abcdef'), size=1000)})
#print (news_dataset)
In [98]: %%timeit
...: dd={}
...: for i in news_dataset['ID']:
...: for j in news_dataset['Content']:
...: dd[j]=i
61.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [99]: %%timeit
...: id_array=[]
...: content_array=[]
...: for id_num in news_dataset['ID']:
...: id_array.append(id_num)
...: for content in news_dataset['Content']:
...: content_array.append(content)
...: news_dict=dict(zip(id_array,content_array))
251 µs ± 3.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [100]: %%timeit
...: news_dict=news_dataset.set_index('ID')['Content'].to_dict()
584 µs ± 9.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [101]: %%timeit
...: news_dict=dict(zip(news_dataset['ID'],news_dataset['Content']))
106 µs ± 3.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [102]: %%timeit
...: news_dict=dict(zip(news_dataset['ID'].values, news_dataset['Content'].values))
122 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

How to perform sum pooling in PyTorch

How to perform sum pooling in PyTorch. Specifically, if we have input (N, C, W_in, H_in) and want output (N, C, W_out, H_out) using a particular kernel_size and stride just like nn.Maxpool2d ?
You could use torch.nn.AvgPool1d (or torch.nn.AvgPool2d, torch.nn.AvgPool3d) which are performing mean pooling - proportional to sum pooling. If you really want the summed values, you could multiply the averaged output by the pooling surface. find divisor_override.
set divisor_override=1
you'll get a sumpool
import torch
input = torch.tensor([[[1,2,3],[3,2,1],[3,4,5]]])
sumpool = torch.nn.AvgPool2d(2, stride=1, divisor_override=1)
you'll get
tensor([[[ 8, 8],
[12, 12]]])
To expand on benjaminplanche's answer:
I need sum pooling as well and it doesn't seem to directly exist, but it is equivalent to running a conv2d with a weights parameter made of ones. I thought it would be faster to run AvgPool2d and multiply by the kernel size product. Turns out, not exactly.
Bottom line up front:
Use torch.nn.functional.avg_pool2d and its related functions and multiply by the kernel size.
Testing in Jupyter I find:
x = torch.rand([1,1,1000,1000])
>>> 3.49 ms ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
_=F.avg_pool2d(torch.rand([1,1,1000,1000]), [10,10])*10*10
>>> 4.99 ms ± 74.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(So 1.50 ms ± 79.0 µs) (I found the *10*10 only adds around 20 µs to the graph)
avePool = nn.AvgPool2d([10, 10], 1, 0)
>>> 80.9 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(So 77.4 ms ± 1.58 ms)
y = torch.ones([1,1,10,10])
_=F.conv2d(torch.rand([1,1,1000,1000]), y)
>>> 14.4 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(So 10.9 ms ± 426 µs)
sumPool = nn.Conv2d(1, 1, 10, 1, 0, 1, 1, False)
sumPool.weight = torch.nn.Parameter(y)
>>> 7.24 ms ± 63.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(So 3.75 ms ± 68.3 µs)
And as a sanity check.
abs_err = torch.max(torch.abs(avePool(x)*10*10 - sumPool(x)))
magnitude = torch.max(torch.max(avePool(x)*10*10, torch.max(sumPool(x))))
relative_err = abs_err/magnitude
abs_err.item(), magnitude.item(), relative_err.item()
>>> (3.814697265625e-06, 62.89910125732422, 6.064788493631568e-08)
That's probably a reasonable rounding related error.
I do not know why the functional version is faster than making a dedicated kernel, but it looks like if you want to make a dedicated kernel, prefer the Conv2D version, and make the weights untrainable with sumPool.weights.requires_grad = False or with torch.no_grad(): during creation of the kernel parameters. These results may change with kernel size, so test for your own application if you need to speed up this part. Let me know if I missed something...
