How can I reduce Execution time of Python code - python-3.x

In this this code I'm calculating difference between squares of n numbers and the square of the sum of n numbers.
Example : n=3, (1+2+3)^2 -(1^2+2^2+3^2) =22
def sum_square_diff(num):
sum1=0
sum2=0
for i in range(1,num+1):
sum1 +=i**2
sum2 +=i
sum2=sum2**2
diff=sum2-sum1
return diff
if __name__=="__main__":
n=int(input())
for i in range(n):
num=int(input())
result=sum_square_diff(num)
print(result)
This code is correct but it takes too much time to complete execution.

In the first place, the formula that you want to compute has a closed-form representation. There is no need for any loops:
n*n*(n+1)*(n+1)/4 - n*(n+1)*(2*n+1)/6
But if you insist, you can get >3x speedup by using numpy instead of raw Python:
def sum_square_diff1(num):
x = np.arange(1,num+1)
return x.sum()**2-(x**2).sum()
In [7]: %timeit sum_square_diff(100)
19.6 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit sum_square_diff1(100)
5.61 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Related

Python: Opposite number performance comparison

Why is
def opposite(number):
number - number*2
returning a faster result than
def opposite(number):
return -number
in python?
time by method
Here you can see the difference of performance of the two methods
def opposite(number):
number - number*2
def opposite2(number):
return -number
%timeit opposite(5)
84.3 ns ± 2.33 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit opposite2(5)
66.5 ns ± 6.88 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

cupy indexing is slow

I am trying to perform operations on a large cupy array of size 16000. I find mathematical operations such as addition to be quite fast, but indexing using boolean masks to be relatively slow. For example, the following code:
import cupy as cp
arr = cp.random.normal(0, 1, 16000)
%timeit arr * 5
%timeit arr > 0.4
%timeit arr[arr > 0.4] = 0
gives me the output:
28 µs ± 950 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.5 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
104 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Any reason why the final indexing is at least twice as slow? I assumed that multiplication should be slower than setting array elements.
Update: This is not true for numpy indexing. Changing the cupy array to numpy, I get:
6.71 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.42 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.39 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In the 3rd case, cupy is composing the result via a sequence of operations: cupy_greater, cupy_copy, inclusive_scan_kernel, inclusive_scan_kernel, add_scan_blocked_sum_kernel, CUDA memcpy DtoH (perhaps to provide the number of elements that need to be set to zero), CUDA memset (perhaps to set an array to zero), and finally cupy_scatter_update_mask (to scatter the zeros to their correct locations, perhaps).
This is a considerably more complex sequence than arr*5, which seems to run a single cupy_multiply under the hood. You can probably do better with a cupy user-defined kernel:
import cupy as cp
clamp_generic = cp.ElementwiseKernel(
'T x, T c',
'T y',
'y = (y > x)?c:y',
'clamp_generic')
arr = cp.random.normal(0, 1, 16000)
clamp_generic(0.4, 0, arr)

What's the most concise way to iterate over a list by pairs in Python?

I've got the following bruteforce option that allows me to iterate over points:
# [x1, y1, x2, y2, ..., xn, yn]
coords = [1, 1, 2, 2, 3, 3]
# The goal is to operate with (x, y) within for loop
for (x, y) in zip(coords[::2], coords[1::2]):
# do something with (x, y) as a point
Is there a more concise / efficient way to do it?
(coords -> items)
Short Answer
If you want your items grouped with a specific length of 2, then
zip(items[::2], items[1::2])
is one of the best compromise in terms of speed and clarity.
If you can afford an extra line, you can get a bit (lot -- for larger inputs) more efficient by using iterators:
it = iter(items)
zip(it, it)
Long Answer
(EDIT: added a method that avoids zip())
You could achieve this in a number of ways.
For convenience, I write those as functions that can be benchmarked.
Also I will leave the size of the group as a parameter n (which, in your case, is 2)
def grouping1(items, n=2):
return zip(*tuple(items[i::n] for i in range(n)))
def grouping2(items, n=2):
return zip(*tuple(itertools.islice(items, i, None, n) for i in range(n)))
def grouping3(items, n=2):
for j in range(len(items) // n):
yield items[j:j + n]
def grouping4(items, n=2):
return zip(*([iter(items)] * n))
def grouping5(items, n=2):
it = iter(items)
while True:
result = []
for _ in range(n):
try:
tmp = next(it)
except StopIteration:
break
else:
result.append(tmp)
if len(result) == n:
yield result
else:
break
Benchmarking these with a relatively short list gives:
short = list(range(10))
%timeit [x for x in grouping1(short)]
# 1.33 µs ± 9.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [x for x in grouping2(short)]
# 1.51 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [x for x in grouping3(short)]
# 1.14 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [x for x in grouping4(short)]
# 639 ns ± 7.56 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [x for x in grouping5(short)]
# 3.37 µs ± 16.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For medium sized inputs:
medium = list(range(1000))
%timeit [x for x in grouping1(medium)]
# 21.9 µs ± 466 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [x for x in grouping2(medium)]
# 25.2 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [x for x in grouping3(medium)]
# 65.6 µs ± 233 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [x for x in grouping4(medium)]
# 18.3 µs ± 114 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit [x for x in grouping5(medium)]
# 257 µs ± 2.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For larger inputs:
large = list(range(1000000))
%timeit [x for x in grouping1(large)]
# 49.7 ms ± 840 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [x for x in grouping2(large)]
# 37.5 ms ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [x for x in grouping3(large)]
# 84.4 ms ± 736 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [x for x in grouping4(large)]
# 31.6 ms ± 85.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [x for x in grouping5(large)]
# 274 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As far as efficiency, grouping4() seems to be the fastest, closely followed by grouping1() or grouping3() (depending on the size of the input).
In your case, grouping1() seems a good compromise between speed and clearness, unless you are willing to wrap it up in a function.
Note that grouping4() requires you to use the same iterator multiple times and:
zip(iter(items), iter(items))
would NOT work.
If you want more control over uneven grouping i.e. when the len(items) is not divisible by n, you could replace zip with itertools.zip_longest() from the standard library.
Note also that grouping4() is substantially the grouper() recipe from the itertools official documentation.
You can use iter(object) and next(iterator, default) with a known default to leave your loop:
coords = [1, 1, 2, 2, 3, 3]
it = iter(coords)
while it:
x = next(it, None)
y = next(it, None)
if x is None or y is None:
break
# do something with your pairs
print(x,y)
Output:
1 1
2 2
3 3

converting pandas dataframe into dictionary??

I have a pandas dataframe as news_datasetwhere column id is an article ID and column Content is Article content (large text). Given as,
ID Content
17283 WASHINGTON — Congressional Republicans have...
17284 After the bullet shells get counted, the blood...
17285 When Walt Disney’s “Bambi” opened in 1942, cri...
17286 Death may be the great equalizer, but it isn’t...
17287 SEOUL, South Korea — North Korea’s leader, ...
Now, all I want to convert pandas dataframe into dictionary such as ID would be a key and Content will the value. Basically, what I have done at first something like,
dd={}
for i in news_dataset['ID']:
for j in news_dataset['Content']:
dd[j]=i
This piece of code is pathetic and taking so much time(> 4 minutes) to get processed. So, after checking for some better approaches(stackoverflow). What I have finally did is,
id_array=[]
content_array=[]
for id_num in news_dataset['ID']:
id_array.append(id_num)
for content in news_dataset['Content']:
content_array.append(content)
news_dict=dict(zip(id_array,content_array))
This code takes nearly 15 seconds to get executed.
What I want to ask is,
i) what's wrong in first code and why it take so much time to get processed?
ii) Does using for loop inside another for loop is wrong way to do iterations when it comes to large text data?
iii) what would be right way to create dictionary using for loop within single piece of query?
I think generally loops in pandas should be avoid if exist some non loop, obviously vectorized alternatives.
You can create index by column ID and call Series.to_dict:
news_dict=news_dataset.set_index('ID')['Content'].to_dict()
Or zip:
news_dict=dict(zip(news_dataset['ID'],news_dataset['Content']))
#alternative
#news_dict=dict(zip(news_dataset['ID'].values, news_dataset['Content'].values))
Performance:
np.random.seed(1425)
#1000rows sample
news_dataset = pd.DataFrame({'ID':np.arange(1000),
'Content':np.random.choice(list('abcdef'), size=1000)})
#print (news_dataset)
In [98]: %%timeit
...: dd={}
...: for i in news_dataset['ID']:
...: for j in news_dataset['Content']:
...: dd[j]=i
...:
61.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [99]: %%timeit
...: id_array=[]
...: content_array=[]
...: for id_num in news_dataset['ID']:
...: id_array.append(id_num)
...: for content in news_dataset['Content']:
...: content_array.append(content)
...: news_dict=dict(zip(id_array,content_array))
...:
251 µs ± 3.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [100]: %%timeit
...: news_dict=news_dataset.set_index('ID')['Content'].to_dict()
...:
584 µs ± 9.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [101]: %%timeit
...: news_dict=dict(zip(news_dataset['ID'],news_dataset['Content']))
...:
106 µs ± 3.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [102]: %%timeit
...: news_dict=dict(zip(news_dataset['ID'].values, news_dataset['Content'].values))
...:
122 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

How to perform sum pooling in PyTorch

How to perform sum pooling in PyTorch. Specifically, if we have input (N, C, W_in, H_in) and want output (N, C, W_out, H_out) using a particular kernel_size and stride just like nn.Maxpool2d ?
You could use torch.nn.AvgPool1d (or torch.nn.AvgPool2d, torch.nn.AvgPool3d) which are performing mean pooling - proportional to sum pooling. If you really want the summed values, you could multiply the averaged output by the pooling surface.
https://pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html#torch.nn.AvgPool2d find divisor_override.
set divisor_override=1
you'll get a sumpool
import torch
input = torch.tensor([[[1,2,3],[3,2,1],[3,4,5]]])
sumpool = torch.nn.AvgPool2d(2, stride=1, divisor_override=1)
sumpool(input)
you'll get
tensor([[[ 8, 8],
[12, 12]]])
To expand on benjaminplanche's answer:
I need sum pooling as well and it doesn't seem to directly exist, but it is equivalent to running a conv2d with a weights parameter made of ones. I thought it would be faster to run AvgPool2d and multiply by the kernel size product. Turns out, not exactly.
Bottom line up front:
Use torch.nn.functional.avg_pool2d and its related functions and multiply by the kernel size.
Testing in Jupyter I find:
(Overhead)
%%timeit
x = torch.rand([1,1,1000,1000])
>>> 3.49 ms ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
_=F.avg_pool2d(torch.rand([1,1,1000,1000]), [10,10])*10*10
>>> 4.99 ms ± 74.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(So 1.50 ms ± 79.0 µs) (I found the *10*10 only adds around 20 µs to the graph)
avePool = nn.AvgPool2d([10, 10], 1, 0)
%%timeit
_=avePool(torch.rand([1,1,1000,1000]))*10*10
>>> 80.9 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(So 77.4 ms ± 1.58 ms)
y = torch.ones([1,1,10,10])
%%timeit
_=F.conv2d(torch.rand([1,1,1000,1000]), y)
>>> 14.4 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(So 10.9 ms ± 426 µs)
sumPool = nn.Conv2d(1, 1, 10, 1, 0, 1, 1, False)
sumPool.weight = torch.nn.Parameter(y)
%%timeit
_=sumPool(torch.rand([1,1,1000,1000]))
>>> 7.24 ms ± 63.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(So 3.75 ms ± 68.3 µs)
And as a sanity check.
abs_err = torch.max(torch.abs(avePool(x)*10*10 - sumPool(x)))
magnitude = torch.max(torch.max(avePool(x)*10*10, torch.max(sumPool(x))))
relative_err = abs_err/magnitude
abs_err.item(), magnitude.item(), relative_err.item()
>>> (3.814697265625e-06, 62.89910125732422, 6.064788493631568e-08)
That's probably a reasonable rounding related error.
I do not know why the functional version is faster than making a dedicated kernel, but it looks like if you want to make a dedicated kernel, prefer the Conv2D version, and make the weights untrainable with sumPool.weights.requires_grad = False or with torch.no_grad(): during creation of the kernel parameters. These results may change with kernel size, so test for your own application if you need to speed up this part. Let me know if I missed something...

Resources