Python: Opposite number performance comparison - python-3.x

Why is
def opposite(number):
number - number*2
returning a faster result than
def opposite(number):
return -number
in python?

time by method
Here you can see the difference of performance of the two methods
def opposite(number):
number - number*2
def opposite2(number):
return -number
%timeit opposite(5)
84.3 ns ± 2.33 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit opposite2(5)
66.5 ns ± 6.88 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Related

How can I reduce Execution time of Python code

In this this code I'm calculating difference between squares of n numbers and the square of the sum of n numbers.
Example : n=3, (1+2+3)^2 -(1^2+2^2+3^2) =22
def sum_square_diff(num):
sum1=0
sum2=0
for i in range(1,num+1):
sum1 +=i**2
sum2 +=i
sum2=sum2**2
diff=sum2-sum1
return diff
if __name__=="__main__":
n=int(input())
for i in range(n):
num=int(input())
result=sum_square_diff(num)
print(result)
This code is correct but it takes too much time to complete execution.
In the first place, the formula that you want to compute has a closed-form representation. There is no need for any loops:
n*n*(n+1)*(n+1)/4 - n*(n+1)*(2*n+1)/6
But if you insist, you can get >3x speedup by using numpy instead of raw Python:
def sum_square_diff1(num):
x = np.arange(1,num+1)
return x.sum()**2-(x**2).sum()
In [7]: %timeit sum_square_diff(100)
19.6 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit sum_square_diff1(100)
5.61 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cupy indexing is slow

I am trying to perform operations on a large cupy array of size 16000. I find mathematical operations such as addition to be quite fast, but indexing using boolean masks to be relatively slow. For example, the following code:
import cupy as cp
arr = cp.random.normal(0, 1, 16000)
%timeit arr * 5
%timeit arr > 0.4
%timeit arr[arr > 0.4] = 0
gives me the output:
28 µs ± 950 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.5 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
104 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Any reason why the final indexing is at least twice as slow? I assumed that multiplication should be slower than setting array elements.
Update: This is not true for numpy indexing. Changing the cupy array to numpy, I get:
6.71 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.42 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.39 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In the 3rd case, cupy is composing the result via a sequence of operations: cupy_greater, cupy_copy, inclusive_scan_kernel, inclusive_scan_kernel, add_scan_blocked_sum_kernel, CUDA memcpy DtoH (perhaps to provide the number of elements that need to be set to zero), CUDA memset (perhaps to set an array to zero), and finally cupy_scatter_update_mask (to scatter the zeros to their correct locations, perhaps).
This is a considerably more complex sequence than arr*5, which seems to run a single cupy_multiply under the hood. You can probably do better with a cupy user-defined kernel:
import cupy as cp
clamp_generic = cp.ElementwiseKernel(
'T x, T c',
'T y',
'y = (y > x)?c:y',
'clamp_generic')
arr = cp.random.normal(0, 1, 16000)
clamp_generic(0.4, 0, arr)

Pandas speedup when working with transposed numpy matrix

I was trying to figure which is faster to standardize date between numpy and pandas and using the whole matrix/DataFrame or column by column and I've found this strange behavior showed in the code below
import pandas as pd
import numpy as np
def stand(df):
res = pd.DataFrame()
for col in df:
res[col] = (df[col] - df[col].min()) / df[col].max()
return res
matrix = pd.DataFrame(np.random.randint(0,174000,size=(1000000, 100)))
matrix.shape
(1000000, 100)
%timeit res = (matrix - matrix.min(axis=0))/ matrix.max(axis=0)
2.64 s ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit stand(matrix)
5.32 s ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But when starting from a "flipped" numpy matrix and transposing it to create the DataFrame
matrix = pd.DataFrame(np.random.randint(0,174000,size=(100, 1000000)).T)
matrix.shape
(1000000, 100)
%timeit res = (matrix - matrix.min(axis=0))/ matrix.max(axis=0)
2.37 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit stand(matrix)
1.2 s ± 8.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The execution of the standardization column by column get ~4 times faster.
This behavior remains also using .values or numpy operations as showed below:
%timeit res = (matrix.values - matrix.min(axis=0).values)/ matrix.max(axis=0).values
2.58 s ± 417 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit stand(matrix)
5.26 s ± 42.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res = np.divide(np.subtract(matrix.values, matrix.min(axis=0).values), matrix.max(axis=0).values)
2.17 s ± 7.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Flipped matrix transpose
matrix = pd.DataFrame(np.random.randint(0,174000,size=(100, 1000000)).T)
matrix.shape
(1000000, 100)
%timeit res = (matrix.values - matrix.min(axis=0).values)/ matrix.max(axis=0).values
2.2 s ± 8.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit stand(matrix)
1.33 s ± 190 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res = np.divide(np.subtract(matrix.values, matrix.min(axis=0).values), matrix.max(axis=0).values)
2.46 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Can someone explain why starting from a reversed matrix and then transposing it before to create the DataFrame changes the performance w.r.t. starting from a non-reversed matrix?

What's the most concise way to iterate over a list by pairs in Python?

I've got the following bruteforce option that allows me to iterate over points:
# [x1, y1, x2, y2, ..., xn, yn]
coords = [1, 1, 2, 2, 3, 3]
# The goal is to operate with (x, y) within for loop
for (x, y) in zip(coords[::2], coords[1::2]):
# do something with (x, y) as a point
Is there a more concise / efficient way to do it?
(coords -> items)
Short Answer
If you want your items grouped with a specific length of 2, then
zip(items[::2], items[1::2])
is one of the best compromise in terms of speed and clarity.
If you can afford an extra line, you can get a bit (lot -- for larger inputs) more efficient by using iterators:
it = iter(items)
zip(it, it)
Long Answer
(EDIT: added a method that avoids zip())
You could achieve this in a number of ways.
For convenience, I write those as functions that can be benchmarked.
Also I will leave the size of the group as a parameter n (which, in your case, is 2)
def grouping1(items, n=2):
return zip(*tuple(items[i::n] for i in range(n)))
def grouping2(items, n=2):
return zip(*tuple(itertools.islice(items, i, None, n) for i in range(n)))
def grouping3(items, n=2):
for j in range(len(items) // n):
yield items[j:j + n]
def grouping4(items, n=2):
return zip(*([iter(items)] * n))
def grouping5(items, n=2):
it = iter(items)
while True:
result = []
for _ in range(n):
try:
tmp = next(it)
except StopIteration:
break
else:
result.append(tmp)
if len(result) == n:
yield result
else:
break
Benchmarking these with a relatively short list gives:
short = list(range(10))
%timeit [x for x in grouping1(short)]
# 1.33 µs ± 9.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [x for x in grouping2(short)]
# 1.51 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [x for x in grouping3(short)]
# 1.14 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [x for x in grouping4(short)]
# 639 ns ± 7.56 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [x for x in grouping5(short)]
# 3.37 µs ± 16.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For medium sized inputs:
medium = list(range(1000))
%timeit [x for x in grouping1(medium)]
# 21.9 µs ± 466 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [x for x in grouping2(medium)]
# 25.2 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [x for x in grouping3(medium)]
# 65.6 µs ± 233 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [x for x in grouping4(medium)]
# 18.3 µs ± 114 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit [x for x in grouping5(medium)]
# 257 µs ± 2.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For larger inputs:
large = list(range(1000000))
%timeit [x for x in grouping1(large)]
# 49.7 ms ± 840 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [x for x in grouping2(large)]
# 37.5 ms ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [x for x in grouping3(large)]
# 84.4 ms ± 736 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [x for x in grouping4(large)]
# 31.6 ms ± 85.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [x for x in grouping5(large)]
# 274 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As far as efficiency, grouping4() seems to be the fastest, closely followed by grouping1() or grouping3() (depending on the size of the input).
In your case, grouping1() seems a good compromise between speed and clearness, unless you are willing to wrap it up in a function.
Note that grouping4() requires you to use the same iterator multiple times and:
zip(iter(items), iter(items))
would NOT work.
If you want more control over uneven grouping i.e. when the len(items) is not divisible by n, you could replace zip with itertools.zip_longest() from the standard library.
Note also that grouping4() is substantially the grouper() recipe from the itertools official documentation.
You can use iter(object) and next(iterator, default) with a known default to leave your loop:
coords = [1, 1, 2, 2, 3, 3]
it = iter(coords)
while it:
x = next(it, None)
y = next(it, None)
if x is None or y is None:
break
# do something with your pairs
print(x,y)
Output:
1 1
2 2
3 3

How to perform sum pooling in PyTorch

How to perform sum pooling in PyTorch. Specifically, if we have input (N, C, W_in, H_in) and want output (N, C, W_out, H_out) using a particular kernel_size and stride just like nn.Maxpool2d ?
You could use torch.nn.AvgPool1d (or torch.nn.AvgPool2d, torch.nn.AvgPool3d) which are performing mean pooling - proportional to sum pooling. If you really want the summed values, you could multiply the averaged output by the pooling surface.
https://pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html#torch.nn.AvgPool2d find divisor_override.
set divisor_override=1
you'll get a sumpool
import torch
input = torch.tensor([[[1,2,3],[3,2,1],[3,4,5]]])
sumpool = torch.nn.AvgPool2d(2, stride=1, divisor_override=1)
sumpool(input)
you'll get
tensor([[[ 8, 8],
[12, 12]]])
To expand on benjaminplanche's answer:
I need sum pooling as well and it doesn't seem to directly exist, but it is equivalent to running a conv2d with a weights parameter made of ones. I thought it would be faster to run AvgPool2d and multiply by the kernel size product. Turns out, not exactly.
Bottom line up front:
Use torch.nn.functional.avg_pool2d and its related functions and multiply by the kernel size.
Testing in Jupyter I find:
(Overhead)
%%timeit
x = torch.rand([1,1,1000,1000])
>>> 3.49 ms ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
_=F.avg_pool2d(torch.rand([1,1,1000,1000]), [10,10])*10*10
>>> 4.99 ms ± 74.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(So 1.50 ms ± 79.0 µs) (I found the *10*10 only adds around 20 µs to the graph)
avePool = nn.AvgPool2d([10, 10], 1, 0)
%%timeit
_=avePool(torch.rand([1,1,1000,1000]))*10*10
>>> 80.9 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(So 77.4 ms ± 1.58 ms)
y = torch.ones([1,1,10,10])
%%timeit
_=F.conv2d(torch.rand([1,1,1000,1000]), y)
>>> 14.4 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(So 10.9 ms ± 426 µs)
sumPool = nn.Conv2d(1, 1, 10, 1, 0, 1, 1, False)
sumPool.weight = torch.nn.Parameter(y)
%%timeit
_=sumPool(torch.rand([1,1,1000,1000]))
>>> 7.24 ms ± 63.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(So 3.75 ms ± 68.3 µs)
And as a sanity check.
abs_err = torch.max(torch.abs(avePool(x)*10*10 - sumPool(x)))
magnitude = torch.max(torch.max(avePool(x)*10*10, torch.max(sumPool(x))))
relative_err = abs_err/magnitude
abs_err.item(), magnitude.item(), relative_err.item()
>>> (3.814697265625e-06, 62.89910125732422, 6.064788493631568e-08)
That's probably a reasonable rounding related error.
I do not know why the functional version is faster than making a dedicated kernel, but it looks like if you want to make a dedicated kernel, prefer the Conv2D version, and make the weights untrainable with sumPool.weights.requires_grad = False or with torch.no_grad(): during creation of the kernel parameters. These results may change with kernel size, so test for your own application if you need to speed up this part. Let me know if I missed something...

Resources