Einsum is slow for tensor multiplication - python-3.x

I'm trying to optimize a particular piece of code to calculate the mahalanobis distance in a vectorized manner. I have a standard implementation which used traditional python multiplication, and another implementation which uses einsum. However, I'm surprised that the einsum implementation is slower than the standard python implementation. Is there anything I'm doing inefficiently in einsum, or are there potentially other methods such as tensordot that I should be looking into?
#SETUP
BATCH_SZ = 128
GAUSSIANS = 100
xvals = np.random.random((BATCH_SZ, 1, 4))
means = np.random.random((GAUSSIANS, 1, 4))
inv_covs = np.random.random((GAUSSIANS, 4, 4))
%%timeit
xvals_newdim = xvals[:, np.newaxis, ...]
means_newdim = means[np.newaxis, ...]
diff_newdim = xvals_newdim - means_newdim
regular = diff_newdim # inv_covs # (diff_newdim).transpose(0, 1, 3, 2)
>> 731 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
diff = xvals - means.squeeze(1)
einsum = np.einsum("ijk,jkl,ijl->ij", diff, inv_covs, diff)
>> 949 µs ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

First thing first. One need to understand what is going on to optimize such a code, then profile, then estimate the time, and only then find a better solution.
TL;DR: both versions are inefficient and runs serially. Neither BLAS libraries nor Numpy are designed for optimizing this use-case. In fact, even basic Numpy operations are not efficient when the last axis is very small (ie. 4). This can be optimized using Numba by writing an implementation specifically designed for your size of matrices.
Understanding
# is a Python operator but it calls Numpy function internally like + or * for example. It performs a loop iterating over all matrices and call a highly optimized BLAS implementation on each matrix. A BLAS is a numerical algebra library. There are many existing BLAS but the default one for Numpy is generally OpenBLAS which is pretty optimized, especially for large matrices. Please note also that np.einsum can call BLAS implementations in specific pattern (if the optimize flag is set properly though) but this is not the case here. It is also worth mentioning that np.einsum is well optimized for 2 matrices in input but less-well for 3 matrices and not optimized for more matrices in parameter. This is because the number of possibility grows exponentially and that the code do the optimization manually. For more information about how np.einsum works, please read How is numpy.einsum implemented?.
Profiling
The thing is you are multiplying many very-small matrices and most BLAS implementations are not optimized for that. In fact, Numpy either: the cost of the generic loop iteration can become big compared to the computation, not to mention the function call to the BLAS. A profiling of Numpy shows that the slowest function of the np.einsum implementation is PyArray_TransferNDimToStrided. This function is not the main computing function but a helper one. In fact, the main computing function takes only 20% of the overall time which leaves a lot of room for improvement! The same is true for the BLAS implementation: cblas_dgemv only takes about 20% as well as dgemv_n_HASWELL (the main computing kernel of the BLAS cblas_dgemv function). The rest is nearly pure overhead of the BLAS library or Numpy (roughly half the time for both). Moreover, both version runs serially. Indeed, np.einsum is not optimized to run with multiple threads and the BLAS cannot use multiple threads since the matrices are too small so multiple threads can be useful (since multi-threading has a significant overhead). This means both versions are pretty inefficient.
Performance metric
To know how inefficient the versions are, one need to know the amount of computation to do and the speed of the processor. The number of Flop (floating-point operation)is provided by np.einsum_path and is 5.120e+05 (for an optimized implementation, otherwise it is 6.144e+05). Mainstream CPUs usually performs >=100 GFlops/s with multiple threads and dozens of GFlops/s serially. For example my i5-9600KF processor can achieve 300-400 GFlops/s in parallel and 50-60 GFlops/s serially. Since the computation last for 0.52 ms for the BLAS version (best), this means the code runs at 1 GFlops/s which is a poor result compared to the optimal.
Optimization
On solution to speed up the computation is to design a Numba (JIT compiler) or Cython (Python to C compiler) implementation that is optimized for your specific sizes of matrices. Indeed, the last dimension is too small for generic codes to be fast. Even a basic compiled code would not be very fast in this case: even the overhead of a C loop can be quite big compared to the actual computation. We can tell to the compiler that the size some matrix axis is small and fixed at compilation time so the compiler can generate a much faster code (thanks to loop unrolling, tiling and SIMD instructions). This can be done with a basic assert in Numba. In addition, we can use the fastmath=True flag so to speed the computation even more if there is no special floating-point (FP) values like NaN or subnormal numbers used. This flag can also impact the accuracy of the result since is assume FP math is associative (which is not true). Put it shortly, it breaks the IEEE-754 standard for sake of performance. Here is the resulting code:
import numba as nb
# use `fastmath=True` for better performance if there is no
# special value used and the accuracy is not critical.
#nb.njit('(float64[:,:,::1], float64[:,:,::1])', fastmath=True)
def compute_fast_einsum(diff, inv_covs):
ni, nj, nk = diff.shape
nl = inv_covs.shape[2]
assert inv_covs.shape == (nj, nk, nl)
assert nk == 4 and nl == 4
res = np.empty((ni, nj), dtype=np.float64)
for i in range(ni):
for j in range(nj):
s = 0.0
for k in range(nk):
for l in range(nl):
s += diff[i, j, k] * inv_covs[j, k, l] * diff[i, j, l]
res[i, j] = s
return res
%%timeit
diff = xvals - means.squeeze(1)
compute_fast_einsum(diff, inv_covs)
Results
Here are performance results on my machine (mean ± std. dev. of 7 runs, 1000 loops each):
# operator: 602 µs ± 3.33 µs per loop
einsum: 698 µs ± 4.62 µs per loop
Numba code: 193 µs ± 544 ns per loop
Numba + fastmath: 177 µs ± 624 ns per loop
Best Numba: < 100 µs <------ 6x-7x faster !
Note that 100 µs is spent in the computation of diff which is not efficient. This one can be also optimized with Numba. In fact, the value of diff can be compute on the fly in the i-based loop from other arrays. This make the computation more cache friendly. This version is called "best Numba" in the results. Note that the Numba versions are not even using multiple threads. That being said, the overhead of multi-threading is generally about 5-500 µs so it may be slower on some machine to use multiple threads (on mainstream PCs, ie. not computing server, the overhead is generally 5-100 µs and it is about 10 µs on my machine).

Related

What is the fastest way to sum 2 matrices using Numba?

I am trying to find the fastest way to sum 2 matrices of the same size using Numba. I came up with 3 different approaches but none of them could beat Numpy.
Here is my code:
import numpy as np
from numba import njit,vectorize, prange,float64
import timeit
import time
# function 1:
def sum_numpy(A,B):
return A+B
# function 2:
sum_numba_simple= njit(cache=True,fastmath=True) (sum_numpy)
# function 3:
#vectorize([float64(float64, float64)])
def sum_numba_vectorized(A,B):
return A+B
# function 4:
#njit('(float64[:,:],float64[:,:])', cache=True, fastmath=True, parallel=True)
def sum_numba_loop(A,B):
n=A.shape[0]
m=A.shape[1]
C = np.empty((n, m), A.dtype)
for i in prange(n):
for j in prange(m):
C[i,j]=A[i,j]+B[i,j]
return C
#Test the functions with 2 matrices of size 1,000,000x3:
N=1000000
np.random.seed(123)
A=np.random.uniform(low=-10, high=10, size=(N,3))
B=np.random.uniform(low=-5, high=5, size=(N,3))
t1=min(timeit.repeat(stmt='sum_numpy(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t2=min(timeit.repeat(stmt='sum_numba_simple(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t3=min(timeit.repeat(stmt='sum_numba_vectorized(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t4=min(timeit.repeat(stmt='sum_numba_loop(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
print("function 1 (sum_numpy): t1= ",t1,"\n")
print("function 2 (sum_numba_simple): t2= ",t2,"\n")
print("function 3 (sum_numba_vectorized): t3= ",t3,"\n")
print("function 4 (sum_numba_loop): t4= ",t4,"\n")
Here are the results:
function 1 (sum_numpy): t1= 0.1655790419999903
function 2 (sum_numba_simple): t2= 0.3019776669998464
function 3 (sum_numba_vectorized): t3= 0.16486266700030683
function 4 (sum_numba_loop): t4= 0.1862256660001549
As you can see, the results show that there isn't any advantage in using Numba in this case. Therefore, my question is:
Is there any other implementation that would increase the speed of the summation ?
Your code is bound by page-faults (see here, here and there for more information about this). Page-faults happens because the array is newly allocated. A solution is to preallocate it and then write within it so to no cause pages to be remapped in physical memory. np.add(A, B, out=C) does this as indicated by #August in the comments. Another solution could be to adapt the standard allocator so not to give the memory back to the OS at the expense of a significant memory usage overhead (AFAIK TC-Malloc can do that for example).
There is another issue on most platforms (especially x86 ones): the cache-line write allocations of write-back caches are expensive during writes. The typical solution to avoid this is to do non-temporal store (if available on the target processor, which is the case on x86-64 one but maybe not others). That being said, neither Numpy nor Numba are able to do that yet. For Numba, I filled an issue covering a simple use-case. Compilers themselves (GCC for Numpy and Clang for Numba) tends not to generate such instructions because they can be detrimental in performance when arrays fit in cache and compilers do not know the size of the array at compile time (they could generate a specific code when they can evaluate the amount of data computed but this is not easy and can slow-down some other codes). AFAIK, the only possible way to fix this is to write a C code and use low-level instructions or to use compiler directives. In your case, about 25% of the bandwidth is lost due to this effect, causing a slowdown up to 33%.
Using multiple threads do not always make memory-bound code faster. In fact, it generally barely scale because using more core do not speed up the execution when the RAM is already saturated. Few cores are generally required so to saturate the RAM on most platforms. Page faults can benefit from using multiple cores regarding the target system (Linux does that in parallel quite well, Windows generally does not scale well, IDK for MacOS).
Finally, there is another issue: the code is not vectorized (at least not on my machine while it can be). On solution is to flatten the array view and do one big loop that the compiler can more easily vectorize (the j-based loop is too small for SIMD instructions to be effective). The contiguity of the input array should also be specified for the compiler to generate a fast SIMD code. Here is the resulting Numba code:
#njit('(float64[:,::1], float64[:,::1], float64[:,::1])', cache=True, fastmath=True, parallel=True)
def sum_numba_fast_loop(A, B, C):
n, m = A.shape
assert C.shape == A.shape
A_flat = A.reshape(n*m)
B_flat = B.reshape(n*m)
C_flat = C.reshape(n*m)
for i in prange(n*m):
C_flat[i]=A_flat[i]+B_flat[i]
return C
Here are results on my 6-core i5-9600KF processor with a ~42 GiB/s RAM:
sum_numpy: 0.642 s 13.9 GiB/s
sum_numba_simple: 0.851 s 10.5 GiB/s
sum_numba_vectorized: 0.639 s 14.0 GiB/s
sum_numba_loop serial: 0.759 s 11.8 GiB/s
sum_numba_loop parallel: 0.472 s 18.9 GiB/s
Numpy "np.add(A, B, out=C)": 0.281 s 31.8 GiB/s <----
Numba fast: 0.288 s 31.0 GiB/s <----
Optimal time: 0.209 s 32.0 GiB/s
The Numba code and the Numpy one saturate my RAM. Using more core does not help (in fact it is a bit slower certainly due to the contention of the memory controller). Both are sub-optimal since they do not use non-temporal store instructions that can prevent cache-line write allocations (causing data to be fetched from the RAM before being written back). The optimal time is the one expected using such instruction. Note that it is expected to reach only 65-80% of the RAM bandwidth because of RAM mixed read/writes. Indeed, interleaving reads and writes cause low-level overheads preventing the RAM to be saturated. For more information about how RAM works, please consider reading Introduction to High Performance Scientific Computing -- Chapter 1.3 and What Every Programmer Should Know About Memory (and possibly this).

Why is buffered_arange is faster than torch.arange in PyTorch?

I am studying the code for wav2vecv2, and they have written their own function buffered_arange which has exactly the same functionality as torch.arange.
def buffered_arange(max):
if not hasattr(buffered_arange, "buf"):
buffered_arange.buf = torch.LongTensor()
if max > buffered_arange.buf.numel():
buffered_arange.buf.resize_(max)
torch.arange(max, out=buffered_arange.buf)
return buffered_arange.buf[:max]
But it seems buffered_arange is much faster than torch.arange despite more operations than simply a single line of torch.arange
>>>%%timeit
>>>buffered_arange(10)
1.19 µs ± 5.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>>%%timeit
>>>torch.arange(10)
2.26 µs ± 8.76 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
May I know what is the logic behind it?
The main difference here is that buffered_arange is actually "buffered", that is, it always uses the same tensor for every call of the function. That is, if you call buffered_arange with an upper limit that is less or equal all previous ones, you'll probably get a speed improvement as only a slice of an existing tensor is returned - if on the other hand the upper limit is larger in every subsequent call, the buffered tensor must be enlarged every time.
On the other hand torch.arange creates a new tensor at every call.

list.count() vs Counter() performance

While trying to find the frequency of a bunch of characters in a string, why does running string.count(character) 4 times for 4 different characters yield faster execution time (using time.time()) than using a collections.Counter(string)?
Background:
Given a sequence of moves represented by a string. Valid moves are R (right), L (left), U (up), and D (down). Return True if the sequence of moves takes me back to the origin. Otherwise, return false.
# approach - 1 : iterate 4 times (3.9*10^-6 seconds)
def foo1(moves):
return moves.count('U') == moves.count('D') and moves.count('L') == moves.count('R')
# approach - 2 iterate once (3.9*10^-5 seconds)
def foo2(moves):
from collections import Counter
d = Counter(moves)
return d['R'] == d['L'] and d['U'] == d['D']
import time
start = time.time()
moves = "LDRRLRUULRLRLRLRLRLRLRLRLRLRL"
foo1(moves)
# foo2(moves)
end = time.time()
print("--- %s seconds ---" % (end - start))
These results are the opposite of what I had expected. My reasoning is that first approach should take longer because the string is iterated over 4 times whereas in the second approach, we iterate only once. Could it be due to the library call overhead?
Counter is faster in theory, but has higher fixed overhead, especially compared to str.count, which can scan the underlying C array with direct memory comparisons, where list.count has to do rich comparisons for each element; converting moves to a list of single characters nearly triples the time for foo1 in local tests, from 448 ns to 1.3 μs (while foo2 actually gets a tiny bit faster, dropping from 5.6 μs to 5.48 μs).
Other problems:
Importing an already imported module uses the cached import, but there is a surprising amount of overhead involved in even a cached import (the loading machinery has a lot of stuff to check to make sure it's okay to do so); in local tests, moving from collections import Counter to the top level reduced the runtime of foo2 by 1.6 μs (5.6 μs with single global import, 7.2 μs with local per-call import). This will vary a lot by environment; on another machine (with less stuff installed in both user and system site-packages), the overhead was only 0.75 μs. Regardless, it's a significant, avoidable disadvantage for foo2.
Counter on modern Python uses a C accelerator to speed up counting, but the accelerator only provides a benefit when the iterable is long enough. If you use the list form of moves, but multiply it by 100 to make a longer sequence, the difference drops, relatively speaking (to 106 µs for foo1 vs. 140 µs for foo2)
You're just not counting very many things; when there are only four things you care about, paying O(n) four times can easily beat paying O(n) once if the former case has lower constant multipliers (which aren't included in big-O notation) than the latter. Counter remains O(n) for any number of unique things being counted; calling .count is O(n) per call, but if you need to know the count of every unique thing in the input, for inputs that are mostly unique, individual .count calls for each will be asymptotically O(n²).
The .count approach is short-circuiting in your specific case, so it isn't even doing O(n) work four times, just twice; the U and D counts don't match, so it never counts L and R at all. Counter doesn't get meaningfully slower if it can't short-circuit (all the cost is paid in the single counting pass), but your foo1, in the same benchmark I used from point #2 (longer input, in list form), goes from 106 µs to 185 µs if I just add a single D to the end of the (pre-multiplication) moves (making the U and D counts the same, and requiring two more count calls); foo2 only goes up to 143 µs (from 140 µs), presumably because moves actually got longer (adding the D before multiplying by 100 meant it went from 2900 elements to count to 3000).
Basically, you had some minor implementation weaknesses, but mostly, you happened to choose a use case that gave all the advantage to .count, none to Counter. If your inputs are always str, and you're only counting them a small, fixed number of times, then sure, repeated calls to count are generally going to win. But for arbitrary input types (especially iterators, where count is impossible, both because it doesn't exist, and because you can only iterate it once), especially larger ones, with more unique things to count, where consistent performance counts (so relying on short-circuiting to reduce the number of count calls isn't acceptable), Counter will win.

Different behaviour of time.sleep() in python 2 and python 3

I use a time.sleep(0.0000001) in loops in a multithreading application because otherwise it won't respond to a thread.interrupt_main() invoked in a backdoor server thread.
I wanted the sleep to be as short as possible in order for the task to run as fast as possible, yet to be still responsive to the interrupt.
It worked fine with python 2.7 (2.7.9 / Anaconda 2.2.0). Using it with python 3.5.1 (Anaconda 4.1.0) on the same machine, it lasted much longer, slowing everything down significantly.
Further investigations in ipython showed the following in python 2
%timeit time.sleep(0.0000001)
resulted in: 1000000 loops, best of 3: 3.72 µs per loop
%timeit time.sleep(0.000000)
resulted in: 1000000 loops, best of 3: 3.86 µs per loop
In python 3 this was different:
%timeit time.sleep(0.0000001)
resulted in: 100 loops, best of 3: 4 ms per loop
%timeit time.sleep(0.000000)
resulted in: 1000000 loops, best of 3: 7.87 µs per loop
I know about the system dependent resolution of time.sleep() which is definitly larger that the 0.0000001. So what I'm using is basically calling time.sleep() as an interrupt.
What explains the difference of about 1000 times between python 2 and python 3?
Can it be changed?
Is there an alternative possibility to make an application/thread more repsonsive to thread.interrupt_main()?
Edit:
My first aproach of reducing the time parameter showed a time reduction in python 2 for values less than 0.000001, which didn't work in python 3.
A value of 0 seems to work now for both versions.

Why are python's for loops so non-linear for large inputs?

I was benchmarking some python code I noticed something strange. I used the following function to measure how fast it took to iterate through an empty for loop:
def f(n):
t1 = time.time()
for i in range(n):
pass
print(time.time() - t1)
f(10**6) prints about 0.035, f(10**7) about 0.35, f(10**8) about 3.5, and f(10**9) about 35. But f(10**10)? Well over 2000. That's certainly unexpected. Why would it take over 60 times as long to iterate through 10 times as many elements? What's with python's for loops that causes this? Is this python-specific, or does this occur in a lot of languages?
When you get above 10^9 you get out of 32bit integer range. Python3 then transparently moves you onto arbitrary precision integers, which are much slower to allocate and use.
In general working with such big numbers is one of the areas where Python3 is a lot slower that Python2 (which at least had fast 64bit integers on many systems). On the good side it makes python easier to use, with fewer overflow type errors.
Some accurate timings using timeit show the times actually roughly increase in line with the input size so your timings seem to be quite a ways off:
In [2]: for n in [10**6,10**7,10**8,10**9,10**10]:
% timeit f(n)
...:
10 loops, best of 3: 22.8 ms per loop
1 loops, best of 3: 226 ms per loop # roughly ten times previous
1 loops, best of 3: 2.26 s per loop # roughly ten times previous
1 loops, best of 3: 23.3 s per loop # roughly ten times previous
1 loops, best of 3: 4min 18s per loop # roughly ten times previous
Using xrange and python2 we see the ratio roughly the same, obviously python2 is much faster overall due to the fact python3 int has been replaced by long:
In [5]: for n in [10**6,10**7,10**8,10**9,10**10]:
% timeit f(n)
...:
100 loops, best of 3: 11.3 ms per loop
10 loops, best of 3: 113 ms per loop
1 loops, best of 3: 1.13 s per loop
1 loops, best of 3: 11.4 s per loop
1 loops, best of 3: 1min 56s per loop
The actual difference in run time seems to be more related to the size of window's long rather than directly related to python 3. The difference is marginal when using unix which handles longs much differently to windows so this is a platform specific issue as much if not more than a python one.

Resources