Different behaviour of time.sleep() in python 2 and python 3 - multithreading

I use a time.sleep(0.0000001) in loops in a multithreading application because otherwise it won't respond to a thread.interrupt_main() invoked in a backdoor server thread.
I wanted the sleep to be as short as possible in order for the task to run as fast as possible, yet to be still responsive to the interrupt.
It worked fine with python 2.7 (2.7.9 / Anaconda 2.2.0). Using it with python 3.5.1 (Anaconda 4.1.0) on the same machine, it lasted much longer, slowing everything down significantly.
Further investigations in ipython showed the following in python 2
%timeit time.sleep(0.0000001)
resulted in: 1000000 loops, best of 3: 3.72 µs per loop
%timeit time.sleep(0.000000)
resulted in: 1000000 loops, best of 3: 3.86 µs per loop
In python 3 this was different:
%timeit time.sleep(0.0000001)
resulted in: 100 loops, best of 3: 4 ms per loop
%timeit time.sleep(0.000000)
resulted in: 1000000 loops, best of 3: 7.87 µs per loop
I know about the system dependent resolution of time.sleep() which is definitly larger that the 0.0000001. So what I'm using is basically calling time.sleep() as an interrupt.
What explains the difference of about 1000 times between python 2 and python 3?
Can it be changed?
Is there an alternative possibility to make an application/thread more repsonsive to thread.interrupt_main()?
Edit:
My first aproach of reducing the time parameter showed a time reduction in python 2 for values less than 0.000001, which didn't work in python 3.
A value of 0 seems to work now for both versions.

Related

Einsum is slow for tensor multiplication

I'm trying to optimize a particular piece of code to calculate the mahalanobis distance in a vectorized manner. I have a standard implementation which used traditional python multiplication, and another implementation which uses einsum. However, I'm surprised that the einsum implementation is slower than the standard python implementation. Is there anything I'm doing inefficiently in einsum, or are there potentially other methods such as tensordot that I should be looking into?
#SETUP
BATCH_SZ = 128
GAUSSIANS = 100
xvals = np.random.random((BATCH_SZ, 1, 4))
means = np.random.random((GAUSSIANS, 1, 4))
inv_covs = np.random.random((GAUSSIANS, 4, 4))
%%timeit
xvals_newdim = xvals[:, np.newaxis, ...]
means_newdim = means[np.newaxis, ...]
diff_newdim = xvals_newdim - means_newdim
regular = diff_newdim # inv_covs # (diff_newdim).transpose(0, 1, 3, 2)
>> 731 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
diff = xvals - means.squeeze(1)
einsum = np.einsum("ijk,jkl,ijl->ij", diff, inv_covs, diff)
>> 949 µs ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
First thing first. One need to understand what is going on to optimize such a code, then profile, then estimate the time, and only then find a better solution.
TL;DR: both versions are inefficient and runs serially. Neither BLAS libraries nor Numpy are designed for optimizing this use-case. In fact, even basic Numpy operations are not efficient when the last axis is very small (ie. 4). This can be optimized using Numba by writing an implementation specifically designed for your size of matrices.
Understanding
# is a Python operator but it calls Numpy function internally like + or * for example. It performs a loop iterating over all matrices and call a highly optimized BLAS implementation on each matrix. A BLAS is a numerical algebra library. There are many existing BLAS but the default one for Numpy is generally OpenBLAS which is pretty optimized, especially for large matrices. Please note also that np.einsum can call BLAS implementations in specific pattern (if the optimize flag is set properly though) but this is not the case here. It is also worth mentioning that np.einsum is well optimized for 2 matrices in input but less-well for 3 matrices and not optimized for more matrices in parameter. This is because the number of possibility grows exponentially and that the code do the optimization manually. For more information about how np.einsum works, please read How is numpy.einsum implemented?.
Profiling
The thing is you are multiplying many very-small matrices and most BLAS implementations are not optimized for that. In fact, Numpy either: the cost of the generic loop iteration can become big compared to the computation, not to mention the function call to the BLAS. A profiling of Numpy shows that the slowest function of the np.einsum implementation is PyArray_TransferNDimToStrided. This function is not the main computing function but a helper one. In fact, the main computing function takes only 20% of the overall time which leaves a lot of room for improvement! The same is true for the BLAS implementation: cblas_dgemv only takes about 20% as well as dgemv_n_HASWELL (the main computing kernel of the BLAS cblas_dgemv function). The rest is nearly pure overhead of the BLAS library or Numpy (roughly half the time for both). Moreover, both version runs serially. Indeed, np.einsum is not optimized to run with multiple threads and the BLAS cannot use multiple threads since the matrices are too small so multiple threads can be useful (since multi-threading has a significant overhead). This means both versions are pretty inefficient.
Performance metric
To know how inefficient the versions are, one need to know the amount of computation to do and the speed of the processor. The number of Flop (floating-point operation)is provided by np.einsum_path and is 5.120e+05 (for an optimized implementation, otherwise it is 6.144e+05). Mainstream CPUs usually performs >=100 GFlops/s with multiple threads and dozens of GFlops/s serially. For example my i5-9600KF processor can achieve 300-400 GFlops/s in parallel and 50-60 GFlops/s serially. Since the computation last for 0.52 ms for the BLAS version (best), this means the code runs at 1 GFlops/s which is a poor result compared to the optimal.
Optimization
On solution to speed up the computation is to design a Numba (JIT compiler) or Cython (Python to C compiler) implementation that is optimized for your specific sizes of matrices. Indeed, the last dimension is too small for generic codes to be fast. Even a basic compiled code would not be very fast in this case: even the overhead of a C loop can be quite big compared to the actual computation. We can tell to the compiler that the size some matrix axis is small and fixed at compilation time so the compiler can generate a much faster code (thanks to loop unrolling, tiling and SIMD instructions). This can be done with a basic assert in Numba. In addition, we can use the fastmath=True flag so to speed the computation even more if there is no special floating-point (FP) values like NaN or subnormal numbers used. This flag can also impact the accuracy of the result since is assume FP math is associative (which is not true). Put it shortly, it breaks the IEEE-754 standard for sake of performance. Here is the resulting code:
import numba as nb
# use `fastmath=True` for better performance if there is no
# special value used and the accuracy is not critical.
#nb.njit('(float64[:,:,::1], float64[:,:,::1])', fastmath=True)
def compute_fast_einsum(diff, inv_covs):
ni, nj, nk = diff.shape
nl = inv_covs.shape[2]
assert inv_covs.shape == (nj, nk, nl)
assert nk == 4 and nl == 4
res = np.empty((ni, nj), dtype=np.float64)
for i in range(ni):
for j in range(nj):
s = 0.0
for k in range(nk):
for l in range(nl):
s += diff[i, j, k] * inv_covs[j, k, l] * diff[i, j, l]
res[i, j] = s
return res
%%timeit
diff = xvals - means.squeeze(1)
compute_fast_einsum(diff, inv_covs)
Results
Here are performance results on my machine (mean ± std. dev. of 7 runs, 1000 loops each):
# operator: 602 µs ± 3.33 µs per loop
einsum: 698 µs ± 4.62 µs per loop
Numba code: 193 µs ± 544 ns per loop
Numba + fastmath: 177 µs ± 624 ns per loop
Best Numba: < 100 µs <------ 6x-7x faster !
Note that 100 µs is spent in the computation of diff which is not efficient. This one can be also optimized with Numba. In fact, the value of diff can be compute on the fly in the i-based loop from other arrays. This make the computation more cache friendly. This version is called "best Numba" in the results. Note that the Numba versions are not even using multiple threads. That being said, the overhead of multi-threading is generally about 5-500 µs so it may be slower on some machine to use multiple threads (on mainstream PCs, ie. not computing server, the overhead is generally 5-100 µs and it is about 10 µs on my machine).

Why is buffered_arange is faster than torch.arange in PyTorch?

I am studying the code for wav2vecv2, and they have written their own function buffered_arange which has exactly the same functionality as torch.arange.
def buffered_arange(max):
if not hasattr(buffered_arange, "buf"):
buffered_arange.buf = torch.LongTensor()
if max > buffered_arange.buf.numel():
buffered_arange.buf.resize_(max)
torch.arange(max, out=buffered_arange.buf)
return buffered_arange.buf[:max]
But it seems buffered_arange is much faster than torch.arange despite more operations than simply a single line of torch.arange
>>>%%timeit
>>>buffered_arange(10)
1.19 µs ± 5.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>>%%timeit
>>>torch.arange(10)
2.26 µs ± 8.76 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
May I know what is the logic behind it?
The main difference here is that buffered_arange is actually "buffered", that is, it always uses the same tensor for every call of the function. That is, if you call buffered_arange with an upper limit that is less or equal all previous ones, you'll probably get a speed improvement as only a slice of an existing tensor is returned - if on the other hand the upper limit is larger in every subsequent call, the buffered tensor must be enlarged every time.
On the other hand torch.arange creates a new tensor at every call.

10 s delay time for time.sleep in ipython after matplotlib 3.x.x import

I realize that time.sleep() calls the OS timer and can vary a little (as this answer points out), but I am seeing a very large delay time in Ipython. This seems to just happen after I have imported matplotlib.pyplot. Then right after waiting about 30 seconds, it starts to lag. To give a working example, try entering iPython:
>>import matplotlib.pyplot as plt
# after 30 seconds
>>%time time.sleep(1)
CPU times: user 5.27 ms, sys: 3.58 ms, total: 8.85 ms
Wall time: 11 s
Using slightly longer times in sleep appears to have an additive effect:
>>%time time.sleep(3)
CPU times: user 4.75 ms, sys: 3.7 ms, total: 8.45 ms
Wall time: 13 s
Very occasionally the wall time is appropriate, but only about 1/10 of the time. I also tried boxing the sleep in a function as follows:
>>def test():
start = time.time()
for i in range(4):
time.sleep(1)
print(f'{time.time() - start}')
>>test()
11.000279188156128
22.000601053237915
33.000962018966675
44.001291036605835
This also occasionally shows smaller time steps, but this is the usual output. I also put the same function in a separate file and used %run script.py in iPython, with the same result. Thus, it happens anytime time.sleep is called.
The only things that seems to work are (a) not importing matplotlib.pyplot
or (b) defining a function based on a simple all-python timer:
>>def dosleep(t):
start = time.time()
while time.time() - start < t:
continue
>>%time dosleep(2)
CPU times: user 1.99 s, sys: 8.4 ms, total: 2 s
Wall time: 2 s
The last example seems like a good solution, but I have a decent amount of code that relies on time.sleep() already, and I would like to still use Jupyter with an Ipython kernel. Is there any way to determine what is holding it up, or are there any tips on how to decrease the lag time? I'm just wondering what sort of thing could cause this.
I'm on Mac OS X 10.14.3, running Python 3.6.8 (Anaconda). My Ipython version is 7.3.0. It works the same for iPython 7.4.0. The matplotlib version is 3.0.3. The problem does not occur until the interactive GUI system is interacted (which is immediately at the import for matplotlib 3.x and at the creation of a figure (plt.figure()) with matplotlib 2.x). It occurs when an icon appears in the dock called "Python 3.6".
I realized this time.sleep delay behavior happens only when using the matplotlib backends Qt5Agg and Qt4Agg. This happens whether I use iPython or the regular python console. It doesn't occur when running a file by entering "python filename.py" into the Terminal, though it does hold up files that are run through the iPython or python consoles. When the plotting GUI starts, the time.sleep behavior kicks in after around 30 seconds or so.
I was able to fix the problem by switching the backend to TkAgg, which works similarly and appears to work well in interactive mode. I just made a file called "matplotlibrc" in my base user folder (~) and added the line
backend : TkAgg
to make the default backend TkAgg. This doesn't entirely answer why it happens, but it fixes the problem.

Measure program runtime statistically sound

Assume I have two variants of my compiled program, ./foo and ./bar, and I want to find out if bar is indeed faster.
I can compare runtimes by running time ./foo and time ./bar, but the numbers vary too much to get a meaningful result here.
What is the quickest way to get a statistically sound comparisons of two command line program execution times? E.g. one that also tells me about the variance of the measurements?
The python module timeit also provides a simple command line interface, which is already much more convenient than issuing time commands multiple times:
$ python -m timeit -s 'import os' 'os.system("./IsSpace-before")'
10 loops, best of 3: 4.9 sec per loop
$ python -m timeit -s 'import os' 'os.system("./IsSpace-after")'
10 loops, best of 3: 4.9 sec per loop
The timeit module does not calculate averages and variances, but simply takes the minimum, on the basis that all measurement errors increase the measurement.

Why are python's for loops so non-linear for large inputs?

I was benchmarking some python code I noticed something strange. I used the following function to measure how fast it took to iterate through an empty for loop:
def f(n):
t1 = time.time()
for i in range(n):
pass
print(time.time() - t1)
f(10**6) prints about 0.035, f(10**7) about 0.35, f(10**8) about 3.5, and f(10**9) about 35. But f(10**10)? Well over 2000. That's certainly unexpected. Why would it take over 60 times as long to iterate through 10 times as many elements? What's with python's for loops that causes this? Is this python-specific, or does this occur in a lot of languages?
When you get above 10^9 you get out of 32bit integer range. Python3 then transparently moves you onto arbitrary precision integers, which are much slower to allocate and use.
In general working with such big numbers is one of the areas where Python3 is a lot slower that Python2 (which at least had fast 64bit integers on many systems). On the good side it makes python easier to use, with fewer overflow type errors.
Some accurate timings using timeit show the times actually roughly increase in line with the input size so your timings seem to be quite a ways off:
In [2]: for n in [10**6,10**7,10**8,10**9,10**10]:
% timeit f(n)
...:
10 loops, best of 3: 22.8 ms per loop
1 loops, best of 3: 226 ms per loop # roughly ten times previous
1 loops, best of 3: 2.26 s per loop # roughly ten times previous
1 loops, best of 3: 23.3 s per loop # roughly ten times previous
1 loops, best of 3: 4min 18s per loop # roughly ten times previous
Using xrange and python2 we see the ratio roughly the same, obviously python2 is much faster overall due to the fact python3 int has been replaced by long:
In [5]: for n in [10**6,10**7,10**8,10**9,10**10]:
% timeit f(n)
...:
100 loops, best of 3: 11.3 ms per loop
10 loops, best of 3: 113 ms per loop
1 loops, best of 3: 1.13 s per loop
1 loops, best of 3: 11.4 s per loop
1 loops, best of 3: 1min 56s per loop
The actual difference in run time seems to be more related to the size of window's long rather than directly related to python 3. The difference is marginal when using unix which handles longs much differently to windows so this is a platform specific issue as much if not more than a python one.

Resources