generating huge amout of random numbers with python - python-3.x

I want to generate random numbers, uniformly between -1 and 1.
I know that using NumPy and generating an array of numbers is much better than generate the one by one in a for loop.
On the other hand, I want these numbers to operate with them only once, so there's no reason for storing them in an array.
My question is, what is the best solution to this, on one hand using a for loop is not time efficient, but I don't store unnecessary numbers, I generate them one by one and then I throw them. On the other hand, an array is not memory efficient, since if I want to generate 10^10 numbers, I need to create a 10^10 size array, with horrible results.
I assume the best choice is to generate small arrays (10^3 or 10^4 elements) one by one, but I want to know if there's a better solution to this problem (maybe a NumPy function that generates the numbers but creates something like an iterable that don't store all them in memory?)

Using NumPy to generate blocks of numbers is best, and you want to keep operations vectorised as much as possible.
A simple benchmark shows that somewhere between 4k and 64k is a reasonable block size:
from timeit import Timer
import numpy as np
for xp in range(20):
size = 2**xp
timer = Timer(
f'rng.uniform(-1., 1., size={size})',
'rng = np.random.default_rng()',
globals=globals()
)
n, t = timer.autorange()
t = min([t] + timer.repeat(3, n)) / n / size
print(f'{size:8} = {1e-6/t:6.2f}M/s')
gives me
1 = 0.47M/s
2 = 0.95M/s
4 = 1.89M/s
8 = 3.80M/s
16 = 7.43M/s
32 = 14.26M/s
64 = 27.10M/s
128 = 48.60M/s
256 = 78.72M/s
512 = 119.07M/s
1024 = 158.71M/s
2048 = 191.51M/s
4096 = 218.71M/s
8192 = 233.25M/s
16384 = 241.23M/s
32768 = 245.35M/s
65536 = 248.75M/s
131072 = 250.53M/s
262144 = 252.62M/s
524288 = 253.99M/s
and working with numbers in a vectorised form is orders-of-magnitude faster.
For example, given a 64k array of values, a vectorised call of np.sum(x) takes 17µs while the similar version going through a generator sum(x) takes 3.5ms, i.e. 200 times slower. Once you've paid the price for getting the floats out into the non-vectorised Python-world going through another yield from doesn't make much difference, only taking 4.5ms, e.g.: via the iPython %timeit magic:
def yield_from(it):
yield from it
x = np.random.uniform(-1, 1, size=2**16)
%timeit np.sum(x)
%timeit sum(x)
%timeit sum(yield_from(x))

you could make a generator, as said in the comment by #Carcigenicate, and combine that with the speedup of generating entire arrays using a yield from expression.
this would look something like this:
def random_numbers():
while True:
yield from np.random.random(1000) * 2 - 1
you can adjust the number of values generated at once to whatever you need, larger is faster but uses more memory

Related

Numpy.linalg.norm performance apparently doesn't scale with the number of dimensions

I have the following snippets of code which is a subroutine of the K-means clustering algorithm; specifically, it tries to assign each point to the closest centroid.
import numpy as np
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
for i in range(n):
distances = np.apply_along_axis(lambda x: np.linalg.norm(x, ord=2), 1, centroids - points[i])
membership[i] = np.argmin(distances)
The running time here should be O(NKD) where D is the dimension of the data points, so naturally I expect when D increases or decreases, the running time would change proportionally as well. To my surprise, I see very little time being changed when changing D, for example when testing on my local machine:
D = 1
python3 benchmark.py 12.10s user 0.39s system 118% cpu 10.564 total
D = 30
python3 benchmark.py 12.17s user 0.36s system 117% cpu 10.703 total
D = 300
python3 benchmark.py 13.30s user 0.31s system 115% cpu 11.784 total
D = 1000
python3 benchmark.py 16.51s user 1.76s system 110% cpu 16.524 total
Is there something that I'm missing here?
Edit: per #Warren's suggestion, I modified the code to use np.linalg.norm with axis parameter directly; the performance is following:
D = 1
python3 benchmark.py 1.45s user 0.37s system 634% cpu 0.287 total
D = 30
python3 benchmark.py 1.67s user 0.29s system 592% cpu 0.331 total
D = 300
python3 benchmark.py 3.03s user 0.32s system 234% cpu 1.428 total
D = 1000
python3 benchmark.py 6.32s user 2.73s system 126% cpu 7.177 total
so the performance was better.
This is due to the overhead of Numpy functions.
Indeed, np.apply_along_axis is called 20_000 times and each call to this function internally does a loop calling the target Python function 250 times (ie. it is not vectorized), and so np.linalg.norm. In the end, np.linalg.norm is called, 20_000 * 250 = 5000000 times. The thing is each call to a Numpy function takes typically about 1 µs. On my machine, np.linalg.norm takes 4-5 µs on an array of size 1. This time is due to many internal checks (types and values), allocations, functions calls, conversion, etc.
There are two simple ways to reduce this overhead: vectorization and using a JIT compiler like Numba. The later is often more efficient as it avoid creating expensive big temporary arrays.
Here is a much faster implementation:
import numpy as np
import numba as nb
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])')
def compute(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
distances[j] = np.linalg.norm(centroids[j] - points[i], ord=2)
membership[i] = np.argmin(distances)
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
compute(points, centroids, membership)
In fact, while this code is much faster, it still have a similar issue: the cost of allocating the temporary arrays centroids[j] - points[i] is significant compared to the actual time required to compute the norm. In fact, each allocations takes only few hundred of nanoseconds, but the number of loop iteration is huge. One solution is simply to compute the norm manually:
from math import sqrt
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])', fastmath=True)
def compute_fast(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
s = 0.0
for k in range(D):
tmp = centroids[j,k] - points[i,k]
s += tmp * tmp
distances[j] = sqrt(s)
membership[i] = np.argmin(distances)
Here are results on my i5-9600KF processor:
D=1:
initial code: 26.56 seconds
compute: 1.44 seconds
compute_fast: 0.02 seconds (x1328)
D=30:
initial code: 27.09 seconds
compute: 1.65 seconds
compute_fast: 0.13 seconds (x208)
D=1000:
initial code: 39.34 seconds
compute: 3.74 seconds
compute_fast: 4.57 seconds (x8.6)
The last implementation is much faster for small values of D since the Numpy overhead are the main bottleneck in this case and the implementation can almost completely remove such overheads (thanks to the JIT compilation).
It is probably O(NKD).
But the thing is you are iterating 3 loops here. One explicitly. One semi-explicitly. And the last one implicitly, inside numpy functions.
The outer one is your explicit for loop, for N.
The middle one is the np.apply_along_axis one, which applies on the K rows of centroids-points[i] (btw, there is another one here, with some broadcasting. But we don't need to count all of them for big-O consideration)
And the inner one is the one on the D columns that occur inside norm.
The inner one is obviously the most important to optimized, and that's good, because it is the only one that is vectorized here.
But that means that for small enough value of D, what we really see is more some constant overhead (times N×K, since it is inside a double for loop). Your inefficient outer for loops drive most of the cost, which, then, looks like O(NK).
Note that np.apply_along_axis is just a for loop by another name. It is not as bad. But almost so. It is still calling several times some python code. It is not vectorization.
But, well, I bet that with D big enough, you'll see that it is O(NKD)
Edit:
Here is what I get when I increase D (with smaller n, so that it remains computable in realistic time)
You see that it looks really linear (affine, to be accurate, since it doesn't pass through 0, which is the reason why it doesn't look very linear to you; and which is explained by my previous comment: most of the inner cost inside the for/along_axis double loop is mainly constant overhead of those loops, when D is small. The "proportional to D" part begins to show when the overhead become negligible)

Python Quadruple For Performance

I have code that reads data and grabs specific data from an object's fields.
How can I eliminate the quadruple for loop here? Its performance seems quite slow.
data = readnek(filename) # read in data
bigNum=200000
for myNodeVal in range(0, 7): # all 6 elements.
cs_coords = np.ones((bigNum, 2)) # initialize data
counter = 0
for iel in range(bigNum):
for ix in range(0,7):
for iy in range(0,7):
z = data.elem[iel].pos[2, myNodeVal, iy, ix]
x = data.elem[iel].pos[0, myNodeVal, iy, ix]
y = data.elem[iel].pos[1, myNodeVal, iy, ix]
cs_coords[counter, 0:2] = [x, y]
counter += 1
You can remove the two innermost loops using a transposed view that is reshaped so to build a block of 49 [x, y] values then assigned to cs_coords in a vectorized way. The access to z can be removed for better performance (since the Python interpreter optimize nearly nothing). Here is an (untested) example:
data = readnek(filename) # read in data
bigNum=200000
for myNodeVal in range(0, 7): # all 6 elements.
cs_coords = np.ones((bigNum, 2)) # initialize data
counter = 0
for iel in range(bigNum):
arr = data.elem[iel].pos
view_x = arr[0, myNodeVal, 0:7, 0:7].T
view_y = arr[1, myNodeVal, 0:7, 0:7].T
cs_coords[counter:counter+49] = np.hstack([view_x.reshape(-1, 1), view_y.reshape(-1, 1)])
counter += 49
Note that the initial code is probably flawed since cs_coords.shape[0] is bigNum and counter will be bigNum * 49. You certainly need to use the shape (bigNum*49, 2) instead so to avoid out of bound errors.
Note the above code is still far from being optimal since it will create many small arrays and Numpy is not optimized to deal with very small arrays (CPython neither). It is hard to do much better without more information on data. Using Numba or Cython can certainly help a lot to speed up this code. Still, even with such tool, the code will not be very efficient since the memory access pattern is inefficient (bad cache locality) and the overall code will be memory-bound.

multithreaded iteration over numpy array indices

I have a piece of code which iterates over a three-dimensional array and writes into each cell a value based on the indices and the current value itself:
import numpy as np
nx = ny = nz = 100
array = np.zeros((nx, ny, nz))
def fun(val, k):
# Do something with the indices
return val + (k[0] * k[1] * k[2])
with np.nditer(array, flags=['multi_index'], op_flags=['readwrite']) as it:
for x in it:
x[...] = fun(x, it.multi_index)
Note, that fun might do something more sophisticated, which takes most of the total runtime, and that the input arrays might have different lengths per axis.
However, this code could run in multiple threads, as fun can be assumed to be threadsafe (Only the value and index of the current cell are required). But finding a method to iterate over all cells and have the current index available seems to be hard.
A possible solution might be https://stackoverflow.com/a/58012407/446140, where the array is split by the x-axis into chunks and passed to a Pool.
However, the solution is not universally applicable and I wonder if there is a more general solution for this problem (which could also work with nD arrays)?
The first issue is to split up the 3D array into equally sized chunks. np.array_split can be used, but the offset of each of the splits has to be stored to get the correct indices again.
An interesting question, with a few possible solutions. As you indicated, it is possible to use np.array_split, but since we are only interested in the indices, we can also use np.unravel_index, which would mean that we only have to loop over all the indices (the size) of the array to get the index.
Now there are two great ideas for multiprocessing:
Create a (thread safe) shared memory of the array and splitting the indices across the different processes.
Only update the array in a main thread, but provide a copy of the required data to the processes and let them return the value that has to be updated.
Both solutions will work for any np.ndarray, but have different advantages. Creating a shared memory doesn't create copies, but can have a large insertion penalty if it has to wait on other processes (the computational time, is small compared to the write time.)
There are probably many more solutions, but I will work out the first solution, where a Shared Memory object is created and a range of indices is provided to every process.
Required imports:
import itertools
import numpy as np
import multiprocessing as mp
from multiprocessing import shared_memory
Shared Numpy arrays
The main problem with applying multiprocessing on np.ndarray's is that memory sharing between processes can be difficult. For this the following class can be used:
class SharedNumpy:
__slots__ = ('arr', 'shm', 'name', 'shared',)
def __init__(self, arr: np.ndarray = None):
if arr is not None:
self.shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
self.arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=self.shm.buf)
self.name = self.shm.name
np.copyto(self.arr, arr)
def __getattr__(self, item):
if hasattr(self.arr, item):
return getattr(self.arr, item)
raise AttributeError(f"{self.__class__.__name__}, doesn't have attribute {item!r}")
def __str__(self):
return str(self.arr)
#classmethod
def from_name(cls, name, shape, dtype):
memory = cls(arr=None)
memory.shm = shared_memory.SharedMemory(name)
memory.arr = np.ndarray(shape, dtype=dtype, buffer=memory.shm.buf)
memory.name = name
return memory
#property
def dtype(self):
return self.arr.dtype
#property
def shape(self):
return self.arr.shape
This makes it possible to create a shared memory object in the main process and then use SharedNumpy.from_name to get it in other processes.
Simple test
A quick (non threaded) test would be:
def simple_test():
data = np.array(np.zeros((5,) * 2))
mem_primary = SharedNumpy(arr=data)
mem_second = SharedNumpy.from_name(name=mem_primary.name, shape=data.shape, dtype=data.dtype)
assert mem_primary.name == mem_second.name, "Different memory names"
assert np.array_equal(mem_primary.arr, mem_second.arr), "Different array values."
mem_primary.arr[2] = 5
assert np.array_equal(mem_primary.arr, mem_second.arr), "Different array values."
print("Completed 3/3 tests...")
A threaded test will follow later!
Distribution
The next part is focused on providing the processes with the necessary data. In this case we will provide every process with a range of indices that it has to calculate and all the data that is required to load the shared memory.
The input of this function is a dim the number of numpy axis, and the size, which are the number of elements per axis.
def distributed(size, dim):
memory = SharedNumpy(arr=np.zeros((size,) * dim))
split_size = np.int64(np.ceil(memory.arr.size / mp.cpu_count()))
settings = dict(
memory=itertools.repeat(memory.name),
shape=itertools.repeat(memory.arr.shape),
dtype=itertools.repeat(memory.arr.dtype),
start=np.arange(mp.cpu_count()),
num=itertools.repeat(split_size)
)
with mp.Pool(mp.cpu_count()) as pool:
pool.starmap(fun, zip(*settings.values()))
print(f"\n\nDone {dim}D, size: {size}, elements: {size ** dim}")
return memory
Notes:
By using starmap instead of map, it is possible to provide multiple input arguments (a list of arguments for every process).
(also see docs starmap)
itertools.repeat is used to add constants to the starmap
(also see: zip() in python, how to use static values)
By using np.unravel_index, we only need to have a start index and the chunk size per process.
The start and num tell the chunks of indices that have to be converted per process, by applying range(start * num, (start + 1) * num).
Testing
For the testing I am using different input sizes and dimensions. Since the data increases with the formula sizes ^ dimensions, I limited the test to a size of 128 and 3 dimensions (that is 2,097,152 points, and already start taking quit a bit of time.)
Code
fun
def fun(name, shape, dtype, start, num):
memory = SharedNumpy.from_name(name, shape=shape, dtype=dtype)
for idx in range(start * num, min((start + 1) * num, memory.arr.size)):
# Do something with the indices
indices = np.unravel_index([idx], shape)
memory.arr[indices] += np.product(indices)
memory.shm.close() # Closes the shared memory for this process.
Running the example
if __name__ == '__main__':
for size in [5, 10, 15]:
for dim in [1, 2, 3]:
memory = distributed(size, dim)
print(memory)
memory.shm.unlink()
For the OP's code, I used his code with a small addition that I allow the array to have different sizes and dimensions, in any case I use:
def sequential(size, dim):
array = np.zeros((size,) * dim)
...
And looking at the output array of both codes, will result in the same outcomes.
Plots
The code for the graphs have been taken from the reply in:
https://codereview.stackexchange.com/questions/165245/plot-timings-for-a-range-of-inputs
With the minor alteration that labels was changed to codes in
empty_multi_index = pd.MultiIndex(levels=[[], []], codes=[[], []], names=['func', 'result'])
Where the 1d, 2d and 3d reference the dimensions and the input is the size.
Sequentially (OP code):
Distributed (this code):
Results
This method works on an arbitrary sized numpy array, and is able to perform an operation on the indices of the array. It provides you with full access of the whole numpy array, so it can also be used to perform different kind of statistical analysis, which do not change the array.
From the timings it can be seen that for small data shapes the distributed version has no to little advantages, because of the extra complexity of creating the processes. However for larger amount of data it starts to become more effective.
I only timed it on short delays in the computational time (simple fun), but on more complex calculations, it should outperform the sequential version much sooner.
Extra
If you are only interested in operations that are performed over or along axis, these numpy functions might help to vectorize your solutions instead of using multiprocessing:
np.apply_over_axes
np.apply_along_axis

Cython: make prange parallelization thread-safe

Cython starter here. I am trying to speed up a calculation of a certain pairwise statistic (in several bins) by using multiple threads. In particular, I am using prange from cython.parallel, which internally uses openMP.
The following minimal example illustrates the problem (compilation via Jupyter notebook Cython magic).
Notebook setup:
%load_ext Cython
import numpy as np
Cython code:
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
#boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
double[:] Z = np.zeros(nbins,dtype=np.float64)
int i,j,b
with nogil, parallel(num_threads=num_threads):
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z[b] += Xij*Yij
return np.asarray(Z)
mock data and bins
X = np.random.rand(10000)
bin_edges = np.linspace(0.,1,11)
bins = np.array([bin_edges[:-1],bin_edges[1:]]).T
bins = bins.copy(order='C')
Timing via
%timeit my_parallel_statistic(X,bins,1)
%timeit my_parallel_statistic(X,bins,4)
yields
1 loop, best of 3: 728 ms per loop
1 loop, best of 3: 330 ms per loop
which is not a perfect scaling, but that is not the main point of the question. (But do let me know if you have suggestions beyond adding the usual decorators or fine-tuning the prange arguments.)
However, this calculation is apparently not thread-safe:
Z1 = my_parallel_statistic(X,bins,1)
Z4 = my_parallel_statistic(X,bins,4)
np.allclose(Z1,Z4)
reveals a significant difference between the two results (up to 20% in this example).
I strongly suspect that the problem is that multiple threads can do
Z[b] += Xij*Yij
at the same time. But what I don't know is how to fix this without sacrificing the speed-up.
In my actual use case, the calculation of Xij and Yij is more expensive, hence I would like to do them only once per pair. Also, pre-computing and storing Xij and Yij for all pairs and then simply looping through bins is not a good option either because N can get very large, and I can't store 100,000 x 100,000 numpy arrays in memory (this was actually the main motivation for rewriting it in Cython!).
System info (added following suggestion in comments):
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4790K CPU # 4.00GHz
OS: Red Hat Linux v6.8
Memory: 16 GB
Yes, Z[b] += Xij*Yij is indeed a race condition.
There are a couple of options of making this atomic or critical. Implementation issues with Cython aside, you would in any case have bad performance due to false sharing on the shared Z vector.
So the better alternative is to reserve a private array for each thread. There are a couple of (non-)options again. One could use a private malloc'd pointer, but I wanted to stick with np. Memory slices cannot be assigned as private variables. A two dimensional (num_threads, nbins) array works, but for some reason generates very complicated inefficient array index code. This works but is slower and does not scale.
A flat numpy array with manual "2D" indexing works well. You get a little bit extra performance by avoiding padding the private parts of the array to 64 byte, which is a typical cache line size. This avoids false sharing between the cores. The private parts are simply summed up serially outside of the parallel region.
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
cimport openmp
#boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
# pad local data to 64 byte avoid false sharing of cache-lines
int nbins_padded = (((nbins - 1) // 8) + 1) * 8
double[:] Z_local = np.zeros(nbins_padded * num_threads,dtype=np.float64)
double[:] Z = np.zeros(nbins)
int i,j,b, bb, tid
with nogil, parallel(num_threads=num_threads):
tid = openmp.omp_get_thread_num()
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z_local[tid * nbins_padded + b] += Xij*Yij
for tid in range(num_threads):
for bb in range(nbins):
Z[bb] += Z_local[tid * nbins_padded + bb]
return np.asarray(Z)
This performs quite well on my 4 core machine, with 720 ms / 191 ms, a speedup of 3.6. The remaining gap may be due to turbo mode. I don't have access to a proper machine for testing right now.
You are right that the access to Z is under a race condition.
You might be better off defining num_threads copies of Z, as cdef double[:] Z = np.zeros((num_threads, nbins), dtype=np.float64) and perform a sum along axis 0 after the prange loop.
return np.sum(Z, axis=0)
Cython code can have a with gil statement in a parallel region but it is only documented for error handling. You could have a look at the general C code to see whether that would trigger an atomic OpenMP operation but I doubt it.

Find euclidean distance between rows of two huge CSR matrices

I have two sparse martrices, A and B. A is 120000*5000 and B is 30000*5000. I need to find the euclidean distances between each row in B with all rows of A and then find the 5 rows in A with the lowest distance to the selected row in B. As it is a very big data I am using CSR otherwise I get memory error. It is clear that for each row in A it calculates (x_b - x_a)^2 5000 times and sums them and then get a sqrt. This process is taking a very very long time, like 11 days! Is there any way I can do this more efficiently? I just need the 5 rows with the lowest distance to each row in B.
I am implementing K-Nearest Neighbours and A is my training set and B is my test set.
Well - I don't know if you could 'vectorize' that code, so that it would run in native code instead of Python. The trick to speed-up numpy and scipy is always getting that.
If you can run that code in native code in a 1GHz CPU, with 1 FP instruction for clock cicle, you'd get it done in a little under 10 hours.
(5000 * 2 * 30000 * 120000) / 1024 ** 3
Raise that to 1.5Ghz x 2 CPU physical cores x 4 way SIMD instructions with multiply + acummulate (Intel AVX extensions, available in most CPUs) and you could get that number crunching down to one hour, at 2 x 100% on a modest core i5 machinne. But that would require full SIMD optimization in native code - far from a trivial task (although, if you decide to go this path, further questions on S.O. could get help from people either to wet their hands in SIMD coding :-) ) - interfacing this code in C with Scipy is not hard using cython, for example (you only need that part to get it to the above 10 hour figure)
Now... as for algorithm optimization, and keeping things Python :-)
Fact is, you don't need to fully calculate all distances from rows in A - you just need to keep a sorted list of the 5 lower rows - and any time the cumulation of a sum of squares get larger than the 5th nearest row (so far), you just abort the calculation for that row.
You could use Python' heapq operations for that:
import heapq
import math
def get_closer_rows(b_row, a):
result = [(float("+inf"), None) * 5]
for i, a_row in enumerate(a):
distance_sq = 0
count = 0
for element_a, element_b in zip(a_row, b_row):
distance_sq += element_a * element_b
if not count % 64 and distance_sq > result[4][0]:
break
count += 1
else:
heapq.heappush(result, (distance, i))
result[:] = result[:5]
return [math.sqrt(r) for r in result]
closer_rows_to_b = []
for row in b:
closer_rows_to_b.append(get_closer_rows(row, a))
Note the auxiliar "count" to avoid the expensive retrieving and comparison of values for all multiplications.
Now, if you can run this code using pypy instead of regular Python, I believe it could get full benefit of JITting, and you could get a noticeable improvement over your times if you are running the code in pure Python (i.e.: non numpy/scipy vectorized code).

Resources