Why does python lru_cache performs best when maxsize is a power-of-two?

Why does python lru_cache performs best when maxsize is a power-of-two? - python-3.x

Documentation says this:
If maxsize is set to None, the LRU feature is disabled and the cache can grow without bound. The LRU feature performs best when maxsize is a power-of-two.
Would anyone happen to know where does this "power-of-two" come from? I am guessing it has something to do with the implementation.

Where the size effect arises
The lru_cache() code exercises its underlying dictionary in an atypical way. While maintaining total constant size, cache misses delete the oldest item and insert a new item.
The power-of-two suggestion is an artifact of how this delete-and-insert pattern interacts with the underlying dictionary implementation.
How dictionaries work
Table sizes are a power of two.
Deleted keys are replaced with dummy entries.
New keys can sometimes reuse the dummy slot, sometimes not.
Repeated delete-and-inserts with different keys will fill-up the table with dummy entries.
An O(N) resize operation runs when the table is two-thirds full.
Since the number of active entries remains constant, a resize operation doesn't actually change the table size.
The only effect of the resize is to clear the accumulated dummy entries.
Performance implications
A dict with 2**n entries has the most available space for dummy entries, so the O(n) resizes occur less often.
Also, sparse dictionaries have fewer hash collisions than mostly full dictionaries. Collisions degrade dictionary performance.
When it matters
The lru_cache() only updates the dictionary when there is a cache miss. Also, when there is a miss, the wrapped function is called. So, the effect of resizes would only matter if there are high proportion of misses and if the wrapped function is very cheap.
Far more important than giving the maxsize a power-of-two is using the largest reasonable maxsize. Bigger caches have more cache hits — that's where the big wins come from.
Simulation
Once an lru_cache() is full and the first resize has occurred, the dictionary settles into a steady state and will never get larger. Here, we simulate what happens next as new dummy entries are added and periodic resizes clear them away.
steady_state_dict_size = 2 ** 7 # always a power of two
def simulate_lru_cache(lru_maxsize, events=1_000_000):
'Count resize operations as dummy keys are added'
resize_point = steady_state_dict_size * 2 // 3
assert lru_maxsize < resize_point
dummies = 0
resizes = 0
for i in range(events):
dummies += 1
filled = lru_maxsize + dummies
if filled >= resize_point:
dummies = 0
resizes += 1
work = resizes * lru_maxsize # resizing is O(n)
work_per_event = work / events
print(lru_maxsize, '-->', resizes, work_per_event)
Here is an excerpt of the output:
for maxsize in range(42, 85):
simulate_lru_cache(maxsize)
42 --> 23255 0.97671
43 --> 23809 1.023787
44 --> 24390 1.07316
45 --> 25000 1.125
46 --> 25641 1.179486
...
80 --> 200000 16.0
81 --> 250000 20.25
82 --> 333333 27.333306
83 --> 500000 41.5
84 --> 1000000 84.0
What this shows is that the cache does significantly less work when maxsize is as far away as possible from the resize_point.
History
The effect was minimal in Python3.2, when dictionaries grew by 4 x active_entries when resizing.
The effect became catastrophic when the growth rate was lowered to 2 x active entries.
Later a compromise was reached, setting the growth rate to 3 x used. That significantly mitigated the issue by giving us a larger steady state size by default.
A power-of-two maxsize is still the optimum setting, giving the least work for a given steady state dictionary size, but it no longer matters as much as it did in Python3.2.
Hope this helps clear up your understanding. :-)

TL;DR - this is an optimization that doesn't have much effect at small lru_cache sizes, but (see Raymond's reply) has a larger effect as your lru_cache size gets bigger.
So this piqued my interest and I decided to see if this was actually true.
First I went and read the source for the LRU cache. The implementation for cpython is here: https://github.com/python/cpython/blob/master/Lib/functools.py#L723 and I didn't see anything that jumped out to me as something that would operate better based on powers of two.
So, I wrote a short python program to make LRU caches of various sizes and then exercise those caches several times. Here's the code:
from functools import lru_cache
from collections import defaultdict
from statistics import mean
import time
def run_test(i):
# We create a new decorated perform_calc
#lru_cache(maxsize=i)
def perform_calc(input):
return input * 3.1415
# let's run the test 5 times (so that we exercise the caching)
for j in range(5):
# Calculate the value for a range larger than our largest cache
for k in range(2000):
perform_calc(k)
for t in range(10):
print (t)
values = defaultdict(list)
for i in range(1,1025):
start = time.perf_counter()
run_test(i)
t = time.perf_counter() - start
values[i].append(t)
for k,v in values.items():
print(f"{k}\t{mean(v)}")
I ran this on a macbook pro under light load with python 3.7.7.
Here's the results:
https://docs.google.com/spreadsheets/d/1LqZHbpEL_l704w-PjZvjJ7nzDI1lx8k39GRdm3YGS6c/preview?usp=sharing
The random spikes are probably due to GC pauses or system interrupts.
At this point I realized that my code always generated cache misses, and never cache hits. What happens if we run the same thing, but always hit the cache?
I replaced the inner loop with:
# let's run the test 5 times (so that we exercise the caching)
for j in range(5):
# Only ever create cache hits
for k in range(i):
perform_calc(k)
The data for this is in the same spreadsheet as above, second tab.
Let's see:
Hmm, but we don't really care about most of these numbers. Also, we're not doing the same amount of work for each test, so the timing doesn't seem useful.
What if we run it for just 2^n 2^n + 1, and 2^n - 1. Since this speeds things up, we'll average it out over 100 tests, instead of just 10.
We'll also generate a large random list to run on, since that way we'll expect to have some cache hits and cache misses.
from functools import lru_cache
from collections import defaultdict
from statistics import mean
import time
import random
rands = list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128))
random.shuffle(rands)
def run_test(i):
# We create a new decorated perform_calc
#lru_cache(maxsize=i)
def perform_calc(input):
return input * 3.1415
# let's run the test 5 times (so that we exercise the caching)
for j in range(5):
for k in rands:
perform_calc(k)
for t in range(100):
print (t)
values = defaultdict(list)
# Interesting numbers, and how many random elements to generate
for i in [15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, 255, 256, 257, 511, 512, 513, 1023, 1024, 1025]:
start = time.perf_counter()
run_test(i)
t = time.perf_counter() - start
values[i].append(t)
for k,v in values.items():
print(f"{k}\t{mean(v)}")
Data for this is in the third tab of the spreadsheet above.
Here's a graph of the average time per element / lru cache size:
Time, of course, decreases as our cache size gets larger since we don't spend as much time performing calculations. The interesting thing is that there does seem to be a dip from 15 to 16, 17 and 31 to 32, 33. Let's zoom in on the higher numbers:
Not only do we lose that pattern in the higher numbers, but we actually see that performance decreases for some of the powers of two (511 to 512, 513).
Edit: The note about power-of-two was added in 2012, but the algorithm for functools.lru_cache looks the same at that commit, so unfortunately that disproves my theory that the algorithm has changed and the docs are out of date.
Edit: Removed my hypotheses. The original author replied above - the problem with my code is that I was working with "small" caches - meaning that the O(n) resize on the dicts was not very expensive. It would be cool to experiment with very large lru_caches and lots of cache misses to see if we can get the effect to appear.

Related

Numpy.linalg.norm performance apparently doesn't scale with the number of dimensions

I have the following snippets of code which is a subroutine of the K-means clustering algorithm; specifically, it tries to assign each point to the closest centroid.
import numpy as np
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
for i in range(n):
distances = np.apply_along_axis(lambda x: np.linalg.norm(x, ord=2), 1, centroids - points[i])
membership[i] = np.argmin(distances)
The running time here should be O(NKD) where D is the dimension of the data points, so naturally I expect when D increases or decreases, the running time would change proportionally as well. To my surprise, I see very little time being changed when changing D, for example when testing on my local machine:
D = 1
python3 benchmark.py 12.10s user 0.39s system 118% cpu 10.564 total
D = 30
python3 benchmark.py 12.17s user 0.36s system 117% cpu 10.703 total
D = 300
python3 benchmark.py 13.30s user 0.31s system 115% cpu 11.784 total
D = 1000
python3 benchmark.py 16.51s user 1.76s system 110% cpu 16.524 total
Is there something that I'm missing here?
Edit: per #Warren's suggestion, I modified the code to use np.linalg.norm with axis parameter directly; the performance is following:
D = 1
python3 benchmark.py 1.45s user 0.37s system 634% cpu 0.287 total
D = 30
python3 benchmark.py 1.67s user 0.29s system 592% cpu 0.331 total
D = 300
python3 benchmark.py 3.03s user 0.32s system 234% cpu 1.428 total
D = 1000
python3 benchmark.py 6.32s user 2.73s system 126% cpu 7.177 total
so the performance was better.

This is due to the overhead of Numpy functions.
Indeed, np.apply_along_axis is called 20_000 times and each call to this function internally does a loop calling the target Python function 250 times (ie. it is not vectorized), and so np.linalg.norm. In the end, np.linalg.norm is called, 20_000 * 250 = 5000000 times. The thing is each call to a Numpy function takes typically about 1 µs. On my machine, np.linalg.norm takes 4-5 µs on an array of size 1. This time is due to many internal checks (types and values), allocations, functions calls, conversion, etc.
There are two simple ways to reduce this overhead: vectorization and using a JIT compiler like Numba. The later is often more efficient as it avoid creating expensive big temporary arrays.
Here is a much faster implementation:
import numpy as np
import numba as nb
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])')
def compute(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
distances[j] = np.linalg.norm(centroids[j] - points[i], ord=2)
membership[i] = np.argmin(distances)
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
compute(points, centroids, membership)
In fact, while this code is much faster, it still have a similar issue: the cost of allocating the temporary arrays centroids[j] - points[i] is significant compared to the actual time required to compute the norm. In fact, each allocations takes only few hundred of nanoseconds, but the number of loop iteration is huge. One solution is simply to compute the norm manually:
from math import sqrt
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])', fastmath=True)
def compute_fast(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
s = 0.0
for k in range(D):
tmp = centroids[j,k] - points[i,k]
s += tmp * tmp
distances[j] = sqrt(s)
membership[i] = np.argmin(distances)
Here are results on my i5-9600KF processor:
D=1:
initial code: 26.56 seconds
compute: 1.44 seconds
compute_fast: 0.02 seconds (x1328)
D=30:
initial code: 27.09 seconds
compute: 1.65 seconds
compute_fast: 0.13 seconds (x208)
D=1000:
initial code: 39.34 seconds
compute: 3.74 seconds
compute_fast: 4.57 seconds (x8.6)
The last implementation is much faster for small values of D since the Numpy overhead are the main bottleneck in this case and the implementation can almost completely remove such overheads (thanks to the JIT compilation).

It is probably O(NKD).
But the thing is you are iterating 3 loops here. One explicitly. One semi-explicitly. And the last one implicitly, inside numpy functions.
The outer one is your explicit for loop, for N.
The middle one is the np.apply_along_axis one, which applies on the K rows of centroids-points[i] (btw, there is another one here, with some broadcasting. But we don't need to count all of them for big-O consideration)
And the inner one is the one on the D columns that occur inside norm.
The inner one is obviously the most important to optimized, and that's good, because it is the only one that is vectorized here.
But that means that for small enough value of D, what we really see is more some constant overhead (times N×K, since it is inside a double for loop). Your inefficient outer for loops drive most of the cost, which, then, looks like O(NK).
Note that np.apply_along_axis is just a for loop by another name. It is not as bad. But almost so. It is still calling several times some python code. It is not vectorization.
But, well, I bet that with D big enough, you'll see that it is O(NKD)
Edit:
Here is what I get when I increase D (with smaller n, so that it remains computable in realistic time)
You see that it looks really linear (affine, to be accurate, since it doesn't pass through 0, which is the reason why it doesn't look very linear to you; and which is explained by my previous comment: most of the inner cost inside the for/along_axis double loop is mainly constant overhead of those loops, when D is small. The "proportional to D" part begins to show when the overhead become negligible)

multithreaded iteration over numpy array indices

I have a piece of code which iterates over a three-dimensional array and writes into each cell a value based on the indices and the current value itself:
import numpy as np
nx = ny = nz = 100
array = np.zeros((nx, ny, nz))
def fun(val, k):
# Do something with the indices
return val + (k[0] * k[1] * k[2])
with np.nditer(array, flags=['multi_index'], op_flags=['readwrite']) as it:
for x in it:
x[...] = fun(x, it.multi_index)
Note, that fun might do something more sophisticated, which takes most of the total runtime, and that the input arrays might have different lengths per axis.
However, this code could run in multiple threads, as fun can be assumed to be threadsafe (Only the value and index of the current cell are required). But finding a method to iterate over all cells and have the current index available seems to be hard.
A possible solution might be https://stackoverflow.com/a/58012407/446140, where the array is split by the x-axis into chunks and passed to a Pool.
However, the solution is not universally applicable and I wonder if there is a more general solution for this problem (which could also work with nD arrays)?
The first issue is to split up the 3D array into equally sized chunks. np.array_split can be used, but the offset of each of the splits has to be stored to get the correct indices again.

An interesting question, with a few possible solutions. As you indicated, it is possible to use np.array_split, but since we are only interested in the indices, we can also use np.unravel_index, which would mean that we only have to loop over all the indices (the size) of the array to get the index.
Now there are two great ideas for multiprocessing:
Create a (thread safe) shared memory of the array and splitting the indices across the different processes.
Only update the array in a main thread, but provide a copy of the required data to the processes and let them return the value that has to be updated.
Both solutions will work for any np.ndarray, but have different advantages. Creating a shared memory doesn't create copies, but can have a large insertion penalty if it has to wait on other processes (the computational time, is small compared to the write time.)
There are probably many more solutions, but I will work out the first solution, where a Shared Memory object is created and a range of indices is provided to every process.
Required imports:
import itertools
import numpy as np
import multiprocessing as mp
from multiprocessing import shared_memory
Shared Numpy arrays
The main problem with applying multiprocessing on np.ndarray's is that memory sharing between processes can be difficult. For this the following class can be used:
class SharedNumpy:
__slots__ = ('arr', 'shm', 'name', 'shared',)
def __init__(self, arr: np.ndarray = None):
if arr is not None:
self.shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
self.arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=self.shm.buf)
self.name = self.shm.name
np.copyto(self.arr, arr)
def __getattr__(self, item):
if hasattr(self.arr, item):
return getattr(self.arr, item)
raise AttributeError(f"{self.__class__.__name__}, doesn't have attribute {item!r}")
def __str__(self):
return str(self.arr)
#classmethod
def from_name(cls, name, shape, dtype):
memory = cls(arr=None)
memory.shm = shared_memory.SharedMemory(name)
memory.arr = np.ndarray(shape, dtype=dtype, buffer=memory.shm.buf)
memory.name = name
return memory
#property
def dtype(self):
return self.arr.dtype
#property
def shape(self):
return self.arr.shape
This makes it possible to create a shared memory object in the main process and then use SharedNumpy.from_name to get it in other processes.
Simple test
A quick (non threaded) test would be:
def simple_test():
data = np.array(np.zeros((5,) * 2))
mem_primary = SharedNumpy(arr=data)
mem_second = SharedNumpy.from_name(name=mem_primary.name, shape=data.shape, dtype=data.dtype)
assert mem_primary.name == mem_second.name, "Different memory names"
assert np.array_equal(mem_primary.arr, mem_second.arr), "Different array values."
mem_primary.arr[2] = 5
assert np.array_equal(mem_primary.arr, mem_second.arr), "Different array values."
print("Completed 3/3 tests...")
A threaded test will follow later!
Distribution
The next part is focused on providing the processes with the necessary data. In this case we will provide every process with a range of indices that it has to calculate and all the data that is required to load the shared memory.
The input of this function is a dim the number of numpy axis, and the size, which are the number of elements per axis.
def distributed(size, dim):
memory = SharedNumpy(arr=np.zeros((size,) * dim))
split_size = np.int64(np.ceil(memory.arr.size / mp.cpu_count()))
settings = dict(
memory=itertools.repeat(memory.name),
shape=itertools.repeat(memory.arr.shape),
dtype=itertools.repeat(memory.arr.dtype),
start=np.arange(mp.cpu_count()),
num=itertools.repeat(split_size)
)
with mp.Pool(mp.cpu_count()) as pool:
pool.starmap(fun, zip(*settings.values()))
print(f"\n\nDone {dim}D, size: {size}, elements: {size ** dim}")
return memory
Notes:
By using starmap instead of map, it is possible to provide multiple input arguments (a list of arguments for every process).
(also see docs starmap)
itertools.repeat is used to add constants to the starmap
(also see: zip() in python, how to use static values)
By using np.unravel_index, we only need to have a start index and the chunk size per process.
The start and num tell the chunks of indices that have to be converted per process, by applying range(start * num, (start + 1) * num).
Testing
For the testing I am using different input sizes and dimensions. Since the data increases with the formula sizes ^ dimensions, I limited the test to a size of 128 and 3 dimensions (that is 2,097,152 points, and already start taking quit a bit of time.)
Code
fun
def fun(name, shape, dtype, start, num):
memory = SharedNumpy.from_name(name, shape=shape, dtype=dtype)
for idx in range(start * num, min((start + 1) * num, memory.arr.size)):
# Do something with the indices
indices = np.unravel_index([idx], shape)
memory.arr[indices] += np.product(indices)
memory.shm.close() # Closes the shared memory for this process.
Running the example
if __name__ == '__main__':
for size in [5, 10, 15]:
for dim in [1, 2, 3]:
memory = distributed(size, dim)
print(memory)
memory.shm.unlink()
For the OP's code, I used his code with a small addition that I allow the array to have different sizes and dimensions, in any case I use:
def sequential(size, dim):
array = np.zeros((size,) * dim)
...
And looking at the output array of both codes, will result in the same outcomes.
Plots
The code for the graphs have been taken from the reply in:
https://codereview.stackexchange.com/questions/165245/plot-timings-for-a-range-of-inputs
With the minor alteration that labels was changed to codes in
empty_multi_index = pd.MultiIndex(levels=[[], []], codes=[[], []], names=['func', 'result'])
Where the 1d, 2d and 3d reference the dimensions and the input is the size.
Sequentially (OP code):
Distributed (this code):
Results
This method works on an arbitrary sized numpy array, and is able to perform an operation on the indices of the array. It provides you with full access of the whole numpy array, so it can also be used to perform different kind of statistical analysis, which do not change the array.
From the timings it can be seen that for small data shapes the distributed version has no to little advantages, because of the extra complexity of creating the processes. However for larger amount of data it starts to become more effective.
I only timed it on short delays in the computational time (simple fun), but on more complex calculations, it should outperform the sequential version much sooner.
Extra
If you are only interested in operations that are performed over or along axis, these numpy functions might help to vectorize your solutions instead of using multiprocessing:
np.apply_over_axes
np.apply_along_axis

saving and loading large numpy matrix

The below code is how I save the numpy array and it is about 27GB after saved. There are more than 200K images data and each shape is (224,224,3)
hf = h5py.File('cropped data/features_train.h5', 'w')
for i,each in enumerate(features_train):
hf.create_dataset(str(i), data=each)
hf.close()
This is the method I used to load the data, and it takes hours for loading.
features_train = np.zeros(shape=(1,224,224,3))
hf = h5py.File('cropped data/features_train.h5', 'r')
for key in hf.keys():
x = hf.get(key)
x = np.array(x)
features_train = np.append(features_train,np.array([x]),axis=0)
hf.close()
So, does anyone has a better solution for this large size of data?

You didn't tell us how much physical RAM your server has,
but 27 GiB sounds like "a lot".
Consider breaking your run into several smaller batches.
There is an old saw in java land that asks "why does this have quadratic runtime?",
that is, "why is this so slow?"
String s = ""
for (int i = 0; i < 1e6, i++) {
s += "x";
}
The answer is that toward the end,
on each iteration we are reading ~ a million characters
then writing them, then appending a single character.
The cost is O(1e12).
Standard solution is to use a StringBuilder so we're back
to the expected O(1e6).
Here, I worry that calling np.append() pushes us into the quadratic regime.
To verify, replace the features_train assignment with a simple evaluation
of np.array([x]), so we spend a moment computing and then immediately discarding
that value on each iteration.
If the conjecture is right, runtime will be much smaller.
To remedy it, avoid calling .append().
Rather, preallocate 27 GiB with np.zeros()
(or np.empty())
and then within the loop assign each freshly read array
into the offset of its preallocated slot.
Linear runtime will allow the task to complete much more quickly.

Find euclidean distance between rows of two huge CSR matrices

I have two sparse martrices, A and B. A is 120000*5000 and B is 30000*5000. I need to find the euclidean distances between each row in B with all rows of A and then find the 5 rows in A with the lowest distance to the selected row in B. As it is a very big data I am using CSR otherwise I get memory error. It is clear that for each row in A it calculates (x_b - x_a)^2 5000 times and sums them and then get a sqrt. This process is taking a very very long time, like 11 days! Is there any way I can do this more efficiently? I just need the 5 rows with the lowest distance to each row in B.
I am implementing K-Nearest Neighbours and A is my training set and B is my test set.

Well - I don't know if you could 'vectorize' that code, so that it would run in native code instead of Python. The trick to speed-up numpy and scipy is always getting that.
If you can run that code in native code in a 1GHz CPU, with 1 FP instruction for clock cicle, you'd get it done in a little under 10 hours.
(5000 * 2 * 30000 * 120000) / 1024 ** 3
Raise that to 1.5Ghz x 2 CPU physical cores x 4 way SIMD instructions with multiply + acummulate (Intel AVX extensions, available in most CPUs) and you could get that number crunching down to one hour, at 2 x 100% on a modest core i5 machinne. But that would require full SIMD optimization in native code - far from a trivial task (although, if you decide to go this path, further questions on S.O. could get help from people either to wet their hands in SIMD coding :-) ) - interfacing this code in C with Scipy is not hard using cython, for example (you only need that part to get it to the above 10 hour figure)
Now... as for algorithm optimization, and keeping things Python :-)
Fact is, you don't need to fully calculate all distances from rows in A - you just need to keep a sorted list of the 5 lower rows - and any time the cumulation of a sum of squares get larger than the 5th nearest row (so far), you just abort the calculation for that row.
You could use Python' heapq operations for that:
import heapq
import math
def get_closer_rows(b_row, a):
result = [(float("+inf"), None) * 5]
for i, a_row in enumerate(a):
distance_sq = 0
count = 0
for element_a, element_b in zip(a_row, b_row):
distance_sq += element_a * element_b
if not count % 64 and distance_sq > result[4][0]:
break
count += 1
else:
heapq.heappush(result, (distance, i))
result[:] = result[:5]
return [math.sqrt(r) for r in result]
closer_rows_to_b = []
for row in b:
closer_rows_to_b.append(get_closer_rows(row, a))
Note the auxiliar "count" to avoid the expensive retrieving and comparison of values for all multiplications.
Now, if you can run this code using pypy instead of regular Python, I believe it could get full benefit of JITting, and you could get a noticeable improvement over your times if you are running the code in pure Python (i.e.: non numpy/scipy vectorized code).

J Primes Enumeration

J will answer the n-th prime via p:n.
If I ask for the 100 millionth prime I get an almost instant answer. I cannot imagine J is sieving for that prime that quickly, but neither looking it up in a table as that table would be around 1GB in size.
There are equations giving approximations to the number of primes to a bound, but they are only approximations.
How is J finding the answer so quickly ?

J uses a table to start, then calculates
NOTE! This is speculation, based on benchmarks (shown below).
If you want to quickly try for yourself, try the following:
p:1e8 NB. near-instant
p:1e8-1 NB. noticeable pause
The low points on the graph are where J looks up the prime in a table. After that, J is calculating the value from a particular starting point so it doesn't have to calculate the entire thing. So some lucky primes will be constant time (simple table lookup) but generally there's first a table lookup, and then a calculation. But happily, it calculates starting from the previous table lookup instead of calculating the entire value.
Benchmarks
I did some benchmarking to see how p: performs on my machine (iMac i5, 16G RAM). I'm using J803. The results are interesting. I'm guessing the sawtooth pattern in the time plots (visible on the 'up to 2e5' plot) is lookup table related, while the overall log-ish shape (visible on the 'up to 1e7' plot) is CPU related.
NB. my test script
ts=:3 : 0
a=.y
while. a do.
c=.timespacex 'p:(1e4*a)' NB. 1000 times a
a=.<:a
b=.c;b
end.
}:b
)
a =: ts 200
require'plot'
plot >0{each a NB. time
plot >1{each a NB. space
(p: up to 2e5)
time
space
(p: up to 1e7)
time
space
During these runs one core was hovering around 100%:
Also, the voc page states:
Currently, arguments larger than 2^31 are tested to be prime according to a probabilistic algorithm (Miller-Rabin).
And in addition to a prime lookup table as #Mauris points out, v2.c contains this function:
static F1(jtdetmr){A z;B*zv;I d,h,i,n,wn,*wv;
RZ(w=vi(w));
wn=AN(w); wv=AV(w);
GA(z,B01,wn,AR(w),AS(w)); zv=BAV(z);
for(i=0;i<wn;++i){
n=*wv++;
if(1>=n||!(1&n)||0==n%3||0==n%5){*zv++=0; continue;}
h=0; d=n-1; while(!(1&d)){++h; d>>=1;}
if (n< 9080191)*zv++=spspd(31,n,d,h)&&spspd(73,n,d,h);
else if(n<94906266)*zv++=spspd(2 ,n,d,h)&&spspd( 7,n,d,h)&&spspd(61,n,d,h);
else *zv++=spspx(2 ,n,d,h)&&spspx( 7,n,d,h)&&spspx(61,n,d,h);
}
RE(0); R z;
} /* deterministic Miller-Rabin */

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string