Is there a way to speed up this function?

Is there a way to speed up this function? - python-3.x

I'm comparing performance of this F# function:
let e28 N =
seq {for i in 2L..2L..N do for j in 1..4 -> i} |> Seq.scan (+) 1L |> Seq.sum
with Python 3.3 equivalents:
def e28a(N = 100000):
diagNumber = 1
sum = diagNumber
for width in range(2, N+1, 2):
for j in range(4):
diagNumber += width
sum += diagNumber
return sum
import itertools as it
def e28b(N = 100000):
return sum(it.accumulate(it.chain([1], (i for i in range(2, N+1, 2) for j in range(4)))))
import numpy as np
def e28c(N = 100000):
return np.sum(np.cumsum(np.fromiter(chain([1], (i for i in range(2, N+1, 2) for j in range(4))), np.int64)))
and I'm getting 64-bit CPython 3.3.1 performance on Windows 7 about 574 times slower than C++. Here are the times for N = 100000:
e28: 23ms; e28a: 48.4ms; e28b: 49.7ms; e28c: 40.2ms; C++ version: 0.07ms
Is there a low hanging fruit in optimizing Python code without altering the underlying algorithm?

The F# version can be sped up by ~10x by switching to a procedural, mutable approach (like your python e28a). When the "payload operation" (in this case, just +) is so trivial, the use of combinators ends up adding a relatively significant overhead. As a side note, Seq.sum uses checked arithmetic, which also adds a touch of overhead.
One of the nice things about F# is that you can fall back to procedural/mutable style if needed for a perf-critical hot path.
let e28_original N =
seq {
for i in 2UL..2UL..N do
for j in 1..4 do
yield i
}
|> Seq.scan (+) 1UL
|> Seq.sum
let e28_mutable N =
let mutable sum = 1UL
let mutable total = sum
for i in 2UL..2UL..N do
for j in 1..4 do
sum <- sum + i
total <- total + sum
total
let time f =
f () |> ignore // allow for warmup / JIT
let sw = System.Diagnostics.Stopwatch.StartNew()
let result = f ()
sw.Stop()
printfn "Result: %A Elapsed: %A" result sw.Elapsed
time (fun _ -> e28_original 100000UL)
time (fun _ -> e28_mutable 100000UL)
Result
Result: 666691667100001UL Elapsed: 00:00:00.0429414
Result: 666691667100001UL Elapsed: 00:00:00.0034971

Using your F# version I got:
> e28(100000L);;
Real: 00:00:00.061, CPU: 00:00:00.062, GC gen0: 2, gen1: 0, gen2: 0
val it : int64 = 666691667100001L
Using:
let e28d N =
seq {2L..2L..N}
|> Seq.collect(fun x->seq{yield x;yield x; yield x; yield x})
|> Seq.scan (+) 1L
|> Seq.sum
I got:
> e28d(100000L);;
Real: 00:00:00.040, CPU: 00:00:00.031, GC gen0: 2, gen1: 0, gen2: 0
val it : int64 = 666691667100001L
You will probably have a difficult time getting the python to perform quite as well as the F# simply since F# is compiled and Python is interpreted. That being said, the above improvement will work for the python as well:
>>> def e28a(N = 100000):
diagNumber = 1;
sum = diagNumber;
for width in range(2, N+1, 2):
for j in range(4):
diagNumber += width;
sum += diagNumber;
return sum;
>>> if __name__ == '__main__':
import timeit
print(timeit.timeit("e28a()", setup="from __main__ import e28a", number=10))
0.5249497228663813
>>> def e28a(N = 100000):
diagNumber = 1;
sum = diagNumber;
for width in range(2, N+1, 2):
diagNumber += width;
sum += diagNumber;
diagNumber += width;
sum += diagNumber;
diagNumber += width;
sum += diagNumber;
diagNumber += width;
sum += diagNumber;
return sum;
>>> if __name__ == '__main__':
import timeit
print(timeit.timeit("e28a()", setup="from __main__ import e28a", number=10))
0.2585966329330063
>>>
Part of this improvement comes from fewer function calls, i.e.:
>>> def e28a(N = 100000):
diagNumber = 1;
sum = diagNumber;
temp_range = range(4) #Change here
for width in range(2, N+1, 2):
for j in temp_range: #Change here
diagNumber += width;
sum += diagNumber;
return sum;
>>> if __name__ == '__main__':
import timeit
print(timeit.timeit("e28a()", setup="from __main__ import e28a", number=10))
0.40251470339956086
>>>
And I think the other part comes from removing the loop. Both of these can be fairly expensive in Python.

This is almost twice as fast on my machine. It uses memoization, and also basic arithmetic deduction.
You have to define a global variable.
summi=2
def e28d(N = 100000):
def memo(width):
global summi
summi+=width*4+4
return summi-width*2+2
x= sum((memo(width*4)) for width in range (2, N+1, 2))+1
return x
Results:
e28a:
0.0591201782227 seconds
e28d:
0.0349650382996 seconds
Hope it is at least constructive. Note: you would have to modulate it according to whether the number is odd or not.
Update:
Here is a function that runs about a hundred times faster in python (about 0.5 ms for N=100000), by avoiding loops totally:
import math
def e28e(X = 100000):
keyint, keybool=int(X/6), X%6
if keybool/2==0: keyvar=(16*keyint+sum(range(keyint))*12)
elif keybool/2==1: keyvar=(44*keyint+sum(range(keyint))*36+7)
else: keyvar=(28*(keyint+1)+sum(range(keyint+1))*60-2)
X-=keybool%2
diag= math.pow(X,2)+2*X+1
newvar=keyint+int(X/2)+1
summ= int(diag*newvar+keyvar)
return summ

Related

find sequence of elements in numpy array [duplicate]

In Python or NumPy, what is the best way to find out the first occurrence of a subarray?
For example, I have
a = [1, 2, 3, 4, 5, 6]
b = [2, 3, 4]
What is the fastest way (run-time-wise) to find out where b occurs in a? I understand for strings this is extremely easy, but what about for a list or numpy ndarray?
Thanks a lot!
[EDITED] I prefer the numpy solution, since from my experience numpy vectorization is much faster than Python list comprehension. Meanwhile, the big array is huge, so I don't want to convert it into a string; that will be (too) long.

I'm assuming you're looking for a numpy-specific solution, rather than a simple list comprehension or for loop. One straightforward approach is to use the rolling window technique to search for windows of the appropriate size.
This approach is simple, works correctly, and is much faster than any pure Python solution. It should be sufficient for many use cases. However, it is not the most efficient approach possible, for a number of reasons. For an approach that is more complicated, but asymptotically optimal in the expected case, see the numba-based rolling hash implementation in norok2's answer.
Here's the rolling_window function:
>>> def rolling_window(a, size):
... shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
... strides = a.strides + (a. strides[-1],)
... return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
...
Then you could do something like
>>> a = numpy.arange(10)
>>> numpy.random.shuffle(a)
>>> a
array([7, 3, 6, 8, 4, 0, 9, 2, 1, 5])
>>> rolling_window(a, 3) == [8, 4, 0]
array([[False, False, False],
[False, False, False],
[False, False, False],
[ True, True, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
To make this really useful, you'd have to reduce it along axis 1 using all:
>>> numpy.all(rolling_window(a, 3) == [8, 4, 0], axis=1)
array([False, False, False, True, False, False, False, False], dtype=bool)
Then you could use that however you'd use a boolean array. A simple way to get the index out:
>>> bool_indices = numpy.all(rolling_window(a, 3) == [8, 4, 0], axis=1)
>>> numpy.mgrid[0:len(bool_indices)][bool_indices]
array([3])
For lists you could adapt one of these rolling window iterators to use a similar approach.
For very large arrays and subarrays, you could save memory like this:
>>> windows = rolling_window(a, 3)
>>> sub = [8, 4, 0]
>>> hits = numpy.ones((len(a) - len(sub) + 1,), dtype=bool)
>>> for i, x in enumerate(sub):
... hits &= numpy.in1d(windows[:,i], [x])
...
>>> hits
array([False, False, False, True, False, False, False, False], dtype=bool)
>>> hits.nonzero()
(array([3]),)
On the other hand, this will probably be somewhat slower.

The following code should work:
[x for x in xrange(len(a)) if a[x:x+len(b)] == b]
Returns the index at which the pattern starts.

(EDITED to include a deeper discussion, better code and more benchmarks)
Summary
For raw speed and efficiency, one can use a Cython or Numba accelerated version (when the input is a Python sequence or a NumPy array, respectively) of one of the classical algorithms.
The recommended approaches are:
find_kmp_cy() for Python sequences (list, tuple, etc.)
find_kmp_nb() for NumPy arrays
Other efficient approaches, are find_rk_cy() and find_rk_nb() which, are more memory efficient but are not guaranteed to run in linear time.
If Cython / Numba are not available, again both find_kmp() and find_rk() are a good all-around solution for most use cases, although in the average case and for Python sequences, the naïve approach, in some form, notably find_pivot(), may be faster. For NumPy arrays, find_conv() (from #Jaime answer) outperforms any non-accelerated naïve approach.
(Full code is below, and here and there.)
Theory
This is a classical problem in computer science that goes by the name of string-searching or string matching problem.
The naive approach, based on two nested loops, has a computational complexity of O(n + m) on average, but worst case is O(n m).
Over the years, a number of alternative approaches have been developed which guarantee a better worst case performances.
Of the classical algorithms, the ones that can be best suited to generic sequences (since they do not rely on an alphabet) are:
the naïve algorithm (basically consisting of two nested loops)
the Knuth–Morris–Pratt (KMP) algorithm
the Rabin-Karp (RK) algorithm
This last algorithm relies on the computation of a rolling hash for its efficiency and therefore may require some additional knowledge of the input for optimal performance.
Eventually, it is best suited for homogeneous data, like for example numeric arrays.
A notable example of numeric arrays in Python is, of course, NumPy arrays.
Remarks
The naïve algorithm, by being so simple, lends itself to different implementations with various degrees of run-time speed in Python.
The other algorithms are less flexible in what can be optimized via language tricks.
Explicit looping in Python may be a speed bottleneck and several tricks can be used to perform the looping outside of the interpreter.
Cython is especially good at speeding up explicit loops for generic Python code.
Numba is especially good at speeding up explicit loops on NumPy arrays.
This is an excellent use-case for generators, so all the code will be using those instead of regular functions.
Python Sequences (list, tuple, etc.)
Based on the Naïve Algorithm
find_loop(), find_loop_cy() and find_loop_nb() which are the explicit-loop only implementation in pure Python, Cython and with Numba JITing respectively. Note the forceobj=True in the Numba version, which is required because we are using Python object inputs.
def find_loop(seq, subseq):
n = len(seq)
m = len(subseq)
for i in range(n - m + 1):
found = True
for j in range(m):
if seq[i + j] != subseq[j]:
found = False
break
if found:
yield i
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
def find_loop_cy(seq, subseq):
cdef Py_ssize_t n = len(seq)
cdef Py_ssize_t m = len(subseq)
for i in range(n - m + 1):
found = True
for j in range(m):
if seq[i + j] != subseq[j]:
found = False
break
if found:
yield i
find_loop_nb = nb.jit(find_loop, forceobj=True)
find_loop_nb.__name__ = 'find_loop_nb'
find_all() replaces the inner loop with all() on a comprehension generator
def find_all(seq, subseq):
n = len(seq)
m = len(subseq)
for i in range(n - m + 1):
if all(seq[i + j] == subseq[j] for j in range(m)):
yield i
find_slice() replaces the inner loop with direct comparison == after slicing []
def find_slice(seq, subseq):
n = len(seq)
m = len(subseq)
for i in range(n - m + 1):
if seq[i:i + m] == subseq:
yield i
find_mix() and find_mix2() replaces the inner loop with direct comparison == after slicing [] but includes one or two additional short-circuiting on the first (and last) character which may be faster because slicing with an int is much faster than slicing with a slice().
def find_mix(seq, subseq):
n = len(seq)
m = len(subseq)
for i in range(n - m + 1):
if seq[i] == subseq[0] and seq[i:i + m] == subseq:
yield i
def find_mix2(seq, subseq):
n = len(seq)
m = len(subseq)
for i in range(n - m + 1):
if seq[i] == subseq[0] and seq[i + m - 1] == subseq[m - 1] \
and seq[i:i + m] == subseq:
yield i
find_pivot() and find_pivot2() replace the outer loop with multiple .index() call using the first item of the sub-sequence, while using slicing for the inner loop, eventually with additional short-circuiting on the last item (the first matches by construction). The multiple .index() calls are wrapped in a index_all() generator (which may be useful on its own).
def index_all(seq, item, start=0, stop=-1):
try:
n = len(seq)
if n > 0:
start %= n
stop %= n
i = start
while True:
i = seq.index(item, i)
if i <= stop:
yield i
i += 1
else:
return
else:
return
except ValueError:
pass
def find_pivot(seq, subseq):
n = len(seq)
m = len(subseq)
if m > n:
return
for i in index_all(seq, subseq[0], 0, n - m):
if seq[i:i + m] == subseq:
yield i
def find_pivot2(seq, subseq):
n = len(seq)
m = len(subseq)
if m > n:
return
for i in index_all(seq, subseq[0], 0, n - m):
if seq[i + m - 1] == subseq[m - 1] and seq[i:i + m] == subseq:
yield i
Based on Knuth–Morris–Pratt (KMP) Algorithm
find_kmp() is a plain Python implementation of the algorithm. Since there is no simple looping or places where one could use slicing with a slice(), there is not much to be done for optimization, except using Cython (Numba would require again forceobj=True which would lead to slow code).
def find_kmp(seq, subseq):
n = len(seq)
m = len(subseq)
# : compute offsets
offsets = [0] * m
j = 1
k = 0
while j < m:
if subseq[j] == subseq[k]:
k += 1
offsets[j] = k
j += 1
else:
if k != 0:
k = offsets[k - 1]
else:
offsets[j] = 0
j += 1
# : find matches
i = j = 0
while i < n:
if seq[i] == subseq[j]:
i += 1
j += 1
if j == m:
yield i - j
j = offsets[j - 1]
elif i < n and seq[i] != subseq[j]:
if j != 0:
j = offsets[j - 1]
else:
i += 1
find_kmp_cy() is Cython implementation of the algorithm where the indices use C int data type, which result in much faster code.
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
def find_kmp_cy(seq, subseq):
cdef Py_ssize_t n = len(seq)
cdef Py_ssize_t m = len(subseq)
# : compute offsets
offsets = [0] * m
cdef Py_ssize_t j = 1
cdef Py_ssize_t k = 0
while j < m:
if subseq[j] == subseq[k]:
k += 1
offsets[j] = k
j += 1
else:
if k != 0:
k = offsets[k - 1]
else:
offsets[j] = 0
j += 1
# : find matches
cdef Py_ssize_t i = 0
j = 0
while i < n:
if seq[i] == subseq[j]:
i += 1
j += 1
if j == m:
yield i - j
j = offsets[j - 1]
elif i < n and seq[i] != subseq[j]:
if j != 0:
j = offsets[j - 1]
else:
i += 1
Based on Rabin-Karp (RK) Algorithm
find_rk() is a pure Python implementation, which relies on Python's hash() for the computation (and comparison) of the hash. Such hash is made rolling by mean of a simple sum(). The roll-over is then computed from the previous hash by subtracting the result of hash() on the just visited item seq[i - 1] and adding up the result of hash() on the newly considered item seq[i + m - 1].
def find_rk(seq, subseq):
n = len(seq)
m = len(subseq)
if seq[:m] == subseq:
yield 0
hash_subseq = sum(hash(x) for x in subseq) # compute hash
curr_hash = sum(hash(x) for x in seq[:m]) # compute hash
for i in range(1, n - m + 1):
curr_hash += hash(seq[i + m - 1]) - hash(seq[i - 1]) # update hash
if hash_subseq == curr_hash and seq[i:i + m] == subseq:
yield i
find_rk_cy() is Cython implementation of the algorithm where the indices use the appropriate C data type, which results in much faster code. Note that hash() truncates "the return value based on the bit width of the host machine."
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
def find_rk_cy(seq, subseq):
cdef Py_ssize_t n = len(seq)
cdef Py_ssize_t m = len(subseq)
if seq[:m] == subseq:
yield 0
cdef Py_ssize_t hash_subseq = sum(hash(x) for x in subseq) # compute hash
cdef Py_ssize_t curr_hash = sum(hash(x) for x in seq[:m]) # compute hash
cdef Py_ssize_t old_item, new_item
for i in range(1, n - m + 1):
old_item = hash(seq[i - 1])
new_item = hash(seq[i + m - 1])
curr_hash += new_item - old_item # update hash
if hash_subseq == curr_hash and seq[i:i + m] == subseq:
yield i
Benchmarks
The above functions are evaluated on two inputs:
random inputs
def gen_input(n, k=2):
return tuple(random.randint(0, k - 1) for _ in range(n))
(almost) worst inputs for the naïve algorithm
def gen_input_worst(n, k=-2):
result = [0] * n
result[k] = 1
return tuple(result)
The subseq has fixed size (32).
Since there are so many alternatives, two separate grouping have been done and some solutions with very small variations and almost identical timings have been omitted (i.e. find_mix2() and find_pivot2()).
For each group both inputs are tested.
For each benchmark the full plot and a zoom on the fastest approach is provided.
Naïve on Random
Naïve on Worst
Other on Random
Other on Worst
(Full code is available here.)
NumPy Arrays
Based on the Naïve Algorithm
find_loop(), find_loop_cy() and find_loop_nb() which are the explicit-loop only implementation in pure Python, Cython and with Numba JITing respectively. The code for the first two are the same as above and hence omitted. find_loop_nb() now enjoys fast JIT compilation. The inner loop has been written in a separate function because it can then be reused for find_rk_nb() (calling Numba functions inside Numba functions does not incur in the function call penalty typical of Python).
#nb.jit
def _is_equal_nb(seq, subseq, m, i):
for j in range(m):
if seq[i + j] != subseq[j]:
return False
return True
#nb.jit
def find_loop_nb(seq, subseq):
n = len(seq)
m = len(subseq)
for i in range(n - m + 1):
if _is_equal_nb(seq, subseq, m, i):
yield i
find_all() is the same as above, while find_slice(), find_mix() and find_mix2() are almost identical to the above, the only difference is that seq[i:i + m] == subseq is now the argument of np.all(): np.all(seq[i:i + m] == subseq).
find_pivot() and find_pivot2() share the same ideas as above, except that now uses np.where() instead of index_all() and the need for enclosing the array equality inside an np.all() call.
def find_pivot(seq, subseq):
n = len(seq)
m = len(subseq)
if m > n:
return
max_i = n - m
for i in np.where(seq == subseq[0])[0]:
if i > max_i:
return
elif np.all(seq[i:i + m] == subseq):
yield i
def find_pivot2(seq, subseq):
n = len(seq)
m = len(subseq)
if m > n:
return
max_i = n - m
for i in np.where(seq == subseq[0])[0]:
if i > max_i:
return
elif seq[i + m - 1] == subseq[m - 1] \
and np.all(seq[i:i + m] == subseq):
yield i
find_rolling() express the looping via a rolling window and the matching is checked with np.all(). This vectorizes all the looping at the expenses of creating large temporary objects, while still substantially appling the naïve algorithm. (The approach is from #senderle answer).
def rolling_window(arr, size):
shape = arr.shape[:-1] + (arr.shape[-1] - size + 1, size)
strides = arr.strides + (arr.strides[-1],)
return np.lib.stride_tricks.as_strided(arr, shape=shape, strides=strides)
def find_rolling(seq, subseq):
bool_indices = np.all(rolling_window(seq, len(subseq)) == subseq, axis=1)
yield from np.mgrid[0:len(bool_indices)][bool_indices]
find_rolling2() is a slightly more memory efficient variation of the above, where the vectorization is only partial and one explicit looping (along the expected shortest dimension -- the length of subseq) is kept. (The approach is also from #senderle answer).
def find_rolling2(seq, subseq):
windows = rolling_window(seq, len(subseq))
hits = np.ones((len(seq) - len(subseq) + 1,), dtype=bool)
for i, x in enumerate(subseq):
hits &= np.in1d(windows[:, i], [x])
yield from hits.nonzero()[0]
Based on Knuth–Morris–Pratt (KMP) Algorithm
find_kmp() is the same as above, while find_kmp_nb() is a straightforward JIT-compilation of that.
find_kmp_nb = nb.jit(find_kmp)
find_kmp_nb.__name__ = 'find_kmp_nb'
Based on Rabin-Karp (RK) Algorithm
find_rk() is the same as the above, except that again seq[i:i + m] == subseq is enclosed in an np.all() call.
find_rk_nb() is the Numba accelerated version of the above. Uses _is_equal_nb() defined earlier to definitively determine a match, while for the hashing, it uses a Numba accelerated sum_hash_nb() function whose definition is pretty straightforward.
#nb.jit
def sum_hash_nb(arr):
result = 0
for x in arr:
result += hash(x)
return result
#nb.jit
def find_rk_nb(seq, subseq):
n = len(seq)
m = len(subseq)
if _is_equal_nb(seq, subseq, m, 0):
yield 0
hash_subseq = sum_hash_nb(subseq) # compute hash
curr_hash = sum_hash_nb(seq[:m]) # compute hash
for i in range(1, n - m + 1):
curr_hash += hash(seq[i + m - 1]) - hash(seq[i - 1]) # update hash
if hash_subseq == curr_hash and _is_equal_nb(seq, subseq, m, i):
yield i
find_conv() uses a pseudo Rabin-Karp method, where initial candidates are hashed using the np.dot() product and located on the convolution between seq and subseq with np.where(). The approach is pseudo because, while it still uses hashing to identify probable candidates, it is may not be regarded as a rolling hash (it depends on the actual implementation of np.correlate()). Also, it needs to create a temporary array the size of the input. (The approach is from #Jaime answer).
def find_conv(seq, subseq):
target = np.dot(subseq, subseq)
candidates = np.where(np.correlate(seq, subseq, mode='valid') == target)[0]
check = candidates[:, np.newaxis] + np.arange(len(subseq))
mask = np.all((np.take(seq, check) == subseq), axis=-1)
yield from candidates[mask]
Benchmarks
Like before, the above functions are evaluated on two inputs:
random inputs
def gen_input(n, k=2):
return np.random.randint(0, k, n)
(almost) worst inputs for the naïve algorithm
def gen_input_worst(n, k=-2):
result = np.zeros(n, dtype=int)
result[k] = 1
return result
The subseq has fixed size (32).
This plots follow the same scheme as before, summarized below for convenience.
Since there are so many alternatives, two separate grouping have been done and some solutions with very small variations and almost identical timings have been omitted (i.e. find_mix2() and find_pivot2()).
For each group both inputs are tested.
For each benchmark the full plot and a zoom on the fastest approach is provided.
Naïve on Random
Naïve on Worst
Other on Random
Other on Worst
(Full code is available here.)

A convolution based approach, that should be more memory efficient than the stride_tricks based approach:
def find_subsequence(seq, subseq):
target = np.dot(subseq, subseq)
candidates = np.where(np.correlate(seq,
subseq, mode='valid') == target)[0]
# some of the candidates entries may be false positives, double check
check = candidates[:, np.newaxis] + np.arange(len(subseq))
mask = np.all((np.take(seq, check) == subseq), axis=-1)
return candidates[mask]
With really big arrays it may not be possible to use a stride_tricks approach, but this one still works:
haystack = np.random.randint(1000, size=(1e6))
needle = np.random.randint(1000, size=(100,))
# Hide 10 needles in the haystack
place = np.random.randint(1e6 - 100 + 1, size=10)
for idx in place:
haystack[idx:idx+100] = needle
In [3]: find_subsequence(haystack, needle)
Out[3]:
array([253824, 321497, 414169, 456777, 635055, 879149, 884282, 954848,
961100, 973481], dtype=int64)
In [4]: np.all(np.sort(place) == find_subsequence(haystack, needle))
Out[4]: True
In [5]: %timeit find_subsequence(haystack, needle)
10 loops, best of 3: 79.2 ms per loop

you can call tostring() method to convert an array to string, and then you can use fast string search. this method maybe faster when you have many subarray to check.
import numpy as np
a = np.array([1,2,3,4,5,6])
b = np.array([2,3,4])
print a.tostring().index(b.tostring())//a.itemsize

Another try, but I'm sure there is more pythonic & efficent way to do that ...
def array_match(a, b):
for i in xrange(0, len(a)-len(b)+1):
if a[i:i+len(b)] == b:
return i
return None
a = [1, 2, 3, 4, 5, 6]
b = [2, 3, 4]
print array_match(a,b)
1
(This first answer was not in scope of the question, as cdhowie mentionned)
set(a) & set(b) == set(b)

Here is a rather straight-forward option:
def first_subarray(full_array, sub_array):
n = len(full_array)
k = len(sub_array)
matches = np.argwhere([np.all(full_array[start_ix:start_ix+k] == sub_array)
for start_ix in range(0, n-k+1)])
return matches[0]
Then using the original a, b vectors we get:
a = [1, 2, 3, 4, 5, 6]
b = [2, 3, 4]
first_subarray(a, b)
Out[44]:
array([1], dtype=int64)

Quick comparison of three of the proposed solutions (average time of 100 iteration for randomly created vectors.):
import time
import collections
import numpy as np
def function_1(seq, sub):
# direct comparison
seq = list(seq)
sub = list(sub)
return [i for i in range(len(seq) - len(sub)) if seq[i:i+len(sub)] == sub]
def function_2(seq, sub):
# Jamie's solution
target = np.dot(sub, sub)
candidates = np.where(np.correlate(seq, sub, mode='valid') == target)[0]
check = candidates[:, np.newaxis] + np.arange(len(sub))
mask = np.all((np.take(seq, check) == sub), axis=-1)
return candidates[mask]
def function_3(seq, sub):
# HYRY solution
return seq.tostring().index(sub.tostring())//seq.itemsize
# --- assessment time performance
N = 100
seq = np.random.choice([0, 1, 2, 3, 4, 5, 6], 3000)
sub = np.array([1, 2, 3])
tim = collections.OrderedDict()
tim.update({function_1: 0.})
tim.update({function_2: 0.})
tim.update({function_3: 0.})
for function in tim.keys():
for _ in range(N):
seq = np.random.choice([0, 1, 2, 3, 4], 3000)
sub = np.array([1, 2, 3])
start = time.time()
function(seq, sub)
end = time.time()
tim[function] += end - start
timer_dict = collections.OrderedDict()
for key, val in tim.items():
timer_dict.update({key.__name__: val / N})
print(timer_dict)
Which would result (on my old machine) in:
OrderedDict([
('function_1', 0.0008518099784851074),
('function_2', 8.157730102539063e-05),
('function_3', 6.124973297119141e-06)
])

First, convert the list to string.
a = ''.join(str(i) for i in a)
b = ''.join(str(i) for i in b)
After converting to string, you can easily find the index of substring with the following string function.
a.index(b)
Cheers!!

Python Multiprocessing (Splitting data in smaller chunks - multiple function arguments)

Note from 22.02.21:
-Potentially my problem could also be solved by a more efficient memory usage instead of multiprocessing, since I realized that the memory load gets very high and might be a limiting factor here.
I'm trying to reduce the time that my script needs to run by making use of multiprocessing.
In the past I got some good tips about increasing the speed of the function itself (Increase performance of np.where() loop), but now I would like to make use of all cores of a 32-core workstation.
My function compares entries of two lists (X and Y) with a reference lists Q and Z. For every element in X/Y, it checks whether X[i] occurs somewhere in Q and whether Y[i] occurs in Z. If X[i] == Q[s] AND Y[i] == Z[s], it returns the index "s".
(Note: My real data consists of DNA sequencing reads and I need to map my reads to the reference.)
What I tried so far:
Splitting my long lists X and Y into even chunks (n-chunks, where n == cpu_count)
Trying the "concurrent.futures.ProcessPoolExecutor()" to run the function for each "sublist" in parallel and in the end combine the result of each process to one final dictionary (matchdict). (--> see commented out section)
My problem:
All cores are getting used when I uncomment the multiprocessing section but it ends up with an error (index out of range) which I could not yet resolve. (--> Tip: lower N to 1000 and you will immediately see the error without waiting forever)
Does anyone know how to solve this, or can suggest a better approach to use multiprocessing in my code?
Here is the code:
import numpy as np
import multiprocessing
import concurrent.futures
np.random.seed(1)
def matchdictfunc(index,x,y,q,z): # function to match entries of X and Y to Q and Z and get index of Q/Z where their values match X/Y
lookup = {}
for i, (q, z) in enumerate(zip(Q, Z)):
lookup.setdefault((q, z), []).append(i)
matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]
matchdict = {}
for ind, match in enumerate(matchlist):
matchdict[index[ind]] = match
return matchdict
def split(a, n): # function to split list in n even parts
k, m = divmod(len(a), n)
return list((a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n)))
def splitinput(index,X,Y,Q,Z): # split large lists X and Y in n-even parts (n = cpu_count), make new list containing n-times Q and Z (to feed Q and Z for every process)
cpu_count = multiprocessing.cpu_count()
#create multiple chunks for X and Y and index:
index_split = split(index,cpu_count)
X_split = split(X,cpu_count)
Y_split = split(Y,cpu_count)
# create list with several times Q and Z since it needs to be same length as X_split etc:
Q_mult = []
Z_mult = []
for _ in range(cpu_count):
Q_mult.append(Q)
Z_mult.append(Z)
return index_split,X_split,Y_split,Q_mult,Z_mult
# N will finally scale up to 10^9
N = 10000000
M = 300
index = [str(x) for x in list(range(N))]
X = np.random.randint(M, size=N)
Y = np.random.randint(M, size=N)
# Q and Z size is fixed at 120000
Q = np.random.randint(M, size=120000)
Z = np.random.randint(M, size=120000)
# convert int32 arrays to str64 arrays and then to list, to represent original data (which are strings and not numbers)
X = np.char.mod('%d', X).tolist()
Y = np.char.mod('%d', Y).tolist()
Q = np.char.mod('%d', Q).tolist()
Z = np.char.mod('%d', Z).tolist()
# single-core:
matchdict = matchdictfunc(index,X,Y,Q,Z)
# split lists to number of processors (cpu_count)
index_split,X_split,Y_split,Q_mult,Z_mult = splitinput(index,X,Y,Q,Z)
## Multiprocessing attempt - FAILS! (index out of range)
# finallist = []
# if __name__ == '__main__':
# with concurrent.futures.ProcessPoolExecutor() as executor:
# results = executor.map(matchlistfunc,X_split,Y_split,Q_mult,Z_mult)
# for result in results:
# finallist.append(result)
# matchdict = {}
# for d in finallist:
# matchdict.update(d)

Your function matchdictfunc currently has arguments x, y, q, z but in fact does not use them, although in the multiprocessing version it will need to use two arguments. There is also no need for function splitinput to replicate Q and Z into returned values Q_split and Z_split. Currently, matchdictfunc is expecting Q and Z to be global variables and we can arrange for that to be the case in the multiprocessing version by using the initializer and initargs arguments when constructing the pool. You should also move code that you do not need to be executed by the sub-processes into the block controlled by if __name__ == '__main__':, such as the arary initialization code. These changes result in:
import numpy as np
import multiprocessing
import concurrent.futures
MULTIPROCESSING = True
def init_pool(q, z):
global Q, Z
Q = q
Z = z
def matchdictfunc(index, X, Y): # function to match entries of X and Y to Q and Z and get index of Q/Z where their values match X/Y
lookup = {}
for i, (q, z) in enumerate(zip(Q, Z)):
lookup.setdefault((q, z), []).append(i)
matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]
matchdict = {}
for ind, match in enumerate(matchlist):
matchdict[index[ind]] = match
return matchdict
def split(a, n): # function to split list in n even parts
k, m = divmod(len(a), n)
return list((a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n)))
def splitinput(index, X, Y): # split large lists X and Y in n-even parts (n = cpu_count))
cpu_count = multiprocessing.cpu_count()
#create multiple chunks for X and Y and index:
index_split = split(index,cpu_count)
X_split = split(X,cpu_count)
Y_split = split(Y,cpu_count)
return index_split, X_split ,Y_split
def main():
# following required for non-multiprocessing
if not MULTIPROCESSING:
global Q, Z
np.random.seed(1)
# N will finally scale up to 10^9
N = 10000000
M = 300
index = [str(x) for x in list(range(N))]
X = np.random.randint(M, size=N)
Y = np.random.randint(M, size=N)
# Q and Z size is fixed at 120000
Q = np.random.randint(M, size=120000)
Z = np.random.randint(M, size=120000)
# convert int32 arrays to str64 arrays and then to list, to represent original data (which are strings and not numbers)
X = np.char.mod('%d', X).tolist()
Y = np.char.mod('%d', Y).tolist()
Q = np.char.mod('%d', Q).tolist()
Z = np.char.mod('%d', Z).tolist()
# for non-multiprocessing:
if not MULTIPROCESSING:
matchdict = matchdictfunc(index, X, Y)
else:
# for multiprocessing:
# split lists to number of processors (cpu_count)
index_split, X_split, Y_split = splitinput(index, X, Y)
with concurrent.futures.ProcessPoolExecutor(initializer=init_pool, initargs=(Q, Z)) as executor:
finallist = [result for result in executor.map(matchdictfunc, index_split, X_split, Y_split)]
matchdict = {}
for d in finallist:
matchdict.update(d)
#print(matchdict)
if __name__ == '__main__':
main()
Note: I tried this for a smaller value of N = 1000 (printing out the results of matchdict) and the multiprocessing version seemed to return the same results. My machine does not have the resources to run with the full value of N without freezing up everything else.
Another Approach
I am working under the assumption that your DNA data is external and the X and Y values can be read n values at a time or can be read in and written out so that this is possible. Then rather than having all the data resident in memory and splitting it up into 32 pieces, I propose that it be read n values at a time and thus broken up into approximately N/n pieces.
In the following code I have switched to using the imap method from class multiprocessing.pool.Pool. The advantage is that it lazily submits tasks to the process pool, that is, the iterable argument doesn't have to be a list or convertible to a list. Instead the pool will iterate over the iterable sending tasks to the pool in chunksize groups. In the code below, I have used a generator function for the argument to imap, which will generate successive X and Y values. Your actual generator function would first open the DNA file (or files) and read in successive portions of the file.
import numpy as np
import multiprocessing
def init_pool(q, z):
global Q, Z
Q = q
Z = z
def matchdictfunc(t): # function to match entries of X and Y to Q and Z and get index of Q/Z where their values match X/Y
index, X, Y = t
lookup = {}
for i, (q, z) in enumerate(zip(Q, Z)):
lookup.setdefault((q, z), []).append(i)
matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]
matchdict = {}
for ind, match in enumerate(matchlist):
matchdict[index[ind]] = match
return matchdict
def next_tuple(n, stop, M):
start = 0
while True:
end = min(start + n, stop)
index = [str(x) for x in list(range(start, end))]
x = np.random.randint(M, size=n)
y = np.random.randint(M, size=n)
# convert int32 arrays to str64 arrays and then to list, to represent original data (which are strings and not numbers)
x = np.char.mod('%d', x).tolist()
y = np.char.mod('%d', y).tolist()
yield (index, x, y)
start = end
if start >= stop:
break
def compute_chunksize(XY_AT_A_TIME, N):
n_tasks, remainder = divmod(N, XY_AT_A_TIME)
if remainder:
n_tasks += 1
chunksize, remainder = divmod(n_tasks, multiprocessing.cpu_count() * 4)
if remainder:
chunksize += 1
return chunksize
def main():
np.random.seed(1)
# N will finally scale up to 10^9
N = 10000000
M = 300
# Q and Z size is fixed at 120000
Q = np.random.randint(M, size=120000)
Z = np.random.randint(M, size=120000)
# convert int32 arrays to str64 arrays and then to list, to represent original data (which are strings and not numbers)
Q = np.char.mod('%d', Q).tolist()
Z = np.char.mod('%d', Z).tolist()
matchdict = {}
# number of X, Y pairs at a time:
# experiment with this, especially as N increases:
XY_AT_A_TIME = 10000
chunksize = compute_chunksize(XY_AT_A_TIME, N)
#print('chunksize =', chunksize) # 32 with 8 cores
with multiprocessing.Pool(initializer=init_pool, initargs=(Q, Z)) as pool:
for d in pool.imap(matchdictfunc, next_tuple(XY_AT_A_TIME, N, M), chunksize):
matchdict.update(d)
#print(matchdict)
if __name__ == '__main__':
import time
t = time.time()
main()
print('total time =', time.time() - t)
Update
I want to eliminate using numpy from the benchmark. It is known that numpy uses multiprocessing for some of its operations and when used in multiprocessing applications can be the cause of of reduced performance. So the first thing I did was to take the OP's original program and where the code was, for example:
import numpy as np
np.random.seed(1)
X = np.random.randint(M, size=N)
X = np.char.mod('%d', X).tolist()
I replaced it with:
import random
random.seed(1)
X = [str(random.randrange(M)) for _ in range(N)]
I then timed the OP's program to get the time for generating the X, Y, Q and Z lists and the total time. On my desktop the times were approximately 20 seconds and 37 seconds respectively! So in my multiprocessing version just generating the arguments for the process pool's processes is more than half the total running time. I also discovered for the second approach, that as I increased the value of XY_AT_A_TIME that the CPU utilization went down from 100% to around 50% but that the total elapsed time improved. I haven't quite figured out why this is.
Next I tried to emulate how the programs would function if they were reading the data in. So I wrote out 2 * N random integers to a file, temp.txt and modified the OP's program to initialize X and Y from the file and then modified my second approach's next_tuple function as follows:
def next_tuple(n, stop, M):
with open('temp.txt') as f:
start = 0
while True:
end = min(start + n, stop)
index = [str(x) for x in range(start, end)] # improvement
x = [f.readline().strip() for _ in range(n)]
y = [f.readline().strip() for _ in range(n)]
yield (index, x, y)
start = end
if start >= stop:
break
Again as I increased XY_AT_A_TIME the CPU utilization went down (best performance I found was value 400000 with CPU utilization only around 40%).
I finally rewrote my first approach trying to be more memory efficient (see below). This updated version again reads the random numbers from a file but uses generator functions for X, Y and index so I don't need memory for both the full lists and the splits. Again, I do not expect duplicated results for the multiprocessing and non-multiprocessing versions because of the way I am assigning the X and Y values in the two cases (a simple solution to this would have been to write the random numbers to an X-value file and a Y-value file and read the values back from the two files). But this has no effect on the running times. But again, the CPU utilization, despite using the default pool size of 8, was only 30 - 40% (it fluctuated quite a bit) and the overall running time was nearly double the non-multiprocessing running time. But why?
import random
import multiprocessing
import concurrent.futures
import time
MULTIPROCESSING = True
POOL_SIZE = multiprocessing.cpu_count()
def init_pool(q, z):
global Q, Z
Q = q
Z = z
def matchdictfunc(index, X, Y): # function to match entries of X and Y to Q and Z and get index of Q/Z where their values match X/Y
lookup = {}
for i, (q, z) in enumerate(zip(Q, Z)):
lookup.setdefault((q, z), []).append(i)
matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]
matchdict = {}
for ind, match in enumerate(matchlist):
matchdict[index[ind]] = match
return matchdict
def split(a): # function to split list in POOL_SIZE even parts
k, m = divmod(N, POOL_SIZE)
divisions = [(i + 1) * k + min(i + 1, m) - (i * k + min(i, m)) for i in range(POOL_SIZE)]
parts = []
for division in divisions:
part = [next(a) for _ in range(division)]
parts.append(part)
return parts
def splitinput(index, X, Y): # split large lists X and Y in n-even parts (n = POOL_SIZE)
#create multiple chunks for X and Y and index:
index_split = split(index)
X_split = split(X)
Y_split = split(Y)
return index_split, X_split ,Y_split
def main():
global N
# following required for non-multiprocessing
if not MULTIPROCESSING:
global Q, Z
random.seed(1)
# N will finally scale up to 10^9
N = 10000000
M = 300
# Q and Z size is fixed at 120000
Q = [str(random.randrange(M)) for _ in range(120000)]
Z = [str(random.randrange(M)) for _ in range(120000)]
with open('temp.txt') as f:
# for non-multiprocessing:
if not MULTIPROCESSING:
index = [str(x) for x in range(N)]
X = [f.readline().strip() for _ in range(N)]
Y = [f.readline().strip() for _ in range(N)]
matchdict = matchdictfunc(index, X, Y)
else:
# for multiprocessing:
# split lists to number of processors (POOL_SIZE)
# generator functions:
index = (str(x) for x in range(N))
X = (f.readline().strip() for _ in range(N))
Y = (f.readline().strip() for _ in range(N))
index_split, X_split, Y_split = splitinput(index, X, Y)
with concurrent.futures.ProcessPoolExecutor(POOL_SIZE, initializer=init_pool, initargs=(Q, Z)) as executor:
finallist = [result for result in executor.map(matchdictfunc, index_split, X_split, Y_split)]
matchdict = {}
for d in finallist:
matchdict.update(d)
if __name__ == '__main__':
t = time.time()
main()
print('total time =', time.time() - t)
Resolution?
Can it be that the overhead of transferring the data from the main process to the subprocesses, which involves shared memory reading and writing, is what is slowing everything down? So, this final version was an attempt to eliminate this potential cause for the slowdown. On my desktop I have 8 processors. For the first approach dividing the N = 10000000 X and Y values among them means that each process should be processing N // 8 -> 1250000 values. So I wrote out the random numbers in 16 groups of 1250000 numbers (8 groups for X and 8 groups for Y) as a binary file noting the offset and length of each of these 16 groups using the following code:
import random
random.seed(1)
with open('temp.txt', 'wb') as f:
offsets = []
for i in range(16):
n = [str(random.randrange(300)) for _ in range(1250000)]
b = ','.join(n).encode('ascii')
l = len(b)
offsets.append((f.tell(), l))
f.write(b)
print(offsets)
And from that I constructed lists X_SPECS and Y_SPECS that the worker function matchdictfunc could use for reading in the values X and Y itself as needed. So now instead of passing 1250000 values at a time to this worker function, we are just passing indices 0, 1, ... 7 to the worker function so it knows which group it has to read in. Shared memory access has been totally eliminated in accessing X and Y (it's still required for Q and Z) and the disk access moved to the process pool. The CPU Utilization will, of course, not be 100% because the worker function is doing I/O. But I found that while the running time has now been greatly improved, it still offered no improvement over the original non-multiprocessing version:
OP's original program modified to read `X` and `Y` values in from file: 26.2 seconds
Multiprocessing elapsed time: 29.2 seconds
In fact, when I changed the code to use multithreading by replacing the ProcessPoolExecutor with ThreadPoolExecutor, the elpased time went down almost another second demonstrating the there is very little contention for the Global Interpreter Lock within the worker function, i.e. most of the time is being spent in C-language code. The main work is done by:
matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]
When we do this with multiprocessing, we have multiple list comprehensions and multiple zip operations (on smaller lists) being performed by separate processes and we then assemble the results in the end. This is conjecture on my part, but there just may not be any performance gains to be had by taking what are already efficient operations and scaling them down across multiple processors. Or in other words, I am stumped and that was my best guess.
The final version (with some additional optimizations -- please note):
import random
import concurrent.futures
import time
POOL_SIZE = 8
X_SPECS = [(0, 4541088), (4541088, 4541824), (9082912, 4540691), (13623603, 4541385), (18164988, 4541459), (22706447, 4542961), (27249408, 4541847), (31791255, 4542186)]
Y_SPECS = [(36333441, 4542101), (40875542, 4540120), (45415662, 4540802), (49956464, 4540971), (54497435, 4541427), (59038862, 4541523), (63580385, 4541571), (68121956, 4542335)]
def init_pool(q_z):
global Q_Z
Q_Z = q_z
def matchdictfunc(index, i): # function to match entries of X and Y to Q and Z and get index of Q/Z where their values match X/Y
x_offset, x_len = X_SPECS[i]
y_offset, y_len = Y_SPECS[i]
with open('temp.txt', 'rb') as f:
f.seek(x_offset, 0)
X = f.read(x_len).decode('ascii').split(',')
f.seek(y_offset, 0)
Y = f.read(y_len).decode('ascii').split(',')
lookup = {}
for i, (q, z) in enumerate(Q_Z):
lookup.setdefault((q, z), []).append(i)
matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]
matchdict = {}
for ind, match in enumerate(matchlist):
matchdict[index[ind]] = match
return matchdict
def split(a): # function to split list in POOL_SIZE even parts
k, m = divmod(N, POOL_SIZE)
divisions = [(i + 1) * k + min(i + 1, m) - (i * k + min(i, m)) for i in range(POOL_SIZE)]
parts = []
for division in divisions:
part = [next(a) for _ in range(division)]
parts.append(part)
return parts
def main():
global N
random.seed(1)
# N will finally scale up to 10^9
N = 10000000
M = 300
# Q and Z size is fixed at 120000
Q = (str(random.randrange(M)) for _ in range(120000))
Z = (str(random.randrange(M)) for _ in range(120000))
Q_Z = list(zip(Q, Z)) # pre-compute the `zip` function
# for multiprocessing:
# split lists to number of processors (POOL_SIZE)
# generator functions:
index = (str(x) for x in range(N))
index_split = split(index)
with concurrent.futures.ProcessPoolExecutor(POOL_SIZE, initializer=init_pool, initargs=(Q_Z,)) as executor:
finallist = executor.map(matchdictfunc, index_split, range(8))
matchdict = {}
for d in finallist:
matchdict.update(d)
print(len(matchdict))
if __name__ == '__main__':
t = time.time()
main()
print('total time =', time.time() - t)
The Cost of Inter-Process Memory Transfers
In the code below function create_files was called to create 100 identical files consisting of a "pickled" list of 1,000,000 numbers. I then used a multiprocessing pool of size 8 twice to read the 100 files and unpickle the files to reconstitute the original lists. The difference between the first case (worker1) and the second case (worker2) was that in the second case the list is returned back to the caller (but not saved so that memory can be garbage collected immediately). The second case took more than three times longer than the first case. This can also explain in part why you do not see a speedup when you switch to multiprocessing.
from multiprocessing import Pool
import pickle
import time
def create_files():
l = [i for i in range(1000000)]
# create 100 identical files:
for file in range(1, 101):
with open(f'pkl/test{file}.pkl', 'wb') as f:
pickle.dump(l, f)
def worker1(file):
file_name = f'pkl/test{file}.pkl'
with open(file_name, 'rb') as f:
obj = pickle.load(f)
def worker2(file):
file_name = f'pkl/test{file}.pkl'
with open(file_name, 'rb') as f:
obj = pickle.load(f)
return file_name, obj
POOLSIZE = 8
if __name__ == '__main__':
#create_files()
pool = Pool(POOLSIZE)
t = time.time()
# no data returned:
for file in range(1, 101):
pool.apply_async(worker1, args=(file,))
pool.close()
pool.join()
print(time.time() - t)
pool = Pool(POOLSIZE)
t = time.time()
for file in range(1, 101):
pool.apply_async(worker2, args=(file,))
pool.close()
pool.join()
print(time.time() - t)
t = time.time()
for file in range(1, 101):
worker2(file)
print(time.time() - t)

Smoothing values (neighbors between 1-9)

Instructions: Compute and store R=1000 random values from 0-1 as x. moving_window_average(x, n_neighbors) is pre-loaded into memory from 3a. Compute the moving window average for x for the range of n_neighbors 1-9. Store x as well as each of these averages as consecutive lists in a list called Y.
My solution:
R = 1000
n_neighbors = 9
x = [random.uniform(0,1) for i in range(R)]
Y = [moving_window_average(x, n_neighbors) for n_neighbors in range(1,n_neighbors)]
where moving_window_average(x, n_neighbors) is a function as follows:
def moving_window_average(x, n_neighbors=1):
n = len(x)
width = n_neighbors*2 + 1
x = [x[0]]*n_neighbors + x + [x[-1]]*n_neighbors
# To complete the function,
# return a list of the mean of values from i to i+width for all values i from 0 to n-1.
mean_values=[]
for i in range(1,n+1):
mean_values.append((x[i-1] + x[i] + x[i+1])/width)
return (mean_values)
This gives me an error, Check your usage of Y again. Even though I've tested for a few values, I did not get yet why there is a problem with this exercise. Did I just misunderstand something?

The instruction tells you to compute moving averages for all neighbors ranging from 1 to 9. So the below code should work:
import random
random.seed(1)
R = 1000
x = []
for i in range(R):
num = random.uniform(0,1)
x.append(num)
Y = []
Y.append(x)
for i in range(1,10):
mov_avg = moving_window_average(x, n_neighbors=i)
Y.append(mov_avg)

Actually your moving_window_average(list, n_neighbors) function is not going to work with a n_neighbors bigger than one, I mean, the interpreter won't say a thing, but you're not delivering correctness on what you have been asked.
I suggest you to use something like:
def moving_window_average(x, n_neighbors=1):
n = len(x)
width = n_neighbors*2 + 1
x = [x[0]]*n_neighbors + x + [x[-1]]*n_neighbors
mean_values = []
for i in range(n):
temp = x[i: i+width]
sum_= 0
for elm in temp:
sum_+= elm
mean_values.append(sum_ / width)
return mean_values

My solution for +100XP
import random
random.seed(1)
R=1000
Y = list()
x = [random.uniform(0, 1) for num in range(R)]
for n_neighbors in range(10):
Y.append(moving_window_average(x, n_neighbors))

Python: Compute a Huge Fibonacci Number Modulo m

# Uses python3
# Given two integers n and m, output Fn mod m (that is, the remainder of Fn when divided by m
def Huge_Fib(n,m):
if n == 0 : return 0
elif n == 1: return 1
else:
a,b = 0,1
for i in range(1,n):
a, b = b, (a+b) % m
print(b);
n,m = map(int, input().split());
Huge_Fib(n,m);
The code works very well. However, when I run a case as n = 99999999999999999, m = 2, it takes me much time. Do you have any better solutions?

Here is my solution, you don't have to go through 99999999999999999 iterations if you find the pisano period.
I also recommend that you watch this video: https://www.youtube.com/watch?v=Nu-lW-Ifyec
# Uses python3
import sys
def get_fibonacci_huge(n, m):
if n <= 1:
return n
arr = [0, 1]
previousMod = 0
currentMod = 1
for i in range(n - 1):
tempMod = previousMod
previousMod = currentMod % m
currentMod = (tempMod + currentMod) % m
arr.append(currentMod)
if currentMod == 1 and previousMod == 0:
index = (n % (i + 1))
return arr[index]
return currentMod
if __name__ == '__main__':
input = sys.stdin.read();
n, m = map(int, input.split())
print(get_fibonacci_huge(n,m))

# Uses python3
# Given two integers n and m, output Fn mod m (that is, the remainder of Fn when divided by m
def Huge_Fib(n,m):
# Initialize a matrix [[1,1],[1,0]]
v1, v2, v3 = 1, 1, 0
# Perform fast exponentiation of the matrix (quickly raise it to the nth power)
for rec in bin(n)[3:]:
calc = (v2*v2) % m
v1, v2, v3 = (v1*v1+calc) % m, ((v1+v3)*v2) % m, (calc+v3*v3) % m
if rec == '1': v1, v2, v3 = (v1+v2) % m, v1, v2
print(v2);
n,m = map(int, input().split());
Huge_Fib(n,m);
This is a superfast solution refer to https://stackoverflow.com/a/23462371/3700852

I solved it in Python 3. This the fastest algorithm to compute a huge Fibonacci number modulo m.For example for n =2816213588, m = 239, it took Max time used: 0.01/5.00, max memory used: 9424896/536870912.)
def pisanoPeriod(m):
previous, current = 0, 1
for i in range(0, m * m):
previous, current = current, (previous + current) % m
# A Pisano Period starts with 01
if (previous == 0 and current == 1):
return i + 1
def calc_fib(n,m):
p = pisanoPeriod(m)
n = n % p
if (n <= 1):
return n
else:
previous,current = 0,1
for i in range(2,n+1):
previous,current = current,(previous+current)
return current%m
n,m =map(int,input().split(" "))
print(calc_fib(n,m))

In the below code we are using two concepts of Fibonacci series:
Pisano periods follows a Fibonacci sequence and hence each repetition(pattern) begins with 0 and 1 appearing consecutively one after the other.
fib(n) divides fib(m) only when n divides m which means if fib(4)%3==0,then fib(4+4)%3==0,fib(4+4+4)%3==0 and so on.This helps us in finding the Pisano period.
To know about Pisano periods,I recommend that you watch this video: https://www.youtube.com/watch?v=Nu-lW-Ifyec
#python3
def pisano_length(m):
i=2
while(fib(i)%m!=0):
i+=1
if(fib(i+1)%m!=1):
while(fib(i+1)%m!=1):
i+=i
print("The pisano length for mod {} is: {}".format(m,i))
return(i)
def fib(n):
a,b=0,1
if(n==0 or n==1):
return n
else:
for i in range(2,n+1):
b,a=a+b,b
return(b)
#we want to calculate fib(n)%m for big numbers
n,m=map(int,input().split())
remainder=n%pisano_length(m)
print(fib(remainder)%m)

You should look up Pisano periods.
https://en.wikipedia.org/wiki/Pisano_period and
http://webspace.ship.edu/msrenault/fibonacci/fibfactory.htm should give you a good understanding of what they are.
edit: Just googling "fibonacci modulo" gives you those two as the top two results.

For any integer m>=2, the sequence fn modulo m is periodic - Pisano Period.
So no need to store and find fn. Instead, find a repeating pattern for given m.

This is how i have done by calculating the pisano period.(Java)
public static long get_pisano_period(long m) {
long a = 0, b = 1;
long c;
for (int i = 0; i < m * m; i++) {
c = (a + b) % m;
a = b;
b = c;
if (a == 0 && b == 1)
return i + 1;
}
return 0;
}
public static BigInteger get_fibonacci_huge(long n,long m) {
long remainder = n % get_pisano_period(m);
BigInteger first = BigInteger.valueOf(0);
BigInteger second=BigInteger.valueOf(1);
BigInteger m1=BigInteger.valueOf(m);
BigInteger res = BigInteger.valueOf(remainder);
for (long i = 1; i < remainder; i++) {
res = (first.add(second)).mod(m1);
first = second;
second = res;
}
return res.mod(m1);
}

How to write cos(1)

I need to find a way to write cos(1) in python using a while loop. But i cant use any math functions. Can someone help me out?
for example I also had to write the value of exp(1) and I was able to do it by writing:
count = 1
term = 1
expTotal = 0
xx = 1
while abs(term) > 1e-20:
print("%1d %22.17e" % (count, term))
expTotal = expTotal + term
term=term * xx/(count)
count+=1
I amm completely lost as for how to do this with the cos and sin values though.

Just change your expression to compute the term to:
term = term * (-1 * x * x)/( (2*count) * ((2*count)-1) )
Multiplying the count by 2 could be changed to increment the count by 2, so here is your copypasta:
import math
def cos(x):
cosTotal = 1
count = 2
term = 1
x=float(x)
while abs(term) > 1e-20:
term *= (-x * x)/( count * (count-1) )
cosTotal += term
count += 2
print("%1d %22.17e" % (count, term))
return cosTotal
print( cos(1) )
print( math.cos(1) )

You can calculate cos(1) by using the Taylor expansion of this function:
You can find more details on Wikipedia, see an implementation below:
import math
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n-1)
def cos(order):
a = 0
for i in range(0, order):
a += ((-1)**i)/(factorial(2*i)*1.0)
return a
print cos(10)
print math.cos(1)
This gives as output:
0.540302305868
0.540302305868
EDIT: Apparently the cosine is implemented in hardware using the CORDIC algorithm that uses a lookup table to calculate atan. See below a Python implementation of the CORDIS algorithm based on this Google group question:
#atans = [math.atan(2.0**(-i)) for i in range(0,40)]
atans =[0.7853981633974483, 0.4636476090008061, 0.24497866312686414, 0.12435499454676144, 0.06241880999595735, 0.031239833430268277, 0.015623728620476831, 0.007812341060101111, 0.0039062301319669718, 0.0019531225164788188, 0.0009765621895593195, 0.0004882812111948983, 0.00024414062014936177, 0.00012207031189367021, 6.103515617420877e-05, 3.0517578115526096e-05, 1.5258789061315762e-05, 7.62939453110197e-06, 3.814697265606496e-06, 1.907348632810187e-06, 9.536743164059608e-07, 4.7683715820308884e-07, 2.3841857910155797e-07, 1.1920928955078068e-07, 5.960464477539055e-08, 2.9802322387695303e-08, 1.4901161193847655e-08, 7.450580596923828e-09, 3.725290298461914e-09, 1.862645149230957e-09, 9.313225746154785e-10, 4.656612873077393e-10, 2.3283064365386963e-10, 1.1641532182693481e-10, 5.820766091346741e-11, 2.9103830456733704e-11, 1.4551915228366852e-11, 7.275957614183426e-12, 3.637978807091713e-12, 1.8189894035458565e-12]
def cosine_sine_cordic(beta,N=40):
# in hardware, put this in a table.
def K_vals(n):
K = []
acc = 1.0
for i in range(0, n):
acc = acc * (1.0/(1 + 2.0**(-2*i))**0.5)
K.append(acc)
return K
#K = K_vals(N)
K = 0.6072529350088812561694
x = 1
y = 0
for i in range(0,N):
d = 1.0
if beta < 0:
d = -1.0
(x,y) = (x - (d*(2.0**(-i))*y), (d*(2.0**(-i))*x) + y)
# in hardware put the atan values in a table
beta = beta - (d*atans[i])
return (K*x, K*y)
if __name__ == '__main__':
beta = 1
cos_val, sin_val = cosine_sine_cordic(beta)
print "Actual cos: " + str(math.cos(beta))
print "Cordic cos: " + str(cos_val)
This gives as output:
Actual cos: 0.540302305868
Cordic cos: 0.540302305869

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Is there a way to speed up this function? - python-3.x

Related

find sequence of elements in numpy array [duplicate]

Python Multiprocessing (Splitting data in smaller chunks - multiple function arguments)

Smoothing values (neighbors between 1-9)

Python: Compute a Huge Fibonacci Number Modulo m

How to write cos(1)

Categories

Resources