I have read that the function apply_async doesn't give ordered results. If I have repeated calls to a function which prints the squares of a list of numbers, I can see from the display that the list is not ordered.
However when the function returns the number instead of printing it and I use .get() to get the values, then I see that the results are ordered.
I have a few questions --
Why are the results from .get() ordered?
If I have a loop which as a variable named a and its value is different for different iterations. Will using apply_async cause overwrites of the values of a as it runs the processes in parallel and asynchronously?
Will I be able to save computational time if I run apply instead of apply_async? My code shows that apply is slower than the for loop. Why is that so?
Can we use a function declared within the ___main___ function with apply_async?
Here is a small working example:
from multiprocessing import Pool
import time
def f(x):
return x*x
if __name__ == '__main__':
print('For loop')
t1f = time.time()
for ii in range(20):
f(ii)
t2f = time.time()
print('Time taken for For loop = ', t2f-t1f,' seconds')
pool = Pool(processes=4)
print('Apply async loop')
t1a = time.time()
results = [pool.apply_async(f, args = (j,)) for j in range(20)]
pool.close()
pool.join()
t2a = time.time()
print('Time taken for pool = ', t2a-t1a,' seconds')
print([results[hh].get() for hh in range(len(results))])
This results as:
For loop Time taken for For loop = 5.9604644775390625e-06 seconds
Apply async loop Time taken for pool = 0.10188460350036621 seconds
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225,
256, 289, 324, 361]
Why are the results from .get() ordered?
because the results list is ordered.
If I have a loop which as a variable named a and its value is different for different iterations. Will using apply_async cause
overwrites of the values of a as it runs the processes in parallel
and asynchronously?
generally no, but I can't tell without seeing the code.
Will I be able to save computational time if I run apply instead of apply_async? My code shows that apply is slower than the for
loop. Why is that so?
no, apply blocks on each call, there is no parallelism. apply is slower because of multiprocessing overhead.
Can we use a function declared within the ___main___ function with apply_async?
yes for *nix, no for windows, because there is no fork().
your time measurement of .apply_async is wrong, you should take t2a after result.get, and don't assume the result is finished in order:
while not all(r.ready() for r in results):
time.sleep(0.1)
btw, your work function runs too fast to finish, do more computation to do a true benchmark.
Related
Do I understand the following correctly?
When num_workers >=1, the main process pre-loads prefetch_factor * num_workers batches. When the training loop consumes one batch, the corresponding worker loads the next batch in its queue.
If this is the case, let's go through an example.
NOTE: I have chosen the numeric values for illustration purposes and
have ignored various overheads in this example. The accompanying code example uses these numbers.
Say I have num_workers=4, prefetch_factor=4, batch_size=128. Further assume, it takes 0.003125 s to fetch an item from a source database and the train step takes 0.05 s.
Now, each batch would take 0.003125 * 128 = 0.4 s to load.
With a prefetch_factor=4 and num_workers=4, first, 4*4=16 batches will be loaded.
Once the 16 batches are loaded, the first train step consumes 1 batch and takes 0.05 s. Say worker[0] provided this batch and will start the process to generate a new batch to replenish the queue. Recall fetching a new batch takes 0.4 s.
Similarly, the second step consumes one more batch and the corresponding worker (worker[1] in this example) starts the data fetching process.
The first 8 train steps would take 0.05*8=0.4s. By this time, 8 batches have been
consumed and worker[0] has produced 1 batch. In the next step, 1 batch is consumed and worker[1] produces a new batch. worker[1] had started the data fetching process in the second train step which would now be completed.
Following this we can see, each subsequent train step will consume 1 batch and one of the workers will produce 1 batch, keeping the dataloader queue to have always 8 batches. This means that the train step is never waiting for the data loading process as there are always 8 batches in the buffer.
I would expect this behavior regardless of the data size of the batch given num_workers, prefetch_factor are large enough. However, in the following code example that is not case.
In the code below, I define a custom iterable that returns a numpy array. As the size of the numpy array increases, increasing num_worker or 'prefetch_factor' does not improve the time taken for running through a batch.
I'm guessing this is because each worker serializes the batch to send to the main process where it is de-serialized. As the data size increase, this process would take more time. However, I would think if the queue size is large enough (num_workers, prefetch_factor), at some point, there should be a break even point where each training step consumption of a batch would be accompanied by replenishment via one of the workers as I illustrated in the above example.
In the code below, when MyIterable returns a small object (np array of size (10, 150)), increasing num_workers helps as expected. But when the returned object is larger (np array of size (1000, 150)), num_workers or prefetch_factor does not do much.
# small np object
avg time per batch for num workers=0: 0.47068126868714444
avg time per batch for num workers=2: 0.20982365206225495
avg time per batch for num workers=4: 0.10560789656221914
avg time per batch for num workers=6: 0.07202646931250456
avg time per batch for num workers=8: 0.05311137337469063
# large np object
avg time per batch for num workers=0: 0.6090951558124971
avg time per batch for num workers=2: 0.4594530961876444
avg time per batch for num workers=4: 0.45023533212543043
avg time per batch for num workers=6: 0.3830978863124983
avg time per batch for num workers=8: 0.3811495694375253
Am I missing something here? Why doesn't the data loader queue have enough buffer such that data loading is not the bottleneck?
Even if the serialization and de-serialization process would take longer for the latter case, I'd expect to have a large enough buffer where the consumption and replenishment rate of the batches are almost equal. Otherwise, what is the point of having prefetch_factor.
If the code is behaving as expected, are there any other ways to pre-load the next n batches in a buffer such that it is large enough and never depleted?
Thanks
import time
import torch
import numpy as np
from time import sleep
from torch.utils.data import DataLoader, IterableDataset
def collate_fn(records):
# some custom collation function
return records
class MyIterable(object):
def __init__(self, n):
self.n = n
self.i = 0
def __iter__(self):
return self
def __next__(self):
if self.i < self.n:
sleep(0.003125) # simulates data fetch time
# return np.random.random((10, 150)) # small data item
return np.random.random((1000, 150)) # large data item
else:
raise StopIteration
class MyIterableDataset(IterableDataset):
def __init__(self, n):
super(MyIterableDataset).__init__()
self.n = n
def __iter__(self):
return MyIterable(self.n)
def get_performance_metrics(num_workers):
ds = MyIterableDataset(n=10000)
if num_workers == 0:
dl = torch.utils.data.DataLoader(ds, num_workers=0, batch_size=128, collate_fn=collate_fn)
else:
dl = torch.utils.data.DataLoader(ds, num_workers=num_workers, prefetch_factor=4, persistent_workers=True,
batch_size=128, collate_fn=collate_fn,
multiprocessing_context='spawn')
warmup = 5
times = []
t0 = time.perf_counter()
for i, batch in enumerate(dl):
sleep(0.05) # simulates train step
e = time.perf_counter()
if i >= warmup:
times.append(e - t0)
t0 = time.perf_counter()
if i >= 20:
break
print(f'avg time per batch for num workers={num_workers}: {sum(times) / len(times)}')
if __name__ == '__main__':
num_worker_options = [0, 2, 4, 6, 8]
for n in num_worker_options:
get_performance_metrics(n)
i have a matrix A and want to calculate the distance matrix D from it, iteratively. The reason behind wanting to calculate it step by step is to later include some if-statements in the iteration process.
My code right now looks like this:
import numpy as np
from scipy.spatial import distance
def create_data_matrix(n,m):
mean = np.zeros(m)
cov = np.eye(m, dtype=float)
data_matrix = np.random.multivariate_normal(mean,cov,n)
return(data_matrix)
def create_full_distance(A):
distance_matrix = np.triu(distance.squareform(distance.pdist(A,"euclidean")),0)
return(distance_matrix)
matrix_a = create_data_matrix(1000,2)
distance_from_numpy = create_full_distance(matrix_a)
matrix_b = np.empty((1000,1000))
for idx, line in enumerate(matrix_a):
for j, line2 in enumerate(matrix_a):
matrix_b[idx][j] = distance.euclidean(matrix_a[idx],matrix_a[j])
Now the matrices "distance_from_numpy" and "matrix_b" are the same, though matrix_b takes far longer to calculate allthough the matrix_a is only a (100x2) matrix, and i know that "distance.pdist()" method is very fast but i am not sure if i can implement it in an iteration process.
My question is, why is the double for loop so slow and how can i increase the speed while still preserving the iteration process (since i want to include if statements there) ?
edit: for context: i want to preserve the iteration, because i'd like stop the iteration if one of the distances is smaller than a specific number.
Python is a high-level language and therefore loops are inherently slow. It just has to deal with a lot of overhead. This gets progressively worse, as the number of nested loops increases. On the other hand, Numpy uses fast Fortran code.
To speed up the Python implementation, you can for example implement the loop part with Cython, which will translate your code to C, and then compile it for faster execution. Other options are Numba, or writing the loops in Fortran.
As Ehsan mentioned in a comment i used numba to increase computational speed.
from numba import jit
import numpy as np
from scipy.spatial import distance
def create_data_matrix(n,m):
mean = np.zeros(m)
cov = np.eye(m, dtype=float)
data_matrix = np.random.multivariate_normal(mean,cov,n)
return(data_matrix)
def create_full_distance(A):
distance_matrix = np.triu(distance.squareform(distance.pdist(A,"euclidean")),0)
return(distance_matrix)
#jit(nopython=True) # Set "nopython" mode for best performance, equivalent to #njit
def slow_loop(matrix_a):
matrix_b = np.empty((1000,1000))
for i in range(len(matrix_a)):
for j in range(len(matrix_a)):
#matrix_b[i][j] = distance.euclidean(matrix_a[i],matrix_a[j])
matrix_b[i][j] = np.linalg.norm(matrix_a[i]-matrix_a[j])
print("matrix_b: ",matrix_b)
return()
def slow_loop_without_numba(matrix_a):
matrix_b = np.empty((1000,1000))
for i in range(len(matrix_a)):
for j in range(len(matrix_a)):
matrix_b[i][j] = np.linalg.norm(matrix_a[i]-matrix_a[j])
return()
matrix_a = create_data_matrix(1000,2)
start = time.time()
ergebnis = create_full_distance(matrix_a)
#print("ergebnis: ",ergebnis)
end = time.time()
print("with scipy.distance.pdist = %s" % (end - start))
start2 = time.time()
slow_loop(matrix_a)
end2 = time.time()
print("with #jit onto np.linalg.norm = %s" % (end2 - start2))
start3 = time.time()
slow_loop_without_numba(matrix_a)
end3 = time.time()
print("slow_loop without numba = %s" % (end3 - start3))
i executed the code and it yielded these results:
with scipy.distance.pdist = 0.021986722946166992
with #jit onto np.linalg.norm = 0.8565070629119873
slow_loop without numba = 6.818004846572876
so numba increased the computational speed by alot allthough scipy is still much faster. This will be more interesting the bigger the distance matrices get. I couldnĀ“t use numba on a function with scipy methods.
I understand the process of passing the function as a parameter to a different function but, coming from the c# background, I don't understand the need of it.
Can someone please make me aware of some scenarios in which this is preferred?
One of the reasons why passing a function as a parameter is useful is the concept of lambda functions in python.
method2(lambda: method1('world'))
>>> hello world
The benefit of lambda functions are easily visible when used with python functions map(), filter(), and reduce().
Lambda functions with map()
>map(lambda x: x*2, my_list)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38]
Lambda with reduce()
>reduce(lambda x, y: x+y, my_list)
190
Lambda with filter()
filter(lambda x: x >10, my_list)
[11, 12, 13, 14, 15, 16, 17, 18, 19]
Basically unlike c# your code gets reduced number of lines and becomes more efficient
since your function call and execution happens on the same line
Passing functions into functions allows to parameterise behaviour. This is not unlike passing values into functions allows to parameterise data.
def is_greater(what: int, base: int):
if what > base: # fixed behaviour, parameterised data
print(f'{what} is greater')
def is_valid(what: int, condition: 'Callable'):
if condition(what): # parameterised behaviour
print(f'{what} is valid')
Some common use-cases include:
map, filter and others that apply some behaviour to iterables. The functions itself merely implement the "apply to each element" part, but the behaviour can be swapped out:
>>> print(*map(float, ['1', '2', '3.0'])
1.0 2.0 3.0
In such situations, one often uses a lambda to define the behaviour on the fly.
>>> print(sorted(
... ['Bobby Tables', 'Brian Wayne', 'Charles Chapeau'],
... key=lambda name: name.split()[1]), # sort by last name
... )
['Charles Chapeau', 'Bobby Tables', 'Brian Wayne']
Function decorators that wrap a function with additional behaviour.
def print_call(func):
"""Decorator that prints the arguments its target is called with"""
def wrapped_func(*args, **kwargs):
print(f'call {func} with {args} and {kwargs}')
return func(*args, **kwargs)
return wrapped_func
#print_call
def rolling_sum(*numbers, initial=0):
totals = [initial]
for number in numbers:
totals.append(totals[-1] + number)
return totals
rolling_sum(1, 10, 27, 42, 5, initial=100)
# call <function rolling_sum at 0x10ed6fd08> with ([1, 10, 27, 42, 5],) and {'initial': 100}
Every time you see a decorator applied with # it is a higher order function.
Callbacks and payloads that are executed at another time, context, condition, thread or even process.
def call_after(delay: float, func: 'Callable', *args, **kwargs):
"""Call ``func(*args, **kwargs)`` after ``delay`` seconds"""
time.sleep(delay)
func(*args, **kwargs)
thread = threading.Thread(
target=call_after, # payload for the thread is a function
args=(1, print, 'Hello World'))
thread.start()
print("Let's see what happens...")
# Let's see what happens...
#
# Hello World
Passing functions instead of values allows to emulate lazy evaluation.
def as_needed(expensive_computation, default):
if random_condition():
return expensive_computation()
return default
I am having issues with my code running out of memory on large data sets. I attempted to chunk the data to feed it into the calculation graph but I eventually get an out of memory error. Would setting it up to use the feed_dict functionality get around this problem?
My code is set up like the following, with a nested map_fn function due to a result of the tf_itertools_product_2D_nest function.
tf_itertools_product_2D_nest function is from Cartesian Product in Tensorflow
I also tried a variation where I made a list of tensor-lists which was significantly slower than doing it purely in tensorflow so I'd prefer to avoid that method.
import tensorflow as tf
import numpy as np
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.9
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess = tf.Session()
sess.run(tf.global_variables_initializer())
tensorboard_log_dir = "../log/"
def tf_itertools_product_2D_nest(a,b): #does not work on nested tensors
a, b = a[ None, :, None ], b[ :, None, None ]
#print(sess.run(tf.shape(a)))
#print(sess.run(tf.shape(b)))
n_feat_dimension_in_common = tf.shape(a)[-1]
c = tf.concat( [ a + tf.zeros_like( b ), tf.zeros_like( a ) + b ], axis = 2 )
return c
def do_calc(arr_pair):
arr_1 = arr_pair[0]
arr_binary = arr_pair[1]
return tf.reduce_max(tf.cumsum(arr_1*arr_binary))
def calc_row_wrapper(row):
return tf.map_fn(do_calc,row)
for i in range(0,10):
a = tf.constant(np.random.random((7,10))*10,tf.float64)
b = tf.constant(np.random.randint(2, size=(3,10)),tf.float64)
a_b_itertools_product = tf_itertools_product_2D_nest(a,b)
'''Creates array like this:
[ [[arr_a0,arr_b0], [arr_a1,arr_b0],...],
[[arr_a0,arr_b1], [arr_a1,arr_b1],...],
[[arr_a0,arr_b2], [arr_a1,arr_b2],...],
...]
'''
with tf.summary.FileWriter(tensorboard_log_dir, sess.graph) as writer:
result_array = sess.run(tf.map_fn(calc_row_wrapper,a_b_itertools_product),
options=run_options,run_metadata=run_metadata)
writer.add_run_metadata(run_metadata,"iteration {}".format(i))
print(result_array.shape)
print(result_array)
print("")
# result_array should be an array with 3 rows (1 for each binary vector in b) and 7 columns (1 for each row in a)
I can imagine that is unnecessarily consuming memory due to the extra dimension added. Is there a way to mimic the outcome of the standard itertools.product() function to output 1 long list of every possible combination of items in the 2 input iterables? Like the result of:
itertools.product([[1,2],[3,4]],[[5,6],[7,8]])
# [([1, 2], [5, 6]), ([1, 2], [7, 8]), ([3, 4], [5, 6]), ([3, 4], [7, 8])]
That would eliminate the need to call map_fn twice.
When map_fn is called within a loop as my code shows, will it keep spawning graphs for every iteration? There appears to be a big "map_" node for every iteration cycle in this code's Tensorboardgraph.
Tensorboard Default View (not enough reputation yet)
When I select a particular iteration based on the tag in Tensorboard, only the map node corresponding to the iteration is highlighted with all the others grayed out. Does that mean that for that cycle only the map node for that cycle is present (and the others no longer, if from a previous cycle , exist in memory)?
Tensorboard 1 iteration view
I'm learning python's generators, iterators, iterables, and I can't explain why the following is not working. I want to create, as an exercise, a simple version of the function zip. Here's what i did:
def myzip(*collections):
iterables = tuple(iter(collection) for collection in collections)
yield tuple(next(iterable) for iterable in iterables)
test = myzip([1,2,3],(4,5,6),{7,8,9})
print(next(test))
print(next(test))
print(next(test))
What I do is:
I have collections which is a tuple of some collections
I create a new tuple iterables where, for each collection (which is iterable), I get the iterator using iter
Then, I create a new tuple where, on each iterable, I call next. This tuple is then yield.
So I expect that at the first execution the object iterables is created (and stored). Then in each iteration (including the first one) I call next on every iterable stored before and return it.
However this is what I get:
(1, 4, 8)
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-108-424963a58e58> in <module>()
8
9 print(next(test))
---> 10 print(next(test))
StopIteration:
So I see that the first iteration is fine and the result is correct. However, the second iteration raise a StopIteration exception and I don't understand why: each iterable still has some values, so none of the nexts return StopIteration. In fact, this works:
def myziptest(*collections):
iterables = tuple(iter(collection) for collection in collections)
for _ in range(3):
print(tuple(next(iterable) for iterable in iterables))
test = myziptest([1,2,3],(4,5,6),{7,8,9})
Output:
(1, 4, 8)
(2, 5, 9)
(3, 6, 7)
So what is going on?
Thanks a lot
Here's a working solution
def myzip(*collections):
iterables = tuple(iter(collection) for collection in collections)
while True:
try:
yield tuple([next(iterable) for iterable in iterables])
except StopIteration:
# one of the iterables has no more left.
break
test = myzip([1,2,3],(4,5,6),{7,8,9})
print(next(test))
print(next(test))
print(next(test))
The difference between this code and yours is that your code only yields one result. Meaning, calling next more than once will give you a StopIteration.
Think of yield x as putting x into a queue, and next as popping from that queue. And when you try to pop from an empty queue, you get the Stopiteration. You can pop only as many as you put.