how can i run batches in thread worker parallel - multithreading

any help please how can i do this process and run batch en parallel with multiprocessing.pool
I have a list of batches (20000 items) splited into multiple batches one batch 2000,
batch_size = 2000
i use multiprocessing.pool
# This loop needs to be parallelized
for batch_index, batched_payloads in enumerate(frappe.utils.create_batch(payloads, batch_size)):
for i, payload in enumerate(batched_payloads):
# Do something
continue
try:
doc = self.process_doc(doc)
# Do something
frappe.db.commit()
except Exception:
frappe.db.rollback()

One solution is use https://github.com/classner/pymp for multiprocessing and parallel processing

Related

How does the queue in Pytorch DataLoader work with num_workers >= 2?

Do I understand the following correctly?
When num_workers >=1, the main process pre-loads prefetch_factor * num_workers batches. When the training loop consumes one batch, the corresponding worker loads the next batch in its queue.
If this is the case, let's go through an example.
NOTE: I have chosen the numeric values for illustration purposes and
have ignored various overheads in this example. The accompanying code example uses these numbers.
Say I have num_workers=4, prefetch_factor=4, batch_size=128. Further assume, it takes 0.003125 s to fetch an item from a source database and the train step takes 0.05 s.
Now, each batch would take 0.003125 * 128 = 0.4 s to load.
With a prefetch_factor=4 and num_workers=4, first, 4*4=16 batches will be loaded.
Once the 16 batches are loaded, the first train step consumes 1 batch and takes 0.05 s. Say worker[0] provided this batch and will start the process to generate a new batch to replenish the queue. Recall fetching a new batch takes 0.4 s.
Similarly, the second step consumes one more batch and the corresponding worker (worker[1] in this example) starts the data fetching process.
The first 8 train steps would take 0.05*8=0.4s. By this time, 8 batches have been
consumed and worker[0] has produced 1 batch. In the next step, 1 batch is consumed and worker[1] produces a new batch. worker[1] had started the data fetching process in the second train step which would now be completed.
Following this we can see, each subsequent train step will consume 1 batch and one of the workers will produce 1 batch, keeping the dataloader queue to have always 8 batches. This means that the train step is never waiting for the data loading process as there are always 8 batches in the buffer.
I would expect this behavior regardless of the data size of the batch given num_workers, prefetch_factor are large enough. However, in the following code example that is not case.
In the code below, I define a custom iterable that returns a numpy array. As the size of the numpy array increases, increasing num_worker or 'prefetch_factor' does not improve the time taken for running through a batch.
I'm guessing this is because each worker serializes the batch to send to the main process where it is de-serialized. As the data size increase, this process would take more time. However, I would think if the queue size is large enough (num_workers, prefetch_factor), at some point, there should be a break even point where each training step consumption of a batch would be accompanied by replenishment via one of the workers as I illustrated in the above example.
In the code below, when MyIterable returns a small object (np array of size (10, 150)), increasing num_workers helps as expected. But when the returned object is larger (np array of size (1000, 150)), num_workers or prefetch_factor does not do much.
# small np object
avg time per batch for num workers=0: 0.47068126868714444
avg time per batch for num workers=2: 0.20982365206225495
avg time per batch for num workers=4: 0.10560789656221914
avg time per batch for num workers=6: 0.07202646931250456
avg time per batch for num workers=8: 0.05311137337469063
# large np object
avg time per batch for num workers=0: 0.6090951558124971
avg time per batch for num workers=2: 0.4594530961876444
avg time per batch for num workers=4: 0.45023533212543043
avg time per batch for num workers=6: 0.3830978863124983
avg time per batch for num workers=8: 0.3811495694375253
Am I missing something here? Why doesn't the data loader queue have enough buffer such that data loading is not the bottleneck?
Even if the serialization and de-serialization process would take longer for the latter case, I'd expect to have a large enough buffer where the consumption and replenishment rate of the batches are almost equal. Otherwise, what is the point of having prefetch_factor.
If the code is behaving as expected, are there any other ways to pre-load the next n batches in a buffer such that it is large enough and never depleted?
Thanks
import time
import torch
import numpy as np
from time import sleep
from torch.utils.data import DataLoader, IterableDataset
def collate_fn(records):
# some custom collation function
return records
class MyIterable(object):
def __init__(self, n):
self.n = n
self.i = 0
def __iter__(self):
return self
def __next__(self):
if self.i < self.n:
sleep(0.003125) # simulates data fetch time
# return np.random.random((10, 150)) # small data item
return np.random.random((1000, 150)) # large data item
else:
raise StopIteration
class MyIterableDataset(IterableDataset):
def __init__(self, n):
super(MyIterableDataset).__init__()
self.n = n
def __iter__(self):
return MyIterable(self.n)
def get_performance_metrics(num_workers):
ds = MyIterableDataset(n=10000)
if num_workers == 0:
dl = torch.utils.data.DataLoader(ds, num_workers=0, batch_size=128, collate_fn=collate_fn)
else:
dl = torch.utils.data.DataLoader(ds, num_workers=num_workers, prefetch_factor=4, persistent_workers=True,
batch_size=128, collate_fn=collate_fn,
multiprocessing_context='spawn')
warmup = 5
times = []
t0 = time.perf_counter()
for i, batch in enumerate(dl):
sleep(0.05) # simulates train step
e = time.perf_counter()
if i >= warmup:
times.append(e - t0)
t0 = time.perf_counter()
if i >= 20:
break
print(f'avg time per batch for num workers={num_workers}: {sum(times) / len(times)}')
if __name__ == '__main__':
num_worker_options = [0, 2, 4, 6, 8]
for n in num_worker_options:
get_performance_metrics(n)

PyTorch: while loading batched data using Dataloader, how to transfer the data to GPU automatically

If we use a combination of the Dataset and Dataloader classes (as shown below), I have to explicitly load the data onto the GPU using .to() or .cuda(). Is there a way to instruct the dataloader to do it automatically/implicitly?
Code to understand/reproduce the scenario:
from torch.utils.data import Dataset, DataLoader
import numpy as np
class DemoData(Dataset):
def __init__(self, limit):
super(DemoData, self).__init__()
self.data = np.arange(limit)
def __len__(self):
return self.data.shape[0]
def __getitem__(self, idx):
return (self.data[idx], self.data[idx]*100)
demo = DemoData(100)
loader = DataLoader(demo, batch_size=50, shuffle=True)
for i, (i1, i2) in enumerate(loader):
print('Batch Index: {}'.format(i))
print('Shape of data item 1: {}; shape of data item 2: {}'.format(i1.shape, i2.shape))
# i1, i2 = i1.to('cuda:0'), i2.to('cuda:0')
print('Device of data item 1: {}; device of data item 2: {}\n'.format(i1.device, i2.device))
Which will output the following; note - without explicit device transfer instruction, the data is loaded onto CPU:
Batch Index: 0
Shape of data item 1: torch.Size([50]); shape of data item 2: torch.Size([50])
Device of data item 1: cpu; device of data item 2: cpu
Batch Index: 1
Shape of data item 1: torch.Size([50]); shape of data item 2: torch.Size([50])
Device of data item 1: cpu; device of data item 2: cpu
A possible solution is at this PyTorch GitHub repo. Issue(still open at the time this question was posted), but, I am unable to make it to work when the dataloader has to return multiple data-items!
You can modify the collate_fn to handle several items at once:
from torch.utils.data.dataloader import default_collate
device = torch.device('cuda:0') # or whatever device/cpu you like
# the new collate function is quite generic
loader = DataLoader(demo, batch_size=50, shuffle=True,
collate_fn=lambda x: tuple(x_.to(device) for x_ in default_collate(x)))
Note that if you want to have multiple workers for the dataloader, you'll need to add
torch.multiprocessing.set_start_method('spawn')
after your if __name__ == '__main__' (see this issue).
Having said that, it seems like using pin_memory=True in your DataLoader would be much more efficient. Have you tried this option?
See memory pinning for more information.
Update (Feb 8th, 2021)
This post made me look at my "data-to-model" time spent during training.
I compared three alternatives:
DataLoader works on CPU and only after the batch is retrieved data is moved to GPU.
Same as (1) but with pin_memory=True in DataLoader.
The proposed method of using collate_fn to move data to GPU.
From my limited experimentation it seems like the second option performs best (but not by a big margin).
The third option required fussing about the start_method of the data loader processes, and it seems to incur an overhead at the beginning of each epoch.

Is it possible to use multithreading for hyperparameter tuning with keras?

Since hyperparameter tuning seems to consist in training different models for the same task, I suppose it is a good idea to train them in parallel in order to gain some time. However, my attempt was quite unsuccessful, as multiple errors occured during the execution of my code. I was wondering if using keras requires me to write multithreading differently, or if the problem lies elsewhere.
Here's what I wrote (I'm trying to calculate the effect of dropout on the minimal value of a custom metric) :
from threading import Thread
class FitModel(Thread):
def __init__(self, params):
Thread.__init__(self)
self.params = params
def run(self):
DC=DeltaCallback(verbose=0) #custom metric
model=keras.models.Sequential([
keras.layers.Conv1D(64,11,activation="relu",padding="SAME",input_shape=(700,1)),
keras.layers.BatchNormalization(),
keras.layers.AvgPool1D(pool_size=2),
keras.layers.Conv1D(128,11,activation="relu",padding="SAME"),
keras.layers.BatchNormalization(),
keras.layers.AvgPool1D(pool_size=2),
keras.layers.Conv1D(256,11,activation="relu",padding="SAME"),
keras.layers.BatchNormalization(),
keras.layers.AvgPool1D(pool_size=2),
keras.layers.Conv1D(512,11,activation="relu",padding="SAME"),
keras.layers.BatchNormalization(),
keras.layers.AvgPool1D(pool_size=2),
keras.layers.Conv1D(512,11,activation="relu",padding="SAME"),
keras.layers.BatchNormalization(),
keras.layers.AvgPool1D(pool_size=2),
keras.layers.Flatten(),
keras.layers.Dropout(self.params[0]),
keras.layers.Dense(4096,activation="relu"),
keras.layers.Dropout(self.params[1]),
keras.layers.Dense(4096,activation="relu"),
keras.layers.Dense(256,activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.RMSprop(learning_rate=0.00001),
metrics=["accuracy"])
model.fit(X_train,y_train,batch_size=100,epochs=50,validation_data=(X_valid,y_valid),callbacks=[DC],verbose=2)
print(self.params,"epochs : ",DC.deltas.index(min(DC.deltas)))
print(self.params,"deltamin : ",DC.deltas[DC.deltas.index(min(DC.deltas))])
print(self.params,"Nval : ", DC.Nvals[DC.deltas.index(min(DC.deltas))])
parameters_list=[[0,0.1],[0.1,0],[0.2,0],[0.1,0.2],[0.2,0.1],[0.3,0],[0.3,0.1],[0.3,0.2],[0,0.3],[0.1,0.3],[0.2,0.3]]
# create threads
THREADS = [FitModel(parameters) for parameters in parameters_list]
# start threads
for thread in THREADS:
thread.start()
# wait for threads to finish
for thread in THREADS:
thread.join()
The problem is that multiple Exceptions occur when I try to execute this code, as well as OOM errors. Any idea how to make this work?

How to have incrementing batch size in pytorch

In pytorch, DataLoader will split a dataset into batches of set size with additional options of shuffling etc, which one can then loop over.
But if I need the batch size to increment, such as first 10 batch of size 50, next 5 batch of size 100 and so on, what's the best way of doing so?
I tried splitting the tensor then concat them:
#10x50 + 5*100
originalTensor = torch.randn(1000, 80)
split1=torch.split(originalTensor, 500, dim=0)
split2=torch.split(list(split1)[0], 100, dim=0)
Thereafter is there a way to pass the concatenated tensor into dataLoader or any other way to directly turn the concat tensor into a generator (which might lose shuffling and other functionalities)?
I think you can do that by simply providing a non-default batch_sampler to your DataLoader.
For instance:
class VaryingSizeBatchSampler(Sampler):
r"""Wraps another sampler to yield a varying-size mini-batch of indices.
Args:
sampler (Sampler): Base sampler.
batch_size_fn (function): Size of current mini-batch.
drop_last (bool): If ``True``, the sampler will drop the last batch if
its size would be less than ``batch_size``
"""
def __init__(self, sampler, batch_size_fn, drop_last):
if not isinstance(sampler, Sampler):
raise ValueError("sampler should be an instance of "
"torch.utils.data.Sampler, but got sampler={}"
.format(sampler))
self.sampler = sampler
self.batch_size_fn = batch_size_fn
self.drop_last = drop_last
self.batch_counter = 0
def __iter__(self):
batch = []
cur_batch_size = self.batch_size_fn(self.batch_counter) # get current batch size
for idx in self.sampler:
batch.append(idx)
if len(batch) == cur_batch_size:
yield batch
self.batch_counter += 1
cur_batch_size = self.batch_size_fn(self.batch_counter) # get current batch size
batch = []
if len(batch) > 0 and not self.drop_last:
yield batch
def __len__(self):
raise NotImplementedError('You need to implement it yourself!')

Ordered results with apply_async

I have read that the function apply_async doesn't give ordered results. If I have repeated calls to a function which prints the squares of a list of numbers, I can see from the display that the list is not ordered.
However when the function returns the number instead of printing it and I use .get() to get the values, then I see that the results are ordered.
I have a few questions --
Why are the results from .get() ordered?
If I have a loop which as a variable named a and its value is different for different iterations. Will using apply_async cause overwrites of the values of a as it runs the processes in parallel and asynchronously?
Will I be able to save computational time if I run apply instead of apply_async? My code shows that apply is slower than the for loop. Why is that so?
Can we use a function declared within the ___main___ function with apply_async?
Here is a small working example:
from multiprocessing import Pool
import time
def f(x):
return x*x
if __name__ == '__main__':
print('For loop')
t1f = time.time()
for ii in range(20):
f(ii)
t2f = time.time()
print('Time taken for For loop = ', t2f-t1f,' seconds')
pool = Pool(processes=4)
print('Apply async loop')
t1a = time.time()
results = [pool.apply_async(f, args = (j,)) for j in range(20)]
pool.close()
pool.join()
t2a = time.time()
print('Time taken for pool = ', t2a-t1a,' seconds')
print([results[hh].get() for hh in range(len(results))])
This results as:
For loop Time taken for For loop = 5.9604644775390625e-06 seconds
Apply async loop Time taken for pool = 0.10188460350036621 seconds
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225,
256, 289, 324, 361]
Why are the results from .get() ordered?
because the results list is ordered.
If I have a loop which as a variable named a and its value is different for different iterations. Will using apply_async cause
overwrites of the values of a as it runs the processes in parallel
and asynchronously?
generally no, but I can't tell without seeing the code.
Will I be able to save computational time if I run apply instead of apply_async? My code shows that apply is slower than the for
loop. Why is that so?
no, apply blocks on each call, there is no parallelism. apply is slower because of multiprocessing overhead.
Can we use a function declared within the ___main___ function with apply_async?
yes for *nix, no for windows, because there is no fork().
your time measurement of .apply_async is wrong, you should take t2a after result.get, and don't assume the result is finished in order:
while not all(r.ready() for r in results):
time.sleep(0.1)
btw, your work function runs too fast to finish, do more computation to do a true benchmark.

Resources