So I have a Python 3.7 program that uses Threading library to multiprocess tasks
def myFunc(stName,ndName,ltName):
##logic here
names = open('names.txt').read().splitlines() ## more than 30k name
for i in names:
processThread = threading.Thread(target=myFunc, args=(i,name2nd,lName,))
processThread.start()
time.sleep(0.4)
I have to open multiple windows to complete the tasks with different inputs, but eventually I ran into a very laggy situation where I cant even browse my OSX , I tried to use the multiprocessing library to solve the issue but unfortunately, multiprocessing seems not to be working correctly in OSX .
Anyone can advise ?
This behavior is to be expected. If myFunc is a CPU intensive task that takes time, you are potentially starting up to 30k threads doing this task which will use all the machine resources.
Another potential issue with your code is that Threads are expensive in term of memory (each thread uses 8MB of memory). Creating 30k threads would use up to 240GB of memory which your machine probably doesn't have, and will lead to an OutOfMemoryError.
Finally, another issue with that code is that your main routine is starting up all those threads, but not waiting for any of them to finish executing. This means that the last started threads will most likely not run until the end.
I would recommend using a ThreadPoolExecutor to solve all those issues:
from concurrent.futures.thread import ThreadPoolExecutor
def myFunc(stName,ndName,ltName):
##logic here
names = open('names.txt').read().splitlines() ## more than 30k name
num_workers = 8
with ThreadPoolExecutor(max_workers=num_workers) as executor:
for i in names:
executor.map(myFunc, (i, name2nd, lName))
You can play with num_workers to find a balance between amount of resources being used by this program and speed of execution that fits you.
Related
I am new to Python. I have been trying to develop a GUI based tool to monitor a set of databases. I want to pull data with multiple threads to make the DB reads faster. I found that threads can be managed using threading class or concurrent.futures class or using queue. In my tool there will be frequent DB reads and GUI will updated accordingly. My question is - what will be best option to work with for threading ? And how to manage life cycle of threads ?
I tried few example provided in different websites with following results.
threads created using threading class are nicely updating the GUI. But I don't know how to manage 30 threads.
Threads created using concurrent.futures.ThreadPoolExecutor are managed by the class. But it is updating the GUI after all the threads complete their task.
The thing with python threading is that there isn't really a proper way to stop a thread without stopping the entire execution. I am guessing your using threading or _thread
What I would do is create a list and have each function access a certain index of the list.
Process ID = Item in list.
thread 0 would be checking item 0 in list "running".
an example using _thread
import _thread
running = []
def task(id):
global running
while running[id]:
#do something
#Create 5 tasks
for i in range(0,6):
running.append(True)
_thread.start_new_thread(task,(i,))
# Now lets stop tasks 2 and 4.
running[1] = False
running[3] = False
# After doing this the threads will end once code in while loop has finished
# To restart tasks 2 and 4
_thread.start_new_thread(task,(1,))
_thread.start_new_thread(task,(3,))
This is my rudimentary way of managing tasks.
It may or may not work for you.
I am not a professional. But it works.
I'm using the AllenNLP (version 2.6) semantic role labeling model to process a large pile of sentences. My Python version is 3.7.9. I'm on MacOS 11.6.1. My goal is to use multiprocessing.Pool to parallelize the work, but the calls via the pool are taking longer than they do in the parent process, sometimes substantially so.
In the parent process, I have explicitly placed the model in shared memory as follows:
from allennlp.predictors import Predictor
from allennlp.models.archival import load_archive
import allennlp_models.structured_prediction.predictors.srl
PREDICTOR_PATH = "...<srl model path>..."
archive = load_archive(PREDICTOR_PATH)
archive.model.share_memory()
PREDICTOR = Predictor.from_archive(archive)
I know the model is only being loaded once, in the parent process. And I place the model in shared memory whether or not I'm going to make use of the pool. I'm using torch.multiprocessing, as many recommend, and I'm using the spawn start method.
I'm calling the predictor in the pool using Pool.apply_async, and I'm timing the calls within the child processes. I know that the pool is using the available CPUs (I have six cores), and I'm nowhere near running out of physical memory, so there's no reason for the child processes to be swapped to disk.
Here's what happens, for a batch of 395 sentences:
Without multiprocessing: 638 total processing seconds (and elapsed time).
With a 4-process pool: 293 seconds elapsed time, 915 total processing seconds.
With a 12-process pool: 263 seconds elapsed time, 2024 total processing seconds.
The more processes, the worse the total AllenNLP processing time - even though the model is explicitly in shared memory, and the only thing that crosses the process boundary during the invocation is the input text and the output JSON.
I've done some profiling, and the first thing that leaps out at me is that the function torch._C._nn.linear is taking significantly longer in the multiprocessing cases. This function takes two tensors as arguments - but there are no tensors being passed across the process boundary, and I'm decoding, not training, so the model should be entirely read-only. It seems like it has to be a problem with locking or competition for the shared model resource, but I don't understand at all why that would be the case. And I'm not a torch programmer, so my understanding of what's happening is limited.
Any pointers or suggestions would be appreciated.
Turns out that I wasn't comparing exactly the right things. This thread: https://github.com/allenai/allennlp/discussions/5471 goes into all the detail. Briefly, because pytorch can use additional resources under the hood, my baseline test without multiprocessing wasn't taxing my computer enough when running two instances in parallel; I had to run 4 instances to see the penalty, and in that case, the total processing time was essentially the same for 4 parallel nonmultiprocessing invocations, or one multiprocessing case with 4 subprocesses.
Is there a way we could concurrently run functions on CPU and GPU (using Python)? I'm already using Numba to do thread level scheduling for compute intensive functions on the GPU, but I now also need to add parallelism between CPU-GPU. Once we ensure that the GPU shared memory has all the data to start processing, I need to trigger the GPU start and then in parallel run some functions on the host using the CPU.
I'm sure that the time taken by GPU to return the data is much more than the CPU to finish a task. So that once the GPU has finished processing, CPU is already waiting to fetch the data to the host. Is there a standard library/way to achieve this? Appreciate any pointers in this regard.
Thanks Robert and Ander. I was thinking on similar lines but wasn't very sure. I checked that until I put some synchronization for task completion between the cores, (for ex. cp.cuda.Device().synchronize() when using CuPy) I'm effectively running GPU-CPU in parallel. Thanks again. A general flow with Numba, to make gpu_function and cpu_function run in parallel will be something like the following:
""" GPU has buffer full to start processing Frame N-1 """
tmp_gpu = cp.asarray(tmp_cpu)
gpu_function(tmp_gpu)
""" CPU receives Frame N over TCP socket """
tmp_cpu = cpu_function()
""" For instance we know cpu_function takes [a little] longer than gpu_function """
cp.cuda.Device().synchronize()
Of course, we could even do away with the time spent in transferring tmp_cpu to tmp_gpu by employing PING-PONG buffer and initial frame delay.
In the documentation for fit_generator() (docs: https://keras.io/models/sequential/#fit_generator) it says that the parameter use_multiprocessing accepts a bool that if set to True allows process-based threading.
It also says that the parameter workers is an integer that designates how many process to spin up if using process-based threading. Apparently it defaults to 1 (a single process based thread) and if set to 0 it will execute the generator on the main thread.
What I thought this meant was that if use_multiprocessing=True and workers > 0 (let's use 6 for an example) that it would spin up 6 processes running the generator independently. However, when I test this I think I must be misunderstanding something (see below).
My confusion arises from the fact that if I set use_multiprocessing to False and workers = 1 then in my task manager I can see that all 12 of my virtual cores are being utilized somewhat evenly and I am at about 50% CPU usage while training my model (for reference, I have an i7-8750H CPU with 6 cores that support virtualization and I have virtualization enabled in BIOS). If I increase the number of workers, the CPU usage goes to 100% and training is much faster. If I decrease the number of workers to 0 so that it runs on the main thread, I can see that all of my virtual cores are still being used, but it seems somewhat uneven and CPU usage is at about 36%.
Unfortunately, if I set multiprocessing = True, then I get a brokenpipe error. I have yet to fix this, but I'd like to better understand what I am trying to fix here.
If someone could please explain the difference between training with use_multiprocessing = True and use_multiprocessing = False, as well as when workers are = 0, 1, and >1 I would be very grateful. If it matters, I am using tensorflow (gpu version) as the backend for keras with python 3.6 in Spyder with the IPython Console.
My suspicion is that use_multiprocessing is actually enabling multiprocessing when True whereas workers>1 when use_multiprocessing=False is setting the number of threads, but that's just a guess.
The only thing I know is that when use_multiprocessing=False and workers > 1, there are many parallel data loading threads (I'm not really good with these names, threads, processes, etc.). But there are five parallel fronts loading data to the queue (so, loading data is faster, but it doesn't affect the model's speed - this can be good when data loading takes too long).
Whenever I tried use_multiprocessing=True, everything got frozen.
I notice that the following code using multiple threads and keep all CPU cores busy about 100% while it is reading the file.
scala.io.Source.fromFile("huge_file.txt").toList
and I assume the following is the same
scala.io.Source.fromFile("huge_file.txt").foreach
I interrupt this code as a unit test under Eclipse debugger on my dev machine (OS X 10.9.2) and showing these threads: main, ReaderThread, 3 Daemon System Thread. htop shows all threads are busy if I run this in a scala console in a 24-cores server machine (ubuntu 12).
Questions:
How do I limit this code on using N number of threads?
For the sake of understanding the system performance, can you explain to me what, why and how this is done in io.Source? Reading the source doesn't helping.
I assume each line is read in sequence; however, since it is using multiple threads, so is the foreach run in multiple threads? My debugger seems to tell me that the code still run in the main thread.
Any insight would be appreciated.
As suggested, I put my findings here.
I use the following to test my dummy code with and without -J-XX:+UseSerialGC option
$ scala -J-XX:+UseSerialGC
scala> var c = 0
scala> scala.io.Source.fromFile("huge_file.txt").foreach(e => c += e)
Before I use the option, all 24 cores in my server machine are busy during the file read. After the option, only two threads are busy.
Here is the memory profile I captured on my dev machine, not server. I first perform the GC to get the baseline, then I run the above code several times. The Eden Space got clean up periodically. The memory swing is about 20M, while the smaller file I read is about 200M i.e. io.Source creates 10% of temporary objects per each run.
This characteristics will create trouble in a shared system. This will also limit us to handle multiple big files all at once. This stresses memory, i/o and CPU usage in a way that I can't run my code with other production jobs, but run it separately to avoid having this system impact.
If you know a better way or suggestion to handle this situation in a real shared production environment, please let me know.