Let's say I have a generator that produces some random number on each call. I want the combinations of these numbers in 2's.
def generate():
while True: # Note this is not actually infinite, just an example
yield random(1,10)
for combos in iter.combinations(generate(),2):
#DO AN OPERATION WITH A COMBINATION
#HOW DO I MULTI-THREAD THIS?
But my generator is going to yield a total number 'n' which is 24000+ numbers. So I need to process the combinations as they are made instead of storing in a list(memory).
I also need to multithread this operation by dividing the combinations among at least 4 threads.
I thought of doing this round robin, i.e. Assign 4 queues and each thread being responsible for 1 queue.
Do you guys have any other recommendations? I need the script to finish executing as soon as possible.
EDIT:
Ok, I just wrote both versions of this program (list based, generator based).
And My list based version is actually taking less RAM. How is this possible?
EDIT 2:
It was because I tried to plot points using pyplot point by point. This caused the graph to be re-rendered on every call.
Related
I come from Java. In Java 8 and above we have Concurrency API (Executor Service, Countdown Latch, Cyclic Barrier, Parallel Stream API). Is there any simple API like this in Python 3? I found only a lot of ugly code, where everyone is reinventing the wheel like fork join operation on hard-coded specific dict or list with custom code.
Lets say I have a dict with 50 000 elements and it contains group_id integer. I want to count how many elements are there in each group.
Something like this, but I want to make it nice, clean & parallel:
import collections
dataset_dict = collections.defaultdict(int)
for img, group_id in dataset:
dataset_dict[classes[group_id]] += 1
print(dataset_dict)
The best I have found is Ray library in Python 3, but the API is very low level and not up to par with other modern languages. With the boom of lambdas and PyTorch / Keras machine learning in Python, progress in Typescript and overhaul since Java 8, I really need something like that in Python 3.
Can you provide some simple example for above code? I tried something with Ray, which seems the most simple. But the problem is writing increments in shared variable. Perhaps you know a better, more modern API for Python 3.
The expected behavior is that the 50 000 elements will be split by number of CPUs. Each thread will sum up the group count and then it will join the results into a final result. I think it could be just a simple Fork Join pool in this case. I want a perfect clean code, easily readable. So you just read the code and you get that "aha" moment like it's simple, but also smart, because the beauty is in the simplicity.
One somewhat fundamental difference between Python and Java is that Python has a Global Interpreter Lock. This makes it a little more difficult to implement low level threading in the same way that you would for Java for example.
In Python, parallelism is typically achieved through multiple processes. Multiprocessing is the built in library which generally wraps the process of spawning multiple processes and shared memory objects. Note there's also an asyncio library which provides coroutines but doesn't provide true parallelism (user level cooperative multi-tasking)
Ray is a full distributed system so it can help to parallelize/distribute python code across an cores on a single machine, or across a full cluster. With Ray, you could use a Parallel Iterator instead of a list, and wrap your dataset_dict in an actor. It might look something like:
dataset_iter = from_items(dataset)
dataset_iter.for_each(lambda x: ray.get(dataset_dict.increment.remote(x)))
# This line starts the processing
list(dataset_iter.gather_async())
and dataset_dict would look something like
import collections
#ray.remote
class Counter:
def __init__(self):
self.counter = collections.Counter()
def increment(self, key):
self.counter[key] += 1
dataset_dict = Counter.remote()
I’m new to Python and I'm struggling to understand some things in multiprocessing/threading. I want to speed up a function and have been trying different approaches from the multiprocessing module, but I can’t get it to run any faster. It’s possible it won’t run any faster, but I wanted to be sure this is the case before giving up. This isn’t a full description, but the most time-consuming activities are:
-repeatedly generating random data (10,000 rows and 10 columns)
-using a pre-fit model to predict an outcome for each row and
-comparing each predicted value to an initial value.
It performs this multiple times depending on how many of the predicted values equal the initial value, updating the parameters of the distribution each time. The output of the function is a single numeric value.
I want to loop over several of these initial values and end up with a list of the output values. I was hoping to get multiple iterations to run concurrently (but I’m open to anything that could make it faster). I’ve been ignorantly attempting pool.apply, starmap and Process but haven’t seen a change in time.
My questions are:
Based on the description of what I’m doing, is my program I/O or CPU bound? (Is it possible to tell from that? Is this even the right question to be asking?)
Should I be using multithreading or multiprocessing?
How can I determine if the iterations are running concurrently or not?
Given you didn't mention anything about drives, I'm going to assume it's not very IO bound (although still possible). Are you using multiple threads/processes yet? If not, that's definitely your issue.
I'd probably look at Pythons Thread library and because of the loop to create data, maybe the thread pool library. You just need all of your threads running that rand function at the same time.
EDIT: I forgot to mention. If you open Task Manager/System Monitor, you should be able to see load per CPU/Thread. If only one is maxed at any given time, you aren't concurrent.
Example: I wrote a quick example to help with the thread pool. Your 10,000 item list with 10 columns was not even noticeable on my i7. I increased the columns to 10,000 and it used 4GB of RAM and probably 30 seconds of 100% CPU # 3.4GHz.
from multiprocessing import Pool, Array
import random
def thread_function(_):
"""Return a random number."""
l = []
for _ in range(10000):
l.append(random.randint(0, 10000))
return l
if __name__ == '__main__':
rand_list = Array('i', range(10000))
with Pool() as pool:
rand_list = pool.map(thread_function, rand_list)
print(len(rand_list))
I have a matrix of a big size say 20000*20000 and this matrix keeps on changing every iteration. The matrix is being produced in Fortran, and Fortran calls a C++ function that processes the matrix into block diagonal form. I would like to have the c++ function creates two threads (using C++11) where each will handle 10000 * 10000. I can easily break the matrix into two parts since it's a special matrix. The matrix elements keeps on changing every iterations and if I create and join (kill) the two threads every iteration the over head becomes way expensive and the point of using multi-threading approach is lost. I decided to do the iterations inside the threads; however, I am not sure if I can keep the threads waiting for the updated matrix in order to solve in the next interation (we need to go back to Fortran to calculate the new matrix through).
The point I am stuck at is the following:
When I create the two threads from the function in c++, that function will return to Fortran and the function instance is destroyed (right?). What happen to the two threads that are currently waiting for the new matrix ?
If I have two datasets (having equal number of rows and columns) and I wish to run a piece of code that I have made, then there are two options obviously, either to go with sequential execution or parallel programming.
Now, the algorithm (code) that I have made is a big one and consists of multiple for loops. I wish to ask, is there any way to directly use it on both of them or will I have to transform the code in some way? A heads up would be great.
To answer your question: you do not have to transform the code to run it on two datasets in parallel, it should work fine like it is.
The need for parallel processing usually arises in two ways (for most users, I would imagine):
You have code you can run sequentially, but you would like to do it in parallel.
You have a function that is taking very long to execute on a large dataset, and you would like to run it in parallel to speed it up.
For the first case, you do not have to do anything, you can just execute it in parallel using one of the libraries designed for it, or just run two instances of R on the same computer and run the same code but with different datasets in each of them.
It doesn't matter how many for loops you have in there and you don't even need to have the same number of rows in columns in the datasets.
If it runs fine sequentially, it means there will be no dependence between the parallel chains and thus no problem.
Since your question falls in the first case, you can run it in parallel.
If you have the second case, you can sometimes turn it into the first case by splitting your dataset into pieces (where you can run each of the pieces sequentially) and then you run it in parallel. This is easier said than done, and won't always be possible. It is also why not all functions just have a run.in.parallel=TRUE option: it is not always obvious how you should split the data, nor is it always possible.
So you have already done most of the work by writing the functions, and splitting the data.
Here is a general way of doing parallel processing with one function, on two datasets:
library( doParallel )
cl <- makeCluster( 2 ) # for 2 processors, i.e. 2 parallel chains
registerDoParallel( cl )
datalist <- list(mydataset1 , mydataset2)
# now start the chains
nchains <- 2 # for two processors
results_list <- foreach(i=1:nchains ,
.packages = c( 'packages_you_need') ) %dopar% {
result <- find.string( datalist[[i]] )
return(result) }
The result will be a list with two elements, each containing the results from a chain. You can then combine it as you wish, or use a .combine function. See the foreach help for details.
You can use this code any time you have a case like number 1 described above. Most of the time you can also use it for cases like number 2, if you spend some time thinking about how you want to divide the data, and then combine the results. Think of it as a "parallel wrapper".
It should work in Windows, GNU/Linux, and Mac OS, but I haven't tested it on all of them.
I keep this script handy whenever I need a quick speed-up, but I still always start out by writing code I can run sequentially. Thinking in parallel hurts my brain.
I am writing an app and need to do something functionally similar to what url shortening websites do. I will be generating 6 character (case insensitive alphanumeric) random strings which would identify their longer versions of the link. This leads to 2176782336 possibilities ((10+26)^6). While assigning these strings, there are two approaches I can think about.
Approach 1: the system generates a random string at the runtime and checks for it uniqueness in the system, if it is not unique it tries again. and finally reaches a unique string somehow. But it might create issues if the user is "unlucky" maybe.
Approach 2: I generate a pool of some possible values and assign them as soon as they are needed, this however would make sure the user is always allocated a unique string almost instantly, while this could at the same time also mean, I would have to do plenty of computation in crons beforehand and will increase over the period of time.
While I already have the code to generate such values, a help on approach might be insightful as I am looking forward to a highly accelerated app experience. I could not find any comparative study on this.
Cheers!
What I do in similar situations is to keep N values queued up so that I can instantly assign them, and then when the queue's size falls below a certain threshold (say, .2 * N) I have a background task add another N items to the queue. It probably makes sense to start this background task as soon as your program starts (as opposed to generating the first N values offline and then loading them at startup), operating on the assumption that there will be some delay between startup and requests for values from the queue.