Multithreading dropping many tasks - python-3.x

I want to do a simple job. I have a list of n elements, and want to split the list into two smaller lists and use threading to perform a simple calculation and append them to a new list. I've written some testcode and it seems to work fine when I have a small amount of elements (say 3000). But when the element list is larger (30,000), over 12-20k tasks are being dropped and the append just doesn't go through.
I've read a lot about what constitutes threadsafe, and queueing. I believe it has something to do with that, but even after experimenting with Lock() I still seem to be unable to get a threadsafe Thread.
Can someone point me in the right direction? Cheers.
# Seperate thread workload
a_genes = genes[0:count_seperator]
b_genes = genes[count_seperator:genes_count]
class GeneThread (Thread):
def __init__(self, genelist):
Thread.__init__(self)
self.genelist = genelist
def run(self):
for gene in self.genelist:
total_reputation = 0
for local_snp in gene:
user_rsid = rsids[0]
if user_rsid is None:
continue
rep = "B"
# If multiplier is 0, don't waste time calculating
if not rep or rep == "G" or rep == "U":
continue
importance = 1
weighted_reputation = importance * mul[rep]
zygosity = "homozygous_minor"
if rep == "B":
weighted_reputation *= z_mul[zygosity]
# Now we apply the spread amplifier, we raise the score to the power of the spread number
rep_square = pow(spread, weighted_reputation)
total_reputation += rep_square
try:
with lock:
UserGeneReputation.append(total_reputation)
except:
pass
start_time = time.time()
# Create new threads
gene_thread1 = GeneThread(genelist=a_genes)
gene_thread2 = GeneThread(genelist=b_genes)
gene_thread1.daemon, gene_thread2.daemon = True, True
# Start new Threads
gene_thread1.start()
gene_thread2.start()
print(len(UserGeneReputation))
print("--- %s seconds ---" % (time.time() - start_time))

You have, broadly speaking, two choices with threads. You can have them be autonomous, do their work, and then terminate themselves quietly. Or you can have them be managed by some other thread that monitors their lifetime and knows when they're done. You have a design that absolutely requires the second option (how else will you know when you have all the results you need?), yet you've chosen the first (set them for self-termination and not monitored).
Don't make the threads daemon threads. Instead, wait for both threads to finish after you start them. That's not the most sophisticated or elegant solution, but it's the one everyone learns first.
The problem with this approach is that it forces your code to be dependent on how work is assigned to threads. This can cause performance problems as you wind up having to create and destroy a thread every time you want to know when work is done, and the only way you can know that work is done is by waiting for it. Ideally, you would treat threads as an abstraction that gets work done somehow and code that has to wait for work to be finished would wait for the work itself to be finished (through some synchronization associated with the work itself) rather than wait for the thread to be finished. That way, you can be flexible about what thread does what work and don't have to keep creating and destroying threads every time you need to assign work.
But everyone learns the create/join method. And sometimes it really is the best choice. Even when you use other methods, you likely still have an outer create/join to create the threads in the first place and, typically, ensure they cleanly finish to shut down your program in an orderly way.

Related

Asynchronous Communication between few 'loops'

I have 3 classes that represent nearly isolated processes that can be run concurrently (meant to be persistent, like 3 main() loops).
class DataProcess:
...
def runOnce(self):
...
class ComputeProcess:
...
def runOnce(self):
...
class OtherProcess:
...
def runOnce(self):
...
Here's the pattern I'm trying to achieve:
start various streams
start each process
allow each process to publish to any stream
allow each process to listen to any stream (at various points in it's loop) and behave accordingly (allow for interruption of it's current task or not, etc.)
For example one 'process' Listens for external data. Another process does computation on some of that data. The computation process might be busy for a while, so by the time it comes back to start and checks the stream, there may be many values that piled up. I don't want to just use a queue because, actually I don't want to be forced to process each one in order, I'd rather be able to implement logic like, "if there is one or multiple things waiting, just run your process one more time, otherwise go do this interruptible task while you wait for something to show up."
That's like a lot, right? So I was thinking of using an actor model until I discovered RxPy. I saw that a stream is like a subject
from reactivex.subject import BehaviorSubject
newData = BehaviorSubject()
newModel = BehaviorSubject()
then I thought I'd start 3 threads for each of my high level processes:
thread = threading.Thread(target=data)
threads = {'data': thread}
thread = threading.Thread(target=compute)
threads = {'compute': thread}
thread = threading.Thread(target=other)
threads = {'other': thread}
for thread in threads.values():
thread.start()
and I thought the functions of those threads should listen to the streams:
def data():
while True:
DataProcess().runOnce() # publishes to stream inside process
def compute():
def run():
ComuteProcess().runOnce()
newData.events.subscribe(run())
newModel.events.subscribe(run())
def other():
''' not done '''
ComuteProcess().runOnce()
Ok, so that's what I have so far. Is this pattern going to give me what I'm looking for?
Should I use threading in conjunction with rxpy or just use rxpy scheduler stuff to achieve concurrency? If so how?
I hope this question isn't too vague, I suppose I'm looking for the simplest framework where I can have a small number of computational-memory units (like objects because they have internal state) that communicate with each other and work in parallel (or concurrently). At the highest level I want to be able to treat these computational-memory units (which I've called processes above) as like individuals who mostly work on their own stuff but occasionally broadcast or send a message to a specific other individual, requesting information or providing information.
Am I perhaps actually looking for an actor model framework? or is this RxPy setup versatile enough to achieve that without extreme complexity?
Thanks so much!

Thread synchronized time read

i have multiple threads running an infinite while true without them knowing of each other's existence.
Inside their respective loops i need them to check the time and do something based on it before the next iteration, something like this:
Thread:
while True:
now = timedate.now()
# do something
time.sleep(0.2)
these threads are started in my main program in such a way:
Main:
t1.start()
t2.start()
t3.start()
...
...
while True:
#main program does something
Onto the problem, i need all the threads running to receive the same time when they check for it.
I was thinking maybe about creating a class with a lock on it and a variable to store the time, the first thread that acquires the lock saves the time in it so that the following threads can read it but to me this seems quinda a hacky way of doing things (plus i wouldn't know how to check when all the threads have read the time to update it).
What would be the best way, if possible, to implement this?

Threading in Python 3

I write Python 3 code, in which I have 2 functions. The first function insertBlock() inserts data in MongoDB collection 1, the second function insertTransactionData() takes data from collection 1 and inserts it into collection 2. Data is in very large amount so I use threading to increase performance. But when I use threading it is taking more time to insert data than without threading. I am so confused that exactly how threading will work in my code and how to increase performance? Here is the main function :
if __name__ == '__main__':
t1 = threading.Thread(target=insertBlock())
t1.start()
t2 = threading.Thread(target=insertTransactionData())
t2.start()
From the python documentation for threading:
target is the callable object to be invoked by the run() method. Defaults to None, meaning nothing is called.
So the correct usage is
threading.Thread(target=insertBlock)
(without the () after insertBlock), because otherwise insertBlock is called, executed normally (blocking the main thread) and target is set to it's return value None. This causes t1.start() not to do anything and you don't get any performance improvement.
Warning:
Be aware that multithreading gives you no guarantee on what the order of execution in different threads will be. You can not rely on the data that insertBlock has inserted into the database inside the insertTransactionData function, because at the time insertTransactionData uses this data, you can not be sure that it was already inserted. So, maybe multithreading does not work at all for this code or you need to restructure your code and only parallelize those parts that do not depend on each other.
I solved this problem by merging these two functionalities into one new function
insertBlockAndTransaction(startrange,endrange). As these two functionalities depend on each other so what I did is I insert transaction information immediately below where block information is inserted (block number was common and needed for both functionalities).Then did multithreading by creating 10 threads for single function:
for i in range(10):
print('thread:',i)
t1 = threading.Thread(target=insertBlockAndTransaction,args(5000000+i*10000,5000000+(i+1)*10000))
t1.start()
It helps me to deal with increasing execution time for more than 1lakh data.

manage early return of event loop with python

I have a service running the following loop
while True:
feedback = f1()
if check1(feedback):
break
feedback = f2()
if check2(feedback):
break
feedback = f3()
if check3(feedback):
break
time.sleep(10)
do_cleanup(feedback)
Now I would like to run these feedback checks with different time intervals. One naive way is to move the time.sleep() into the f functions. But that causes blocking. What would be the easiest way to achieve periodic checks with different intervals? Here all the f functions are cheap to run.
The event loop in asyncio sounds like the way to go. But due to my inexperience, I don't know where the check and break logic should go for the event loop.
Or is there any other packages/code patterns to do this kind of monitoring logic?
In asyncio you might split the service into three separate tasks, each with its own loop and timing - you can think of them as three threads, except they are all scheduled in the same thread, and multi-task cooperatively by suspending at await.
For this purpose let's start with a utility function that calls a function and checks its result at a regular interval:
async def at_interval(f, check, seconds):
while True:
feedback = f()
if check(feedback):
return feedback
await asyncio.sleep(seconds)
The return is the equivalent to the break in your original code.
With that in place, the service spawns three such loops and wait for any of them to finish. Whichever completes first carries the "feedback" we're waiting for, and we can dispose of the others.
async def service():
loop = asyncio.get_event_loop()
t1 = loop.create_task(at_interval(f1, check1, 3))
t2 = loop.create_task(at_interval(f2, check2, 5))
t3 = loop.create_task(at_interval(f3, check3, 7))
done, pending = await asyncio.wait(
[t1, t2, t3], return_when=asyncio.FIRST_COMPLETED)
for t in pending:
t.cancel()
feedback = await list(done)[0]
do_cleanup(feedback)
asyncio.get_event_loop().run_until_complete(service())
A small difference between this and your code is that here it is possible (though very unlikely) for more than one check to fail before the service picks up on it. For example, if through a stroke of bad luck two of the above tasks end up sharing the absolute time of wakeup to the microsecond, they will be scheduled in the same event loop iteration. Both will return from their corresponding at_interval coroutines, and done will contain more than one feedback. The code handles it by picking a feedback and calling do_cleanup on that one, but it could also loop over all.
If this is not acceptable, you can easily pass each at_interval a callable that cancels all tasks except itself. This is currently done in service for brevity, but it can be done in at_interval as well. One task cancelling the others would ensure that only one feedback can exist.

sync threads to read different resources at exactly the same time

I have two cameras and this is important to read the frames with OpenCV exactly at the same time, I thought something like Lock but I cannot figure out the way I can implement this.
I need some trigger to push and enable the threads to read frames, and then wait for another trigger hit, something like below :
def get_frame(queue, cap):
while running:
if(read_frame):
queue.put(cap.read());
else:
# without this sleep this function just consumes unnecessary CPU time
time.sleep(some_time);
q = Queue.Queue()
# for every camera
for u in xrange(2):
t = threading.Thread(target=get_frame, args = (q, caps[u]))
t.daemon = True
t.start()
The problems with the above implementation are :
I need the sleep time to be defined since I don't know the delay in between every frame read (i.e. it might be long or short, depending on the calculation)
This does not enable me to read once for every trigger hit.
So this approach won't work, Any suggestions?
Consider getting FPS from VideoCapture. Also, note the difference between VideoCapture.grab and VideoCapture.retrieve frame. This is used for camera synchronization.
First call VideoCapture#grab for both cameras and then retrieve the frames. See docs.

Resources