Why cannot join the processes after the work is done? - python-3.x

I have a Queue, a Manager and a Lock inside a class. The class has a run function that starts 6 Process and they all wait to an exit_flag to be true in order to end their jobs. However, I cannot "end" the jobs because when I call the join methods on each job, it blocks it. The code looks as follows:
from multiprocessing import Process, Lock, Queue, Manager
class MyClass():
def __init__(self):
self.q = Queue(maxsize=50)
self.lock = Lock()
manager = Manager()
self.manager = manager.dict()
def fill_queue(self,idx):
while not self.exit():
#do something
result,result_type= self.perform_extraction()
if result_type not in self.manager():
self.manager[result_type] = []
while self.q.full() and not self.exit():
sleep(10)
if self.exit():
print('Exit filler')
break
self.lock.acquire()
self.q.put((result,result_type))
self.lock.release()
else:
print(f'queue filler {idx} ended')
def empty_queue(self,idx):
while not self.exit():
if self.q.emtpy():
continue
self.lock.aquire()
result, result_type = self.q.get()
self.lock.release()
result,id = self.perform_test(queue_value)
if result >=0 and result not in self.manager[result_type]:
self.manager[id] += [(result, result_type)]
self.insert_to_database(result, result_type) --> this inserts the value into a sqlite3 ddbb
else:
print(f'worker {idx} ended')
def run(self,n_workers):
jobs = []
for _ in range(2):
p = Process(target = self.fill_queue, args=(_,))
jobs.append(p)
for _ in range(n_workers):
p = Process(target = self.empty_queue, args=(_,))
jobs.append(p)
for job in jobs:
job.start()
for idx,job in enumerate(jobs):
print(f'joining job {idx}')
job.join()
if not job.is_alive():
print(f'closing job {idx}')
job.close()
else:
print(f'job {idx} still alive')
if __name__ == '__main__':
mc = MyClass()
mc.run(n_workers = 4)
print('RUN ENDED!')
The manager is used to communicate between processes and when the criteria is met and I have X amount of elements in the manager, the self.exit() function returns True.
When I run this code, it gets stucked printing joining job 0 and it stays there forever, I dont know why. If I add some timeout and set it to job.join(5) (5 arbitary, no real reason), it prints:
joining job 0
job 0 still alive
joining job 1
job 1 still alive
joining job 2
closing job 2
joining job 3
closing job 3
joining job 4
closing job 4
joining job 5
closing job 5
RUN ENDED!
And the code does not finish. I also tried to do job.terminate() if the job is still alive and this throwed an error telling me that some leaked folders could not be found. Does this means that I have some zombie processes?
Why this is happening? What I'm doing wrong?
EDIT: Added some logic on the interaction with manager(). I'm using the manager to add a couple of types of results and append all results of same type to a list, so the dict structure is something like {result_type:[result_values]} and the reason to use manager is to avoid storing duplicate results and check when I the algorithm mets the exit criteria.
Now, the exit criteria function looks like this:
def exit(self):
for v in self.manager.values():
if len(v) < 10:
return False
return True
So, all processes ends when I have 10 items on each list of each type. There are 3 possible amount of different types, so once each type is added as a manager key, it only needs to be filled and when all 3 have at least 10 values (there could be more), then all the job should end.
EDIT 2: Added some printing information to fill_queue and empty_queue functions. What is printed is:
worker 0 ended
worker 1 ended
queue filler 0 ended
worker 2 ended
queue filler 1 ended
joining job 0 --> *
worker 3 ended
so this usually (like always) prints before all workers print the "ended" statement, but it never joins the first job. This is actually stuck at the first try of calling join method at the for idx,job in enumerate(jobs): cycle.

Related

Best way to keep creating threads on variable list argument

I have an event that I am listening to every minute that returns a list ; it could be empty, 1 element, or more. And with those elements in that list, I'd like to run a function that would monitor an event on that element every minute for 10 minute.
For that I wrote that script
from concurrent.futures import ThreadPoolExecutor
from time import sleep
import asyncio
import Client
client = Client()
def handle_event(event):
for i in range(10):
client.get_info(event)
sleep(60)
async def main():
while True:
entires = client.get_new_entry()
if len(entires) > 0:
with ThreadPoolExecutor(max_workers=len(entires)) as executor:
executor.map(handle_event, entires)
await asyncio.sleep(60)
if __name__ == "__main__":
loop = asyncio.new_event_loop()
loop.run_until_complete(main())
However, instead of keep monitoring the entries, it blocks while the previous entries are still being monitors.
Any idea how I could do that please?
First let me explain why your program doesn't work the way you want it to: It's because you use the ThreadPoolExecutor as a context manager, which will not close until all the threads started by the call to map are finished. So main() waits there, and the next iteration of the loop can't happen until all the work is finished.
There are ways around this. Since you are using asyncio already, one approach is to move the creation of the Executor to a separate task. Each iteration of the main loop starts one copy of this task, which runs as long as it takes to finish. It's a async def function so many copies of this task can run concurrently.
I changed a few things in your code. Instead of Client I just used some simple print statements. I pass a list of integers, of random length, to handle_event. I increment a counter each time through the while True: loop, and add 10 times the counter to every integer in the list. This makes it easy to see how old calls continue for a time, mixing with new calls. I also shortened your time delays. All of these changes were for convenience and are not important.
The important change is to move ThreadPoolExecutor creation into a task. To make it cooperate with other tasks, it must contain an await expression, and for that reason I use executor.submit rather than executor.map. submit returns a concurrent.futures.Future, which provides a convenient way to await the completion of all the calls. executor.map, on the other hand, returns an iterator; I couldn't think of any good way to convert it to an awaitable object.
To convert a concurrent.futures.Future to an asyncio.Future, an awaitable, there is a function asyncio.wrap_future. When all the futures are complete, I exit from the ThreadPoolExecutor context manager. That will be very fast since all of the Executor's work is finished, so it does not block other tasks.
import random
from concurrent.futures import ThreadPoolExecutor
from time import sleep
import asyncio
def handle_event(event):
for i in range(10):
print("Still here", event)
sleep(2)
async def process_entires(counter, entires):
print("Counter", counter, "Entires", entires)
x = [counter * 10 + a for a in entires]
with ThreadPoolExecutor(max_workers=len(entires)) as executor:
futs = []
for z in x:
futs.append(executor.submit(handle_event, z))
await asyncio.gather(*(asyncio.wrap_future(f) for f in futs))
async def main():
counter = 0
while True:
entires = [0, 1, 2, 3, 4][:random.randrange(5)]
if len(entires) > 0:
counter += 1
asyncio.create_task(process_entires(counter, entires))
await asyncio.sleep(3)
if __name__ == "__main__":
asyncio.run(main())

how to apply mutli-threading for getting working URLs from list of 1000 URLs - Python

Normally to check status code of 1000 URLs take 9hr 30 min
How can I apply multi-threading for these URLs, my output should be working URLs which will have status code of 200.
For example out of 100 URLs we have 70 with 200 code and remaining with 404 or anything else.
Input = ['https://xxxxxx1','https://xxxxxx2',..........,'https://xxxxxx100']
Output:- ['https://xxxxxx1','https://xxxxxx2','https://xxxxxx3',..........,'https://xxxxxx70'] these will have 200 status code
Just a suggestion of how simply thread works in python. You can use split your url list into two and then make two functions, which run on two separate threads.
import threading
Output = []
List1 = [half of your urls]
List2 = [other half of your urls]
def check_status(lst):
"""
Do you task
"""
def check_status(lst):
"""
Do you task
"""
if __name__ == "__main__":
# creating thread
t1 = threading.Thread(target=check_status, args=(List1,))
t2 = threading.Thread(target=check_status_2, args=(List2,))
# starting thread 1
t1.start()
# starting thread 2
t2.start()
# wait until thread 1 is completely executed
t1.join()
# wait until thread 2 is completely executed
t2.join()
# both threads completely executed
print("Completed")
Once the threads start, your program also keeps on executing. In order to stop execution of your ongoing program until a thread is completed, use join method. Append the urls giving 200 status code to Output

how can I get the return value from thread.run

i'm trying to "save" a return value from a function (that returns integer) but i'm getting None object
import threading
class SitesThread(threading.Thread):
def __init__(self, func, searchLine):
threading.Thread.__init__(self)
self.func = func
self.searchLine = searchLine
def run(self):
self.func(self.searchLine)
def print1(searchLine):
print(searchLine, "this is print 1")
return 1
def print2(searchLine):
print(searchLine, "this is print 2")
return 2
def main():
threads = []
line = input("pleAS insert a search line")
t1 = SitesThread(print1, line)
t2 = SitesThread(print2, line)
res1 = t1.start()
res2 = t2.start()
threads.append(t1)
threads.append(t2)
for t in threads:
t.join()
print("thread 1 is alive?", t1.isAlive())
print(res1)
print("thread 2 is alive?", t2.isAlive())
print(res2)
if __name__ == "__main__":
main()
i'm expecting to get:
'searchLine' this is print 1
'searchLine' this is print 2
thread 1 is alive? False
1
thread 2 is alive? False
2
but i get:
i'm expecting to get:
'searchLine' this is print 1
'searchLine' this is print 2
thread 1 is alive? False
None
thread 2 is alive? False
None
I'm unsure how you will get it to return and be placed in the res1 or 2 variable. However you can still print out searchLine portion of your thread. Take a look at the code below
print("thread 1 is alive?", t1.isAlive())
print(t1.searchLine)
print("thread 2 is alive?", t2.isAlive())
print(t2.searchLine)
This will print what you are searching for... so if you searched for 12 it would print 12.
Hope this helps. I'll keep poking around with it and see if I can get something that matches your expected output exactly.
it's seems so far that i can't get a return value from a thread that running a function that returns a value.
So if someone have the same problem as I got, one optional solution is to change the function, instead of returning a value it can put the value in a global list (make sure you know exactly where in the list you are saving this value so your other threads won't run it over).
other solution is to use processes instead of threads, but since i'm trying to save time and my functions are all about API requests, iv'e learned that threads are faster than processes in this case.
Hope it will be useful for someone.

Why is this queue.join call blocking indefinitely?

I'm playing about with a personal project in python3.6 and I've run into the following issue which results in the my_queue.join() call blocking indefinitely. Note this isn't my actual code but a minimal example demonstrating the issue.
import threading
import queue
def foo(stop_event, my_queue):
while not stop_event.is_set():
try:
item = my_queue.get(timeout=0.1)
print(item) #Actual logic goes here
except queue.Empty:
pass
print('DONE')
stop_event = threading.Event()
my_queue = queue.Queue()
thread = threading.Thread(target=foo, args=(stop_event, my_queue))
thread.start()
my_queue.put(1)
my_queue.put(2)
my_queue.put(3)
print('ALL PUT')
my_queue.join()
print('ALL PROCESSED')
stop_event.set()
print('ALL COMPLETE')
I get the following output (it's actually been consistent, but I understand that the output order may differ due to threading):
ALL PUT
1
2
3
No matter how long I wait I never see ALL PROCESSED output to the console, so why is my_queue.join() blocking indefinitely when all the items have been processed?
From the docs:
The count of unfinished tasks goes up whenever an item is added to the
queue. The count goes down whenever a consumer thread calls
task_done() to indicate that the item was retrieved and all work on it
is complete. When the count of unfinished tasks drops to zero, join()
unblocks.
You're never calling q.task_done() inside your foo function. The foo function should be something like the example:
def worker():
while True:
item = q.get()
if item is None:
break
do_work(item)
q.task_done()

Thread objects not freed from memory

I wrote a continuous script that collects some data from the internet every few seconds, keeps it in memory for a while, periodically stores it all to db and then deletes it. To keep everything running smoothly I use threads to collect the data from several sources at the same time. To minimize db operations and to avoid conflict with other db processes, I only write every now and then.
The memory from the deleted variables is never returned and eventually becomes so large the script crashes (shown by tracemalloc and pympler). I guess I'm handling the data coming out of the threads wrong but I don't know how I could do it differently. Minimal example below.
Addition: I don't think I can use a queue because in reality multiple functions are threaded from this point, modifying different local variables.
import threading
import time
import tracemalloc
import pympler.muppy, pympler.summary
import gc
tracemalloc.start()
def a():
# collect data
collection.update({int(time.time()): list(range(1,1000))})
return
collection = {}
threads = []
start = time.time()
cycle = 0
while time.time() < start + 60:
cycle += 1
t = threading.Thread(target = a)
threads.append(t)
t.start()
time.sleep(1)
for t in threads:
if t.is_alive() == False:
t.join()
# periodically delete data
delete = []
for key, val in collection.items():
if key < time.time() - 10:
delete.append(key)
for delet in delete:
print('DELETING:', delet)
del collection[delet]
gc.collect()
print('CYCLE:', cycle, 'THREADS:', threading.active_count(), 'COLLECTION:', len(collection))
print(tracemalloc.get_traced_memory())
all_objects = pympler.muppy.get_objects()
sum1 = pympler.summary.summarize(all_objects)
pympler.summary.print_(sum1)

Resources