I am new to Python threading and I have researched about it. I would like to implement the following functionality as written in pseudo code:
while True:
while(file size < 1 GB):
sleep for 1 minute
process(file)
file = next file
I want the process() function to be a daemon thread. The next time line 4 is reached in the code (next file is 1 GB), it should create a new thread if the previous thread is running. The maximum such threads are 3 and they loop one after the other. At any time there will be at least 2 threads free. So basically any one free thread should be called to do the job once the code reaches line 4.
Are daemon threads and thread queue the things I need to look at? Or is there some other way to solve this?
Related
When using the multi-threaded approach to solve IO Bound problems in Python, this works by freeing the GIL. Let us suppose we have Thread1 which takes 10 seconds to read a file, during this 10 seconds it does not require the GIL and can leave Thread2 to execute code. Thread1 and Thread2 are effectively running in parallel because Thread1 is doing system call operations and can execute independently of Thread2, however Thread1 is still executing code.
Now, suppose we have a setup using asyncio or any asynchronous programming code. When we do something such as,
file_content = await ten_second_long_file_read()
During the time in which await is called, system calls are done to read the content of the files and when it is done an event is sent back and code execution can be later continue. During the time we are await'ing, other code can be ran.
My confusion comes from the fact that asynchronous programming is primarily single threaded. With the multiple threaded approach when T1 is reading from a file, it is still performing code execution, it simply free'd the GIL to perform work in parallel with another thread. However with asynchronous programming, when we are awaiting, how is it performing other tasks when we are waiting, aswell as reading data in a single thread? I understand the multiple-threaded idea, but not asynchronous because it is still performing the system calls in a single thread. With asynchronous programming it has nowhere to free the GIL to, considering there is only one thread. Is asyncio secretly using threads?
The number of filehandles is independent of the GIL, and threads. Posix select documentation gives a bit of an idea of the distinct mechanism around file handles.
To illustrate I created three files, 1.txt etc. These are just:
1
one
Obviously open for reading is ok but not for writing. To make a ten second read I just held the filehandle open for ten seconds, reading the first line, waiting 10 seconds, then reading the second line.
asyncio version
import asyncio
from threading import active_count
do = ['1.txt', '2.txt', '3.txt']
async def ten_second_long_file_read():
while do:
doing = do.pop()
with open(doing, 'r') as f:
print(f.readline().strip())
await asyncio.sleep(10)
print(f"threads {active_count()}")
print(f.readline().strip())
async def main():
await asyncio.gather(asyncio.create_task(ten_second_long_file_read()),
asyncio.create_task(ten_second_long_file_read()))
asyncio.run(main())
This produces a very predictable output and as expected, one thread only.
3
2
threads 1
three
1
threads 1
two
threads 1
one
threading - changes
Remove async of course. Swap asyncio.sleep(10) for time.sleep(10). The main change is the calling function.
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as e:
e.submit(ten_second_long_file_read)
e.submit(ten_second_long_file_read)
Also a fairly predictable output, however you cannot rely on this.
3
2
threads 3
three
threads 3
two
1
threads 2
one
Running the same threaded version in debug the output is a bit random, on one run on my computer this was:
23
threads 3threads 3
twothree
1
threads 2
one
This highlights a difference in threads in that the running thread is pre-emptively switched creating a whole bundle of complexity under the heading thread safety. This issue does not exist in asyncio as there is a single thread.
multi-processing
Similar to the threaded code however __name__ == '__main__' is required and the process pool executor provides a snapshot of the context.
def main():
with concurrent.futures.ProcessPoolExecutor(max_workers=2) as e:
e.submit(ten_second_long_file_read)
e.submit(ten_second_long_file_read)
if __name__ == '__main__': # required for executor
main()
Two big differences. No shared understanding of the do list so everything is done twice. Processes don't know what the other process has done. More CPU power available, however more work required to manage the load.
Three processes required for this so the overhead is large, however each process only has one thread.
3
3
threads 1
threads 1
three
three
2
2
threads 1
threads 1
two
two
1
1
threads 1
threads 1
one
one
q= queue.Queue()
for i in [3,2,1]:
def f():
time.sleep(i)
print(i)
q.put(i)
threading.Thread(target=f).start()
print(q.get())
For this piece of code, it returns 1. The reason for this is because the queue is FIFO and "1" is put first as it slept the least time.
extended question,
If I continue to run q.get() twice, it still outputs the same value "1" rather than "2" and "3". Can anyone tell me why that is? Is there anything to do with threading?
Another extended question,
When the code finishes running completely, but there are still threads that haven't finished, will they get shut down immediately as the whole program finishes?
q.get()
#this gives me 1, but I suppose it should give me 2
q.get()
#this gives me 1, but I suppose it should give me 3
Update:
It is a Python 3 code.
Assuming that the language is Python3.
The second and third calls to q.get() return 1 because each of the three threads puts a 1 into the queue. There is never a 2 or a 3 in the queue.
I don't fully understand what to expect in this case—I'm not a Python expert—but the function, f does not appear to capture the value of the loop variable, i. The i in the function f appears to be the same variable as the i in the loop, and the loop leaves i==1 before any of the three threads wakes up from sleeping. So, in all three threads, i==1 by the time q.put(i) is called.
When the code finishes running completely, but there are still threads that haven't finished, will they get shut down immediately?
No. The process won't exit until all of its threads (including the main thread) have terminated. If you want to create a thread that will be automatically, forcibly, abruptly terminated when all of the "normal" threads are finished, then you can make that thread a daemon thread.
See https://docs.python.org/3/library/threading.html, and search for "daemon".
import threading
urls=['','','','',...]
for url in urls:
threading.Thread(target=downloadSaveData, args=(url,)).start()
How to limit max thread? Say, maxThread=4. After starting of first 4 threads, I don't want to wait till all 4 threads completed, rather continuously adding one thread whenever the total existing threads are less than 4, i.e. when one thread completes, the next one thread joins.
What it sounds like you want is a Semaphore which guards access to a resource. Your code will probably look something like the following.
import threading
urls=['','','','',...]
semaphore = Semaphore(4)
for url in urls:
threading.Thread(target=downloadSaveData, args=(url, semaphore, )).start()
You just need to modify your downloadSaveData to take in a Semaphore object and do something along these lines.
def downloadSaveDate(url, semaphore):
with semaphore:
# perform some logic here
Overall I would suggest changing your code to use a ThreadPoolExecutor which handles a lot of this stuff under the hood for you.
I got the source code from http://www.saltycrane.com/blog/2008/09/simplistic-python-thread-example/ however when I tried to modify the code to my needs the results are not what I wanted.
import time
from threading import Thread
def myfunc():
time.sleep(2)
print("thread working on something")
while 1:
thread = Thread(target=myfunc())
thread.start()
print("looping")
and got the results of
thread working on something
looping
// wait 2 secondd
thread working on something
looping
// wait 2 seconds
thread working on something
looping
// wait 2 seconds and so on
thread working on something
looping
// wait 2 seconds
but then I have to wait 2 seconds before I do anything.
I want to be able to do anything while the thread does something else like checking things in an array and compare them.
In the main loop, you are initialising and starting a new thread an endless number of times. In reality you will have millions of threads running. This of course is not practical and would soon crash the program.
The reason your program does not crash is that the function that is running in the thread is executed and ends in the one pass i.e. you do not have a loop in the thread function to keep the thread alive and working.
Suggestion.
Add a loop to your threading function (myfunc) that will continue to run indefinitely in the background.
Initialise and call the thread function outside of the loop in your main section. In this way you will create only 1 thread that will run its own loop in the background. You could of course run a number of these same threads in the background if you called it more than once.
Now create a loop in your main body, and continue with your array checking or any other task that you want to run whilst the threading function continues to run in the background.
Something like this may help
import time
from threading import Thread
def myfunc():
counter = 0
while 1>0:
print "The thread counter is at ", counter
counter += 1
time.sleep (2)
thread = Thread(target=myfunc)
thread.start()
# The thread has now initialised and is running in the background
mCounter = 0
while 1:
print "Main loop counter = ", mCounter
mCounter += 1
time.sleep (5)
In this example, the thread will print a line every 2 seconds, and the main loop will print a line every 5 seconds.
Be careful to close your thread down. In some cases, a keyboard interrupt will stop the main loop, but the thread will keep on running.
I hope this helps.
I'm looking into the Reusable Barrier algorithm from the book "The Little Book Of Semaphores" (archived here).
The puzzle is on page 31 (Basic Synchronization Patterns/Reusable Barrier), and I have come up with a 'solution' (or not) which differs from the solution from the book (a two-phase barrier).
This is my 'code' for each thread:
# n = 4; threads running
# semaphore = n max., initialized to 0
# mutex, unowned.
start:
mutex.wait()
counter = counter + 1
if counter = n:
semaphore.signal(4) # add 4 at once
counter = 0
mutex.release()
semaphore.wait()
# critical section
semaphore.release()
goto start
This does seem to work, I've even inserted different sleep timers into different sections of the threads, and they still wait for all the threads to come before continuing each and every loop. Am I missing something? Is there a condition that this will fail?
I've implemented this using the Windows library Semaphore and Mutex functions.
Update:
Thank you to starblue for the answer. Turns out that if for whatever reason a thread is slow between mutex.release() and semaphore.wait() any of the threads that arrive to semaphore.wait() after a full loop will be able to go through again, since there will be one of the N unused signals left.
And having put a Sleep command for thread number 3, I got this result where one can see that thread 3 missed a turn the first time, with thread 1 having done 2 turns, and then catching up on the second turn (which was in fact its 1st turn).
Thanks again to everyone for the input.
One thread could run several times through the barrier while some other thread doesn't run at all.