APScheduler Threading design pattern - multithreading

I'm currently writing an application to schedule some metrics with APScheduler (3.6.0) and python3.6
After 5-6 days running, the scheduler stops without any error. So I thought it could be a ressource / concurrency problem.
I have different schedules like every 15 minutes, every 30 minutes, every hour etc...
My Scheduler is initilized as following and I am only using threads right now :
executors = {
'default': ThreadPoolExecutor(60),
'processpool': ProcessPoolExecutor(5)
}
scheduler = BackgroundScheduler(timezone="Europe/Paris", executors=executors)
My question is the following : when multiple jobs start, passing through the same code part, can this be a potential conception mistake ? Do I need to put some lock then.
For example, they use the same code part to connect to a database (same module) and retrieve a result
def executedb(data):
cursor = None
con = None
try:
dsn = cx.makedsn(data[FORMAT_CFG_DB_HOST], data[FORMAT_CFG_DB_PORT],
service_name=data[FORMAT_CFG_DB_SERVICE_NAME])
con = cx.connect(user=data[FORMAT_CFG_DB_USER], password=data[FORMAT_CFG_DB_PASSWORD], dsn=dsn)
cursor = con.cursor()
sql = data[FORMAT_METRIC_SQL]
cursor = cursor.execute(sql)
rows = rows_to_dict_list(cursor)
if len(rows) > 1:
raise Exception("")
except Exception as exc:
raise Exception('')
finally:
try:
cursor.close()
con.close()
except Exception as exc:
raise Exception('')
return rows
Can this lead to threading concurrency ? And what is the best design pattern to do this ?

If you have multiple threads in your process, then when you connect to the database you should ensure that threaded mode is enabled, like this:
con = cx.connect(user=data[FORMAT_CFG_DB_USER], password=data[FORMAT_CFG_DB_PASSWORD], dsn=dsn, threaded=True)
Without that you run the very real risk of corrupting memory or causing other issues. Perhaps this is the source of your issues?
Another option is to use the new module python-oracledb which always has threaded mode enabled.

Related

Performance difference between multithread using queue and futures.ThreadPoolExecutor using list in python3?

I was trying various approaches with python multi-threading to see which one fits my requirements. To give an overview, I have a bunch of items that I need to send to an API. Then based on the response, some of the items will go to a database and all the items will be logged; e.g., for an item if the API returns success, that item will only be logged but when it returns failure, that item will be sent to database for future retry along with logging.
Now based on the API response I can separate out success items from failure and make a batch query with all failure items, which will improve my database performance. To do that, I am accumulating all requests at one place and trying to perform multithreaded API calls(since this is an IO bound task, I'm not even thinking about multiprocessing) but at the same time I need to keep track of which response belongs to which request.
Coming to the actual question, I tried two different approaches which I thought would give nearly identical performance, but there turned out to be a huge difference.
To simulate the API call, I created an API in my localhost with a 500ms sleep(for avg processing time). Please note that I want to start logging and inserting to database after all API calls are complete.
Approach - 1(With threading.Thread and queue.Queue())
import requests
import datetime
import threading
import queue
def target(data_q):
while not data_q.empty():
data_q.get()
response = requests.get("https://postman-echo.com/get?foo1=bar1&foo2=bar2")
print(response.status_code)
data_q.task_done()
if __name__ == "__main__":
data_q = queue.Queue()
for i in range(0, 20):
data_q.put(i)
start = datetime.datetime.now()
num_thread = 5
for _ in range(num_thread):
worker = threading.Thread(target=target(data_q))
worker.start()
data_q.join()
print('Time taken multi-threading: '+str(datetime.datetime.now() - start))
I tried with 5, 10, 20 and 30 times and the results are below correspondingly,
Time taken multi-threading: 0:00:06.625710
Time taken multi-threading: 0:00:13.326969
Time taken multi-threading: 0:00:26.435534
Time taken multi-threading: 0:00:40.737406
What shocked me here is, I tried the same without multi-threading and got almost same performance.
Then after some googling around, I was introduced to futures module.
Approach - 2(Using concurrent.futures)
def fetch_url(im_url):
try:
response = requests.get(im_url)
return response.status_code
except Exception as e:
traceback.print_exc()
if __name__ == "__main__":
data = []
for i in range(0, 20):
data.append(i)
start = datetime.datetime.now()
urls = ["https://postman-echo.com/get?foo1=bar1&foo2=bar2" + str(item) for item in data]
with futures.ThreadPoolExecutor(max_workers=5) as executor:
responses = executor.map(fetch_url, urls)
for ret in responses:
print(ret)
print('Time taken future concurrent: ' + str(datetime.datetime.now() - start))
Again with 5, 10, 20 and 30 times and the results are below correspondingly,
Time taken future concurrent: 0:00:01.276891
Time taken future concurrent: 0:00:02.635949
Time taken future concurrent: 0:00:05.073299
Time taken future concurrent: 0:00:07.296873
Now I've heard about asyncio, but I've not used it yet. I've also read that it gives even better performance than futures.ThreadPoolExecutor().
Final question, If both approaches are using threads(or so I think) then why there is a huge performance gap? Am I doing something terribly wrong? I looked around. Was not able to find a satisfying answer. Any thoughts on this would be highly appreciated. Thanks for going through the question.
[Edit 1]The whole thing is running on python 3.8.
[Edit 2] Updated code examples and execution times. Now they should run on anyone's system.
The documentation of ThreadPoolExecutor explains in detail how many threads are started when the max_workers parameter is not given, as in your example. The behaviour is different depending on the exact Python version, but the number of tasks started is most probably more than 3, the number of threads in the first version using a queue. You should use futures.ThreadPoolExecutor(max_workers= 3) to compare the two approaches.
To the updated Approach - 1 I suggest to modify the for loop a bit:
for _ in range(num_thread):
target_to_run= target(data_q)
print('target to run: {}'.format(target_to_run))
worker = threading.Thread(target= target_to_run)
worker.start()
The output will be like this:
200
...
200
200
target to run: None
target to run: None
target to run: None
target to run: None
target to run: None
Time taken multi-threading: 0:00:10.846368
The problem is that the Thread constructor expects a callable object or None as its target. You do not give it a callable, rather queue processing happens on the first invocation of target(data_q) by the main thread, and 5 threads are started that do nothing because their target is None.

Threads will not close off after program completion

I have a script that receives temperature data via using requests. Since I had to make multiple requests (around 13000) I decided to explore the use of multi-threading which I am new at.
The programs work by grabbing longitude/latitude data from a csv file and then makes a request to retrieve the temperature data.
The problem that I am facing is that the script does not finish fully when the last temperature value is retrieved.
Here is the code. I have shortened so it is easy to see what I am doing:
num_threads = 16
q = Queue(maxsize=0)
def get_temp(q):
while not q.empty():
work = q.get()
if work is None:
break
## rest of my code here
q.task_done()
At main:
def main():
for o in range(num_threads):
logging.debug('Starting Thread %s', o)
worker = threading.Thread(target=get_temp, args=(q,))
worker.setDaemon(True)
worker.start()
logging.info("Main Thread Waiting")
q.join()
logging.info("Job complete!")
I do not see any errors on the console and temperature is being successfully being written to another file. I have a tried running a test csv file with only a few longitude/latitude references and the script seems to finish executing fine.
So is there a way of shedding light as to what might be happening in the background? I am using Python 3.7.3 on PyCharm 2019.1 on Linux Mint 19.1.
the .join() function waits for all threads to join before continuing to the next line

Python Threadpool not running in parallel

I am new to python and I am trying to use threadpool to run this script in parallel. However it does not run but just run in sequence. The scrip basically iterate through excel file to pick the ip address of devices and then sends xml request based on an input file. I have spent multiple hours on this, what am I not getting.
def do_upload(xml_file):
for ip in codecIPs:
try:
request = open(xml_file, "r").read()
h = httplib2.Http(".cache")
h.add_credentials(username, password)
url = "http://{}/putxml".format(ip)
print('-'*40)
print('Uploading Wall Paper to {}'.format(ip))
resp, content = h.request(url, "POST", body=request,
headers={'content-type': 'text/xml; charset=UTF-8'})
print(content)
except (socket.timeout, socket.error, httpexception) as e:
print('failed to connect to {}'.format(codecIPs), e)
pool = ThreadPool(3)
results = pool.map(do_upload('brandinglogo.xml'), codecIPs)
pool.close()
pool.join()
Python has no parallelism in its threading model due to the so called Global Interpreter Lock. Basically, all threads run on core only. It enables concurrent execution though. So for IO bound tasks, like downloading files from the web, database accesses, etc. You will get some speedup using threads to kick off those syscalls. But for CPU bound tasks, you need to use Processes. Therefore use the multiprocessing python library.

watchdog with Pool in python 3

I have a simple watchdog in python 3 that reboots my server if something goes wrong:
import time, os
from multiprocessing import Pool
def watchdog(x):
time.sleep(x)
os.system('reboot')
return
def main():
while True:
p = Pool(processes=1)
p.apply_async(watchdog, (60, )) # start watchdog with 60s interval
# here some code thas has a little chance to block permanently...
# reboot is ok because of many other files running independently
# that will get problems too if this one blocks too long and
# this will reset all together and autostart everything back
# block is happening 1-2 time a month, mostly within a http-request
p.terminate()
p.join()
return
if __name__ == '__main__':
main()
p = Pool(processes=1) is declared every time the while loop starts.
Now here the question: Is there any smarter way?
If I p.terminate() to prevent the process from reboot, the Pool becomes closed for any other work. Or is there even nothing wrong with declaring a new Pool every time because of garbage collection.
Use a process. Processes support all of the features you are using, so you don't need to make a pool with size one. While processes do have a warning about using the terminate() method (since it can corrupt pipes, sockets, and locking primitives), you are not using any of those items and don't need to care. (In any event, Pool.terminate() probably has the same issues with pipes etc. even though it lacks a similar warning.)

Multithreaded HTTP GET requests slow down badly after ~900 downloads

I'm attempting to download around 3,000 files (each being maybe 3 MB in size) from Amazon S3 using requests_futures, but the download slows down badly after about 900, and actually starts to run slower than a basic for-loop.
It doesn't appear that I'm running out of memory or CPU bandwidth. It does, however, seem like the Wifi connection on my machine slows to almost nothing: I drop from a few thousand packets/sec to just 3-4. The weirdest part is that I can't load any websites until the Python process exits and I restart my wifi adapter.
What in the world could be causing this, and how can I go about debugging it?
If it helps, here's my Python code:
import requests
from requests_futures.sessions import FuturesSession
from concurrent.futures import ThreadPoolExecutor, as_completed
# get a nice progress bar
from tqdm import tqdm
def download_threaded(urls, thread_pool, session):
futures_session = FuturesSession(executor=thread_pool, session=session)
futures_mapping = {}
for i, url in enumerate(urls):
future = futures_session.get(url)
futures_mapping[future] = i
results = [None] * len(futures_mapping)
with tqdm(total=len(futures_mapping), desc="Downloading") as progress:
for future in as_completed(futures_mapping):
try:
response = future.result()
result = response.text
except Exception as e:
result = e
i = futures_mapping[future]
results[i] = result
progress.update()
return results
s3_paths = [] # some big list of file paths on Amazon S3
def make_s3_url(path):
return "https://{}.s3.amazonaws.com/{}".format(BUCKET_NAME, path)
urls = map(make_s3_url, s3_paths)
with ThreadPoolExecutor() as thread_pool:
with requests.session() as session:
results = download_threaded(urls, thread_pool, session)
Edit with various things I've tried:
time.sleep(0.25) after every future.result() (performance degrades sharply around 900)
4 threads instead of the default 20 (performance degrades more gradually, but still degrades to basically nothing)
1 thread (performance degrades sharply around 900, but recovers intermittently)
ProcessPoolExecutor instead of ThreadPoolExecutor (performance degrades sharply around 900)
calling raise_for_status() to throw an exception whenever the status is greater than 200, then catching this exception by printing it as a warning (no warnings appear)
use ethernet instead of wifi, on a totally different network (no change)
creating futures in a normal requests session instead of using a FutureSession (this is what I did originally, and found requests_futures while trying to fix the issue)
running the download only only a narrow range of files around the failure point (e.g. file 850 through file 950) -- performance is just fine here, print(response.status_code) shows 200 all the way, and no exceptions are caught.
For what it's worth, I have previously been able to download ~1500 files from S3 in about 4 seconds using a similar method, albeit with files an order of magnitude smaller
Things I will try when I have time today:
Using a for-loop
Using Curl in the shell
Using Curl + Parallel in the shell
Using urllib2
Edit: it looks like the number of threads is stable, but when the performance starts to go bad the number of "Idle Wake Ups" appears to spike from a few hundred to a few thousand. What does that number mean, and can I use it to solve this problem?
Edit 2 from the future: I never ended up figuring out this problem. Instead of doing it all in one application, I just chunked the list of files and ran each chunk with a separate Python invocation in a separate terminal window. Ugly but effective! The cause of the problem will forever be a mystery, but I assume it was some kind of problem deep in the networking stack of my work machine at the time.
This isn't a surprise.
You don't get any parallelism when you have more threads than cores.
You can prove this to yourself by simplifying the problem to a single core with multiple threads.
What happens? You can only have one thread running at a time, so the operating system context switches each thread to give everyone a turn. One thread works, the others sleep until they are woken up in turn to do their bit. In that case you can't do better than single thread.
You may do worse because context switching and memory allocated for each thread (1MB each) have a price, too.
Read up on Amdahl's Law.

Resources