How do I get improved pymongo performance using threading? - multithreading

I'm trying to see performance improvements on pymongo, but I'm not observing any.
My sample db has 400,000 records. Essentially I'm seeing threaded and single threaded performance be equal - and the only performance gain coming from multiple process execution.
Does pymongo not release the GIL during queries?
Single Perf: real 0m0.618s
Multiproc:real 0m0.144s
Multithread:real 0m0.656s
Regular code:
choices = ['foo','bar','baz']
def regular_read(db, sample_choice):
rows = db.test_samples.find({'choice':sample_choice})
return 42 # done to remove calculations from the picture
def main():
client = MongoClient('localhost', 27017)
db = client['test-async']
for sample_choice in choices:
regular_read(db, sample_choice)
if __name__ == '__main__':
main()
$ time python3 mongotest_read.py
real 0m0.618s
user 0m0.085s
sys 0m0.018s
Now when I use multiprocessing I can see some improvement.
from random import randint, choice
import functools
from pymongo import MongoClient
from concurrent import futures
choices = ['foo','bar','baz']
MAX_WORKERS = 4
def regular_read(sample_choice):
client = MongoClient('localhost', 27017,connect=False)
db = client['test-async']
rows = db.test_samples.find({'choice':sample_choice})
#return sum(r['data'] for r in rows)
return 42
def main():
f = functools.partial(regular_read)
with futures.ProcessPoolExecutor(MAX_WORKERS) as executor:
res = executor.map(f, choices)
print(list(res))
return len(list(res))
if __name__ == '__main__':
main()
$ time python3 mongotest_proc_read.py
[42, 42, 42]
real 0m0.144s
user 0m0.106s
sys 0m0.041s
But when you switch from ProcessPoolExecutor to ThreadPoolExecutor the speed drops back to single threaded mode.
...
def main():
client = MongoClient('localhost', 27017,connect=False)
f = functools.partial(regular_read, client)
with futures.ThreadPoolExecutor(MAX_WORKERS) as executor:
res = executor.map(f, choices)
print(list(res))
return len(list(res))
$ time python3 mongotest_thread_read.py
[42, 42, 42]
real 0m0.656s
user 0m0.111s
sys 0m0.024s
...

PyMongo uses the standard Python socket module, which does drop the GIL while sending and receiving data over the network. However, it's not MongoDB or the network that's your bottleneck: it's Python.
CPU-intensive Python processes do not scale by adding threads; indeed they slow down slightly due to context-switching and other inefficiencies. To use more than one CPU in Python, start subprocesses.
I know it doesn't seem intuitive that a "find" should be CPU intensive, but the Python interpreter is slow enough to contradict our intuition. If the query is fast and there's no latency to MongoDB on localhost, MongoDB can easily outperform the Python client. The experiment you just ran, substituting subprocesses for threads, confirms that Python performance is the bottleneck.
To ensure maximum throughput, make sure you have C extensions enabled: pymongo.has_c() == True. With that in place, PyMongo runs as fast as a Python client library can achieve, to get more throughput go to multiprocessing.
If your expected real-world scenario involves more time-consuming queries, or a remote MongoDB with some network latency, multithreading may give you some performance increase.

You can use mongodb indexes to optimize your queries.
https://docs.mongodb.com/manual/tutorial/optimize-query-performance-with-indexes-and-projections/

Related

python speedup a simple function

I try to find a simple way to "speed up" simple functions for a big script so I googled for it and found 3 ways to do that.
but it seems the time they need is always the same.
so what I am doing wrong testing them?
file1:
from concurrent.futures import ThreadPoolExecutor as PoolExecutor
from threading import Thread
import time
import os
import math
#https://dev.to/rhymes/how-to-make-python-code-concurrent-with-3-lines-of-code-2fpe
def benchmark():
start = time.time()
for i in range (0, 40000000):
x = math.sqrt(i)
print(x)
end = time.time()
print('time', end - start)
with PoolExecutor(max_workers=3) as executor:
for _ in executor.map((benchmark())):
pass
file2:
#the basic way
from threading import Thread
import time
import os
import math
def calc():
start = time.time()
for i in range (0, 40000000):
x = math.sqrt(i)
print(x)
end = time.time()
print('time', end - start)
calc()
file3:
import asyncio
import uvloop
import time
import math
#https://github.com/magicstack/uvloop
async def main():
start = time.time()
for i in range (0, 40000000):
x = math.sqrt(i)
print(x)
end = time.time()
print('time', end - start)
uvloop.install()
asyncio.run(main())
every file needs about 180-200 sec
so i 'can't see' a difference.
I googled for it and found 3 ways to [speed up a function], but it seems the time they need is always the same. so what I am doing wrong testing them?
You seemed to have found strategies to speed up some code by parallelizing it, but you failed to implement them correctly. First, the speedup is supposed to come from running multiple instances of the function in parallel, and the code snippets make no attempt to do that. Then, there are other problems.
In the first example, you pass the result benchmark() to executor.map, which means all of benchmark() is immediately executed to completion, thus effectively disabling parallelization. (Also, executor.map is supposed to receive an iterable, not None, and this code must have printed a traceback not shown in the question.) The correct way would be something like:
# run the benchmark 5 times in parallel - if that takes less
# than 5x of a single benchmark, you've got a speedup
with ThreadPoolExecutor(max_workers=5) as executor:
for _ in range(5):
executor.submit(benchmark)
For this to actually produce a speedup, you should try to use ProcessPoolExecutor, which runs its tasks in separate processes and is therefore unaffected by the GIL.
The second code snippet never actually creates or runs a thread, it just executes the function in the main thread, so it's unclear how that's supposed to speed things up.
The last snippet doesn't await anything, so the async def works just like an ordinary function. Note that asyncio is an async framework based on switching between tasks blocked on IO, and as such can never speed CPU-bound calculations.

Threading/Async in Requests-html

I have a large number of links I need to scrape from a website. I have ~70 base links and from them over 700 links that need to be scraped from those starting 70. So in order to speed up this process, takes about 2-3 hours without threading/async, I decided to try and use a thread/async.
My problem is that I need to render some javascript in order to get the links in the first place. I have been using requests-html to do this as its html.render() method is very reliable. However, when I try and run this using threading or async I run into a host of problems. I tried AsyncHTMLSession due to this Github PR but have been unable to get it to work. I was wondering if anyone had any ideas or links they could point me too that might help.
Here is some example code:
from multiprocessing.pool import ThreadPool
from requests_html import AsyncHTMLSession
links = (tuple of links)
n = 5
batch = [links[i:i+n] for i in range(0, len(links), n)]
def link_processor(batch_link):
session = AsyncHTMLSession()
results = []
for l in batch_link:
print(l)
r = session.get(l)
r.html.arender()
tmp_next = r.html.xpath('//a[contains(#href, "/matches/")]')
return tmp_next
pool = ThreadPool(processes=2)
output = pool.map(link_processor, batch)
pool.close()
pool.join()
print(output)
Output:
RuntimeError: There is no current event loop in thread 'Thread-1'.
Was able to fix this with some help from the learnpython subreddit. Turns out requests-html probably uses threads in some way and so threading the threads has an issue so simply using multiprocessing pool works.
FIXED CODE:
from multiprocessing import Pool
from requests_html import HTMLSession
.....
pool = Pool(processes=3)
output = pool.map(link_processor, batch[:2])
pool.close()
pool.join()
print(output)

Thread objects not freed from memory

I wrote a continuous script that collects some data from the internet every few seconds, keeps it in memory for a while, periodically stores it all to db and then deletes it. To keep everything running smoothly I use threads to collect the data from several sources at the same time. To minimize db operations and to avoid conflict with other db processes, I only write every now and then.
The memory from the deleted variables is never returned and eventually becomes so large the script crashes (shown by tracemalloc and pympler). I guess I'm handling the data coming out of the threads wrong but I don't know how I could do it differently. Minimal example below.
Addition: I don't think I can use a queue because in reality multiple functions are threaded from this point, modifying different local variables.
import threading
import time
import tracemalloc
import pympler.muppy, pympler.summary
import gc
tracemalloc.start()
def a():
# collect data
collection.update({int(time.time()): list(range(1,1000))})
return
collection = {}
threads = []
start = time.time()
cycle = 0
while time.time() < start + 60:
cycle += 1
t = threading.Thread(target = a)
threads.append(t)
t.start()
time.sleep(1)
for t in threads:
if t.is_alive() == False:
t.join()
# periodically delete data
delete = []
for key, val in collection.items():
if key < time.time() - 10:
delete.append(key)
for delet in delete:
print('DELETING:', delet)
del collection[delet]
gc.collect()
print('CYCLE:', cycle, 'THREADS:', threading.active_count(), 'COLLECTION:', len(collection))
print(tracemalloc.get_traced_memory())
all_objects = pympler.muppy.get_objects()
sum1 = pympler.summary.summarize(all_objects)
pympler.summary.print_(sum1)

What can be slowing down my program when i use multithreading?

I'm writing a program that downloads data from a website (eve-central.com). It returns xml when I send a GET request with some parameters. The problem is that I need to make about 7080 of such requests because i can't specify the typeid parameter more than once.
def get_data_eve_central(typeids, system, hours, minq=1, thread_count=1):
import xmltodict, urllib3
pool = urllib3.HTTPConnectionPool('api.eve-central.com')
for typeid in typeids:
r = pool.request('GET', '/api/quicklook', fields={'typeid': typeid, 'usesystem': system, 'sethours': hours, 'setminQ': minq})
answer = xmltodict.parse(r.data)
It was really slow when I just connected to the website and made all the requests so I decided to make it use multiple threads at a time (I read that if the process involves a lot of waiting (I/O, HTTP requests), it can be speeded up a lot with multithreading). I rewrote it using multiple threads, but it somehow isn't any faster (a bit slower in fact). Here's the code rewritten using multithreading:
def get_data_eve_central(all_typeids, system, hours, minq=1, thread_count=1):
if thread_count > len(all_typeids): raise NameError('TooManyThreads')
def requester(typeids):
pool = urllib3.HTTPConnectionPool('api.eve-central.com')
for typeid in typeids:
r = pool.request('GET', '/api/quicklook', fields={'typeid': typeid, 'usesystem': system, 'sethours': hours, 'setminQ': minq})
answer = xmltodict.parse(r.data)['evec_api']['quicklook']
answers.append(answer)
def chunkify(items, quantity):
chunk_len = len(items) // quantity
rest_count = len(items) % quantity
chunks = []
for i in range(quantity):
chunk = items[:chunk_len]
items = items[chunk_len:]
if rest_count and items:
chunk.append(items.pop(0))
rest_count -= 1
chunks.append(chunk)
return chunks
t = time.clock()
threads = []
answers = []
for typeids in chunkify(all_typeids, thread_count):
threads.append(threading.Thread(target=requester, args=[typeids]))
threads[-1].start()
threads[-1].join()
print(time.clock()-t)
return answers
What I do is I divide all typeids into as many chunks as the quantity of threads i want to use and create a thread for each chunk to process it. The question is: what can slow it down? (I apologise for my bad english)
Python has Global Interpreter Lock. It can be your problem. Actually Python cannot do it in a genuine parallel way. You may think about switching to other languages or staying with Python but use process-based parallelism to solve your task. Here is a nice presentation Inside the Python GIL

Cassandra Pycassa connection pool, how to use properly?

In order to get a Cassandra insert going faster I'm using multithreading, its working ok, but if I add more threads it doesnt make any difference, I think I'm not generating more connections, I think maybe I should be using pool.execute(f, *args, **kwargs) but I dont know how to use it, the documentation is quite scanty. Heres my code so far..
import connect_to_ks_bp
from connect_to_ks_bp import ks_refs
import time
import pycassa
from datetime import datetime
import json
import threadpool
pool = threadpool.ThreadPool(20)
count = 1
bench = open("benchCassp20_100000.txt", "w")
def process_tasks(lines):
#let threadpool format your requests into a list
requests = threadpool.makeRequests(insert_into_cfs, lines)
#insert the requests into the threadpool
for req in requests:
pool.putRequest(req)
pool.wait()
def read(file):
"""read data from json and insert into keyspace"""
json_data=open(file)
lines = []
for line in json_data:
lines.append(line)
print len(lines)
process_tasks(lines)
def insert_into_cfs(line):
global count
count +=1
if count > 5000:
bench.write(str(datetime.now())+"\n")
count = 1
#print count
#print kspool.checkedout()
"""
user_tweet_cf = pycassa.ColumnFamily(kspool, 'UserTweet')
user_name_cf = pycassa.ColumnFamily(kspool, 'UserName')
tweet_cf = pycassa.ColumnFamily(kspool, 'Tweet')
user_follower_cf = pycassa.ColumnFamily(kspool, 'UserFollower')
"""
tweet_data = json.loads(line)
"""Format the tweet time as an epoch seconds int value"""
tweet_time = time.strptime(tweet_data['created_at'],"%a, %d %b %Y %H:%M:%S +0000")
tweet_time = int(time.mktime(tweet_time))
new_user_tweet(tweet_data['from_user_id'],tweet_time,tweet_data['id'])
new_user_name(tweet_data['from_user_id'],tweet_data['from_user_name'])
new_tweet(tweet_data['id'],tweet_data['text'],tweet_data['to_user_id'])
if tweet_data['to_user_id'] != 0:
new_user_follower(tweet_data['from_user_id'],tweet_data['to_user_id'])
""""4 functions below carry out the inserts into specific column families"""
def new_user_tweet(from_user_id,tweet_time,id):
ks_refs.user_tweet_cf.insert(from_user_id,{(tweet_time): id})
def new_user_name(from_user_id,user_name):
ks_refs.user_name_cf.insert(from_user_id,{'username': user_name})
def new_tweet(id,text,to_user_id):
ks_refs.tweet_cf.insert(id,{
'text': text
,'to_user_id': to_user_id
})
def new_user_follower(from_user_id,to_user_id):
ks_refs.user_follower_cf.insert(from_user_id,{to_user_id: 0})
read('tweets.json')
if __name__ == '__main__':
This is just another file..
import pycassa
from pycassa.pool import ConnectionPool
from pycassa.columnfamily import ColumnFamily
"""This is a static class I set up to hold the global database connection stuff,
I only want to connect once and then the various insert functions will use these fields a lot"""
class ks_refs():
pool = ConnectionPool('TweetsKS',use_threadlocal = True,max_overflow = -1)
#classmethod
def cf_connect(cls, column_family):
cf = pycassa.ColumnFamily(cls.pool, column_family)
return cf
ks_refs.user_name_cfo = ks_refs.cf_connect('UserName')
ks_refs.user_tweet_cfo = ks_refs.cf_connect('UserTweet')
ks_refs.tweet_cfo = ks_refs.cf_connect('Tweet')
ks_refs.user_follower_cfo = ks_refs.cf_connect('UserFollower')
#trying out a batch mutator whihc is supposed to increase performance
ks_refs.user_name_cf = ks_refs.user_name_cfo.batch(queue_size=10000)
ks_refs.user_tweet_cf = ks_refs.user_tweet_cfo.batch(queue_size=10000)
ks_refs.tweet_cf = ks_refs.tweet_cfo.batch(queue_size=10000)
ks_refs.user_follower_cf = ks_refs.user_follower_cfo.batch(queue_size=10000)
A few thoughts:
Batch sizes of 10,000 are way too large. Try 100.
Make your ConnectionPool size at least as large as the number of threads using the pool_size parameter. The default is 5. Pool overflow should only be used when the number of active threads may vary over time, not when you have a fixed number of threads. The reason is that it will result in a lot of unnecessary opening and closing of new connections, which is a fairly expensive process.
After you've resolved those issues, look into these:
I'm not familiar with the threadpool library that you're using. Make sure that if you take the insertions to Cassandra out of the picture that you see an increase in the performance when you increase the number of threads
Python itself has a limit to how many threads may be useful due to the GIL. It shouldn't normally max out at 20, but it might if you're doing something CPU intensive or something that requires a lot of Python interpretation. The test that I described in my previous point will cover this as well. It may be the case that you should consider using the multiprocessing module, but you would need some code changes to handle that (namely, not sharing ConnectionPools, CFs, or hardly anything else between processes).

Resources