I'm trying to understand how to delete keys out of a memoized database call using python DiskCache package. Below is a simple function, that shows how I am memoizing a simple function call, and it works fine, with subsequent calls running much faster.
The documentation says that I can delete specific keys, but I can't see what the key is when it was generated using the memoize decorator
I'd have guessed that it was something like
cache.pop(("__main__slowfunc", 5)), and while this doesn't throw an error, it doesn't remove the key from the cache.
from diskcache import FanoutCache
from pathlib import Path
import os
import time
local = Path(os.environ["AllUsersProfile"]) / "CacheTests"
cacheLocation = local / "cache"
cache = FanoutCache(cacheLocation, timeout=1)
#cache.memoize()
def slowfunc(iterations):
for i in range(0, iterations):
time.sleep(1)
iterations = 6
start = time.time()
slowfunc(iterations)
end = time.time()
print(f"Initial Call = {round(end-start,0)}s")
Any assistance appreciated.
Thanks
Great question, and it's really confusing how your example doesn't work given that it matches what the source code of disk cache appears to do.
I've extended the example a bit and it seems to work for me like this. See if this works for you:
from diskcache import FanoutCache
from pathlib import Path
import os
import time
local = Path(os.environ["AllUsersProfile"]) / "CacheTests"
cacheLocation = local / "cache"
cache = FanoutCache(cacheLocation, timeout=1)
#cache.memoize()
def slowfunc(iterations):
print("Recalculating")
for i in range(0, iterations):
time.sleep(1)
iterations = 3
cache.delete(("__main__slowfunc", iterations))
start = time.time()
slowfunc(iterations)
end = time.time()
print(f"Initial Call = {round(end-start,0)}s")
start = time.time()
slowfunc(iterations)
end = time.time()
print(f"Subsequent Call = {round(end-start,0)}s")
cache.delete(("__main__slowfunc", iterations))
start = time.time()
slowfunc(iterations)
end = time.time()
print(f"After deletion = {round(end-start,0)}s")
Results:
Recalculating
Initial Call = 3.0s
Subsequent Call = 0.0s
Recalculating
After deletion = 3.0s
I tried it with pop instead of delete and that worked too
Related
The example I will describe here is purely conceptual so I'm not interested in solving this actual problem.
What I need to accomplish is to be able to asynchronously run a function based on a continuous output of a subprocess command, in this case, the windows ping yahoo.com -t command and based on the time value from the replies I want to trigger the startme function. Now inside this function, there will be some more processing done, including some database and/or network-related calls so basically I/O processing.
My best bet would be that I should use Threading but for some reason, I can't get this to work as intended. Here is what I have tried so far:
First of all I tried the old way of using Threads like this:
import subprocess
import re
import asyncio
import time
import threading
def startme(mytime: int):
print(f"Mytime {mytime} was started!")
time.sleep(mytime) ## including more long operation functions here such as database calls and even some time.sleep() - if possible
print(f"Mytime {mytime} finished!")
myproc = subprocess.Popen(['ping', 'yahoo.com', '-t'], shell=True, stdout=subprocess.PIPE)
def main():
while True:
output = myproc.stdout.readline()
if myproc.poll() is not None:
break
myoutput = output.strip().decode(encoding="UTF-8")
print(myoutput)
mytime = re.findall("(?<=time\=)(.*)(?=ms\s)", myoutput)
try:
mytime = int(mytime[0])
if mytime < 197:
# startme(int(mytime[0]))
p1 = threading.Thread(target=startme(mytime), daemon=True)
# p1 = threading.Thread(target=startme(mytime)) # tried with and without the daemon
p1.start()
# p1.join()
except:
pass
main()
But right after startme() fire for the first time, the pings stop showing and they are waiting for the startme.time.sleep() to finish.
I did manage to get this working using the concurrent.futures's ThreadPoolExecutor but when tried to replace the time.sleep() with the actual database query I found out that my startme() function will never complete so no Mytime xxx finished! message is ever shown nor any database entry is being made.
import sqlite3
import subprocess
import re
import asyncio
import time
# import threading
# import multiprocessing
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor
conn = sqlite3.connect('test.db')
c = conn.cursor()
c.execute(
'''CREATE TABLE IF NOT EXISTS mytable (id INTEGER PRIMARY KEY, u1, u2, u3, u4)''')
def startme(mytime: int):
print(f"Mytime {mytime} was started!")
# time.sleep(mytime) ## including more long operation functions here such as database calls and even some time.sleep() - if possible
c.execute("INSERT INTO mytable VALUES (null, ?, ?, ?, ?)",(1,2,3,mytime))
conn.commit()
print(f"Mytime {mytime} finished!")
myproc = subprocess.Popen(['ping', 'yahoo.com', '-t'], shell=True, stdout=subprocess.PIPE)
def main():
while True:
output = myproc.stdout.readline()
myoutput = output.strip().decode(encoding="UTF-8")
print(myoutput)
mytime = re.findall("(?<=time\=)(.*)(?=ms\s)", myoutput)
try:
mytime = int(mytime[0])
if mytime < 197:
print(f"The time {mytime} is low enought to call startme()" )
executor = ThreadPoolExecutor()
# executor = ProcessPoolExecutor() # I did tried using process even if it's not a CPU-related issue
executor.submit(startme, mytime)
except:
pass
main()
I did try using asyncio but I soon realized this is not the case but I'm wondering if I should try aiosqlite
I also thought about using asyncio.create_subprocess_shell and run both as parallel subprocesses but can't think of a way to wait for a certain string from the ping command that would trigger the second script.
Please note that I don't really need a return from the startme() function and the ping command example is conceptually derived from the mitmproxy's mitmdump output command.
The first code wasn't working as I did a stupid mistake when creating the thread so p1 = threading.Thread(target=startme(mytime)) does not take the function with its arguments but separately like this p1 = threading.Thread(target=startme, args=(mytime,))
The reason why I could not get the SQL insert statement to work in my second code was this error:
SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 10688 and this is thread id 17964
that I didn't saw until I wrapped my SQL statement into a try/except and captured the error. So I needed to make the SQL database connection inside my startme() function
The other asyncio stuff was just nonsense and cannot be applied to the current issue here.
I try to find a simple way to "speed up" simple functions for a big script so I googled for it and found 3 ways to do that.
but it seems the time they need is always the same.
so what I am doing wrong testing them?
file1:
from concurrent.futures import ThreadPoolExecutor as PoolExecutor
from threading import Thread
import time
import os
import math
#https://dev.to/rhymes/how-to-make-python-code-concurrent-with-3-lines-of-code-2fpe
def benchmark():
start = time.time()
for i in range (0, 40000000):
x = math.sqrt(i)
print(x)
end = time.time()
print('time', end - start)
with PoolExecutor(max_workers=3) as executor:
for _ in executor.map((benchmark())):
pass
file2:
#the basic way
from threading import Thread
import time
import os
import math
def calc():
start = time.time()
for i in range (0, 40000000):
x = math.sqrt(i)
print(x)
end = time.time()
print('time', end - start)
calc()
file3:
import asyncio
import uvloop
import time
import math
#https://github.com/magicstack/uvloop
async def main():
start = time.time()
for i in range (0, 40000000):
x = math.sqrt(i)
print(x)
end = time.time()
print('time', end - start)
uvloop.install()
asyncio.run(main())
every file needs about 180-200 sec
so i 'can't see' a difference.
I googled for it and found 3 ways to [speed up a function], but it seems the time they need is always the same. so what I am doing wrong testing them?
You seemed to have found strategies to speed up some code by parallelizing it, but you failed to implement them correctly. First, the speedup is supposed to come from running multiple instances of the function in parallel, and the code snippets make no attempt to do that. Then, there are other problems.
In the first example, you pass the result benchmark() to executor.map, which means all of benchmark() is immediately executed to completion, thus effectively disabling parallelization. (Also, executor.map is supposed to receive an iterable, not None, and this code must have printed a traceback not shown in the question.) The correct way would be something like:
# run the benchmark 5 times in parallel - if that takes less
# than 5x of a single benchmark, you've got a speedup
with ThreadPoolExecutor(max_workers=5) as executor:
for _ in range(5):
executor.submit(benchmark)
For this to actually produce a speedup, you should try to use ProcessPoolExecutor, which runs its tasks in separate processes and is therefore unaffected by the GIL.
The second code snippet never actually creates or runs a thread, it just executes the function in the main thread, so it's unclear how that's supposed to speed things up.
The last snippet doesn't await anything, so the async def works just like an ordinary function. Note that asyncio is an async framework based on switching between tasks blocked on IO, and as such can never speed CPU-bound calculations.
I have a large number of links I need to scrape from a website. I have ~70 base links and from them over 700 links that need to be scraped from those starting 70. So in order to speed up this process, takes about 2-3 hours without threading/async, I decided to try and use a thread/async.
My problem is that I need to render some javascript in order to get the links in the first place. I have been using requests-html to do this as its html.render() method is very reliable. However, when I try and run this using threading or async I run into a host of problems. I tried AsyncHTMLSession due to this Github PR but have been unable to get it to work. I was wondering if anyone had any ideas or links they could point me too that might help.
Here is some example code:
from multiprocessing.pool import ThreadPool
from requests_html import AsyncHTMLSession
links = (tuple of links)
n = 5
batch = [links[i:i+n] for i in range(0, len(links), n)]
def link_processor(batch_link):
session = AsyncHTMLSession()
results = []
for l in batch_link:
print(l)
r = session.get(l)
r.html.arender()
tmp_next = r.html.xpath('//a[contains(#href, "/matches/")]')
return tmp_next
pool = ThreadPool(processes=2)
output = pool.map(link_processor, batch)
pool.close()
pool.join()
print(output)
Output:
RuntimeError: There is no current event loop in thread 'Thread-1'.
Was able to fix this with some help from the learnpython subreddit. Turns out requests-html probably uses threads in some way and so threading the threads has an issue so simply using multiprocessing pool works.
FIXED CODE:
from multiprocessing import Pool
from requests_html import HTMLSession
.....
pool = Pool(processes=3)
output = pool.map(link_processor, batch[:2])
pool.close()
pool.join()
print(output)
I wrote a continuous script that collects some data from the internet every few seconds, keeps it in memory for a while, periodically stores it all to db and then deletes it. To keep everything running smoothly I use threads to collect the data from several sources at the same time. To minimize db operations and to avoid conflict with other db processes, I only write every now and then.
The memory from the deleted variables is never returned and eventually becomes so large the script crashes (shown by tracemalloc and pympler). I guess I'm handling the data coming out of the threads wrong but I don't know how I could do it differently. Minimal example below.
Addition: I don't think I can use a queue because in reality multiple functions are threaded from this point, modifying different local variables.
import threading
import time
import tracemalloc
import pympler.muppy, pympler.summary
import gc
tracemalloc.start()
def a():
# collect data
collection.update({int(time.time()): list(range(1,1000))})
return
collection = {}
threads = []
start = time.time()
cycle = 0
while time.time() < start + 60:
cycle += 1
t = threading.Thread(target = a)
threads.append(t)
t.start()
time.sleep(1)
for t in threads:
if t.is_alive() == False:
t.join()
# periodically delete data
delete = []
for key, val in collection.items():
if key < time.time() - 10:
delete.append(key)
for delet in delete:
print('DELETING:', delet)
del collection[delet]
gc.collect()
print('CYCLE:', cycle, 'THREADS:', threading.active_count(), 'COLLECTION:', len(collection))
print(tracemalloc.get_traced_memory())
all_objects = pympler.muppy.get_objects()
sum1 = pympler.summary.summarize(all_objects)
pympler.summary.print_(sum1)
In order to get a Cassandra insert going faster I'm using multithreading, its working ok, but if I add more threads it doesnt make any difference, I think I'm not generating more connections, I think maybe I should be using pool.execute(f, *args, **kwargs) but I dont know how to use it, the documentation is quite scanty. Heres my code so far..
import connect_to_ks_bp
from connect_to_ks_bp import ks_refs
import time
import pycassa
from datetime import datetime
import json
import threadpool
pool = threadpool.ThreadPool(20)
count = 1
bench = open("benchCassp20_100000.txt", "w")
def process_tasks(lines):
#let threadpool format your requests into a list
requests = threadpool.makeRequests(insert_into_cfs, lines)
#insert the requests into the threadpool
for req in requests:
pool.putRequest(req)
pool.wait()
def read(file):
"""read data from json and insert into keyspace"""
json_data=open(file)
lines = []
for line in json_data:
lines.append(line)
print len(lines)
process_tasks(lines)
def insert_into_cfs(line):
global count
count +=1
if count > 5000:
bench.write(str(datetime.now())+"\n")
count = 1
#print count
#print kspool.checkedout()
"""
user_tweet_cf = pycassa.ColumnFamily(kspool, 'UserTweet')
user_name_cf = pycassa.ColumnFamily(kspool, 'UserName')
tweet_cf = pycassa.ColumnFamily(kspool, 'Tweet')
user_follower_cf = pycassa.ColumnFamily(kspool, 'UserFollower')
"""
tweet_data = json.loads(line)
"""Format the tweet time as an epoch seconds int value"""
tweet_time = time.strptime(tweet_data['created_at'],"%a, %d %b %Y %H:%M:%S +0000")
tweet_time = int(time.mktime(tweet_time))
new_user_tweet(tweet_data['from_user_id'],tweet_time,tweet_data['id'])
new_user_name(tweet_data['from_user_id'],tweet_data['from_user_name'])
new_tweet(tweet_data['id'],tweet_data['text'],tweet_data['to_user_id'])
if tweet_data['to_user_id'] != 0:
new_user_follower(tweet_data['from_user_id'],tweet_data['to_user_id'])
""""4 functions below carry out the inserts into specific column families"""
def new_user_tweet(from_user_id,tweet_time,id):
ks_refs.user_tweet_cf.insert(from_user_id,{(tweet_time): id})
def new_user_name(from_user_id,user_name):
ks_refs.user_name_cf.insert(from_user_id,{'username': user_name})
def new_tweet(id,text,to_user_id):
ks_refs.tweet_cf.insert(id,{
'text': text
,'to_user_id': to_user_id
})
def new_user_follower(from_user_id,to_user_id):
ks_refs.user_follower_cf.insert(from_user_id,{to_user_id: 0})
read('tweets.json')
if __name__ == '__main__':
This is just another file..
import pycassa
from pycassa.pool import ConnectionPool
from pycassa.columnfamily import ColumnFamily
"""This is a static class I set up to hold the global database connection stuff,
I only want to connect once and then the various insert functions will use these fields a lot"""
class ks_refs():
pool = ConnectionPool('TweetsKS',use_threadlocal = True,max_overflow = -1)
#classmethod
def cf_connect(cls, column_family):
cf = pycassa.ColumnFamily(cls.pool, column_family)
return cf
ks_refs.user_name_cfo = ks_refs.cf_connect('UserName')
ks_refs.user_tweet_cfo = ks_refs.cf_connect('UserTweet')
ks_refs.tweet_cfo = ks_refs.cf_connect('Tweet')
ks_refs.user_follower_cfo = ks_refs.cf_connect('UserFollower')
#trying out a batch mutator whihc is supposed to increase performance
ks_refs.user_name_cf = ks_refs.user_name_cfo.batch(queue_size=10000)
ks_refs.user_tweet_cf = ks_refs.user_tweet_cfo.batch(queue_size=10000)
ks_refs.tweet_cf = ks_refs.tweet_cfo.batch(queue_size=10000)
ks_refs.user_follower_cf = ks_refs.user_follower_cfo.batch(queue_size=10000)
A few thoughts:
Batch sizes of 10,000 are way too large. Try 100.
Make your ConnectionPool size at least as large as the number of threads using the pool_size parameter. The default is 5. Pool overflow should only be used when the number of active threads may vary over time, not when you have a fixed number of threads. The reason is that it will result in a lot of unnecessary opening and closing of new connections, which is a fairly expensive process.
After you've resolved those issues, look into these:
I'm not familiar with the threadpool library that you're using. Make sure that if you take the insertions to Cassandra out of the picture that you see an increase in the performance when you increase the number of threads
Python itself has a limit to how many threads may be useful due to the GIL. It shouldn't normally max out at 20, but it might if you're doing something CPU intensive or something that requires a lot of Python interpretation. The test that I described in my previous point will cover this as well. It may be the case that you should consider using the multiprocessing module, but you would need some code changes to handle that (namely, not sharing ConnectionPools, CFs, or hardly anything else between processes).