I am trying to use klepto to do LRU caching. I would like to store the cache to disk, and am currently using klepto's dir_archive option for this. I have written the following code, largely based on the code in the klepto test scripts:
def mymap(data):
return hashlib.sha256(data).hexdigest()
class MyLRUCache:
#lru_cache(cache=dir_archive(cached=False), keymap=mymap, ignore='self', maxsize=5)
def __call__(self, data)
return data
call = __call__
def store(self, data):
self.call(data)
# I would also appreciate a better way to do this, if possible.
def lookup(self, key):
return self.call.__cache__()[key]
This code appears to work fine until the cache reaches maxsize. At that point, instead of using LRU to remove a single item, lru_cache purges the entire cache! Below is the piece of klepto source code that does this (https://github.com/uqfoundation/klepto/blob/master/klepto/safe.py):
# purge cache
if _len(cache) > maxsize:
if cache.archived():
cache.dump()
cache.clear()
queue.clear()
refcount.clear()
else: # purge least recently used cache entry
key = queue_popleft()
refcount[key] -= 1
while refcount[key]:
key = queue_popleft()
refcount[key] -= 1
del cache[key], refcount[key]
So my question is, why does klepto purge "archived" caches? Is it possible to use lru_cache and dir_archive together?
Also, if my code looks completely nuts, I would really appreciate some sample code of how I should be writing this, since there was not much documentation for klepto.
ADDITIONAL NOTES:
I also tried defining dir_archive with cached=True. The in-memory cache still gets purged when maxsize is reached, but the contents of the cache are dumped to the archived cache at that point. I have several problems with this:
The in-memory cache is only accurate until maxsize is reached, at which point it is wiped.
The archived cache is not affected by maxsize. Every time maxsize is reached by the in-memory cache, all items in the in-memory cache are dumped to the archived cache, regardless of how many are already there.
LRU caching seems impossible based on points 1 and 2.
The answer is that you couldn't before your question, but now you can.
If you get the most recent klepto from github, and provide the new flag
purge=False -- then you get the behavior you are looking for. I just added this in response to your question.
In your case:
lru_cache(cache=dir_archive(cached=False), keymap=mymap, ignore='self', maxsize=5, purge=False)
Or, for example:
#lru_cache(maxsize=3, cache=dict_archive('test'), purge=True)
def identity(x):
return x
identity(1)
identity(2)
identity(3)
ic = identity.__cache__()
assert len(ic.keys()) == 3
assert len(ic.archive.keys()) == 0
identity(4)
assert len(ic.keys()) == 0
assert len(ic.archive.keys()) == 4
identity(5)
assert len(ic.keys()) == 1
assert len(ic.archive.keys()) == 4
#lru_cache(maxsize=3, cache=dict_archive('test'), purge=False)
def inverse(x):
return -x
inverse(1)
inverse(2)
inverse(3)
ic = inverse.__cache__()
assert len(ic.keys()) == 3
assert len(ic.archive.keys()) == 0
inverse(4)
assert len(ic.keys()) == 3
assert len(ic.archive.keys()) == 1
inverse(5)
assert len(ic.keys()) == 3
assert len(ic.archive.keys()) == 2
Please add a ticket if this doesn't do what you were expecting. Thanks for the suggestion.
Related
I have this program written in python that uses berkeleydb to store data (event logs) which i migrated to lmdb. My problem is, before an event gets written, the program does a lookup if the event already exists. I noticed that the berkeleydb version is much faster in doing the single value lookup using 13k+ records (as if the lmdb version is 1 second slower for every lookup) even with transactions enabled in berkeleydb. Any idea how to speed up the lmdb version? Note that I've had 70gb+ (about 30 million records) worth of data already stored in my berkeleydb and doing additional processing on those events takes me more than an hour so I thought switching to lmdb would decrease the processing time.
My LMDB environment was opened this way (I event set the readahead to False (but the database size is just about 35mb so I don't think it matters):
env = lmdb.open(db_folder, map_size=100000000000, max_dbs=4, readahead=False)
database = env.open_db('events'.encode())
My berkeleydb was opened this way:
env = db.DBEnv()
env.open(db_folder, db.DB_INIT_MPOOL | db.DB_CREATE | db.DB_INIT_LOG | db.DB_INIT_TXN | db.DB_RECOVER, 0)
database = db.DB(env)
BerkeleyDB version of check:
if event['eId'].encode('utf-8') in database:
duplicate_count += 1
else:
try:
txn = env.txn_begin(None)
database[event['eId'].encode('utf-8')] = json.dumps(event).encode('utf-8')
except:
if txn is not None:
txn.abort()
txn = None
raise
else:
txn.commit()
txn = None
event_count += 1
lmdb version:
with env.begin(buffers=True, db=database) as txn:
if (txn.get(event['eId'].encode()) is not None):
dup_event_count += 1
else:
txn.put(event['eId'].encode(), json.dumps(event).encode('utf-8'))
event_count += 1
Solution:
Place with env.begin outside the loop:
#case('rand lookup')
def test():
with env.begin() as txn:
for word in words:
txn.get(word)
return len(words)
#case('per txn rand lookup')
def test():
for word in words:
with env.begin() as txn:
txn.get(word)
return len(words)
Figured this out myself. What I'm doing is a per transaction random lookup. I just had to place with env.begin outside of the for loop (not visible in my example) as suggested in this example: https://raw.githubusercontent.com/jnwatson/py-lmdb/master/examples/dirtybench.py
I have code like the one below :
def expensive(self,c,v):
.....
def inner_loop(self,c,collector):
self.db.query('SELECT ...',(c,))
for v in self.db.cursor.fetchall() :
collector.append( self.expensive(c,v) )
def method(self):
# create a Pool
#join the Pool ??
self.db.query('SELECT ...')
for c in self.db.cursor.fetchall() :
collector = []
#RUN the whole cycle in parallel in separate processes
self.inner_loop(c, collector)
#do stuff with the collector
#! close the pool ?
both the Outer and the Inner loop are thousands of steps ...
I think I understand how to run a Pool of couple of processes.
All the examples I found show that more or less.
But in my case I need to lunch a persistent Pool and then feed the data (c-value). Once a inner-loop process has finished I have to supply the next-available-c-value.
And keep the processes running and collect the results.
How do I do that ?
A clunky idea I have is :
def method(self):
ws = 4
with Pool(processes=ws) as pool :
cs = []
for i,c in enumerate(..) :
cs.append(c)
if i % ws == 0 :
res = [pool.apply(self.inner_loop, (c)) for i in range(ws)]
cs = []
collector.append(res)
will this keep the same pool running !! i.e. not lunch new process every time ?i
Do I need 'if i % ws == 0' part or I can use imap(), map_async() and the Pool obj will block the loop when available workers are exhausted and continue when some are freed ?
Yes, the way that multiprocessing.Pool works is:
Worker processes within a Pool typically live for the complete duration of the Pool’s work queue.
So simply submitting all your work to the pool via imap should be sufficient:
with Pool(processes=4) as pool:
initial_results = db.fetchall("SELECT c FROM outer")
results = [pool.imap(self.inner_loop, (c,)) for c in initial_results]
That said, if you really are doing this to fetch things from the DB, it may make more sense to move more processing down into that layer (bring the computation to the data rather than bringing the data to the computation).
I am using flask, and have a route that sends emails to people. I am using threading to send them faster. When i run it on my local machine it takes about 12 seconds to send 300 emails. But when I run it on lambda thorough API Gateway it times out.
here's my code:
import threading
def async_mail(app, msg):
with app.app_context():
mail.send(msg)
def mass_mail_sender(order, user, header):
html = render_template('emails/pickup_mail.html', bruger_order=order.ordre, produkt=order.produkt)
msg = Message(recipients=[user],
sender=('Sender', 'infor#example.com'),
html=html,
subject=header)
thread = threading.Thread(target=async_mail, args=[create_app(), msg])
thread.start()
return thread
#admin.route('/lager/<url_id>/opdater', methods=['POST'])
def update_stock(url_id):
start = time.time()
if current_user.navn != 'Admin':
abort(403)
if request.method == 'POST':
produkt = Produkt.query.filter_by(url_id=url_id)
nyt_antal = int(request.form['bestilt_hjem'])
produkt.balance = nyt_antal
produkt.bestilt_hjem = nyt_antal
db.session.commit()
orders = OrdreBog.query.filter(OrdreBog.produkt.has(func.lower(Produkt.url_id == url_id))) \
.filter(OrdreBog.produkt_status == 'Ikke klar').all()
threads = []
for order in orders:
if order.antal <= nyt_antal:
nyt_antal -= order.antal
new_thread = mass_mail_sender(order, order.ordre.bruger.email, f'Din bog {order.produkt.titel} er klar til afhentning')
threads.append(new_thread)
order.produkt_status = 'Klar til afhentning'
db.session.commit()
for thread in threads:
try:
thread.join()
except Exception:
pass
end = time.time()
print(end - start)
return 'Emails sendt'
return ''
AWS lambda functions designed to run functions within these constraints:
Memory– The amount of memory available to the function during execution. Choose an amount between 128 MB and 3,008 MB in 64-MB increments.
Lambda allocates CPU power linearly in proportion to the amount of memory configured. At 1,792 MB, a function has the equivalent of one full vCPU (one vCPU-second of credits per second).
Timeout – The amount of time that Lambda allows a function to run before stopping it. The default is 3 seconds. The maximum allowed value is 900 seconds.
To put this in your mail sending multi threaded python code. The function will terminate automatically either when your function execution completes successfully or it reaches configured timeout.
I understand you want single python function to send n number of emails "concurrently". To achieve this with lambda try the "Concurrency" setting and trigger your lambda function through a local script, S3 hosted html/js triggered by cloud watch or EC2 instance.
Concurrency – Reserve concurrency for a function to set the maximum number of simultaneous executions for a function. Provision concurrency to ensure that a function can scale without fluctuations in latency.
https://docs.aws.amazon.com/lambda/latest/dg/configuration-console.html
Important: All above settings will affect your lambda function execution cost significantly. So plan and compare before applying.
If you need any more help, let me know.
Thank you.
For experiment, I set up a cache uwsgi.cache_set('test', data) inside a mule process. The cache is set as expected.
Now I spawn a thread from which I can access that cache
Threading is enabled in uwsgi.ini:
[uwsgi]
threads = 4
in mule.py:
#Threaded function
def a_function():
uwsgi.cache_set('test', b'NOT OK') <- Nothing happens here
cache_return = uwsgi.cache_get('test') <- Returns b'OK' which means the cache did not overwrite the previous value.
if __name__ == '__main__':
cache = uwsgi.cache_set('test', b'OK') <- Works here
cache_return = uwsgi.cache_get('test') <- Return b'OK', as expected
t = Thread(target=a_function)
t.start()
The question is why does this happen and how do I set caches from inside a thread.
OK it seems like I used the wrong function (cache_set) instead of cache_update.
uwsgi.cache_set(key, value[, expire, cache_name])
Set a value in the cache. If the key is already set but not expired,
it doesn’t set anything.
uwsgi.cache_update(key, value[, expire, cache_name])
Update a value in the cache. This always sets the key, whether it was
already set before or not and whether it has expired or not.
I have this function that when hitting a rate limit will call itself again. It should eventually succeed and return the working data. It works normally then rate limiting works as expected, and finally when the data goes back to normal I get:
TypeError: 'NoneType' object is not subscriptable
def grabPks(pageNum):
# cloudflare blocks bots...use scraper library to get around this or build your own logic to store and use a manually generated cloudflare session cookie... I don't care 😎
req = scraper.get("sumurl.com/"+str(pageNum)).content
if(req == b'Rate Limit Exceeded'):
print("adjust the rate limiting because they're blocking us :(")
manPenalty = napLength * 3
print("manually sleeping for {} seconds".format(manPenalty))
time.sleep(manPenalty)
print("okay let's try again... NOW SERVING {}".format(pageNum))
grabPks(pageNum)
else:
tree = html.fromstring(req)
pk = tree.xpath("/path/small/text()")
resCmpress = tree.xpath("path/a//text()")
resXtend = tree.xpath("[path/td[2]/small/a//text()")
balance = tree.xpath("path/font//text()")
return pk, resCmpress, resXtend, balance
I've tried to move the return to outside of the else scope but then it throws:
UnboundLocalError: local variable 'pk' referenced before assignment
Your top level grabPks doesnt return anything if it is rate limited.
Think about this:
Call grabPks()
You're rate limited so you go into the if statement and call grabPks() again.
This time it succeeds so grabPks() returns the value to the function above it.
The first function now falls out of the if statement, gets to the end of the function and returns nothing.
Try return grabPks(pageNum) instead inside your if block.
well okay... I needed to return grabPKs to make it play nice...:
def grabPks(pageNum):
# cloudflare blocks bots...use scraper library to get around this or build your own logic to store and use a manually generated cloudflare session cookie... I don't care 😎
req = scraper.get("sumurl.com/"+str(pageNum)).content
if(req == b'Rate Limit Exceeded'):
print("adjust the rate limiting because they're blocking us :(")
manPenalty = napLength * 3
print("manually sleeping for {} seconds".format(manPenalty))
time.sleep(manPenalty)
print("okay let's try again... NOW SERVING {}".format(pageNum))
return grabPks(pageNum)
else:
tree = html.fromstring(req)
pk = tree.xpath("/path/small/text()")
resCmpress = tree.xpath("path/a//text()")
resXtend = tree.xpath("[path/td[2]/small/a//text()")
balance = tree.xpath("path/font//text()")
return pk, resCmpress, resXtend, balance