How to implement "relationship" caching system in a similar query? - python-3.x

I noticed that when having a Model such as :
class User(Model):
id = ...
books = relationship('Book')
When calling user.books for the first time, SQLAlchemy query the database (when lazy='select' for instance, which is the default), but sub-sequent call to user.books don't call the database. The results seems to have been cached.
I'd like to have the same feature from SQLAlchemy when using a method that query, for instance:
class User:
def get_books(self):
return Book.query.filter(Book.user_id == self.id).all()
But when doing that, if I call 3 times get_books(), SQLAlchemy does call the database 3 times (when setting the ECHO property to True).
How can I change get_books() to use the caching system from SQLAlchemy ?
I insist to mention "from SQLAlchemy" because I believe they handle the refresh/expunge/flush system and changes are then re-queried to the DB if one of these happened. Opposed to if I were to simply create a caching property in the model with a simple:
def get_books(self):
if self._books is None:
self._books = Book.query.filter(Book.user_id == self.id).all()
return self._books
This does not work well with flush/refresh/expunge from SQLAlchemy.
So, How can I change get_books() to use the caching system from SQLAlchemy ?

Edit 1:
I realized that the solution provided under is not perfect, because it caches for the current object. If you have two instances of the same user, and call get_books on both, two queries will be made because the caching applies only on the instance, not globally, contrary to SQLAlchemy.
The reason is simple - I believe - but still unclear how to apply it in my case: The object is defined at the class level, not the instance (books = relationship()), and they build their own query internally, so they can cache it based on the query.
In the solution I gave, the memoize_getter is unaware of the query made, and as such, cannot cache it for the same value accros multiple instance, so any identical call made to another instance will query the database.
Original answer:
I've been trying to wrap my head around SQLAlchemy's code (wow that's dense!), and I think I figured it out!
A relationship, at least when being set as "lazy='select'" (default), is a InstrumentedAttribute, which contains a get function that does the following :
def __get__(self, instance, owner):
if instance is None:
return self
dict_ = instance_dict(instance)
if self._supports_population and self.key in dict_:
return dict_[self.key]
else:
try:
state = instance_state(instance)
except AttributeError as err:
util.raise_(
orm_exc.UnmappedInstanceError(instance),
replace_context=err,
)
return self.impl.get(state, dict_)
So, a basic caching system, respecting SQLAlchemy, would be something like:
from sqlalchemy.orm.base import instance_dict
def get_books(self):
dict_ = instance_dict(self)
if 'books' not in dict_:
dict_['books'] = Book.query.filter(Book.user_id == self.id).all()
return dict_['books']
Now, we can push the vice a bit further, and do ... a decorator (oh sweet):
def memoize_getter(f):
#functools.wraps(f)
def decorator(instance, *args, **kwargs):
property_name = f.__name__.replace('get_', '')
dict_ = instance_dict(instance)
if property_name not in dict_:
dict_[property_name] = f(instance, *args, **kwargs)
return dict_[property_name]
return decorator
Thus transforming the original method to :
class User:
#memoize_getter
def get_books(self):
return Book.query.filter(Book.user_id == self.id).all()
If someone has a better solution, I'm eagerly interested!

Related

Python multiprocess: run several instances of a class, keep all child processes in memory

First, I'd like to thank the StackOverflow community for the tremendous help it provided me over the years, without me having to ask a single question.
I could not find anything that I can relate to my problem, though it is probably due to my lack of understanding of the subject, rather than the absence of a response on the website. My apologies in advance if this is a duplicate.
I am relatively new to multiprocess; some time ago I succeeded in using multiprocessing.pools in a very simple way, where I didn't need any feedback between the child processes.
Now I am facing a much more complicated problem, and I am just lost in the documentation about multiprocessing. I hence ask for you help, your kindness and your patience.
I am trying to build a parallel tempering monte-carlo algorithm, from a class.
The basic class very roughly goes as follows:
import numpy as np
class monte_carlo:
def __init__(self):
self.x=np.ones((1000,3))
self.E=np.mean(self.x)
self.Elist=[]
def simulation(self,temperature):
self.T=temperature
for i in range(3000):
self.MC_step()
if i%10==0:
self.Elist.append(self.E)
return
def MC_step(self):
x=self.x.copy()
k = np.random.randint(1000)
x[k] = (x[k] + np.random.uniform(-1,1,3))
temp_E=np.mean(self.x)
if np.random.random()<np.exp((self.E-temp_E)/self.T):
self.E=temp_E
self.x=x
return
Obviously, I simplified a great deal (actual class is 500 lines long!), and built fake functions for simplicity: __init__ takes a bunch of parameters as arguments, there are many more lists of measurement else than self.Elist, and also many arrays derived from self.X that I use to compute them. The key point is that each instance of the class contains a lot of informations that I want to keep in memory, and that I don't want to copy over and over again, to avoid dramatic slowing down. Else I would just use the multiprocessing.pool module.
Now, the parallelization I want to do, in pseudo-code:
def proba(dE,pT):
return np.exp(-dE/pT)
Tlist=[1.1,1.2,1.3]
N=len(Tlist)
G=[]
for _ in range(N):
G.append(monte_carlo())
for _ in range(5):
for i in range(N): # this loop should be ran in multiprocess
G[i].simulation(Tlist[i])
for i in range(N//2):
dE=G[i].E-G[i+1].E
pT=G[i].T + G[i+1].T
p=proba(dE,pT) # (proba is a function, giving a probability depending on dE)
if np.random.random() < p:
T_temp = G[i].T
G[i].T = G[i+1].T
G[i+1].T = T_temp
Synthesis: I want to run several instances of my monte-carlo class in parallel child processes, with different values for a parameter T, then periodically pause everything to change the different T's, and run again the child processes/class instances, from where they paused.
Doing this, I want each class-instance/child-process to stay independent from one another, save its current state with all internal variables while it is paused, and do as few copies as possible. This last point is critical, as the arrays inside the class are quite big (some are 1000x1000), and a copy will therefore very quickly become quite time-costly.
Thanks in advance, and sorry if I am not clear...
Edit:
I am using a distant machine with many (64) CPUs, running on Debian GNU/Linux 10 (buster).
Edit2:
I made a mistake in my original post: in the end, the temperatures must be exchanged between the class-instances, and not inside the global Tlist.
Edit3: Charchit answer works perfectly for the test code, on both my personal machine and the distant machine I am usually using for running my codes. I hence check this as the accepted answer.
However, I want to report here that, inserting the actual, more complicated code, instead of the oversimplified monte_carlo class, the distant machine gives me some strange errors:
Unable to init server: Could not connect: Connection refused
(CMC_temper_all.py:55509): Gtk-WARNING **: ##:##:##:###: Locale not supported by C library.
Using the fallback 'C' locale.
Unable to init server: Could not connect: Connection refused
(CMC_temper_all.py:55509): Gdk-CRITICAL **: ##:##:##:###:
gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
(CMC_temper_all.py:55509): Gdk-CRITICAL **: ##:##:##:###: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
The "##:##:##:###" are (or seems like) IP adresses.
Without the call to set_start_method('spawn') this error shows only once, in the very beginning, while when I use this method, it seems to show at every occurrence of result.get()...
The strangest thing is that the code seems otherwise to work fine, does not crash, produces the datafiles I then ask it to, etc...
I think this would deserve to publish a new question, but I put it here nonetheless in case someone has a quick answer.
If not, I will resort to add one by one the variables, methods, etc... that are present in my actual code but not in the test example, to try and find the origin of the bug. My best guess for now is that the memory space required by each child-process with the actual code, is too large for the distant machine to accept it, due to some restrictions implemented by the admin.
What you are looking for is sharing state between processes. As per the documentation, you can either create shared memory, which is restrictive about the data it can store and is not thread-safe, but offers better speed and performance; or you can use server processes through managers. The latter is what we are going to use since you want to share whole objects of user-defined datatypes. Keep in mind that using managers will impact speed of your code depending on the complexity of the arguments that you pass and receive, to and from the managed objects.
Managers, proxies and pickling
As mentioned, managers create server processes to store objects, and allow access to them through proxies. I have answered a question with better details on how they work, and how to create a suitable proxy here. We are going to use the same proxy defined in the linked answer, with some variations. Namely, I have replaced the factory functions inside the __getattr__ to something that can be pickled using pickle. This means that you can run instance methods of managed objects created with this proxy without resorting to using multiprocess. The result is this modified proxy:
from multiprocessing.managers import NamespaceProxy, BaseManager
import types
import numpy as np
class A:
def __init__(self, name, method):
self.name = name
self.method = method
def get(self, *args, **kwargs):
return self.method(self.name, args, kwargs)
class ObjProxy(NamespaceProxy):
"""Returns a proxy instance for any user defined data-type. The proxy instance will have the namespace and
functions of the data-type (except private/protected callables/attributes). Furthermore, the proxy will be
pickable and can its state can be shared among different processes. """
def __getattr__(self, name):
result = super().__getattr__(name)
if isinstance(result, types.MethodType):
return A(name, self._callmethod).get
return result
Solution
Now we only need to make sure that when we are creating objects of monte_carlo, we do so using managers and the above proxy. For that, we create a class constructor called create. All objects for monte_carlo should be created with this function. With that, the final code looks like this:
from multiprocessing import Pool
from multiprocessing.managers import NamespaceProxy, BaseManager
import types
import numpy as np
class A:
def __init__(self, name, method):
self.name = name
self.method = method
def get(self, *args, **kwargs):
return self.method(self.name, args, kwargs)
class ObjProxy(NamespaceProxy):
"""Returns a proxy instance for any user defined data-type. The proxy instance will have the namespace and
functions of the data-type (except private/protected callables/attributes). Furthermore, the proxy will be
pickable and can its state can be shared among different processes. """
def __getattr__(self, name):
result = super().__getattr__(name)
if isinstance(result, types.MethodType):
return A(name, self._callmethod).get
return result
class monte_carlo:
def __init__(self, ):
self.x = np.ones((1000, 3))
self.E = np.mean(self.x)
self.Elist = []
self.T = None
def simulation(self, temperature):
self.T = temperature
for i in range(3000):
self.MC_step()
if i % 10 == 0:
self.Elist.append(self.E)
return
def MC_step(self):
x = self.x.copy()
k = np.random.randint(1000)
x[k] = (x[k] + np.random.uniform(-1, 1, 3))
temp_E = np.mean(self.x)
if np.random.random() < np.exp((self.E - temp_E) / self.T):
self.E = temp_E
self.x = x
return
#classmethod
def create(cls, *args, **kwargs):
# Register class
class_str = cls.__name__
BaseManager.register(class_str, cls, ObjProxy, exposed=tuple(dir(cls)))
# Start a manager process
manager = BaseManager()
manager.start()
# Create and return this proxy instance. Using this proxy allows sharing of state between processes.
inst = eval("manager.{}(*args, **kwargs)".format(class_str))
return inst
def proba(dE,pT):
return np.exp(-dE/pT)
if __name__ == "__main__":
Tlist = [1.1, 1.2, 1.3]
N = len(Tlist)
G = []
# Create our managed instances
for _ in range(N):
G.append(monte_carlo.create())
for _ in range(5):
# Run simulations in the manager server
results = []
with Pool(8) as pool:
for i in range(N): # this loop should be ran in multiprocess
results.append(pool.apply_async(G[i].simulation, (Tlist[i], )))
# Wait for the simulations to complete
for result in results:
result.get()
for i in range(N // 2):
dE = G[i].E - G[i + 1].E
pT = G[i].T + G[i + 1].T
p = proba(dE, pT) # (proba is a function, giving a probability depending on dE)
if np.random.random() < p:
T_temp = Tlist[i]
Tlist[i] = Tlist[i + 1]
Tlist[i + 1] = T_temp
print(Tlist)
This meets the criteria you wanted. It does not create any copies at all, rather, all arguments to the simulation method call are serialized inside the pool and sent to the manager server where the object is actually stored. It gets executed there, and the results (if any) are serialized and returned in the main process. All of this, with only using the builtins!
Output
[1.2, 1.1, 1.3]
Edit
Since you are using Linux, I encourage you to use multiprocessing.set_start_method inside the if __name__ ... clause to set the start method to "spawn". Doing this will ensure that the child processes do not have access to variables defined inside the clause.

How to define the same field for load_only and dump_only params at the Marshmallow scheme?

I am trying to build a marshmallow scheme to both load and dump data. And I get everything OK except one field.
Problem description
(If you understand the problem, you don't have to read this).
For load data its type is Decimal. And I used it like this before. Now I want to use this schema for dumping and for that my flask API responses with: TypeError: Object of type Decimal is not JSON serializable. OK, I understand. I changed the type to Float. Then my legacy code started to get an exception while trying to save that field to database (it takes Decimal only). I don't want to change the legacy code so I looked for any solution at the marshmallow docs and found load_only and dump_only params. It seems like those are what I wanted, but here is my problem - I want to set them to the same field. So I just wondered if I can define both fields and tried this:
class PaymentSchema(Schema):
money = fields.Decimal(load_only=True)
money = fields.Float(dump_only=True)
I have been expected for a miracle, of course. Actually I was thinking that it will skip first definition (correctly, re-define it). What I got is an absence of the field at all.
Workaround solution
So I tried another solution. I created another schema for dump and inherit it from the former schema:
class PaymentSchema(Schema):
money = fields.Decimal(load_only=True)
class PaymentDumpSchema(PaymentSchema):
money = fields.Float(dump_only=True)
It works. But I wonder if there's some another, native, "marshmallow-way" solution for this. I have been looking through the docs but I can't find anything.
You can use the marshmallow decorator #pre_load in this decorator you can do whatever you want and return with your type
from marshmallow import pre_load
import like this and in this you will get your payload and change the type as per your requirement.
UPD: I found a good solution finally.
NEW SOLUTION
The trick is to define your field in load_fields and dump_fields inside __init__ method.
from marshmallow.fields import Integer, String, Raw
from marshmallow import Schema
class ItemDumpLoadSchema(Schema):
item = Raw()
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
if not (self.only and 'item' not in self.only) and \
not (self.exclude and 'item' in self.exclude):
self.load_fields['item'] = Integer(missing=0)
self.dump_fields['item'] = String()
Usage:
>>> ItemDumpLoadSchema().load({})
{'item': 0}
>>> ItemDumpLoadSchema().dump({'item': 0})
{'item': '0'}
Don't forget to define field in a schema with some field (Raw in my example) - otherwise it may raise an exception in some cases (e.g. using of only and exclude keywords).
OLD SOLUTION
A little perverted one. It based on #prashant-suthar answer. I named load field with suffix _load and implemented #pre_load, #post_load and error handling.
class ArticleSchema(Schema):
id = fields.String()
title = fields.String()
text = fields.String()
class FlowSchema(Schema):
article = fields.Nested(ArticleSchema, dump_only=True)
article_load = fields.Int(load_only=True)
#pre_load
def pre_load(self, data, *args, **kwargs):
if data.get('article'):
data['article_load'] = data.pop('article')
return data
#post_load
def post_load(self, data, *args, **kwargs):
if data.get('article_load'):
data['article'] = data.pop('article_load')
return data
def handle_error(self, exc, data, **kwargs):
if 'article_load' in exc.messages:
exc.messages['article'] = exc.messages.pop('article_load')
raise exc
Why the old solution is not a good solution?
It doesn't allow to inheritate schemas with different handle_error methods defined. And you have to name pre_load and post_load methods with different names.
pass data_key argument to the field definition
Documentation mentions, data_key parameter can be used along with dump_only or load_only to be able to have same field with different functionality.
So you can write your schema as...
class PaymentSchema(Schema):
decimal_money = fields.Decimal(data_key="money", load_only=True)
money = fields.Float(dump_only=True)
This should solve your problem. I am using data_key for similar problem in marshmallow with SQLAlchemyAutoSchema and this fixed my issue.
Edit
Note: The key in ValidationError.messages (error messages) will be decimal_money by default. You may tweak the handle_error method of Schema class to replace decimal_money with money but it is not recommended as you yourself may not be able to differentiate between the error messages fields.
Thanks.

Get a user's keyboard input that was requested by another function

I am using a python package for database managing. The provided class has a method delete() that deletes a record from the database. Before deleting, it asks a user to verify the operation from a console, e.g. Proceed? [yes, No]:
My function needs to perform other actions depending on whether a user chose to delete a record. Can I get user's input requested by the function from the package?
Toy example:
def ModuleFunc():
while True:
a=input('Proceed? [yes, No]:')
if a in ['yes','No']:
#Perform some actions behind a hood
return
This function will wait for one of the two responses and return None once it gets either. After calling this function, can I determine the User's response (without modifying this function)? I think a modification of the Package's source code is not a good idea in general.
Why not just patch the class at runtime? Say you had a file ./lib/db.py defining a class DB like this:
class DB:
def __init__(self):
pass
def confirm(self, msg):
a=input(msg + ' [Y, N]:')
if a == 'Y':
return True
return False
def delete(self):
if self.confirm('Delete?'):
print ('Deleted!')
return
Then in main.py you could do:
from lib.db import DB
def newDelete(self):
if self.confirm('Delete?'):
print('Do some more stuff!')
print('Deleted!')
return
DB.delete = newDelete
test = DB()
test.delete()
See it working here
I would save key events to somewhere(file or memory) with something like Keylogger. Then, you will be able to reuse last one.
However, if you can modify module package 📦 and redistribute, it would be easier.
Return
To
Return a

Two python(3) classes with the same input (extending a python class)

I have a design change I have been trying to implement with little success, as I can't seem to find my question anywhere.
Currently I have a python class that creates a database connection, stores the index name (table), and other attributes (specifically its an Elasticsearch database connection but that shouldn't matter for this question).
class Create:
# Functions to manipulate Index Objects
def __init__(self, index, type, host, shards=3, replicas=1):
# Create Index Object (OcrBook or OcrPage)
self.index = index
self.type = type
self.shards = shards
self.replicas = replicas
self.es_connection = Elasticsearch([{'host': host, 'port': 9200}])
Associated with this class are functions to manipulate the index objects, for example creating that index (table) on the database (cluster) or modifying that table in some way.
def create_index(self):
# Creates/Executes Index
try:
self.es_connection.indices.create(
index=self.index,
body={
'settings' : {
'number_of_shards' : self.shards,
'number_of_replicas': self.replicas,
}
})
except Exception:
CreateLog.write_log(Exception, 'Create Index Exception')
These being in the same class make sense to me, as the connection to the table/database and creating or modifying that table/database are connected to each other.
I also have a group of other functions that search that particular table. These I believe should be in a separate class as rather then creating or modifying the table/database they are simply searching the table/database and could ideally take any table/database initialized by the create class. Currently I tried breaking them up by doing the following:
class Search(Create):
def find_book(self, bookkey):
""" Finds a Book """
try:
results = self.es_connection.search(self.index, self.type, body={
"query": {
"match": {
"BookKey": bookkey
}
}
})
return results['hits']['hits']
except Exception:
CreateLog.write_log(Exception, 'Could Not Find Book')
This works on windows, but is not portable to 'linux' as the 'class has not been initialized' when I try to use the Search functionality. I know there is a design problem here, and I could combine both classes into one to fix the problem. But I would like to keep them separate. Is there a better way to 'inherit' (I don't believe that's the right word in this case) the object created in the 'Create' class by the search class, does anyone have a better way to separate these logically, or is there a better way to extend the create class with the search functionality? All input is helpful! Thank you.
You seem to be on a OOP path, but why exactly does Search has to be a class? You have a perfect task for a stand-alone function find_book(index_object, bookkey). It does not store anything internally, I do no see why this has to de a class, not a function.
Class naming can also be hinting at your design decisions (or problems). Class name is ususlly a noun, function name tends to be a verb. Create is not a perfect class name to me.
In your setting I'd go with a class IndexObjects (that is Create renamed) and function find_book(index_object, bookkey). You can switch to more of OOP once this design up and running.
Another split of responsibilities that comes to mind is below. Here you inject, not inherit, which allows you to parts more independent.
class IndexObject:
# ...
def query(self, query_dict):
return self.es_connection.search(self.index, self.type, body=query_dict)
class BookSearcher():
def __init__(self, index_object):
self.index_object = index_object
def find(self, book_key):
""" Finds a Book """
query_dict = {"query": {
"match": {
"BookKey": book_key
}
}
}
try:
results = self.index_object.query(query_dict)
return results['hits']['hits']
# FIXME: looks lile bare exception, not great
except Exception:
CreateLog.write_log(Exception, 'Could Not Find Book')

Load inconsistent data in pymongo

I am working with pymongo and am wanting to ensure that data saved can be loaded even if additional data elements have been added to the schema.
I have used this for classes that don't need to have the information processed before assigning it to class attributes:
class MyClass(object):
def __init__(self, instance_id):
#set default values
self.database_id = instance_id
self.myvar = 0
#load values from database
self.__load()
def __load(self):
data_dict = Collection.find_one({"_id":self.database_id})
for key, attribute in data_dict.items():
self.__setattr__(key,attribute)
However, in classes that I have to process the data from the database this doesn't work:
class Example(object):
def __init__(self, name):
self.name = name
self.database_id = None
self.member_dict = {}
self.load()
def load(self):
data_dict = Collection.find_one({"name":self.name})
self.database_id = data_dict["_id"]
for element in data_dict["element_list"]:
self.process_element(element)
for member_name, member_info in data_dict["member_class_dict"].items():
self.member_dict[member_name] = MemberClass(member_info)
def process_element(self, element):
print("Do Stuff")
Two example use cases I have are:
1) List of strings the are used to set flags, this is done by calling a function with the string as the argument. (def process_element above)
2) A dictionary of dictionaries which are used to create a list of instances of a class. (MemberClass(member_info) above)
I tried creating properties to handle this but found that __setattr__ doesn't look for properties.
I know I could redefine __setattr__ to look for specific names but it is my understanding that this would slow down all set interactions with the class and I would prefer to avoid that.
I also know I could use a bunch of try/excepts to catch the errors but this would end up making the code very bulky.
I don't mind the load function being slowed down a bit for this but very much want to avoid anything that will slow down the class outside of loading.
So the solution that I came up with is to use the idea of changing the __setattr__ method but instead to handle the exceptions in the load function instead of the __setattr__.
def load(self):
data_dict = Collection.find_one({"name":self.name})
for key, attribute in world_data.items():
if key == "_id":
self.database_id = attribute
elif key == "element_list":
for element in attribute:
self.process_element(element)
elif key == "member_class_dict":
for member_name, member_info in attribute.items():
self.member_dict[member_name] = MemberClass(member_info)
else:
self.__setattr__(key,attribute)
This provides all of the functionality of overriding the __setattr__ method without slowing down any future calls to __setattr__ outside of loading the class.

Resources