How can I get deterministic hash values for class objects? - python-3.x

I have an application running in Python 3.9.4 where I store class objects in sets (along with many other kinds of objects). I'm getting non-deterministic behavior even when PYTHONHASHSEED=0 because class objects get non-deterministic hash codes. I assume that's because class objects' hash codes come from their addresses in memory.
For example, here are two runs of a little test program, where Before and Equation are classes:
print(hash(Before), hash(Equation), hash(int))
304555224 304593057 271715397
print(hash(Before), hash(Equation), hash(int))
326601328 293027788 273337413
How can I get Python to generate deterministic hash values for class objects?
Is there a metaclass or something that I could monkey-patch so that all class objects, even int, get a hash function that I specify?

Hash for classes is deterministic within the same process . Yes, in cPython it is memory based - but then you can't simply "move" a class object to another memory address using Python code.
If you happen to use some serialization/de-serialization transforms with the classes, the de-serialized objects will ordinarily be new objects, distinct from the original ones, and therefore will hash differently.
For the note: I could not reproduce the behavior you stated in the question: on the same process, the hashes for the class objects will be the same.
If you are calculating the hashes in different processes, though, the will differ. So, although you don't mention multiprocessing there, I assume that is your working case.
Then, indeed, implementing __hash__ and __eq__ proper methods on the metaclass can allow you a stable, across process, hashing - but you can't do that with built-in classes such as int: those are built in native code and can't be changed on the Python side. On the other hand, despite the hash number shown being different for these built-in classes, whatever you are using to serialize/deserialize your classes (that is what Python does for communicating data across processes, even if you don't do any explicit de/serializing) .
Then we come to, while it is straightforward to add __eq__ and __hash__ methods to a metaclass to your classes, it'd be better to ensure that on deserializing, it would always yield the same object (with the same ID). hash stability, as you put it, could possibly ensure you have always the same class, but it would depend on how you'd write your code: it is a bit tricky to retrieve the object instance that is already inside a set, if you check positively for containship of another instance that matches it - the most straightfoward way would be building a identity-dictionary out of a set, and then use the value:
my_registry_dict = {element: element for element in my_registry_set}
my_class = my_registry_dict[incoming_class]
With this in mind, we can have a custom metaclass that not only add __eq__ and __hash__- and you have to pick what elements of the classes you will want to compare for equality - class.__qualname__ can be a simple and functional attribute to use - but also customize the __new__ method so that upon de-serializing the same class a second time will always re-use the first class object defined in the current process (i.e.: ensuring the "singleton" behavior Python classes enjoy in non-corner cases like yours seems to be)
class Meta(type):
registry = {}
def __new__(mcls, name, bases, namespace):
cls = super().__new__(mcls, name, bases, namespace)
if cls not in mcls.registry:
mcls.registry[cls] = cls
else:
# reuse the previously created class
cls = mcls.registry[cls]
return cls
def __hash__(cls):
# when working with metaclasses, using the name `cls` instead of `self``
# helps reminding us that we are dealing with instances that are
# actually classes.
return hash(cls.__qualname__)
def __eq__(cls, other):
return cls.__qualname__ == other.__qualname__

Related

Python class instance changed during local function variable

Let's for example define a class and a function
class class1(object):
"""description of class"""
pass
def fun2(x,y):
x.test=1;
return x.value+y
then define a class instance and run it as a local variable in the function
z=class1()
z.value=1
fun2(z,y=2)
However, if you try to run the code
z.test
a result 1 would be returned.
That was, though the attribute to x was done inside the fun2() locally, it extended to class instance x globally as well. This seemed to violate the first thing one learn about the python function, the argument stays local unless being defined nonlocal or global.
How could this happen? Why the attribute to class inside a function extend outside the function.
I have even stranger example:
def fun3(a):
b=a
b.append(3)
mya = [1]
fun3(mya)
print(mya)
[1, 3]
>
I "copy" the array to a local variable and when I change it, the global one changes as well.
The problem is that the parameters are not passed by a value (basically as a copy of the values). In python they are passed by reference. In C terminology the function gets a pointer to the memory location. It's much faster that way.
Some languages will not let you to play with private attributes of an instance, but in Python it's your responsibility to make sure that does not happen. One other rule of OOP is that you should change the internal state of an instance just by calling its methods.
But you change the value directly.
Python is very flexible and allows you to do even the bad things. But it does not push you.
I always argue to have always at least vague understanding of the underlaying structure of any higher level language (memory model, how the variables are passed etc.). There is another argument for having some C/C++ knowledge. Most of the higher level languages are written in them or at least are inspired by them. A C++ programmer would see clearly what is going on.

In Python, is what are the differences between a method outside a class definition, or a method in it using staticmethod?

I have been working a a very dense set of calculations. It all is to support a specific problem I have.
But the nature of the problem is no different than this. Suppose I develop a class called 'Matrix' that has the machinery to implement matrices. Instantiation would presumably take a list of lists, which would be the matrix entries.
Now I want to provide a multiply method. I have two choices. First, I could define a method like so:
class Matrix():
def __init__(self, entries)
# do the obvious here
return
def determinant(self):
# again, do the obvious here
return result_of_calcs
def multiply(self, b):
# again do the obvious here
return
If I do this, the call signature for two matrix objects, a and b, is
a.multiply(b)...
The other choice is a #staticmethod. Then, the definition looks like:
#staticethod
def multiply(a,b):
# do the obvious thing.
Now the call signature is:
z = multiply(a,b)
I am unclear when one is better than the other. The free-standing function is not truly part of the class definition, but who cares? it gets the job done, and because Python allows "reaching into an object" references from outside, it seems able to do everything. In practice they'll (the class and the method) end up in the same module, so they're at least linked there.
On the other hand, my understanding of the #staticmethod approach is that the function is now part of the class definition (it defines one of the methods), but the method gets no "self" passed in. In a way this is nice because the call signature is the much better looking:
z = multiply(a,b)
and the function can access all the instances' methods and attributes.
Is this the right way to view it? Are there strong reasons to do one or the other? In what ways are they not equivalent?
I have done quite a bit of Python programming since answering this question.
Suppose we have a file named matrix.py, and it has a bunch of code for manipulating matrices. We want to provide a matrix multiply method.
The two approaches are:
define a free:standing function with the signature multiply(x,y)
make it a method of all matrices: x.multiply(y)
Matrix multiply is what I will call a dyadic function. In other words, it always takes two arguments.
The temptation is to use #2, so that a matrix object "carries with it everywhere" the ability to be multiplied. However, the only thing it makes sense to multiply it with is another matrix object. In such cases there are two equally good ways to do that, viz:
z=x.multiply(y)
or
z=y.multiply(x)
However, a better way to do it is to define a function inside the file that is:
multiply(x,y)
multiply(), as such, is a function any code using the 'library' expects to have available. It need not be associated with each matrix. And, since the user will be doing an 'import', they will get the multiply method. This is better code.
What I was wrongly confounding was two things that led me to the method attached to every object instance:
Functions which need to be generally available inside the file that should be
exposed outside it; and
Functions which are needed only inside the file.
multiply() is an example of type 1. Any matrix 'library' ought to likely define matrix multiplication.
What I was worried about was needing to expose all the 'internal' functions. For example, suppose we want to make externally available matrix add(), multiple() and invert(). Suppose, however, we did not want to make externally available - but needed inside - determinant().
One way to 'protect' users is to make determinant a function (a def statement) inside the class declaration for matrices. Then it is protected from exposure. However, nothing stops a user of the code from reaching in if they know the internals, by using the method matrix.determinant().
In the end it comes down to convention, largely. It makes more sense to expose a matrix multiply function which takes two matrices, and is called like multiply(x,y). As for the determinant function, instead of 'wrapping it' in the class, it makes more sense to define it as __determinant(x) at the same level as the class definition for matrices.
You can never truly protect internal methods by their declaration, it seems. The best you can do is warn users. the "dunder" approach gives warning 'this is not expected to be called outside the code in this file'.

When you override __eq__ in a class, do you also need to override __hash__?

I'm making a node class in Python 3 which I will be storing in a min-ordered multi-tree structure. I overrode the __eq__ method which tests for equality by comparing two unique integer instance variables.
Will this approach work for finding nodes in the structure by comparing equality?
Do I also need to override __hash__?
Yes, you have to override __hash__.
The general rule is: if two instances are equal (wrt to __eq__, i.e. a==b is True), then they must have the same hash. Otherwise, all sorts of things can misbehave.
Also, it sounds to me __eq__ is not enough for minimum. At the very least, you'd need to define __lt__.
I'm making a node class in Python 3 which I will be storing in a min-ordered multi-tree structure
If it's the only use, and they are only used internally so user code isn't ever supposed to see a Node, and your code doesn't store them in dicts or sets, or call any functions which do, maybe you can get away without overriding __hash__.
But those are very strong restrictions, which can't really be enforced. And there's no benefit to not overriding __hash__ to be consistent with __eq__. So you still should do it.

What's the difference between T and def in groovy?

I was working with some SQL earlier that got me wondering what the difference was between these two typings.
In my example, I have 2 GroovyRowResults - pastData and currentData. Now, I need to compare 2 points from these result sets. These values should both be of indefinite type. So, when defining them, what's the difference between
def pastResult = pastData[commonKey]
def currentResult = currentData[commonKey]
if(pastResult == currentResult){
doSomething()
}
and
T pastResult = pastData[commonKey]
T currentResult = currentData[commonKey]
if(pastResult == currentResult){
doSomething()
}
I'm assuming T has been declared in your method/class earlier. In that case, it's a generic and the T refers to the same type of object consistently, whereas def is basically just an alias for Object.
T doesn't guarantee the two objects are the exact same class (they may just implement the same interface, or one may be a subclass), it does create more of a contract in the objects that you are dealing with. If you pass the same types of objects into the method, then there will be no difference, but if you pass different or unexpected types, it is more useful.
In other words, in Groovy it's done for readability and consistency, and using generics is much better than using dynamic typing.
I don't think the second example will work unless there is some kind of object called T. Check this link
http://groovy-lang.org/semantics.html#_variable_definition

how to avoid class having a __dict__

I noticed many built-in classes do not have a __dict__ and even classes in modules such as numpy do not have __dict__ defined as they had been defined in C.
I want to define __getattr__, but I'm worried about a recursive loop creeping into the (long) code of one of my class (e.g. see the second reply Understanding the difference between __getattr__ and __getattribute__).
Is there a way I can disable creation of __dict__? Do I need to use __slots__ ?
Not having a __dict__ will not prevent you from writing an infinite recursive loop.
And yes, defining __slots__ will prevent the creation of a __dict__.

Resources