Binary Tree/Search Tree - python-3.x

I'm beginning to start data structures and algorithms in Python and I am on Binary trees right now. However I'm really confused on how to implement them. I see some implementations where there are two classes one for Node and one for the tree itself like so:
class Node:
def __init__(self, data):
self.right = None
self.left = None
self.data = data
class Tree:
def __init__(self):
self.root = None
def insert()
def preorder()
.
.
.
However I also see implementations where there is no Node class and instead all the code goes inside the Tree class without the following
def __init__(self):
self.root = None
Can someone please help me understand why there are two different ways, the pros and cons of each method, the differences between them and which method I should follow to implement a binary tree.
Thank you!

Yes, there are these two ways.
First of all, in the 1-class approach, that class is really the Node class (possibly named differently) of the 2-class approach. True, all the methods are then on the Node class, but that could also have been the case in the 2-class approach, where the Tree class would defer much of the work by calling methods on the Node class. So what the 1-class approach is really missing, is the Tree class. The name of the class can obscure this observation, but it is better to see it that way, even when the methods to work with the tree are on the Node class.
The difference becomes apparent when you need to represent an empty tree. With the 1-class approach you would have to set your main variable to None. But that None is a different concept than a tree without nodes. Because on an object that represents an empty tree one can still call methods to add nodes again. You cannot call methods on None. It also means that the caller needs to be aware when to change its own variable to None or to a Node instance, depending on whether the tree happened to be (or become) empty.
With the 2-class system this management is taken out of the hands of the caller, and managed inside the Tree instance. The caller will always use the single reference to that instance, independent on whether the tree is empty or not.
Examples
1. Creating an empty tree
1-class approach:
root = None
2-class approach:
tree = Tree()
2. Adding a first value to an empty tree
1-class approach
# Cannot use an insert method, as there is no instance yet
root = new Node(1) # Must get the value for the root
2-class approach
tree.insert(1) # No need to be aware of a changing root
As the 1-class approach needs to get the value of the root in this case, you often see that with this approach, the root is always captured like that, even when the tree was not empty. In that case the value of the caller's root variable will not really change.
3. Deleting the last value from a tree
root = root.delete(1) # Must get the new value for the root; could be None!
2-class approach
tree.delete(1) # No need to be aware of a changing root
Even though the deletion might in general not lead to an empty tree, in the 1-class approach the caller must take that possibility into account and always assign to its own root reference, as it might become None.
Variants for 1-class systems
1. A None value
Sometimes you see approaches where an empty tree is represented by a Node instance that has a None-value, so to indicate this is not really data. Then when the first value is added to the tree, that node's value is updated to that value. In this way, the caller does no longer have to manage the root: it will always be a reference to the same Node instance, which sometimes can have a None value so to indicate the tree is actually empty.
The code for the caller to work with this 1-class variant, would become the same as the 2-class approach.
This is bad practice. In some cases you may want a tree to be able to have a None-value as data, and then you cannot make a tree with such a value, as it will be mistaken for a dummy node that represents an empty tree
2. Sentinel node
This is a variant to the above. Here the dummy node is always there. The empty tree is represented by a single node instance. When the first value is inserted into the tree, this dummy node's value is never updated, but the insertion happens to its left member. So the dummy node never plays a role for holding data, but is just the container that maintains the real root of the tree in its left member. This really means that the dummy node is the parallel of what the Tree instance is in the 2-class approach. Instead of having a proper root member, it has a left member for that role, and it has no use for its right member.
How native data structures are implemented
The 2-class approach is more like how native data types work. For instance, take the standard list in Python. An empty list is represented by [], not by None. There is a clear distinction between [] and None. Once you have the list reference (like lst = []), that reference never changes. You an append and delete, but you never have to assign to lst itself.

Related

Python class instance changed during local function variable

Let's for example define a class and a function
class class1(object):
"""description of class"""
pass
def fun2(x,y):
x.test=1;
return x.value+y
then define a class instance and run it as a local variable in the function
z=class1()
z.value=1
fun2(z,y=2)
However, if you try to run the code
z.test
a result 1 would be returned.
That was, though the attribute to x was done inside the fun2() locally, it extended to class instance x globally as well. This seemed to violate the first thing one learn about the python function, the argument stays local unless being defined nonlocal or global.
How could this happen? Why the attribute to class inside a function extend outside the function.
I have even stranger example:
def fun3(a):
b=a
b.append(3)
mya = [1]
fun3(mya)
print(mya)
[1, 3]
>
I "copy" the array to a local variable and when I change it, the global one changes as well.
The problem is that the parameters are not passed by a value (basically as a copy of the values). In python they are passed by reference. In C terminology the function gets a pointer to the memory location. It's much faster that way.
Some languages will not let you to play with private attributes of an instance, but in Python it's your responsibility to make sure that does not happen. One other rule of OOP is that you should change the internal state of an instance just by calling its methods.
But you change the value directly.
Python is very flexible and allows you to do even the bad things. But it does not push you.
I always argue to have always at least vague understanding of the underlaying structure of any higher level language (memory model, how the variables are passed etc.). There is another argument for having some C/C++ knowledge. Most of the higher level languages are written in them or at least are inspired by them. A C++ programmer would see clearly what is going on.

Are collections of inner aggregates valid?

Lets say I have an AggregateRoot called Folder with a collection of sub-folders like this:
class Folder : AggregateRoot
{
string Name { get; }
ICollection<Folder> Folders { get; }
}
The inner collection here is really just a list of aggregate-ids to other Folder-aggregates which is resolved lazily when enumerated.
Is this construct valid in domain modeling where aggregates not only references other aggregates but also defines its Folders-property as a collection of other Folder-aggregates?
Why? The example above may not be particularly good but my goal is mainly to have a natural way of working with aggregate collections and hide the fact that agg-references are resolved through a repository under the surface. I want to work with aggregates as easily as with entity-collections.
My thinking here is also that the sub-folders in some way are owned by the parent-aggregate, that the collection really is a way of representing a place where the aggregates are stored even if it's not really true as they are stored more generally through a repository.
The example with the recursiveness was not really important. The focus is on the fact that a aggregate "seem" to own other aggregates. And when making a change in two folders it would only be possible to save them one by one of course but that should be ok. I would also have to include some rule that folders only can be created in place and not added manually so that they could turn up in more than one agg-collection.
Child Structures are valid use cases in Domain Modeling, and one often encountered in recursive concepts like Groups, Tags, etc. just like in your example of Folders. And I like dealing with them as pure collection objects in the domain layer, with no hint of persistence logic whatsoever. When writing such domain logic, I like to imagine that I am dealing with objects as if they will be perpetually preserved on the RAM .
I will consider your recursive example for my explanation, but the same concept applies to any "collection" of child objects, not just recursive relationships.
Below is a sample implementation in pseudocode, annotated with comments. I apologize in advance that the code is closer to Python in structure. I wanted to convey the idea accurately, and not worry about how to represent it in C#, in which I am not well versed. Please ask questions about the code if something is not clear.
Notes on the pseudocode:
In the domain layer, you deal with collections simply as if its another list/collection, without having to worry about underlying persistence complexities.
FolderService is an ApplicationService that is typically invoked by the API. This service is responsible for assembling the infrastructure services, interacting with the domain layer, and eventual persistence.
FolderTable is an imaginary database representation of the Folder object. FolderRepository along knows about this class and its implementation details
The complexities of saving and retrieving the Folder object from DB would be present only in the FolderRepository class.
The load_by_name repository method eagerly loads and populates all subfolders into the parent folder. We can convert this to be lazily loaded, only on access, and never load it unless we are traversing (can even be paginated, based on requirements, especially if there is no specific limit on the no. of subfolders)
class Folder(AggregateRoot):
name: str
folders: List[Folder]
#classmethod
def construct_from_args(cls, params):
# This is a factory method
return Folder(**params)
def add_sub_folder(self, new_folder: Folder) -> None:
self.folders.append(new_folder)
def remove_sub_folder(self, existing_folder: Folder) -> None:
# Dummy implementation. Actual implementation will be more involved
for folder in self.folders:
if folder.name == existing_folder.name:
self.folders.remove(existing_folder)
class FolderService(ApplicationService):
def add_sub_folder(parent_folder_name: str, new_folder_params: dict) -> None:
folder_repo = _system.get_folder_repository()
parent_folder = folder_repo.load_by_name(parent_folder_name)
new_sub_folder = Folder.construct_from_args(new_folder_params)
parent_folder.add_sub_folder(new_sub_folder)
folder_repo.save(parent_folder)
class FolderTable(DBObject):
# A DB representation of the folder domain object
# `parent_id` will be empty for the root folder
name: str
parent_id: integer
class FolderRepository(Repository):
# Constructor for Repository
# that has `connection`s to the database, for later use
# FolderRepository works exclusively with `FolderTable` class
def load_by_name(self, folder_name: str) -> Folder:
# Load a folder, including its subfolders, from database
persisted_folder = self.connection.find(name=folder_name)
parent_identifier = persisted_folder.id
sub_folders = self.connection.find(parent_identifier)
for sub_folder in sub_folders:
persisted_folder.folders.append(sub_folder)
return persisted_folder
def save(self, folder: Folder) -> None:
persisted_folder = self.connection.find(name=folder.name)
parent_identifier = persisted_folder.id
# Gather the persisted list of folders from database
persisted_sub_folders = self.connection.find(parent_identifier)
for sub_folder in folder.folders:
# The heart of the persistence logic, with three conditions:
# If the subfolder is already persisted,
# Do Nothing
# If there is a persisted subfolder that is no longer a part of folders,
# Remove it
# If the subfolder is not among those already persisted,
# Add it
If you see holes in this implementation or my thought-process, please do point them out.
When an aggregate contains another one should avoid direct references as there is usually some unwanted pain that follows.
Any lazy-loading would indicate that you could redesign things a bit as you should avoid that also.
The most common pattern is to either have only a list of ids or a list of values objects. The latter appears to be more appropriate in your case. You could then always have a fully loaded AR with all the relevant folders. In order to navigate you would need to retrieve the relevant folder.
This particular example has some peculiarities since it represents a hierarchy but you'd have to deal with those on a case-by-base bases.
In short: having any aggregate reference another, whether it be through a collection or otherwise, is ill-advised.

How to define a strategy in hypothesis to generate a pair of similar recursive objects

I am new to hypothesis and I am looking for a way to generate a pair of similar recursive objects.
My strategy for a single object is similar to this example in the hypothesis documentation.
I want to test a function which takes a pair of recursive objects A and B and the side effect of this function should be that A==B.
My first approach would be to write a test which gets two independent objects, like:
#given(my_objects(), my_objects())
def test_is_equal(a, b):
my_function(a, b)
assert a == b
But the downside is that hypothesis does not know that there is a dependency between this two objects and so they can be completely different. That is a valid test and I want to test that too.
But I also want to test complex recursive objects which are only slightly different.
And maybe that hypothesis is able to shrink a pair of very different objects where the test fails to a pair of only slightly different objects where the test fails in the same way.
This one is tricky - to be honest I'd start by writing the same test you already have above, and just turn up the max_examples setting a long way. Then I'd probably write some traditional unit tests, because getting specific data distributions out of Hypothesis is explicitly unsupported (i.e. we try to break everything that assumes a particular distribution, using some combination of heuristics and a bit of feedback).
How would I actually generate similar recursive structures though? I'd use a #composite strategy to build them at the same time, and for each element or subtree I'd draw a boolean and if True draw a different element or subtree to use in the second object. Note that this will give you a strategy for a tuple of two objects and you'll need to unpack it inside the test; that's unavoidable if you want them to be related.
Seriously do try just cracking up max_examples on the naive approach first though, running Hypothesis for ~an hour is amazingly effective and I would even expect it to shrink the output fairly well.

How to generate stable id for AST nodes in functional programming?

I want to substitute a specific AST node into another, and this substituted node is specified by interactive user input.
In non-functional programming, you can use mutable data structure, and each AST node have a object reference, so when I need to reference to a specific node, I can use this reference.
But in functional programming, use IORef is not recommended, so I need to generate id for each AST node, and I want this id to be stable, which means:
when a node is not changed, the generated id will also not change.
when a child node is changed, it's parent's id will not change.
And, to make it clear that it is an id instead of a hash value:
for two different sub nodes, which are compared equal, but corresponding to different part of an expression, they should have different id.
So, what should I do to approach this?
Perhaps you could use the path from the root to the node as the id of that node. For example, for the datatype
data AST = Lit Int
| Add AST AST
| Neg AST
You could have something like
data ASTPathPiece = AddGoLeft
| AddGoRight
| NegGoDown
type ASTPath = [ASTPathPiece]
This satisfies conditions 2 and 3 but, alas, it doesn't satisfy 1 in general. The index into a list will change if you insert a node in a previous position, for example.
If you are rendering the AST into another format, perhaps you could add hidden attributes in the result nodes that identified which ASTPathPiece led to them. Traversing the result nodes upwards to the root would let you reconstruct the ASTPath.

How is insert O(log(n)) in Data.Set?

When looking through the docs of Data.Set, I saw that insertion of an element into the tree is mentioned to be O(log(n)). However, I would intuitively expect it to be O(n*log(n)) (or maybe O(n)?), as referential transparency requires creating a full copy of the previous tree in O(n).
I understand that for example (:) can be made O(1) instead of O(n), as here the full list doesn't have to be copied; the new list can be optimized by the compiler to be the first element plus a pointer to the old list (note that this is a compiler - not a language level - optimization). However, inserting a value into a Data.Set involves rebalancing that looks quite complex to me, to the point where I doubt that there's something similar to the list optimization. I tried reading the paper that is referenced by the Set docs, but couldn't answer my question with it.
So: how can inserting an element into a binary tree be O(log(n)) in a (purely) functional language?
There is no need to make a full copy of a Set in order to insert an element into it. Internally, element are stored in a tree, which means that you only need to create new nodes along the path of the insertion. Untouched nodes can be shared between the pre-insertion and post-insertion version of the Set. And as Deitrich Epp pointed out, in a balanced tree O(log(n)) is the length of the path of the insertion. (Sorry for omitting that important fact.)
Say your Tree type looks like this:
data Tree a = Node a (Tree a) (Tree a)
| Leaf
... and say you have a Tree that looks like this
let t = Node 10 tl (Node 15 Leaf tr')
... where tl and tr' are some named subtrees. Now say you want to insert 12 into this tree. Well, that's going to look something like this:
let t' = Node 10 tl (Node 15 (Node 12 Leaf Leaf) tr')
The subtrees tl and tr' are shared between t and t', and you only had to construct 3 new Nodes to do it, even though the size of t could be much larger than 3.
EDIT: Rebalancing
With respect to rebalancing, think about it like this, and note that I claim no rigor here. Say you have an empty tree. Already balanced! Now say you insert an element. Already balanced! Now say you insert another element. Well, there's an odd number so you can't do much there.
Here's the tricky part. Say you insert another element. This could go two ways: left or right; balanced or unbalanced. In the case that it's unbalanced, you can clearly perform a rotation of the tree to balance it. In the case that it's balanced, already balanced!
What's important to note here is that you're constantly rebalancing. It's not like you have a mess of a tree, decided to insert an element, but before you do that, you rebalance, and then leave a mess after you've completed the insertion.
Now say you keep inserting elements. The tree's gonna get unbalanced, but not by much. And when that does happen, first off you're correcting that immediately, and secondly, the correction occurs along the path of the insertion, which is O(log(n)) in a balanced tree. The rotations in the paper you linked to are touching at most three nodes in the tree to perform a rotation. so you're doing O(3 * log(n)) work when rebalancing. That's still O(log(n)).
To add extra emphasis to what dave4420 said in a comment, there are no compiler optimizations involved in making (:) run in constant time. You could implement your own list data type, and run it in a simple non-optimizing Haskell interpreter, and it would still be O(1).
A list is defined to be an initial element plus a list (or it's empty in the base case). Here's a definition that's equivalent to native lists:
data List a = Nil | Cons a (List a)
So if you've got an element and a list, and you want to build a new list out of them with Cons, that's just creating a new data structure directly from the arguments the constructor requires. There is no more need to even examine the tail list (let alone copy it), than there is to examine or copy the string when you do something like Person "Fred".
You are simply mistaken when you claim that this is a compiler optimization and not a language level one. This behaviour follows directly from the language level definition of the list data type.
Similarly, for a tree defined to be an item plus two trees (or an empty tree), when you insert an item into a non-empty tree it must either go in the left or right subtree. You'll need to construct a new version of that tree containing the element, which means you'll need to construct a new parent node containing the new subtree. But the other subtree doesn't need to be traversed at all; it can be put in the new parent tree as is. In a balanced tree, that's a full half of the tree that can be shared.
Applying this reasoning recursively should show you that there's actually no copying of data elements necessary at all; there's just the new parent nodes needed on the path down to the inserted element's final position. Each new node stores 3 things: an item (shared directly with the item reference in the original tree), an unchanged subtree (shared directly with the original tree), and a newly created subtree (which shares almost all of its structure with the original tree). There will be O(log(n)) of those in a balanced tree.

Resources