How to generate stable id for AST nodes in functional programming? - haskell

I want to substitute a specific AST node into another, and this substituted node is specified by interactive user input.
In non-functional programming, you can use mutable data structure, and each AST node have a object reference, so when I need to reference to a specific node, I can use this reference.
But in functional programming, use IORef is not recommended, so I need to generate id for each AST node, and I want this id to be stable, which means:
when a node is not changed, the generated id will also not change.
when a child node is changed, it's parent's id will not change.
And, to make it clear that it is an id instead of a hash value:
for two different sub nodes, which are compared equal, but corresponding to different part of an expression, they should have different id.
So, what should I do to approach this?

Perhaps you could use the path from the root to the node as the id of that node. For example, for the datatype
data AST = Lit Int
| Add AST AST
| Neg AST
You could have something like
data ASTPathPiece = AddGoLeft
| AddGoRight
| NegGoDown
type ASTPath = [ASTPathPiece]
This satisfies conditions 2 and 3 but, alas, it doesn't satisfy 1 in general. The index into a list will change if you insert a node in a previous position, for example.
If you are rendering the AST into another format, perhaps you could add hidden attributes in the result nodes that identified which ASTPathPiece led to them. Traversing the result nodes upwards to the root would let you reconstruct the ASTPath.

Related

Binary Tree/Search Tree

I'm beginning to start data structures and algorithms in Python and I am on Binary trees right now. However I'm really confused on how to implement them. I see some implementations where there are two classes one for Node and one for the tree itself like so:
class Node:
def __init__(self, data):
self.right = None
self.left = None
self.data = data
class Tree:
def __init__(self):
self.root = None
def insert()
def preorder()
.
.
.
However I also see implementations where there is no Node class and instead all the code goes inside the Tree class without the following
def __init__(self):
self.root = None
Can someone please help me understand why there are two different ways, the pros and cons of each method, the differences between them and which method I should follow to implement a binary tree.
Thank you!
Yes, there are these two ways.
First of all, in the 1-class approach, that class is really the Node class (possibly named differently) of the 2-class approach. True, all the methods are then on the Node class, but that could also have been the case in the 2-class approach, where the Tree class would defer much of the work by calling methods on the Node class. So what the 1-class approach is really missing, is the Tree class. The name of the class can obscure this observation, but it is better to see it that way, even when the methods to work with the tree are on the Node class.
The difference becomes apparent when you need to represent an empty tree. With the 1-class approach you would have to set your main variable to None. But that None is a different concept than a tree without nodes. Because on an object that represents an empty tree one can still call methods to add nodes again. You cannot call methods on None. It also means that the caller needs to be aware when to change its own variable to None or to a Node instance, depending on whether the tree happened to be (or become) empty.
With the 2-class system this management is taken out of the hands of the caller, and managed inside the Tree instance. The caller will always use the single reference to that instance, independent on whether the tree is empty or not.
Examples
1. Creating an empty tree
1-class approach:
root = None
2-class approach:
tree = Tree()
2. Adding a first value to an empty tree
1-class approach
# Cannot use an insert method, as there is no instance yet
root = new Node(1) # Must get the value for the root
2-class approach
tree.insert(1) # No need to be aware of a changing root
As the 1-class approach needs to get the value of the root in this case, you often see that with this approach, the root is always captured like that, even when the tree was not empty. In that case the value of the caller's root variable will not really change.
3. Deleting the last value from a tree
root = root.delete(1) # Must get the new value for the root; could be None!
2-class approach
tree.delete(1) # No need to be aware of a changing root
Even though the deletion might in general not lead to an empty tree, in the 1-class approach the caller must take that possibility into account and always assign to its own root reference, as it might become None.
Variants for 1-class systems
1. A None value
Sometimes you see approaches where an empty tree is represented by a Node instance that has a None-value, so to indicate this is not really data. Then when the first value is added to the tree, that node's value is updated to that value. In this way, the caller does no longer have to manage the root: it will always be a reference to the same Node instance, which sometimes can have a None value so to indicate the tree is actually empty.
The code for the caller to work with this 1-class variant, would become the same as the 2-class approach.
This is bad practice. In some cases you may want a tree to be able to have a None-value as data, and then you cannot make a tree with such a value, as it will be mistaken for a dummy node that represents an empty tree
2. Sentinel node
This is a variant to the above. Here the dummy node is always there. The empty tree is represented by a single node instance. When the first value is inserted into the tree, this dummy node's value is never updated, but the insertion happens to its left member. So the dummy node never plays a role for holding data, but is just the container that maintains the real root of the tree in its left member. This really means that the dummy node is the parallel of what the Tree instance is in the 2-class approach. Instead of having a proper root member, it has a left member for that role, and it has no use for its right member.
How native data structures are implemented
The 2-class approach is more like how native data types work. For instance, take the standard list in Python. An empty list is represented by [], not by None. There is a clear distinction between [] and None. Once you have the list reference (like lst = []), that reference never changes. You an append and delete, but you never have to assign to lst itself.

Product name string matching against a trie (supporting omissions)

I have a list of CPU models. Right now, I think the most suitable approach would be forming a trie from the list, like this:
Intel -- Core -- i -- 3
| | |- 5
| | |- 7
| | -- 9
| |
| -- 2 Duo
|
|- Xeon -- ...
|
|...
Now, I want to match an input string against this trie. This is easy for exact matching, but what if I need a fuzzy one, where a string sequence can have omissions? For "Intel i3", "Core i3" and "i3" to all match to "Intel -> Core -> i -> 3" in the trie.
Is trie the right task for this problem? I thought about using trie search with wildcards, but the wildcard here can be in any position in the sequence.
What data structure can I use to represent the list in a way most applicable to this problem? What algorithm do I use for search?
While I'm not sure it's the optimal data structure for the task, you could use an augmented trie where every node has direct links to every descendant. Obviously you're going to want better than linear search (the trie root would have a link to every other node), and you also have to deal with duplicates, but the memory costs should be fine as long as your depth is reasonable (which should be true for CPU models). This would look something like:
class TrieAugmented:
def __init__(self, val: str):
self.val = val
self.children = []
self.child_paths = {}
When adding CPU models, the new nodes are appended to the list of children as usual but child paths have to be updated on every ancestor node for each new node (additions are O(d^2) rather than O(d), where d is depth). I would have child_paths map node descendant values to a list of nodes in self.children which have that value or store it within child_paths. If you plan on building a static trie and then querying it, you can build the trie and only update direct children as usual before adding in all the shorter paths in a single depth-first pass through the trie. Each node occupies O(d) space instead of constant, so overall this is something like O(n^2) space instead of linear, but that should be doable for a relatively small set of items.
If storage and implementation complexity are bigger concerns than runtime, you can use an unaugmented trie. This makes the runtime linear in number of trie nodes best case rather than roughly linear in input size, but it's very similar to matching file system paths with arbitrary nested structure. In rust glob syntax, you would translate "Core i3" to "/**/Core/**/i/**/3" and treat your trie as a file system (you do in fact insert wildcards at every position in the sequence, and they can match arbitrarily many levels of the trie). Here the trie doesn't make the lookup too fast but does make it possible to match models with omissions to their fully specified versions.

Creating an AST and manage, at the same time, a symbol Table using Haskell's Happy parsers

I am creating a simple imperative language from scratch, I already have a working syntactic tree, without much complication, it just uses the bottom-up style of parsing to create it using a simple tree data structure. The idea now is to implement a complete LeBlanc-Cook style symbol table. Its structure is not complicated, the problem is that I don't know how to make happy fill it while creating the tree at the same time.
The idea behind doing it all in a single pass is that in this way the AST can be filled with only what is the minimum necessary, ignoring stuff like variable or type declarations, which only effects are in the symbol table. Traversing the AST to fill the table is my last option.
I have the basic idea of it being some kind of global state, modified just in certain select times, like when a new block is open or when a variable is declared, but I have no idea how to use the environment that happy gives me together with whatever monadic structure I would create.
I know this question could be reduced to something like "how does happy work?", but anyways. Any comment is appreciated.
Here's an example to illustrate a little better my question.
% monad {MyState}
...
START: INSTRUCTIONS { (AST_Root $1, Final_Symtable_State) } -- Ideally
INSTRUCTIONS: INSTRUCTIONS INSTRUCTION { $2:$1 } -- A list of all instructions
| INSTRUCTION { [$1] }
INSTRUCTION : VARDEF {%???}
| TYPEDEF
| VARMOD
| ...
...
VARDEF: let identifier : Int {%???} -- This should modify the symtable state and not the tree
| let identifier : Int = number {%???} -- This should modify the symtable and provide a new branch for the tree
It seems like a single state would maybe not work out, and there is a combination of monadic and not monadic actions, what would be the best structure to work this out.

How is insert O(log(n)) in Data.Set?

When looking through the docs of Data.Set, I saw that insertion of an element into the tree is mentioned to be O(log(n)). However, I would intuitively expect it to be O(n*log(n)) (or maybe O(n)?), as referential transparency requires creating a full copy of the previous tree in O(n).
I understand that for example (:) can be made O(1) instead of O(n), as here the full list doesn't have to be copied; the new list can be optimized by the compiler to be the first element plus a pointer to the old list (note that this is a compiler - not a language level - optimization). However, inserting a value into a Data.Set involves rebalancing that looks quite complex to me, to the point where I doubt that there's something similar to the list optimization. I tried reading the paper that is referenced by the Set docs, but couldn't answer my question with it.
So: how can inserting an element into a binary tree be O(log(n)) in a (purely) functional language?
There is no need to make a full copy of a Set in order to insert an element into it. Internally, element are stored in a tree, which means that you only need to create new nodes along the path of the insertion. Untouched nodes can be shared between the pre-insertion and post-insertion version of the Set. And as Deitrich Epp pointed out, in a balanced tree O(log(n)) is the length of the path of the insertion. (Sorry for omitting that important fact.)
Say your Tree type looks like this:
data Tree a = Node a (Tree a) (Tree a)
| Leaf
... and say you have a Tree that looks like this
let t = Node 10 tl (Node 15 Leaf tr')
... where tl and tr' are some named subtrees. Now say you want to insert 12 into this tree. Well, that's going to look something like this:
let t' = Node 10 tl (Node 15 (Node 12 Leaf Leaf) tr')
The subtrees tl and tr' are shared between t and t', and you only had to construct 3 new Nodes to do it, even though the size of t could be much larger than 3.
EDIT: Rebalancing
With respect to rebalancing, think about it like this, and note that I claim no rigor here. Say you have an empty tree. Already balanced! Now say you insert an element. Already balanced! Now say you insert another element. Well, there's an odd number so you can't do much there.
Here's the tricky part. Say you insert another element. This could go two ways: left or right; balanced or unbalanced. In the case that it's unbalanced, you can clearly perform a rotation of the tree to balance it. In the case that it's balanced, already balanced!
What's important to note here is that you're constantly rebalancing. It's not like you have a mess of a tree, decided to insert an element, but before you do that, you rebalance, and then leave a mess after you've completed the insertion.
Now say you keep inserting elements. The tree's gonna get unbalanced, but not by much. And when that does happen, first off you're correcting that immediately, and secondly, the correction occurs along the path of the insertion, which is O(log(n)) in a balanced tree. The rotations in the paper you linked to are touching at most three nodes in the tree to perform a rotation. so you're doing O(3 * log(n)) work when rebalancing. That's still O(log(n)).
To add extra emphasis to what dave4420 said in a comment, there are no compiler optimizations involved in making (:) run in constant time. You could implement your own list data type, and run it in a simple non-optimizing Haskell interpreter, and it would still be O(1).
A list is defined to be an initial element plus a list (or it's empty in the base case). Here's a definition that's equivalent to native lists:
data List a = Nil | Cons a (List a)
So if you've got an element and a list, and you want to build a new list out of them with Cons, that's just creating a new data structure directly from the arguments the constructor requires. There is no more need to even examine the tail list (let alone copy it), than there is to examine or copy the string when you do something like Person "Fred".
You are simply mistaken when you claim that this is a compiler optimization and not a language level one. This behaviour follows directly from the language level definition of the list data type.
Similarly, for a tree defined to be an item plus two trees (or an empty tree), when you insert an item into a non-empty tree it must either go in the left or right subtree. You'll need to construct a new version of that tree containing the element, which means you'll need to construct a new parent node containing the new subtree. But the other subtree doesn't need to be traversed at all; it can be put in the new parent tree as is. In a balanced tree, that's a full half of the tree that can be shared.
Applying this reasoning recursively should show you that there's actually no copying of data elements necessary at all; there's just the new parent nodes needed on the path down to the inserted element's final position. Each new node stores 3 things: an item (shared directly with the item reference in the original tree), an unchanged subtree (shared directly with the original tree), and a newly created subtree (which shares almost all of its structure with the original tree). There will be O(log(n)) of those in a balanced tree.

Syntax tree with parent in Haskell

I want to implement an AST in Haskell. I need a parent reference so it seems impossible to use a functional data structure. I've seen the following in an article. We define a node as:
type Tree = Node -> Node
Node allows us to get attribute by key of type Key a.
Is there anything to read about such a pattern? Could you give me some further links?
If you want a pure data structure with cyclic self-references, then as delnan says in the comments the usual term for that is "tying the knot". Searching for that term should give you more information.
Do note that data structures built by tying the knot are difficult (or impossible) to "update" in the usual manner--with a non-cyclic structure you can keep pieces of the original when building a new structure based on it, but changing any piece of a cycle requires you to rebuild the entire cycle as well. Depending on what you're doing, this may or may not be a problem, of course.

Resources