How do I find a matching subtree?

How do I find a matching subtree? - search

I have a large binary tree, T. T "matches". Some number of subtrees of T will also match. In fact, the matching subtrees need not even be full subtrees: they can be truncated, too. By truncated subtree, I mean that nodes in the subtree may not contain children all the way down - some nodes that have children may have their children removed.
An example: see this link. The tree represented by poem1, stanza1, stanza2, line3 is an example of a truncated subtree.
Determining if a tree matches requires performing a calculation on that entire tree. It's not progressive.
How the heck do I find all matches?

http://en.wikipedia.org/wiki/Subgraph_isomorphism_problem
sounds roughly like what you're trying to find (except that you're trying this on all subgraphs of an original graph as well, making it even harder). I don't really know how you are defining "matches" (equality, pattern, color coordinated, sticks with chemicals on the end that ignite when struck?), so it might be quite a different problem.

Related

Tree searching algorithm

I'm looking for suggestions on strategies for searching a tree-like data structure.
The structure is a tree where each element is a string, each branch is a period, and a path is the concatenation of several strings and periods starting at the root. The root and edges from the root are a special case where there is no string behind them.
So given the tree,
{root}
/ \
A X
/ \ /
B C Y
Valid paths are the strings "A", "A.B", "A.C", "X", and "X.Y".
What we have is a set of strings that we need to search for in this tree and find the element that terminates each string. Not all strings in the set appear in the tree. We stop searching when we find all strings. We need to run this search several times but the trees may differ each time. The set of strings to search is the same each run though.
Currently we're using depth-first search, but this isn't very efficient if all strings fall under say the last branch under the root. I feel like there should be a better way of doing this.
What would be a good algorithm for doing this repeated search? Would it be possible to leverage multithreading here as well?

It's an interesting problem; usually one would imagine a single tree being searched for a variable set of strings. Here the situation is reversed: the set of strings is fixed and the tree is highly variable.
I think that the best you can do is build a trie representing the set of strings. That way, you only have to search a tree once for any given prefix. (So, for the example strings you mentioned, you would only need to find the "A" prefix once and the "X" prefix once.) There are lots of trie data structures and algorithms for building them from a set of strings, but since that's a one-time operation for this problem, I wouldn't worry too much about the cost of this preprocessing.

Is it possible to efficiently search a bit-trie for values less than the key?

I am currently storing a large number of unsigned 32-bit integers in a bit trie (effectively forming a binary tree with a node for each bit in the 32-bit value.) This is very efficient for fast lookup of exact values.
I now want to be able to search for keys that may or may not be in the trie and find the value for the first key less than or equal to the search key. Is this efficiently possible with a bit trie, or should I use a different data structure?
I am using a trie due to its speed and cache locality, and ideally want to sacrifice neither.
For example, suppose the trie has two keys added:
0x00AABBCC
0x00AABB00
and I an now searching for a key that is not present, 0x00AABB11. I would like to find the first key present in the tree with a value <= the search key, which in this case would be the node for 0x00AABB00.
While I've thought of a possible algorithm for this, I am seeking concrete information on if it is efficiently possible and/or if there are known algorithms for this, which will no doubt be better than my own.

We can think bit trie as a binary search tree. In fact, it is a binary search tree. Take the 32-bit trie for example, suppose left child as 0, right child as 1. For the root, the left subtree is for the numbers less than 0x80000000 and the right subtree is for the numbers no less than 0x80000000, so on and so forth. So you can just use the similar the method to find the largest item not larger than the search key in the binary search tree. Just don't worry about the backtracks, it won't backtrack too much and won't change the search complexity.
When you match fails in the bit trie, just backtrack to find the right-most child of the nearest ancestor of the failed node.

If the data is static--you're not adding or removing items--then I'd take a good look at using a simple array with binary search. You sacrifice cache locality, but that might not be catastrophic. I don't see cache locality as an end in itself, but rather a means of making the data structure fast.
You might get better cache locality by creating a balanced binary tree in an array. Position 0 is the root node, position 1 is left node, position 2 is right node, etc. It's the same structure you'd use for a binary heap. If you're willing to allocate another 4 bytes per node, you could make it a left-threaded binary tree so that if you search for X and end up at the next larger value, following that left thread would give you the next smaller value. All told, though, I don't see where this can outperform the plain array in the general case.
A lot depends on how sparse your data is and what the range is. If you're looking at a few thousand possible values in the range 0 to 4 billion, then the binary search looks pretty attractive. If you're talking about 500 million distinct values, then I'd look at allocating a bit array (500 megabytes) and doing a direct lookup with linear backward scan. That would give you very good cache locality.

A bit trie walks 32 nodes in the best case when the item is found.
A million entries in a red-black tree like std::map or java.util.TreeMap would only require log2(1,000,000) or roughly 20 nodes per query, worst case. And you do not always need to go to the bottom of the tree making average case appealing.
When backtracking to find <= the difference is even more pronounced.
The fewer entries you have, the better the case for a red-black tree
At a minimum, I would compare any solution to a red-black tree.

Searching for a series of nodes in an isometric graph, Depth First is not finding some possible paths

I have a 4x4 undirected graph, with links/paths between each node vertically,horizontally and diagonally. In my example I've simplified the contents of these nodes to integers. Given a series of numbers of any length, I want to determine if a path exists on the board that consists of these numbers. No node can be used twice. For example searching 789, 548 and 734 on the graph below would return true, but 111, 7343 and 98989 would return false.
I currently have what is essentially a depth first search, but I realized it is missing some paths. In the above example, 12234 could be missed. If the search starts at 1, moves diagonally to 2, and left to 2, there is nowhere else to go. The search then backtracks, marking the rightmost 2 as visited and blocking the only correct path.
The improvement I've been able to come up with is to add additional state to each node to record the depth at which they were visited. That would eliminate this case, and certainly make it more correct. But this is still a problem for 27979 on the graph above. If the search starts at the left-most 2, goes down and right to 7, up-right to 9, up-left to 7, it will again block the correct path.
It seems like I'm using the wrong kind of search here, but what's the right one?

It seems I've come up with a solution, which I'll share in case someone else comes across this is the future.
On each search I build a tree with bidirectional links, so that at each place the path can go multiple ways it branches, like a breadth-first search. This differs in that each branch of the tree can use the nodes of any other unconnected branch. As each node is added to the tree I follow the links back to root and check the node against each link to eliminate the possibility of cyclical paths and allow reuse of nodes from other paths. Once the a branch has reached the depth desired, I backtrack to root to record the path.

How is insert O(log(n)) in Data.Set?

When looking through the docs of Data.Set, I saw that insertion of an element into the tree is mentioned to be O(log(n)). However, I would intuitively expect it to be O(n*log(n)) (or maybe O(n)?), as referential transparency requires creating a full copy of the previous tree in O(n).
I understand that for example (:) can be made O(1) instead of O(n), as here the full list doesn't have to be copied; the new list can be optimized by the compiler to be the first element plus a pointer to the old list (note that this is a compiler - not a language level - optimization). However, inserting a value into a Data.Set involves rebalancing that looks quite complex to me, to the point where I doubt that there's something similar to the list optimization. I tried reading the paper that is referenced by the Set docs, but couldn't answer my question with it.
So: how can inserting an element into a binary tree be O(log(n)) in a (purely) functional language?

There is no need to make a full copy of a Set in order to insert an element into it. Internally, element are stored in a tree, which means that you only need to create new nodes along the path of the insertion. Untouched nodes can be shared between the pre-insertion and post-insertion version of the Set. And as Deitrich Epp pointed out, in a balanced tree O(log(n)) is the length of the path of the insertion. (Sorry for omitting that important fact.)
Say your Tree type looks like this:
data Tree a = Node a (Tree a) (Tree a)
| Leaf
... and say you have a Tree that looks like this
let t = Node 10 tl (Node 15 Leaf tr')
... where tl and tr' are some named subtrees. Now say you want to insert 12 into this tree. Well, that's going to look something like this:
let t' = Node 10 tl (Node 15 (Node 12 Leaf Leaf) tr')
The subtrees tl and tr' are shared between t and t', and you only had to construct 3 new Nodes to do it, even though the size of t could be much larger than 3.
EDIT: Rebalancing
With respect to rebalancing, think about it like this, and note that I claim no rigor here. Say you have an empty tree. Already balanced! Now say you insert an element. Already balanced! Now say you insert another element. Well, there's an odd number so you can't do much there.
Here's the tricky part. Say you insert another element. This could go two ways: left or right; balanced or unbalanced. In the case that it's unbalanced, you can clearly perform a rotation of the tree to balance it. In the case that it's balanced, already balanced!
What's important to note here is that you're constantly rebalancing. It's not like you have a mess of a tree, decided to insert an element, but before you do that, you rebalance, and then leave a mess after you've completed the insertion.
Now say you keep inserting elements. The tree's gonna get unbalanced, but not by much. And when that does happen, first off you're correcting that immediately, and secondly, the correction occurs along the path of the insertion, which is O(log(n)) in a balanced tree. The rotations in the paper you linked to are touching at most three nodes in the tree to perform a rotation. so you're doing O(3 * log(n)) work when rebalancing. That's still O(log(n)).

To add extra emphasis to what dave4420 said in a comment, there are no compiler optimizations involved in making (:) run in constant time. You could implement your own list data type, and run it in a simple non-optimizing Haskell interpreter, and it would still be O(1).
A list is defined to be an initial element plus a list (or it's empty in the base case). Here's a definition that's equivalent to native lists:
data List a = Nil | Cons a (List a)
So if you've got an element and a list, and you want to build a new list out of them with Cons, that's just creating a new data structure directly from the arguments the constructor requires. There is no more need to even examine the tail list (let alone copy it), than there is to examine or copy the string when you do something like Person "Fred".
You are simply mistaken when you claim that this is a compiler optimization and not a language level one. This behaviour follows directly from the language level definition of the list data type.
Similarly, for a tree defined to be an item plus two trees (or an empty tree), when you insert an item into a non-empty tree it must either go in the left or right subtree. You'll need to construct a new version of that tree containing the element, which means you'll need to construct a new parent node containing the new subtree. But the other subtree doesn't need to be traversed at all; it can be put in the new parent tree as is. In a balanced tree, that's a full half of the tree that can be shared.
Applying this reasoning recursively should show you that there's actually no copying of data elements necessary at all; there's just the new parent nodes needed on the path down to the inserted element's final position. Each new node stores 3 things: an item (shared directly with the item reference in the original tree), an unchanged subtree (shared directly with the original tree), and a newly created subtree (which shares almost all of its structure with the original tree). There will be O(log(n)) of those in a balanced tree.

Quadtree object movement

So I need some help brainstorming, from a theoretical standpoint. Right now I have some code that just draws some objects. The objects lie in the leaves of a quadtree. Now as the objects move I want to keep them placed in the correct leaf of the quadtree.
Right now I am just reconstructing the quadtree on the objects after I change their position. I was trying to figure out a way to correct the tree without rebuilding it completely. All I can think of is having a bunch of pointers to adjacent leaf nodes.
Does anyone have an idea of how to figure out the node into which an object moves without just having a ton of pointers everywhere or a link to articles on this? All I could find was different ways to build the quadtree, nothing about updating it.

If I understand your question. You want some way of mapping between spatial coordinates and leaves on the quadtree.
Here's one possible solution I've been looking at:
For simplicity, let's do the 1D case first. And lets assume we have 32 gridpoints in x. Every grid point then corresponds to some leaf on a quadtree of depth five. (depth 0 = the whole grid, depth 1 = 2 points, depth 2 = 4 points... depth 5 = 32 points).
Each leaf could be represented by the branch indices leading to the leaf. At each level there are two branches we can label A and B. So, a particular leaf might be labeled BBAAB, which would mean, go down the B branch, then the B branch, then the A branch, then the B branch and then the B branch.
So, how do you map e.g. BBABB to an x grid point between 0..31? Just convert it to binary, so that BBABB->11011 = 27. Thus, the mapping from gridpoint to leaf-node is simply a matter of translating the letters A and B into 0s and 1s and then interpreting the result as a binary number.
For the 2D case, it's only slightly more complicated. Now we have four branches from each node, so we can label each branch path using a four-letter alphabet, e.g. starting from the root and taking the 3rd branch and then the fourth branch and then the first branch and then the second branch and then the second branch again we would generate the string CDABB.
Now to convert the string (e.g. 'CDABB') into a pair of gridvalues (x,y).
Let's assume A is lower-left, B is lower right, C is upper left and D is upper right. Then, symbolically, we could write, A.x=0, A.y=0 / B.x=1, B.y=0 / C.x=0, C.y=1 / D.x=1, D.y=1.
Taking the example CDABB, we first look at its x values (CDABB).x = (01011), which gives us the x grid point. And similarly for y.
Finally, if you want to find out e.g. the node immediately to the right of CDABB, then simply convert it to a pair of binary numbers in x and y, add +1 to the x value and convert the new pair of binary numbers back into a string.
I'm sure this has all been discovered, but I haven't yet found this information on the web.

If you have the spatial data necessary to insert an element into the quad-tree in the first place (ex: its point or rectangle), then you have the same data needed to remove it.
An easy way is before you move an element, remove it from the quad-tree using the same data you used to originally insert it, then move it, then re-insert.
Removal from the quad-tree can first remove the element from the leaf node(s), then if the leaf nodes become empty, remove them from their parents. If the parents become empty, remove them from their parents, and so forth.
This simple method is efficient enough for a complex world of objects moving every frame as long as you implement the quad-tree efficiently (ex: use a free list for the nodes). There shouldn't have to be a heap allocation on a per-node basis to insert it, nor a heap deallocation involved in removing every single node. Most node allocations/deallocations should be a simple constant-time operation just involving, say, the manipulation of a couple of integers or pointers.
You can also make this a little more complex if you like. You can start off storing the previous position of an object and then move it. If the new position occupies nodes other than the previous position, then remove the object from the nodes it no longer occupies and insert it to the new ones. Otherwise just keep it in the same node(s).
Update
I usually try to avoid linking my previous answers, but in this case I ended up doing a pretty comprehensive write up on the topic which would be hard to replicate anywhere else. Here it is: https://stackoverflow.com/a/48330314/4842163

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string