Product name string matching against a trie (supporting omissions)

Product name string matching against a trie (supporting omissions) - search

I have a list of CPU models. Right now, I think the most suitable approach would be forming a trie from the list, like this:
Intel -- Core -- i -- 3
| | |- 5
| | |- 7
| | -- 9
| |
| -- 2 Duo
|
|- Xeon -- ...
|
|...
Now, I want to match an input string against this trie. This is easy for exact matching, but what if I need a fuzzy one, where a string sequence can have omissions? For "Intel i3", "Core i3" and "i3" to all match to "Intel -> Core -> i -> 3" in the trie.
Is trie the right task for this problem? I thought about using trie search with wildcards, but the wildcard here can be in any position in the sequence.
What data structure can I use to represent the list in a way most applicable to this problem? What algorithm do I use for search?

While I'm not sure it's the optimal data structure for the task, you could use an augmented trie where every node has direct links to every descendant. Obviously you're going to want better than linear search (the trie root would have a link to every other node), and you also have to deal with duplicates, but the memory costs should be fine as long as your depth is reasonable (which should be true for CPU models). This would look something like:
class TrieAugmented:
def __init__(self, val: str):
self.val = val
self.children = []
self.child_paths = {}
When adding CPU models, the new nodes are appended to the list of children as usual but child paths have to be updated on every ancestor node for each new node (additions are O(d^2) rather than O(d), where d is depth). I would have child_paths map node descendant values to a list of nodes in self.children which have that value or store it within child_paths. If you plan on building a static trie and then querying it, you can build the trie and only update direct children as usual before adding in all the shorter paths in a single depth-first pass through the trie. Each node occupies O(d) space instead of constant, so overall this is something like O(n^2) space instead of linear, but that should be doable for a relatively small set of items.
If storage and implementation complexity are bigger concerns than runtime, you can use an unaugmented trie. This makes the runtime linear in number of trie nodes best case rather than roughly linear in input size, but it's very similar to matching file system paths with arbitrary nested structure. In rust glob syntax, you would translate "Core i3" to "/**/Core/**/i/**/3" and treat your trie as a file system (you do in fact insert wildcards at every position in the sequence, and they can match arbitrarily many levels of the trie). Here the trie doesn't make the lookup too fast but does make it possible to match models with omissions to their fully specified versions.

Related

How to generate stable id for AST nodes in functional programming?

I want to substitute a specific AST node into another, and this substituted node is specified by interactive user input.
In non-functional programming, you can use mutable data structure, and each AST node have a object reference, so when I need to reference to a specific node, I can use this reference.
But in functional programming, use IORef is not recommended, so I need to generate id for each AST node, and I want this id to be stable, which means:
when a node is not changed, the generated id will also not change.
when a child node is changed, it's parent's id will not change.
And, to make it clear that it is an id instead of a hash value:
for two different sub nodes, which are compared equal, but corresponding to different part of an expression, they should have different id.
So, what should I do to approach this?

Perhaps you could use the path from the root to the node as the id of that node. For example, for the datatype
data AST = Lit Int
| Add AST AST
| Neg AST
You could have something like
data ASTPathPiece = AddGoLeft
| AddGoRight
| NegGoDown
type ASTPath = [ASTPathPiece]
This satisfies conditions 2 and 3 but, alas, it doesn't satisfy 1 in general. The index into a list will change if you insert a node in a previous position, for example.
If you are rendering the AST into another format, perhaps you could add hidden attributes in the result nodes that identified which ASTPathPiece led to them. Traversing the result nodes upwards to the root would let you reconstruct the ASTPath.

What kind of tree is this that has letters as nodes?

I just started learning C++ and I need to create a tree of lists (Image link below) for a project but I'm not sure if it is a custom tree or a preexisting tree.
A bit on the tree; big blue blocks represent lists, smaller blocks inside represent nodes.
I am not looking for code or anything, just an explanation of the tree or links to where I could find information on it.

The data structure in the image is Trie data structure.
Trie is an efficient information retrieval data structure. Using trie, search complexities can be brought to optimal limit (key length). - (Source : GeeksForGeeks)
Trie shown in the image is made for the following strings -
Act, Actual, Actually, And, Book, Boss, Bore, Board and Boat.
Some useful links to know more -
Trie from
Hackerearth
Trie From GeeksForGeeks
Trie From Topcoder
Youtube Video on Tries

From the picture I'd use something like
struct List;
struct Node {
// ... node data ...
std::shared_ptr<List> list;
};
struct List {
// ... list data ...
std::vector<std::shared_ptr<Node>> nodes;
};
unless the number of nodes in a list can be huge and you need to dynamically insert/remove nodes from the middle of a list.

This looks like a costume tree to me. Usually for projects they make their own costume tree that is a combination of data structures. for example this is a combination of lists, and linked lists.

What you're describing is an implementation of a Trie, or a prefix tree. https://en.wikipedia.org/wiki/Trie
The levels can be implemented in different ways: linked lists, bitmaps, arrays, etc. But the idea behind them is the same.

Tree searching algorithm

I'm looking for suggestions on strategies for searching a tree-like data structure.
The structure is a tree where each element is a string, each branch is a period, and a path is the concatenation of several strings and periods starting at the root. The root and edges from the root are a special case where there is no string behind them.
So given the tree,
{root}
/ \
A X
/ \ /
B C Y
Valid paths are the strings "A", "A.B", "A.C", "X", and "X.Y".
What we have is a set of strings that we need to search for in this tree and find the element that terminates each string. Not all strings in the set appear in the tree. We stop searching when we find all strings. We need to run this search several times but the trees may differ each time. The set of strings to search is the same each run though.
Currently we're using depth-first search, but this isn't very efficient if all strings fall under say the last branch under the root. I feel like there should be a better way of doing this.
What would be a good algorithm for doing this repeated search? Would it be possible to leverage multithreading here as well?

It's an interesting problem; usually one would imagine a single tree being searched for a variable set of strings. Here the situation is reversed: the set of strings is fixed and the tree is highly variable.
I think that the best you can do is build a trie representing the set of strings. That way, you only have to search a tree once for any given prefix. (So, for the example strings you mentioned, you would only need to find the "A" prefix once and the "X" prefix once.) There are lots of trie data structures and algorithms for building them from a set of strings, but since that's a one-time operation for this problem, I wouldn't worry too much about the cost of this preprocessing.

Is it possible to efficiently search a bit-trie for values less than the key?

I am currently storing a large number of unsigned 32-bit integers in a bit trie (effectively forming a binary tree with a node for each bit in the 32-bit value.) This is very efficient for fast lookup of exact values.
I now want to be able to search for keys that may or may not be in the trie and find the value for the first key less than or equal to the search key. Is this efficiently possible with a bit trie, or should I use a different data structure?
I am using a trie due to its speed and cache locality, and ideally want to sacrifice neither.
For example, suppose the trie has two keys added:
0x00AABBCC
0x00AABB00
and I an now searching for a key that is not present, 0x00AABB11. I would like to find the first key present in the tree with a value <= the search key, which in this case would be the node for 0x00AABB00.
While I've thought of a possible algorithm for this, I am seeking concrete information on if it is efficiently possible and/or if there are known algorithms for this, which will no doubt be better than my own.

We can think bit trie as a binary search tree. In fact, it is a binary search tree. Take the 32-bit trie for example, suppose left child as 0, right child as 1. For the root, the left subtree is for the numbers less than 0x80000000 and the right subtree is for the numbers no less than 0x80000000, so on and so forth. So you can just use the similar the method to find the largest item not larger than the search key in the binary search tree. Just don't worry about the backtracks, it won't backtrack too much and won't change the search complexity.
When you match fails in the bit trie, just backtrack to find the right-most child of the nearest ancestor of the failed node.

If the data is static--you're not adding or removing items--then I'd take a good look at using a simple array with binary search. You sacrifice cache locality, but that might not be catastrophic. I don't see cache locality as an end in itself, but rather a means of making the data structure fast.
You might get better cache locality by creating a balanced binary tree in an array. Position 0 is the root node, position 1 is left node, position 2 is right node, etc. It's the same structure you'd use for a binary heap. If you're willing to allocate another 4 bytes per node, you could make it a left-threaded binary tree so that if you search for X and end up at the next larger value, following that left thread would give you the next smaller value. All told, though, I don't see where this can outperform the plain array in the general case.
A lot depends on how sparse your data is and what the range is. If you're looking at a few thousand possible values in the range 0 to 4 billion, then the binary search looks pretty attractive. If you're talking about 500 million distinct values, then I'd look at allocating a bit array (500 megabytes) and doing a direct lookup with linear backward scan. That would give you very good cache locality.

A bit trie walks 32 nodes in the best case when the item is found.
A million entries in a red-black tree like std::map or java.util.TreeMap would only require log2(1,000,000) or roughly 20 nodes per query, worst case. And you do not always need to go to the bottom of the tree making average case appealing.
When backtracking to find <= the difference is even more pronounced.
The fewer entries you have, the better the case for a red-black tree
At a minimum, I would compare any solution to a red-black tree.

How is insert O(log(n)) in Data.Set?

When looking through the docs of Data.Set, I saw that insertion of an element into the tree is mentioned to be O(log(n)). However, I would intuitively expect it to be O(n*log(n)) (or maybe O(n)?), as referential transparency requires creating a full copy of the previous tree in O(n).
I understand that for example (:) can be made O(1) instead of O(n), as here the full list doesn't have to be copied; the new list can be optimized by the compiler to be the first element plus a pointer to the old list (note that this is a compiler - not a language level - optimization). However, inserting a value into a Data.Set involves rebalancing that looks quite complex to me, to the point where I doubt that there's something similar to the list optimization. I tried reading the paper that is referenced by the Set docs, but couldn't answer my question with it.
So: how can inserting an element into a binary tree be O(log(n)) in a (purely) functional language?

There is no need to make a full copy of a Set in order to insert an element into it. Internally, element are stored in a tree, which means that you only need to create new nodes along the path of the insertion. Untouched nodes can be shared between the pre-insertion and post-insertion version of the Set. And as Deitrich Epp pointed out, in a balanced tree O(log(n)) is the length of the path of the insertion. (Sorry for omitting that important fact.)
Say your Tree type looks like this:
data Tree a = Node a (Tree a) (Tree a)
| Leaf
... and say you have a Tree that looks like this
let t = Node 10 tl (Node 15 Leaf tr')
... where tl and tr' are some named subtrees. Now say you want to insert 12 into this tree. Well, that's going to look something like this:
let t' = Node 10 tl (Node 15 (Node 12 Leaf Leaf) tr')
The subtrees tl and tr' are shared between t and t', and you only had to construct 3 new Nodes to do it, even though the size of t could be much larger than 3.
EDIT: Rebalancing
With respect to rebalancing, think about it like this, and note that I claim no rigor here. Say you have an empty tree. Already balanced! Now say you insert an element. Already balanced! Now say you insert another element. Well, there's an odd number so you can't do much there.
Here's the tricky part. Say you insert another element. This could go two ways: left or right; balanced or unbalanced. In the case that it's unbalanced, you can clearly perform a rotation of the tree to balance it. In the case that it's balanced, already balanced!
What's important to note here is that you're constantly rebalancing. It's not like you have a mess of a tree, decided to insert an element, but before you do that, you rebalance, and then leave a mess after you've completed the insertion.
Now say you keep inserting elements. The tree's gonna get unbalanced, but not by much. And when that does happen, first off you're correcting that immediately, and secondly, the correction occurs along the path of the insertion, which is O(log(n)) in a balanced tree. The rotations in the paper you linked to are touching at most three nodes in the tree to perform a rotation. so you're doing O(3 * log(n)) work when rebalancing. That's still O(log(n)).

To add extra emphasis to what dave4420 said in a comment, there are no compiler optimizations involved in making (:) run in constant time. You could implement your own list data type, and run it in a simple non-optimizing Haskell interpreter, and it would still be O(1).
A list is defined to be an initial element plus a list (or it's empty in the base case). Here's a definition that's equivalent to native lists:
data List a = Nil | Cons a (List a)
So if you've got an element and a list, and you want to build a new list out of them with Cons, that's just creating a new data structure directly from the arguments the constructor requires. There is no more need to even examine the tail list (let alone copy it), than there is to examine or copy the string when you do something like Person "Fred".
You are simply mistaken when you claim that this is a compiler optimization and not a language level one. This behaviour follows directly from the language level definition of the list data type.
Similarly, for a tree defined to be an item plus two trees (or an empty tree), when you insert an item into a non-empty tree it must either go in the left or right subtree. You'll need to construct a new version of that tree containing the element, which means you'll need to construct a new parent node containing the new subtree. But the other subtree doesn't need to be traversed at all; it can be put in the new parent tree as is. In a balanced tree, that's a full half of the tree that can be shared.
Applying this reasoning recursively should show you that there's actually no copying of data elements necessary at all; there's just the new parent nodes needed on the path down to the inserted element's final position. Each new node stores 3 things: an item (shared directly with the item reference in the original tree), an unchanged subtree (shared directly with the original tree), and a newly created subtree (which shares almost all of its structure with the original tree). There will be O(log(n)) of those in a balanced tree.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string