Best suited data structure for prefix matching search

Best suited data structure for prefix matching search - search

I have to create a system of customer list (can be as large as 10million customers), each customer will have a unique ID and a unique ID consists of 10 letters, the first 3 are upper case letters and the last 7 are digits (ex: LQK0333208, HCK1646129,...). The system must perform two search operations in a fastest way (exact matching search and partial matching search):
For the exact matching search, users enter a complete Customer ID, and system displays details of the matching customer or an error message if there is no matching customer.
For the partial matching search, users enter several (at least 5 and at most 8) starting letters of Customer ID, and system displays details of the matching customers or an error message if there is no matching customer. If the number of matching customers is greater than 10, display only 10 of them.
So what the suitable data structure for this system? Currently, I am using AVL tree to handle the problem:
For exact matching search, I will perform a logarithmic search (left and right subtree): O(log(n)).
For partial matching search, I will perform a inorder search of the AVL Tree and check if each customer have the demanded prefix. This is a linear search: O(n).
But I want for partial matching search, the system will perform a better search in term of time complexity.
So any suggestion about the data structure is suitable for the system's requirement?
EDIT 1: I have tested the program with Trie Tree and Ternary Search Tree, but for larger dataset like (10 milions customer). There is no way I could store that in-memory data structure in the memory with a larger dataset like that. So any suggestions?
EDIT 2: I have tested the sorted array data structure and It works well with the data set of 10 million users. Actually, this was my first approach when I did not know anything about the Trie or Ternary tree. As far as I understand, first we will store all the customer in an array, then use some sort algorithms like quicksort to sort the array. Then perform binary search to search for the key, which is O(log(n)) to perform the search operation, quite good! But for a long term, when we need to add extra data to the array (not create the new one, but add to the array), for instance just one more customer, so adding the new element will take O(n) in worst case, as we need to find where to add and shift the element.
But for data structure like Trie or Ternary tree, when adding the new element, it might just require O(1) as we just need to traversal the tree to find the string. If we don't mind about the space complexity, I think trie or ternary tree are suit best for this project.

A suitable data structure for this is a trie. This is a tree of all prefixes, where each node (except the root) represents a character, and each possible path from root to a leaf will be a character sequence that corresponds to a valid ID.
A partial match means that there is a path from the root that ends in an internal node.
If implemented with an efficient child lookup, a match can in this particular use case be found in 10 steps. So if we consider 10 to be a constant, the match can be done in constant time, irrespective of how large (i.e. how wide) the tree is. This assumes that looking up a child by its character can be done in constant time (on average).
As in this particular use case the alphabet is limited (upper case only or digit only), a node can have at most 26 child entries, which could be stored in an array of that size, where the indexes map to the corresponding character. This will ensure constant time for stepping from a parent node to the relevant child node. Alternatively a hashing system can also be used (instead of an array with 26 slots).
Here is a demo implementation in JavaScript (using a plain object for the children, i.e. a "dictionary"):
class TrieNode {
constructor(data=null) {
this.children = {}; // Dictionary, <character, TrieNode>
this.data = data; // Non-null when this node represents the end of a valid word
}
addWord(word, data) {
let node = this; // the root of the tree
for (let ch of word) {
if (!(ch in node.children)) {
node.children[ch] = new TrieNode();
}
node = node.children[ch]; // Walk down the tree
}
node.data = data;
}
*getAllData() { // This method returns an iterator over all data in this subtree
if (this.data != null) yield this.data;
// Recursively yield all data in the children's subtrees
for (const child in this.children) yield* this.children[child].getAllData();
}
*find(prefix) { // This method returns an iterator over matches
let node = this;
// Find the node where this prefix ends:
for (let ch of prefix) {
if (!(ch in node.children)) return; // No matches
node = node.children[ch];
}
// Yield all data in this subtree
yield* node.getAllData();
}
}
class Customer {
constructor(id, name) {
this.id = id;
this.name = name;
}
toString() {
return this.name + " (" + this.id + ")";
}
}
// Demo
// Create some Customer data:
const database = [
new Customer('LQK0333208', 'Hanna'),
new Customer('LQK0333311', 'Bert'),
new Customer('LQK0339999', 'Joline'),
new Customer('HCK1646129', 'Sarah'),
new Customer('HCK1646130', 'Pete'),
new Customer('HCK1700012', 'Cristine')
];
// Build a trie for the database of customers
const trie = new TrieNode(); // The root node of the trie.
for (const customer of database) {
trie.addWord(customer.id, customer);
}
// Make a few queries
console.log("query: LQK0333");
for (const customer of trie.find("LQK0333")) console.log("found: " + customer);
console.log("query: HCK16461");
for (const customer of trie.find("HCK16461")) console.log("found: " + customer);
console.log("query: LQK0339999");
for (const customer of trie.find("LQK0339999")) console.log("found: " + customer);
console.log("query: LQK09 should not yield results");
for (const customer of trie.find("LQK09")) console.log("found: " + customer);
Sorted Array
Another approach is to store the Customer records in a sorted array. JavaScript has no such data structure, but splice is surprisingly fast in JavaScript, so you could just maintain a sorted order by inserting new entries in their sorted position. Binary search can be used to locate the index where to find or insert an entry:
class SortedArray {
constructor(keyField) {
this.arr = [];
this.keyField = keyField;
}
addObject(obj) {
const i = this.indexOf(obj[this.keyField]);
if (this.arr[i]?.[this.keyField] === obj[this.keyField]) throw "Duplicate not added";
this.arr.splice(i, 0, obj);
}
*find(prefix) { // This method returns an iterator over matches
for (let i = this.indexOf(prefix); i < this.arr.length; i++) {
const obj = this.arr[i];
if (!obj[this.keyField].startsWith(prefix)) return;
yield obj;
}
}
indexOf(key) {
let low = 0, high = this.arr.length;
while (low < high) {
const mid = (low + high) >> 1;
if (key === this.arr[mid][this.keyField]) return mid;
if (key > this.arr[mid][this.keyField]) {
low = mid + 1;
} else {
high = mid;
}
}
return low;
}
}
class Customer {
constructor(id, name) {
this.id = id;
this.name = name;
}
toString() {
return this.name + " (" + this.id + ")";
}
}
const database = [
new Customer('LQK0333208', 'Hanna'),
new Customer('LQK0333311', 'Bert'),
new Customer('LQK0339999', 'Joline'),
new Customer('HCK1646129', 'Sarah'),
new Customer('HCK1646130', 'Pete'),
new Customer('HCK1700012', 'Cristine')
];
const arr = new SortedArray("id");
for (const customer of database) {
arr.addObject(customer);
}
console.log("query: LQK0333");
for (const customer of arr.find("LQK0333")) console.log("found: " + customer);
console.log("query: HCK16461");
for (const customer of arr.find("HCK16461")) console.log("found: " + customer);
console.log("query: LQK0339999");
for (const customer of arr.find("LQK0339999")) console.log("found: " + customer);
console.log("query: LQK09 should not yield results");
for (const customer of arr.find("LQK09")) console.log("found: " + customer);

Related

Distinct values in Azure Search Suggestions?

I am offloading my search feature on a relational database to Azure Search. My Products tables contains columns like serialNumber, PartNumber etc.. (there can be multiple serialNumbers with the same partNumber).
I want to create a suggestor that can autocomplete partNumbers. But in my scenario I am getting a lot of duplicates in the suggestions because the partNumber match was found in multiple entries.
How can I solve this problem ?

The Suggest API suggests documents, not queries. If you repeat the partNumber information for each serialNumber in your index and then suggest based on partNumber, you will get a result for each matching document. You can see this more clearly by including the key field in the $select parameter. Azure Search will eliminate duplicates within the same document, but not across documents. You will have to do that on the client side, or build a secondary index of partNumbers just for suggestions.
See this forum thread for a more in-depth discussion.
Also, feel free to vote on this UserVoice item to help us prioritize improvements to Suggestions.

I'm facing this problem myself. My solution does not involve a new index (this will only get messy and cost us money).
My take on this is a while-loop adding 'UserIdentity' (in your case, 'partNumber') to a filter, and re-search until my take/top-limit is met or no more suggestions exists:
public async Task<List<MachineSuggestionDTO>> SuggestMachineUser(string searchText, int take, string[] searchFields)
{
var indexClientMachine = _searchServiceClient.Indexes.GetClient(INDEX_MACHINE);
var suggestions = new List<MachineSuggestionDTO>();
var sp = new SuggestParameters
{
UseFuzzyMatching = true,
Top = 100 // Get maximum result for a chance to reduce search calls.
};
// Add searchfields if set
if (searchFields != null && searchFields.Count() != 0)
{
sp.SearchFields = searchFields;
}
// Loop until you get the desired ammount of suggestions, or if under desired ammount, the maximum.
while (suggestions.Count < take)
{
if (!await DistinctSuggestMachineUser(searchText, take, searchFields, suggestions, indexClientMachine, sp))
{
// If no more suggestions is found, we break the while-loop
break;
}
}
// Since the list might me bigger then the take, we return a narrowed list
return suggestions.Take(take).ToList();
}
private async Task<bool> DistinctSuggestMachineUser(string searchText, int take, string[] searchFields, List<MachineSuggestionDTO> suggestions, ISearchIndexClient indexClientMachine, SuggestParameters sp)
{
var response = await indexClientMachine.Documents.SuggestAsync<MachineSearchDocument>(searchText, SUGGESTION_MACHINE, sp);
if(response.Results.Count > 0){
// Fix filter if search is triggered once more
if (!string.IsNullOrEmpty(sp.Filter))
{
sp.Filter += " and ";
}
foreach (var result in response.Results.DistinctBy(r => new { r.Document.UserIdentity, r.Document.UserName, r.Document.UserCode}).Take(take))
{
var d = result.Document;
suggestions.Add(new MachineSuggestionDTO { Id = d.UserIdentity, Namn = d.UserNamn, Hkod = d.UserHkod, Intnr = d.UserIntnr });
// Add found UserIdentity to filter
sp.Filter += $"UserIdentity ne '{d.UserIdentity}' and ";
}
// Remove end of filter if it is run once more
if (sp.Filter.EndsWith(" and "))
{
sp.Filter = sp.Filter.Substring(0, sp.Filter.LastIndexOf(" and ", StringComparison.Ordinal));
}
}
// Returns false if no more suggestions is found
return response.Results.Count > 0;
}

public async Task<List<string>> SuggestionsAsync(bool highlights, bool fuzzy, string term)
{
SuggestParameters sp = new SuggestParameters()
{
UseFuzzyMatching = fuzzy,
Top = 100
};
if (highlights)
{
sp.HighlightPreTag = "<em>";
sp.HighlightPostTag = "</em>";
}
var suggestResult = await searchConfig.IndexClient.Documents.SuggestAsync(term, "mysuggestion", sp);
// Convert the suggest query results to a list that can be displayed in the client.
return suggestResult.Results.Select(x => x.Text).Distinct().Take(10).ToList();
}
After getting top 100 and using distinct it works for me.

You can use the Autocomplete API for that where does the grouping by default. However, if you need more fields together with the result, like, the partNo plus description it doesn't support it. The partNo will be distinct though.

Which algorithm to find the only one duplicate word in a string?

This is very common interview question:
There's a all-english sentence which contains only a duplicate word, for example:
input string: today is a good day is true
output: is
I have an idea:
Read every character from the string, using some hash function to compute the hash value until get a space(' '), then put that hash value in a hash-table.
Repeat Step 1 until the end of the string, if there's duplicate hash-value, then return that word, else return null.
Is that practical?

Your approach is reasonable(actually the best I can think of). Still take into account the fact that a collision may appear. Even if the hashes are the same, compare the words.

It would work, but you can make your life a lot easier.
Are you bound to a specific programming language?
If you code in c# for example, i would suggest you use the
String.Split function (and split by " ") to transform your sentence into a list of words. Then you can easily find duplicates by using LINQ (see How to get duplicate items from a list using LINQ?) or by iterating through your list.

You can use the Map() function, and also return how many times the duplicate word is found in the string.
var a = 'sometimes I feel clever and sometimes not';
var findDuplicateWord = a => {
var map = new Map();
a = a.split(' ');
a.forEach(e => {
if (map.has(e)) {
let count = map.get(e);
map.set(e, count + 1);
} else {
map.set(e, 1);
}
});
let dupe = [];
let hasDupe = false;
map.forEach((value, key) => {
if (value > 1) {
hasDupe = true;
dupe.push(key, value);
}
});
console.log(dupe);
return hasDupe;
};
findDuplicateWord(a);
//output
/* Native Browser JavaScript
[ 'sometimes', 2 ]
=> true */

How can I transform a notes view to a html nested list?

I would like to re-use a notes view in a web browser, Therefor I need the notes view (with response documents hierarchy) represented in HTML as an unordered list (ul) with list items (li).
What SSJS code should I use to compute this list?

None.
If you can edit the view, set it to passthru HTML and add one column at the beginning and end with the list tags. Set them hidden from client.
Or bind it to a repeat control and have the Li tags inside with computed text bound to the view columns. No SsJS in both cases.

NotesViewEntry.getPosition(Char separator) gives a hierarchical output. For example with the separator defined as "." it will give 3 for the third top-level entry, 3.5 for the fifth child of the third top-level entry, 3.5.7 for the seventh child of the fifth child of the third top-level entry.
To elaborate on Stephan's second option, a Repeat Control doesn't care about the structure of the data it's retrieving. It's a handle to a collection, where each "row" is one element in that collection. So if you point it to a collection which is myView.getAllEntries(), each entry is a NotesViewEntry.
Combine the two and you have the level of the hierarchy, if you want to just use indentation. Alternatively, from a NotesViewEntry you can tell if there are children, so whether you need to make it another li or start another ul.
Alternatively, if you want to get more elaborate, look at how I traverse views to create a Dojo Tree Grid navigation in XPages Help Application http://www.openntf.org/internal/home.nsf/project.xsp?action=openDocument&name=XPages%20Help%20Application

not the most beautiful code. I hope it works;
function getList() {
var nav:NotesViewNavigator=database.getView("notesview").createViewNav();
var entry:NotesViewEntry=nav.getFirst();
if (entry!=null){
var countLevel:Integer = 0;
var curLevel:Integer;
var list="";
while (entry != null) {
var edoc:NotesDocument = entry.getDocument();
entryValue = entry.getColumnValues().elementAt(1).toString();
var col:NotesDocumentCollection = edoc.getResponses();
var gotResponse:String;
if (col.getCount()>0){
gotResponse ="1";
}
else{
gotResponse ="0";
}
curLevel = entry.getColumnIndentLevel();
if (curLevel<countLevel){
//no responses & no siblings
var difLevel=countLevel-curLevel;
list=list + "<li>"+entryValue+ "</li>"
var closure="";
for (var i=0;i<(difLevel);i++) {
closure=closure+"</ul></li>"
}
list=list+closure;
countLevel=curLevel;
}
if (curLevel==countLevel){
if(gotResponse=="1"){
//got responses;handle them first
list=list+"<li>";
list=list+entryValue;
list=list+"<ul>";
countLevel=curLevel+1;
}
else{
//must be sibling
list=list + "<li>"+entryValue+ "</li>"
}
}
var tmpentry:NotesViewEntry=nav.getNext(entry);
entry.recycle();
entry=tmpentry;
}
//final closure, last entry could be response doc
var closure = ""
for (var i = 0; i < (countLevel); i++) {
closure = closure + "</ul></li>";
}
list = list + closure;
return list;
} else {
return "No documents found";
}
}

How i can get latest record by using FirstOrDefault() method

Suppose i have 2 records in data base
1) 2007-12-10 10:35:31.000
2) 2008-12-10 10:35:31.000
FirstOrDefault() method will give me the first record match in sequence like 2007-12-10 10:35:31.000 but i need the latest one which is 2008-12-10 10:35:31.000
if ((from value in _names where value != null select value.ExpiryDate < now).Any())
{
return _names.FirstOrDefault();
}

You can use:
return _names.LastOrDefault();
However, your if just sends another unnecessary query (and it is a wrong query too). If you don't have any record, LastOrDefault and FirstOrDefault will return null. You can use something like this to improve the code:
var name = _names.LastOrDefault();
if(name != null)
{
return name;
}
// other code here
If you really want to use FirstOrDefault, you should order descending, like:
var name = _names.Where(n => n.ExpiryDate < now).OrderByDescending(n => n.ExpiryDate).FirstOrDefault();

Map/Reduce differences between Couchbase & CloudAnt

I've been playing around with Couchbase Server and now just tried replicating my local db to Cloudant, but am getting conflicting results for my map/reduce function pair to build a set of unique tags with their associated projects...
// map.js
function(doc) {
if (doc.tags) {
for(var t in doc.tags) {
emit(doc.tags[t], doc._id);
}
}
}
// reduce.js
function(key,values,rereduce) {
if (!rereduce) {
var res=[];
for(var v in values) {
res.push(values[v]);
}
return res;
} else {
return values.length;
}
}
In Cloudbase server this returns JSON like:
{"rows":[
{"key":"3d","value":["project1","project3","project8","project10"]},
{"key":"agents","value":["project2"]},
{"key":"fabrication","value":["project3","project5"]}
]}
That's exactly what I wanted & expected. However, the same query on the Cloudant replica, returns this:
{"rows":[
{"key":"3d","value":4},
{"key":"agents","value":1},
{"key":"fabrication","value":2}
]}
So it somehow only returns the length of the value array... Highly confusing & am grateful for any insights by some M&R ninjas... ;)

It looks like this is exactly the behavior you would expect given your reduce function. The key part is this:
else {
return values.length;
}
In Cloudant, rereduce is always called (since the reduce needs to span over multiple shards.) In this case, rereduce calls values.length, which will only return the length of the array.

I prefer to reduce/re-reduce implicitly rather than depending on the rereduce parameter.
function(doc) { // map
if (doc.tags) {
for(var t in doc.tags) {
emit(doc.tags[t], {id:doc._id, tag:doc.tags[t]});
}
}
}
Then reduce checks whether it is accumulating document ids from the identical tag, or whether it is just counting different tags.
function(keys, vals, rereduce) {
var initial_tag = vals[0].tag;
return vals.reduce(function(state, val) {
if(initial_tag && val.tag === initial_tag) {
// Accumulate ids which produced this tag.
var ids = state.ids;
if(!ids)
ids = [ state.id ]; // Build initial list from the state's id.
return { tag: val.tag,
, ids: ids.concat([val.id])
};
} else {
var state_count = state.ids ? state.ids.length : state;
var val_count = val.ids ? val.ids.length : val;
return state_count + val_count;
}
})
}
(I didn't test this code, but you get the idea. As long as the tag value is the same, it doesn't matter whether it's a reduce or rereduce. Once different tags start reducing together, it detects that because the tag value will change. So at that point just start accumulating.
I have used this trick before, although IMO it's rarely worth it.
Also in your specific case, this is a dangerous reduce function. You are building a wide list to see all the docs that have a tag. CouchDB likes tall lists, not fat lists. If you want to see all the docs that have a tag, you could map them.
for(var a = 0; a < doc.tags.length; a++) {
emit(doc.tags[a], doc._id);
}
Now you can query /db/_design/app/_view/docs_by_tag?key="3d" and you should get
{"total_rows":287,"offset":30,"rows":[
{"id":"project1","key":"3d","value":"project1"}
{"id":"project3","key":"3d","value":"project3"}
{"id":"project8","key":"3d","value":"project8"}
{"id":"project10","key":"3d","value":"project10"}
]}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string