First, I am very new to nodejs. I want a function in nodejs that divides 2^128 into 'n' equal spaces and returns a list of length n. I should be able to use this list to determine in which range a given integer belongs. I expect the code to like the following:
function divideEqually(n){
/**
code here
**/
return aListOfRanges;
}
function findIndex(hashDigest, aListOfRanges){
/*
inspect ranges and find index
*/
return someIndexInList;
}
var list = divideEqually(15); //Returns a list of 15 equally spaced ranges
var index = findIndex('d9729feb74992cc3482b350163a1a010', list) //Find index of a hex digest
How do I do this efficiently. The ranges should be computed lazily as 2^128 will be a huge number whereas 'n' is expected to be less than 20.
Related
Imagine an object with strings as keys and their frequency count of occurrence as the value.
{"bravo charlie": 10, "alpha bravo charlie": 10, "delta echo foxtrot": 15, "delta echo": 7}
I am trying to optimize an algorithm such that A) any key that is a substring of another key AND has the same frequency value should be eliminated. The longer containing key should remain. B) Allow keys that are only a single word to remain even if contained by another key
The following pairwise comparison approach works but becomes very very slow on large objects. For example, an object with 560k keys is taking ~30 mins to complete the pairwise comparison:
// for every multi word key
// if a part of a longer key in candidates AND have same length delete
let keys = Object.keys(candidates)
.sort((a, b) => b.length - a.length)
.filter(key => {
if (key.split(" ").length === 1) {
return false
}
return true
});
// ^ order keys by length to speed up comparisons and filter out single words
checkKeyI: for (const keyi of keys) {
checkKeyJ: for (const keyj of keys) {
// because we pre-sorted if length is less then we are done with possible matches
if (keyj.length <= keyi.length) {
continue checkKeyI
}
// keys must not match exactly
if (keyj === keyi) {
continue checkKeyJ
}
// keyi must be a substring of keyj
if (!keyj.includes(keyi)) {
continue checkKeyJ
}
// they must have the same freq occurr values
if (candidates[keyj] === candidates[keyi]) {
delete candidates[keyi]
continue checkKeyI
}
}
}
The desired result would be:
{"alpha bravo charlie": 10, "delta echo foxtrot": 15, "delta echo": 7}
because bravo charlie was eliminated. Are they any obvious or clever ways to speed this up?
Sort the keys based on their length (use alphabetical ordering as a tie-breaker)
Starting with the longest key (and going in a descending order of length), add the keys to a trie. The leaf node for the key will store the value.
For inserting a key, start traversing the trie. If the key has a single word, insert it. If it has multiple words, and a match is found with the same frequency, don't insert it.
This will reduce your complexity from O(n^2) to O(n*m), where n = number of strings and m = length of longest string
I have written a code which takes list of items and outputs a json with unique items as keys and frequency as value.
The code below works fine when I test it
const tokenFrequency = tokens =>{
const setTokens=[...new Set(tokens)]
return setTokens.reduce((obj, tok) => {
const frequency = tokens.reduce((count, word) =>word===tok?count+1:count, 0);
const containsDigit = /\d+/;
if (!containsDigit.test(tok)) {
obj[tok.toLocaleLowerCase()] = frequency;
}
return obj;
}, new Object());
}
like
const x=["hello","hi","hi","whatsup","hey"]
console.log(tokenFrequency(x))
produces the output
{ hello: 1, hi: 2, whatsup: 1, hey: 1 }
but when i try with huge data corpus's list of words it seem to produce wrong result.
say if i feed a list words with the length of list being 14000+ it produces wrong results.
example:
https://github.com/Nahdus/word2vecDataParsing/blob/master/corpous/listOfWords.txt when this list in this page(linked above) to function the frequency of word "is" comes out to be 4, but the actual frequency is 907.
why does it behave like this for large data?
how can this be fixed?
You would need to normalize your tokens first by applying toLowerCase() to them, or a way to diferentiate between words that are the same but only differ in capitalization.
Reason:
Your small dataset has no Is words (with uppercase 'i'). The large dataset does have occurences of Is (with uppercase 'i'), which apparently has a frequency 4, which in turn overwrites your lowercase is's frequency.
I've a series of strings that represent airline's itineraries:
FLROTP
MADFCOFCOFLR
BLQMADMADUIOUIOMADMADBLQ
MXPJFKJFKMCOJFKMXP
WAWPSAPSAWAW
FLRFRAFRASGNSGNBKKBKKVIEVIEFLR
FLRMUCMUCDELDXBDXBZRHZRHFLR
FLRFRAFRASINSINMELMELSINSINFRAFRAFLR
FLRCDGCDGCANCANJJNZHACANCANCDGCDGFLRWNZCANCANZHAHKGAMSFLR
JFKMTYMTYMEXMEXPTYMDEMDEBOGBOGLIM
PSAISTISTICNICNNRTNRTISTISTPSANRTISTISTPSA
MXPDXBDXBPERPERADLADLMELMELASPASPAYQAYQASPASPSYDSYDDXBDXBMXP
FLRFRAFRAORDORDLASLASBNACLTCLTMUCMUCPSA
FLRCDGCDGBOGBOGBAQBAQBOGBOGCUCCUCBOGBOGMDEMDEBOGBOGUIOGYELIMLIMHAVHAVCDGCDGFLR
FLRFRAFRALAXLAXSEASEAORDORDICTICTORDORDCMHCMHBOSBOSMIAMIAFRAFRAFLR
PSAMUCMUCIADIADGSOGSOCLTCLTMIAMIADFWDFWICTICTDFWDFWCMHCMHPHLPHLALBALBIADIADFRAFRAFLR
FLRFRAFRAEZEEZESCLSCLGRUCGHSDUSDUPOAPOAGRUGRULIMLIMUIOUIOBOGBOGPTYPTYPOSPOSMIAMIAFRAFRAFLR
PSACDGCDGHAVHAVPTYPTYUIOUIOMDEMDEBOGBOGBAQBAQBOGBOGCUCCUCBOGBOGCDGCDGFLR
FLRCDGCDGMEXMEXSJOSJOMEXBJXBJXMEXMEXCDGCDGPSA
I'd like to always be able to find the "middle" of the string (that 90% of the cases is the passenger's destination) but i'm short of ideas. Any help? :)
What you want is not the index at the exact middle of the string, but the closest index to the middle that is a multiple of 3, to index the start of a valid 3-letter code.
You didn't specify a language so I'll just use C++ to illustrate.
std::string code = "MXPJFKJFKMCOJFKMXP";
Find the length of the string:
int length = code.size();
Count how many codes you have:
int codecount = length / 3;
Find the middle code, using integer arithmetic (rounding down), with the codes numbered from zero:
int middlecode = codecount / 2;
Find the start index of your middle code:
int index = middlecode * 3;
Get the middle code:
std::string destination = code.substr(index, 3);
For strings with an even number of codes, this will give the first code in the second half of the string, e.g:
MXPJFKJFKMCOJFKMXP
For strings with an odd number of codes, this will give the middle code, e.g:
FLRFRAFRAORDORDLASLASBNACLTCLTMUCMUCPSA
(which in the above case looks wrong, but you did say only 90%!)
We have a string S and we want to calculate the number of distinct strings that can be formed by rotating the string.
For example :-
S = "aaaa" , here it would be 1 string {"aaaa"}
S = "abab" , here it would be 2 strings {"abab" , "baba"}
So ,is there an algorithm to solve this in O(|S|) complexity where |S| is the length of string.
Suffix trees, baby!
If string is S. Construct the Suffix Tree for SS (S concatenated to S).
Find number of unique substrings of length |S|. The uniqueness you get automatically. For length |S| you might have to change the suffix tree algo a little (to maintain depth info), but is doable.
(Note that the other answer by johnsoe is actually quadratic, or worse, depending on the implementation of Set).
You can solve this with rolling hash functions used in the Rabin-Karp algorithm.
You can use the rolling hash to update the hash table for all substrings of size |S| (obtained by sliding a |S| window across SS) in constant time (so, O(|S|) in total).
Assuming your string comes from an alphabet of constant size, you can inspect the hash table in constant time to obtain the required metric.
Something like this should do the trick.
public static int uniqueRotations(String phrase){
Set<String> rotations = new HashSet<String>();
rotations.add(phrase);
for(int i = 0; i < phrase.length() - 1; i++){
phrase = phrase.charAt(phrase.length() - 1) + phrase.substring(0, phrase.length() - 1);
rotations.add(phrase);
}
return rotations.size();
}
This is a coding exercise. Suppose I have to decide if one string is created by a cyclic shift of another. For example: cab is a cyclic shift of abc but cba is not.
Given two strings s1 and s2 we can do that as follows:
if (s1.length != s2.length)
return false
for(int i = 0; i < s1.length(); i++)
if ((s1.substring(i) + s1.substring(0, i)).equals(s2))
return true
return false
Now what if I have an array of strings and want to find all strings that are cyclic shift of one another? For example: ["abc", "xyz", "yzx", "cab", "xxx"] -> ["abc", "cab"], ["xyz", "yzx"], ["xxx"]
It looks like I have to check all pairs of the strings. Is there a "better" (more efficient) way to do that?
As a start, you can know if a string s1 is a rotation of a string s2 with a single call to contains(), like this:
public boolean isRotation(String s1, String s2){
String s2twice = s2+s2;
return s2twice.contains(s1);
}
Namely, if s1 is "rotation" and s2 is "otationr", the concat gives you "otationrotationr", which contains s1 indeed.
Now, even if we assume this is linear, or close to it (which is not impossible using Rabin-Karp, for instance), you are still left with O(n^2) pair comparisons, which may be too much.
What you could do is build an hashtable where the sorted word is the key, and the posting list contains all the words from your list that, if sorted, give the key (ie. key("bca") and key("cab") both should return "abc"):
private Map<String, List<String>> index;
/* ... */
public void buildIndex(String[] words){
for(String word : words){
String sortedWord = sortWord(word);
if(!index.containsKey(sortedWord)){
index.put(sortedWord, new ArrayList<String>());
}
index.get(sortedWord).add(word);
}
}
CAVEAT: The hashtable will contain, for each key, all the words that have exactly the same letters occurring the same amount of times (not just the rotations, ie. "abba" and "baba" will have the same key but isRotation("abba", "baba") will return false).
But once you have built this index, you can significantly reduce the number of pairs you need to consider: if you want all the rotations for "bca" you just need to sort("bca"), look it up in the hashtable, and check (using the isRotation method above, if you want) if the words in the posting list are the result of a rotation or not.
If strings are short compared to the number of strings in the list, you can do significantly better by rotating all strings to some normal form (lexicographic minimum, for example). Then sort lexicographically and find runs of the same string. That's O(n log n), I think... neglecting string lengths. Something to try, maybe.
Concerning the way to find the pairs in the table, there could be many better way, but what I came up as a first thought is to sort the table and apply the check per adjacent pair.
This is much better and simpler that checking every string with every other string in the table
Consider building an automaton for each string against which you wish to test.
Each automaton should have one entry point for each possible character in the string, and transitions for each character, plus an extra transition from the end to the start.
You could improve performance even further if you amalgated the automata.
I think a combination of the answers by Patrick87 and savinos would make a fair amount of sense. Specifically, in a Java-esque pseudo-code:
List<String> inputs = ["abc", "xyz", "yzx", "cab", "xxx"];
Map<String,List<String>> uniques = new Map<String,List<String>>();
for(String value : inputs) {
String normalized = normalize(value);
if(!uniques.contains(normalized)) {
unqiues.put(normalized, new List<String>());
}
uniques.get(normalized).add(value);
}
// you now have a Map of normalized strings to every string in the input
// that is "equal to" that normalized version
Normalizing the string, as stated by Patrick87 might be best done by picking the rotation of the string that results in the lowest lexographic ordering.
It's worth noting, however, that the "best" algorithm probably relies heavily on the inputs... the number of strings, the length of those string, how many duplicates there are, etc.
You can rotate all the strings to a normalized form using Booth's algorithm (https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation) in O(s) time, where s is the length of the string.
You can then use the normalized form as a key in a HashMap (where the value is the set of rotations seen in the input). You can populate this HashMap in a single pass over the data. i.e., for each string
calculate the normalized form
check if the HashMap contains the normalized form as a key - if not insert the empty Set at this key
add the string to the Set in the HashMap
You then just need to output the values of the HashMap. This makes the total runtime of the algorithm O(n * s) - where n is the number of words and s is the average word length. The total space usage is also O(n * s).