constraint to avoid generating duplicates in this search task

constraint to avoid generating duplicates in this search task - search

I have to solve the following optimization problem:
Given a set of elements (E1,E2,E3,E4,E5,E6) create an arbitrary set of sequences e.g.
seq1:E1,E4,E3
seq2:E2
seq3:E6,E5
and given a function f that gives a value for every pair of elements e.g.
f(E1,E4) = 5
f(E4,E3) = 2
f(E6,E5) = 3
...
in addition it also gives a value for the pair of an element combined with some special element T, e.g.
f(T,E2) = 10
f(E2,T) = 3
f(E5,T) = 1
f(T,E6) = 2
f(T,E1) = 4
f(E3,T) = 2
...
The utility function that must be optimized is the following:
The utility of a sequence set is the sum of the utility of all sequences.
The utility of a sequence A1,A2,A3,...,AN is equal to
f(T,A1)+f(A1,A2)+f(A2,A3)+...+f(AN,T)
for our example set of sequences above this leads to
seq1: f(T,E1)+f(E1,E4)+f(E4,E3)+f(E3,T) = 4+5+2+2=13
seq2: f(T,E2)+f(E2,T) =10+3=13
seq3: f(T,E6)+f(E6,E5)+f(E5,T) =2+3+1=6
Utility(set) = 13+13+6=32
I try to solve a larger version (more elements than 6, rather 1000) of this problem using A* and some heuristic. Starting from zero sequences and stepwise adding elements either to existing sequences or as a new sequence, until we obtain a set of sequences containing all elements.
The problem I run into is the fact that while generating possible solutions I end up with duplicates, for example in above example all the following combinations are generated:
seq1:E1,E4,E3
seq2:E2
seq3:E6,E5
+
seq1:E1,E4,E3
seq2:E6,E5
seq3:E2
+
seq1:E2
seq2:E1,E4,E3
seq3:E6,E5
+
seq1:E2
seq2:E6,E5
seq3:E1,E4,E3
+
seq1:E6,E5
seq2:E2
seq3:E1,E4,E3
+
seq1:E6,E5
seq2:E1,E4,E3
seq3:E2
which all have equal utility, since the order of the sequences does not matter.
These are all permutations of the 3 sequences, since the number of sequences is arbitrairy there can be as much sequences as elements and a faculty(!) amount of duplicates generated...
One way to solve such a problem is keeping already visited states and don't revisit them. However since storing all visited states requires a huge amount of memory and the fact that comparing two states can be a quite expensive operation, I was wondering whether there wasn't a way I could avoid generating these in the first place.
THE QUESTION:
Is there a way to stepwise construct all these sequence constraining the adding of elements in a way that only combinations of sequences are generated rather than all variations of sequences.(or limit the number of duplicates)
As an example, I only found a way to limit the amount of 'duplicates' generated by stating that an element Ei should always be in a seqj with j<=i, therefore if you had two elements E1,E2 only
seq1:E1
seq2:E2
would be generated, and not
seq1:E2
seq2:E1
I was wondering whether there was any such constraint that would prevent duplicates from being generated at all, without failing to generate all combinations of sets.

Well, it is simple. Allow generation of only such sequences that are sorted according to first member, that is, from the above example, only
seq1:E1,E4,E3
seq2:E2
seq3:E6,E5
would be correct. And this you can guard very easily: never allow additional sequence that has its first member less than the first member of its predecessor.

Related

Can I write a hash function say `hashit(a, b, c)` taking no more than O(N) space

Given integers a,b,c such that
-N<=a<=N,
0<=b<=N,
0<=c<=10
Can I write a hash function say hashit(a, b, c) taking no more than O(N) adrdress space.
My naive thought was to write it as,
a+2N*b+10*2N*N*c
thats like O(20N*N) space, so it wont suffice my need.
let me elaborate my usecase, I want tuple (a,b,c) as key of a hashmap . Basically a,b,c are arguments to my function which I want to memorise. in python #lru_cache perfectly does it without any issue for N=1e6 but when I try to write hash function myself I get memory overflow. So how do python do it ?
I am working wih N of the order of 10^6
This code work
#lru_cache(maxsize=None)
def myfn(a,b,c):
//some logic
return 100
But if i write the hash function myself like this, it doesn't . So how do python do it.
def hashit(a,b,c):
return a+2*N*b+2*N*N*c
def myfn(a,b,c):
if hashit(a,b,c) in myhashtable:
return myhashtable[hashit(a,b,c)]
//some logic
myhashtable[hashit(a,b,c)] = 100;
return myhashtable[hashit(a,b,c)]

To directly answer your question of whether it is possible to find an injective hash function from a set of size Θ(N^2) to a set of size O(N): it isn't. The very existence of an injective function from a finite set A to a set B implies that |B| >= |A|. This is similar to trying to give a unique number out of {1, 2, 3} to each member of a group of 20 people.
However, do note that hash functions do oftentimes have collisions; the hash tables that employ them simply have a method for resolving those collisions. As one simple example for clarification, you could for instance hold an array such that every possible output of your hash function is mapped to an index of this array, and at each index you have a list of elements (so an array of lists where the array is of size O(N)), and then in the case of a collision simply go over all elements in the matching list and compare them (not their hashes) until you find what you're looking for. This is known as chain hashing or chaining. Some rare manipulations (re-hashing) on the hash table based on how populated it is (measured through its load factor) could ensure an amortized time complexity of O(1) for element access, but this could over time increase your space complexity if you actually try to hold ω(N) values, though do note that this is unavoidable: you can't use less space than Θ(X) to hold Θ(X) values without any extra information (for instance: if you hold 1000000 unordered elements where each is a natural number between 1 and 10, then you could simply hold ten counters; but in your case you describe a whole possible set of elements of size 11*(N+1)*(2N+1), so Θ(N^2)).
This method would, however, ensure a space complexity of O(N+K) (equivalent to O(max{N,K})) where K is the amount of elements you're holding; so long as you aren't trying to simultaneously hold Θ(N^2) (or however many you deem to be too many) elements, it would probably suffice for your needs.

Longest common substring via suffix array: do we really need unique sentinels?

I am reading about LCP arrays and their use, in conjunction with suffix arrays, in solving the "Longest common substring" problem. This video states that the sentinels used to separate individual strings must be unique, and not be contained in any of the strings themselves.
Unless I am mistaken, the reason for this is so when we construct the LCP array (by comparing how many characters adjacent suffixes have in common) we don't count the sentinel value in the case where two sentinels happen to be at the same index in both the suffixes we are comparing.
This means we can write code like this:
for each character c in the shortest suffix
if suffix_1[c] == suffix_2[c]
increment count of common characters
However, in order to facilitate this, we need to jump through some hoops to ensure we use unique sentinels, which I asked about here.
However, would a simpler (to implement) solution not be to simply count the number of characters in common, stopping when we reach the (single, unique) sentinel character, like this:
set sentinel = '#'
for each character c in the shortest suffix
if suffix_1[c] == suffix_2[c]
if suffix_1[c] != sentinel
increment count of common characters
else
return
Or, am I missing something fundamental here?

Actually I just devised an algorithm that doesn't use sentinels at all: https://github.com/BurntSushi/suffix/issues/14
When concatenating the strings, also record the boundary indexes (e.g. for 3 string of length 4, 2, 5, the boundaries 4, 6, and 11 will be recorded, so we know that concatenated_string[5] belongs to the second original string because 4<= 5 < 6).
Then, to identify which original string every suffix belongs to, just do a binary search.

The short version is "this is mostly an artifact of how suffix array construction algorithms work and has nothing to do with LCP calculations, so provided your suffix array building algorithm doesn't need those sentinels, you can safely skip them."
The longer answer:
At a high level, the basic algorithm described in the video goes like this:
Construct a generalized suffix array for the strings T1 and T2.
Construct an LCP array for that resulting suffix array.
Iterate across the LCP array, looking for adjacent pairs of suffixes that come from different strings.
Find the largest LCP between any two such strings; call it k.
Extract the first k characters from either of the two suffixes.
So, where do sentinels appear in here? They mostly come up in steps (1) and (2). The video alludes to using a linear-time suffix array construction algorithm (SACA). Most fast SACAs for generating suffix arrays for two or more strings assume, as part of their operation, that there are distinct endmarkers at the ends of those strings, and often the internal correctness of the algorithm relies on this. So in that sense, the endmarkers might need to get added in purely to use a fast SACA, completely independent of any later use you might have.
(Why do SACAs need this? Some of the fastest SACAs, such as the SA-IS algorithm, assume the last character of the string is unique, lexicographically precedes everything, and doesn't appear anywhere else. In order to use that algorithm with multiple strings, you need some sort of internal delimiter to mark where one string ends and another starts. That character needs to act as a strong "and we're now done with the first string" character, which is why it needs to lexicographically precede all the other characters.)
Assuming you're using a SACA as a black box this way, from this point forward, those sentinels are completely unnecessary. They aren't used to tell which suffix comes from which string (this should be provided by the SACA), and they can't be a part of the overlap between adjacent strings.
So in that sense, you can think of these sentinels as an implementation detail needed to use a fast SACA, which you'd need to do in order to get the fast runtime.

Collections: How will you find the top 10 longest strings in a list of a billion strings?

I was recently asked a question in an interview. How will you find the top 10 longest strings in a list of a billion strings?
My Answer was that we need to write a Comparator that compares the lengths of 2 strings and then Use the TreeSet(Comparator) constructor.
Once you start adding the strings in the Treeset it will sort as per the sorting order of the comparator defined.
Then just pop the top 10 elements of the Treeset.
The Interviewer wasn't happy with that. The argument was that, to hold billion strings I will have to use a super computer.
Is there any other data stucture than can deal with this kind of data?

Given what you stated about the interviewer saying you would need a super computer, I am going to assume that the strings would come in a stream one string at a time.
Given the immense size due to no knowledge of how large the individual strings are (they could be whole books), I would read them in one at a time from the stream. I would then compare the current string to an ordered list of the top ten longest strings found before it and place it accordingly in the ordered list. I would then remove the smallest length one from the list and proceed to read the next string. That would mean only 11 strings were being stored at one time, the current top 10 and the one currently being processed.

Most languages have a built in sort that is pretty speedy.
stringList.sort(key=len)
in python would work. Then just grab the first 10 elements.
Also your interviewer does sounds behind the times. One billion strings is pretty small now a days

I remember studying similar data structure for such scenarios called as Trie
The height of the tree will give the longest string always.
A special kind of trie, called a suffix tree, can be used to index all suffixes in a text in order to carry out fast full text searches.

The point is you do not need to STORE all strings.
Let's think a simplified version: Find the longest 2 string (assuming no tie case)
You can always do a online algorithm like using 2 variables s1 & s2, where s1 is longest string you encountered so far, s2 is the second longest
Then you use O(N) to read the strings one by one, replace s1 or s2 when it can. This use O(2N) = O(N)
For top 10 strings, it is as dumb as the top 2 case. You can still do it in O(10N) = O(N) and store only 10 strings.
There is a faster way describe as follow but for given constant like 2 or 10, you may not need it.
For top-K strings in general, you can use structure like set in C++ (with longer having higher priority) to store the top-K strings, when a new string comes, you simply insert it, and remove the last one, both use O(lg K). So total you can do it in O(N lg K) with O(K) space.

clustering strings - what algorithm is suitable?

I have some strings and characters will not be repeated in a single string.
for example: "AABC" is not possible.
I want to cluster them into sets by their common sub-strings.
for example: "ABC, CDF, GHP" will be cluster into two sets
{ABC,CDF},{GHP}.
several strings with one or more common sub-strings will be in one set.
a string which has no common sub-string with any other strings will be a set itself.
so keep the number of sets smallest.
for example:
1. "ABC, AHD,AKJ,LAN,WER" will be two sets {ABC, AHD,AKJ,LAN},{WER}.
2. "ABC,BDF, HLK, YHT,PX" will be 3 sets {ABC,BDF}.{HLK, YHT},{PX}.
Finding a string which has nothing common with others is easy I think;
for(i=0; i< strings.num; i++)
{ str1 = strings[i];
bool m_com=false;
for(j=0;j < strings.num; j++ )
{
str2=strings[j];
if(hascommon(str1,str2))
m_com=true;
}
if(!m_com)
{
str1 has no common substring with any string,
}
}
now I am thinking about others, how to classify them, is there any algorithm suitable for this?
Input:
strings (characters are not be repeated)
output:
sets (keep number of sets as small as possible)
I know this involves with finding common sub-string problem and clustering.
but I am not familiar with clustering techniques, so I am hoping some one
could recommend me such algorithm.
while I am looking for good ways to do this, I also appreciate suggestions from others.
Tip: actually these strings are simple paths between two points in a graph. I want to find the edge whose removal cuts all these paths. the number of such edges should be minimum. so, for AB,BC,CD, it means a single path ABCD exist.
and I write down a algorithm to find common substrings in my case(my case much simpler). I think I might use this algorithm during the clustering to measure similarities.
I might have two paths, {ABC, ADC}, both removing A or removing B could split the paths.
or I could have {ABC, ADC,HG}, so removing {A,H}, or {CH}, or {CG},or {AG} all works.
I thought I could solve this by finding common subs-strings, then I decide where to remove edges.

One thing should be pointed out first:
For any two strings, "having common substring" is really equivalent to "having common letter". Thus we can replace the condition by "having common letter".
Consider the graph G whose vertices are the strings, and two strings are connected by an edge if and only if they have a common letter. Then you are really asking for separate the graph G into connected components. This can be done easily, using standard graph operation algorithms, c.f. the wiki page here.
What remains is the task of establishing the graph. This is also easy: first, create 26 boxes, labelled A to Z, and read each string once. If the string contains letter A, then put it (or its index) into box A, etc. Finally, those strings inside one box have edges connecting to each other.
There can be further optimizations, but I guess it will depend on the nature of your input data.

You have to use Heap's algorithm for your job to create permutations https://en.wikipedia.org/wiki/Heap's_algorithm

As opposed to WhatsUp, I assume you want any two strings in a subset to have a common substring. This means that for AB, BC, CD, {AB, BC, CD} is not a valid solution, because AB and CD do not have a common substring.
As Whatsup already pointed out, you can represent your strings as a graph, where vertices are the strings and and edge goes from one to the other if they have a common character.
If we are not accepting chains (as described at the beginning), the problem becomes finding a minimum clique cover, which is unfortunately NP-complete.

Looking for ideas: lexicographically sorted suffix array of many different strings compute efficiently an LCP array

I don't want a direct solution to the problem that's the source of this question but it's this one link:
So I take in the strings and add them to a suffix array which is implemented as a sorted set internally, what I obtain then is a lexicographically sorted list of the two given strings.
S1 = "banana"
S2 = "panama"
SuffixArray.add S1, S2
To make searching for the k-th smallest substring efficient I preprocess this sorted set to add in information about the longest common prefix between a suffix and it's predecessor as well as keeping tabs on a cumulative substrings count. So I know that for a given k greater than the cumulative substrings count of the last item, it's an invalid query.
This works really well for small inputs as well as random large inputs of the constraints given in the problem definition, which is at most 50 strings of length 2000. I am able to pass the 4 out of 7 cases and was pretty surprised I didn't get them all.
So I went searching for the bottleneck and it hit me. Given large number of inputs like these
anananananananana.....ananana
bkbkbkbkbkbkbkbkb.....bkbkbkb
The queries for k-th smallest substrings are still fast as expected but not the way I preprocess the sorted set... The way I calculate the longest common prefix between the elements of the set is not efficient and linear O(m), like this, I did the most naïve thing expecting it to be good enough:
m = anananan
n = anananana
Start at 0 and find the point where `m[i] != n[i]`
It is like this because a suffix and his predecessor might no be related (i.e. coming from different input strings) and so I thought I couldn't help but using brute force.
Here is the question then and where I ended up reducing the problem as. Given a list of lexicographically sorted suffix like in the manner I described above (made up of multiple strings):
What is an efficient way of computing the longest common prefix array?.
The subquestion would then be, am I completely off the mark in my approach? Please propose further avenues of investigation if that's the case.
Foot note, I do not want to be shown implemented algorithm and I don't mind to be told to go read so and so book or resource on the subject as that is what I do anyway while attempting these challenges.
Accepted answer will be something that guides me on the right path or in the case that that fails; something that teaches me how to solve these types of problem in a broader sense, a book or something

READING
I would recommend this tutorial pdf from Stanford.
This tutorial explains a simple O(nlog^2n) algorithm with O(nlogn) space to compute suffix array and a matrix of intermediate results. The matrix of intermediate results can be used to compute the longest common prefix between two suffixes in O(logn).
HINTS
If you wish to try to develop the algorithm yourself, the key is to sort the strings based on their 2^k long prefixes.
From the tutorial:
Let's denote by A(i,k) be the subsequence of A of length 2^k starting at position i.
The position of A(i,k) in the sorted array of A(j,k) subsequences (j=1,n) is kept in P(k,i).
and
Using matrix P, one can iterate descending from the biggest k down to 0 and check whether A(i,k) = A(j,k). If the two prefixes are equal, a common prefix of length 2^k had been found. We only have left to update i and j, increasing them both by 2^k and check again if there are any more common prefixes.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string