O(n^2) (or O(n^2lg(n)) ?)algorithm to calculate the longest common subsequence (LCS) of two 'ring' string - string

This is a problem appeared in today's Pacific NW Region Programming Contest during which no one solved it. It is problem B and the complete problem set is here: http://www.acmicpc-pacnw.org/icpc-statements-2011.zip. There is a well-known O(n^2) algorithm for LCS of two strings using Dynamic Programming. But when these strings are extended to rings I have no idea...
P.S. note that it is subsequence rather than substring, so the elements do not need to be adjacent to each other
P.S. It might not be O(n^2) but O(n^2lgn) or something that can give the result in 5 seconds on a common computer.

Searching the web, this appears to be covered by section 4.3 of the paper "Incremental String Comparison", by Landau, Myers, and Schmidt at cost O(ne) < O(n^2), where I think e is the edit distance. This paper also references a previous paper by Maes giving cost O(mn log m) with more general edit costs - "On a cyclic string to string correcting problem". Expecting a contestant to reproduce either of these papers seems pretty demanding to me - but as far as I can see the question does ask for the longest common subsequence on cyclic strings.

You can double the first and second string and then use the ordinary method, and later wrap the positions around.

It is a good idea to "double" the strings and apply the standard dynamic programing algorithm. The problem with it is that to get the optimal cyclic LCS one then has to "start the algorithm from multiple initial conditions". Just one initial condition (e.g. setting all Lij variables to 0 at the boundaries) will not do in general. In practice it turns out that the number of initial states that are needed are O(N) in number (they span a diagonal), so one gets back to an O(N^3) algorithm.
However, the approach does has some virtue as it can be used to design efficient O(N^2) heuristics (not exact but near exact) for CLCS.
I do not know if a true O(N^2) exist, and would be very interested if someone knows one.
The CLCS problem has quite interesting properties of "periodicity": the length of a CLCS of
p-times reapeated strings is p times the CLCS of the strings. This can be proved by adopting a geometric view off the problem.
Also, there are some additional benefits of the problem: it can be shown that if Lc(N) denotes the averaged value of the CLCS length of two random strings of length N, then
|Lc(N)-CN| is O(\sqrt{N}) where C is Chvatal-Sankoff's constant. For the averaged length L(N) of the standard LCS, the only rate result of which I know says that |L(N)-CN| is O(sqrt(Nlog N)). There could be a nice way to compare Lc(N) with L(N) but I don't know it.
Another question: it is clear that the CLCS length is not superadditive contrary to the LCS length. By this I mean it is not true that CLCS(X1X2,Y1Y2) is always greater than CLCS(X1,Y1)+CLCS(X2,Y2) (it is very easy to find counter examples with a computer).
But it seems possible that the averaged length Lc(N) is superadditive (Lc(N1+N2) greater than Lc(N1)+Lc(N2)) - though if there is a proof I don't know it.
One modest interest in this question is that the values Lc(N)/N for the first few values of N would then provide good bounds to the Chvatal-Sankoff constant (much better than L(N)/N).

As a followup to mcdowella's answer, I'd like to point out that the O(n^2 lg n) solution presented in Maes' paper is the intended solution to the contest problem (check http://www.acmicpc-pacnw.org/ProblemSet/2011/solutions.zip). The O(ne) solution in Landau et al's paper does NOT apply to this problem, as that paper is targeted at edit distance, not LCS. In particular, the solution to cyclic edit distance only applies if the edit operations (add, delete, replace) all have unit (1, 1, 1) cost. LCS, on the other hand, is equivalent to edit distances with (add, delete, replace) costs (1, 1, 2). These are not equivalent to each other; for example, consider the input strings "ABC" and "CXY" (for the acyclic case; you can construct cyclic counterexamples similarly). The LCS of the two strings is "C", but the minimum unit-cost edit is to replace each character in turn.
At 110 lines but no complex data structures, Maes' solution falls towards the upper end of what is reasonable to implement in a contest setting. Even if Landau et al's solution could be adapted to handle cyclic LCS, the complexity of the data structure makes it infeasible in a contest setting.
Last but not least, I'd like to point out that an O(n^2) solution DOES exist for CLCS, described here: http://arxiv.org/abs/1208.0396 At 60 lines, no complex data structures, and only 2 arrays, this solution is quite reasonable to implement in a contest setting. Arriving at the solution might be a different matter, though.

Related

Algorithm: Fit smaller strings together optimally to make longer strings of specified length?

General problem
I have a set of short strings, each of different length with minimum X > 0 and maximum Y. What is an algorithm which will optimally fit together these short strings to make long strings of length M, where M >> Y? Optimal would be defined as the greatest number of long strings with lengths closest to M as possible.
Details
I am writing a tweet creator to practice javascript. I have a list of greetings and a list of account names. I want my program to create tweets such that each tweet has one greeting and the rest of the characters are used for account names. Each tweet has a limit of 140 characters.
Hello! #person1 #acc2 #mygoodfriend3 ...
Of course, each account has a different number of characters. I want each tweet to use up as many of the 140 characters as possible by optimally selecting combinations of account names.
I am pretty certain there is a class of problems / algorithm that is known to solve this problem but I can't remember it.
This kind of problem is called a knapsack problem, and an exact (or exactly optimal) solution is famous for being intractable.
However, there are reasonable approximate solvers, as well as a "pseudo-polynomial time" dynamic-programming algorithm.
I think it is related to the knapsack problem; it is a particular case of "multiple linear bin packing problem". Both are NP-hard problems. Here you can find a greedy algorithm for the linear bin packing problem, but the multiple case is much harder. A constraint programming language/library would help to solve these kind of problems.

Do you need to sort inputs for dynamic programming knapsack

In every single example I've found for a 1/0 Knapsack problem using dynamic programming where the items have weights(costs) and profits, it never explicitly says to sort the items list, but in all the examples they are sorted by increasing both weight and profit (higher weights have higher profits in the examples). So my question is when adding items in the matrix from the item array/list, can I add them in any order, or do I add the one with the smallest weight or profit? Because from multiple examples I found I'm not sure if its just a coincidence or you do in fact need to put the smallest weight/profit into the matrix each time
The dynamic programming solution is nothing but choosing all the possibilities (using brute force) in an efficient way (just by saving the values for future reference).
Note: We consider all the subsets. Even if the list is sorted or not the total number of subsets will be the same. So in the end all the subsets will get considered.
No, you don't need to sort the weights because every row gives the maximum possible value under the weight limit of that row. The maximum will come in the last column of that row.
Maybe you are looking at bottom-up Dynamic solutions. And there is one characteristic in Dynamic solution when you solve using the bottom-up method.
The second approach is the bottom-up method. This approach typically depends
on some natural notion of the “size” of a subproblem, such that solving any particular
subproblem depends only on solving “smaller” subproblems. We sort the
subproblems by size and solve them in size order, smallest first. When solving a
particular subproblem, we have already solved all of the smaller subproblems its
solution depends upon, and we have saved their solutions. We solve each subproblem
only once, and when we first see it, we have already solved all of its
prerequisite subproblems.
From: Introduction to Algorithm, CORMEN (3rd edition)
The "smaller problem" in this case is just smaller in terms of the number of available items to choose, not about the profits or weights of these items. If given a 3 item list, the sub-problems will be 2 items, and 1 item to choose from.
The smallest problem is first hardcoded (base case), then at each stage of moving from a smaller to bigger problem, the best profit is enumerated, and the max is chosen. At the end, all 2^n combinations would have been considered and the various stages of repeated max will bubble up the largest solution.
Changing the input order, or putting dominated items into the input (like higher weight and lower profit) may just change which argument of max() won at each stage, but the final max result will come from the same selection of items, albeit being selected at different stages in the algorithm for different sort orders or input characteristics.
The answer could be found by some random shuffle experiments.
which I found is: better in ascending order. correct me if i'm wrong.
gist: https://gist.github.com/whille/39cf7bf8cf5dcf6ac933063735ae54de
Problem described in "Algorithm Design", ISBN: 9780321295354, chapter 6.4.
Two methods could be used:
as the chapter used, a M cache to pre-calculated, in which not sub answer is needed.
recursive function, which is simple to understand and test. I found functools.cache of python3.5+ could be use to check how many sub caculation is need, as my gist show: ascending order of test_random() is the smallest in currsize, So it's most efficient, and could extend to float value.
results for 10 random weights(1~100) and 200 knapack:
[(13.527716157276256, 18.371888775465692), (16.18632175987168, 206.88043031085252), (20.14117982372607, 81.52793937986635), (33.28606671929836, 298.8676699147799), (49.12968642850187, 22.037638580809592), (55.279973594800225, 377.3715225559507), (56.56103181962746, 460.9161412820592), (60.38456825749498, 10.721915577913244), (67.98836121062645, 63.47478755362385), (86.49436333909377, 208.06767811169286)]: reverse: False
CacheInfo(hits=0, misses=832, maxsize=None, currsize=832)
[(86.49436333909377, 208.06767811169286), (67.98836121062645, 63.47478755362385), (60.38456825749498, 10.721915577913244), (56.56103181962746, 460.9161412820592), (55.279973594800225, 377.3715225559507), (49.12968642850187, 22.037638580809592), (33.28606671929836, 298.8676699147799), (20.14117982372607, 81.52793937986635), (16.18632175987168, 206.88043031085252), (13.527716157276256, 18.371888775465692)]: reverse: True
CacheInfo(hits=0, misses=1120, maxsize=None, currsize=1120)
Note for method2:
if capacity is much larger, all random orders have equal currsize.
callstack overflow should be avoided for large N, so recurse method should be transformed. Generally a two step method could be used, first map subproblem dependency, and calc. I'ill try later in gist.
I guess, sorting might be required in certain types of knapsack problem. For example consider the problem "Maximum Earnings From Taxi". Here, the input has to be sorted by the starting point of the riders, else we won't get the optimal result.
For example, consider the below input for the above problem:-
9
[[2,3,1],[2,9,2], [3,6,7],[2,3,6]]
If you apply a typical knapsack implementation in recursive approach, without sorting the input we won't get the optimal solution.

Generate a crossword from given words maximizing intersections

Is there some feasible (i.e. polynomial time) algorithm which builds, starting from a small (~20) set of words, a crossword which maximizes (or at least for which is "big") the number of intersection? Or, if the intersection criteria is impractical, is it possible to maximize the density (in some sense) of the crossword?
I have already written an exhaustive search in Python, but it takes too long for more than six words...
See also:
Algorithm to generate a crossword (but the answers there, althought good, do not really tackle my issue).
Is there some polynomial time algorithm ?
Answer: No.
For a simple version: if a word end's letter is the same of another word beginning's, we can concatenate them. for example:
cat+tree+element -> Valid
aaa+aaa -> Valid
cab+aboard -> Invalid ('a' != 'b')
Question is: try to concatenate words as many as possible.
But it's equivalent to Hamiltonian path problem, so we don't have any polynomial time algorithm for this problem.
See this for details: Hamiltonian path problem
PS:
For a small (~20) set, you can try heuristic search or Dynamic programming method to get a feasible solution.

Algorithm (or pointer to literature) sought for string processing challenge

A group of amusing students write essays exclusively by plagiarising portions of the complete works of WIlliam Shakespere. At one end of the scale, an essay might exclusively consist a verbatim copy of a soliloquy... at the other, one might see work so novel that - while using a common alphabet - no two adjacent characters in the essay were used adjacently by Will.
Essays need to be graded. A score of 1 is assigned to any essay which can be found (character-by-character identical) in the plain-text of the complete works. A score of 2 is assigned to any work that can be successfully constructed from no fewer than two distinct (character-by-character identical) passages in the complete works, and so on... up to the limit - for an essay with N characters - which scores N if, and only if, no two adjacent characters in the essay were also placed adjacently in the complete works.
The challenge is to implement a program which can efficiently (and accurately) score essays. While any (practicable) data-structure to represent the complete works is acceptable - the essays are presented as ASCII strings.
Having considered this teasing question for a while, I came to the conclusion that it is much harder than it sounds. The naive solution, for an essay of length N, involves 2**(N-1) traversals of the complete works - which is far too inefficient to be practical.
While, obviously, I'm interested in suggested solutions - I'd also appreciate pointers to any literature that deals with this, or any similar, problem.
CLARIFICATIONS
Perhaps some examples (ranging over much shorter strings) will help clarify the 'score' for 'essays'?
Assume Shakespere's complete works are abridged to:
"The quick brown fox jumps over the lazy dog."
Essays scoring 1 include "own fox jump" and "The quick brow". The essay "jogging" scores 6 (despite being short) because it can't be represented in fewer than 6 segments of the complete works... It can be segmented into six strings that are all substrings of the complete works as follows: "[j][og][g][i][n][g]". N.B. Establishing scores for this short example is trivial compared to the original problem - because, in this example "complete works" - there is very little repetition.
Hopefully, this example segmentation helps clarify the 2*(N-1) substring searches in the complete works. If we consider the segmentation, the (N-1) gaps between the N characters in the essay may either be a gap between segments, or not... resulting in ~ 2*(N-1) substring searches of the complete works to test each segmentation hypothesis.
An (N)DFA would be a wonderful solution - if it were practical. I can see how to construct something that solved 'substring matching' in this way - but not scoring. The state space for scoring, on the surface, at least, seems wildly too large (for any substantial complete works of Shakespere.) I'd welcome any explanation that undermines my assumptions that the (N)DFA would be too large to be practical to compute/store.
A general approach for plagiarism detection is to append the student's text to the source text separated by a character not occurring in either and then to build either a suffix tree or suffix array. This will allow you to find in linear time large substrings of the student's text which also appear in the source text.
I find it difficult to be more specific because I do not understand your explanation of the score - the method above would be good for finding the longest stretch in the students work which is an exact quote, but I don't understand your N - is it the number of distinct sections of source text needed to construct the student's text?
If so, there may be a dynamic programming approach. At step k, we work out the least number of distinct sections of source text needed to construct first k characters of the student's text. Using a suffix array built just from the source text or otherwise, we find the longest match between the source text and characters x..k of the student's text, where x is of course as small as possible. Then the least number of sections of source text needed to construct the first k characters of student text is the least needed to construct 1..x-1 (which we have already worked out) plus 1. By running this process for k=1..the length of the student text we find the least number of sections of source text needed to reconstruct the whole of it.
(Or you could just search StackOverflow for the student's text, on the grounds that students never do anything these days except post their question on StackOverflow :-)).
I claim that repeatedly moving along the target string from left to right, using a suffix array or tree to find the longest match at any time, will find the smallest number of different strings from the source text that produces the target string. I originally found this by looking for a dynamic programming recursion but, as pointed out by Evgeny Kluev, this is actually a greedy algorithm, so let's try and prove this with a typical greedy algorithm proof.
Suppose not. Then there is a solution better than the one you get by going for the longest match every time you run off the end of the current match. Compare the two proposed solutions from left to right and look for the first time when the non-greedy solution differs from the greedy solution. If there are multiple non-greedy solutions that do better than the greedy solution I am going to demand that we consider the one that differs from the greedy solution at the last possible instant.
If the non-greedy solution is going to do better than the greedy solution, and there isn't a non-greedy solution that does better and differs later, then the non-greedy solution must find that, in return for breaking off its first match earlier than the greedy solution, it can carry on its next match for longer than the greedy solution. If it can't, it might somehow do better than the greedy solution, but not in this section, which means there is a better non-greedy solution which sticks with the greedy solution until the end of our non-greedy solution's second matching section, which is against our requirement that we want the non-greedy better solution that sticks with the greedy one as long as possible. So we have to assume that, in return for breaking off the first match early, the non-greedy solution gets to carry on its second match longer. But this doesn't work, because, when the greedy solution finally has to finish using its first match, it can jump on to the same section of matching text that the non-greedy solution is using, just entering that section later than the non-greedy solution did, but carrying on for at least as long as the non-greedy solution. So there is no non-greedy solution that does better than the greedy solution and the greedy solution is optimal.
Have you considered using N-Grams to solve this problem?
http://en.wikipedia.org/wiki/N-gram
First read the complete works of Shakespeare and build a trie. Then process the string left to right. We can greedily take the longest substring that matches one in the data because we want the minimum number of strings, so there is no factor of 2^N. The second part is dirt cheap O(N).
The depth of the trie is limited by the available space. With a gigabyte of ram you could reasonably expect to exhaustively cover Shakespearean English string of length at least 5 or 6. I would require that the leaf nodes are unique (which also gives a rule for constructing the trie) and keep a pointer to their place in the actual works, so you have access to the continuation.
This feels like a problem of partial matching a very large regular expression.
If so it can be solved by a very large non deterministic finite state automata or maybe more broadly put as a graph representing for every character in the works of Shakespeare, all the possible next characters.
If necessary for efficiency reasons the NDFA is guaranteed to be convertible to a DFA. But then this construction can give rise to 2^n states, maybe this is what you were alluding to?
This aspect of the complexity does not really worry me. The NDFA will have M + C states; one state for each character and C states where C = 26*2 + #punctuation to connect to each of the M states to allow the algorithm to (re)start when there are 0 matched characters. The question is would the corresponding DFA have O(2^M) states and if so is it necessary to make that DFA, theoretically it's not necessary. However, consider that in the construction, each state will have one and only one transition to exactly one other state (the next state corresponding to the next character in that work). We would expect that each one of the start states will be connected to on average M/C states, but in the worst case M meaning the NDFA will have to track at most M simultaneous states. That's a large number but not an impossibly large number for computers these days.
The score would be derived by initializing to 1 and then it would incremented every time a non-accepting state is reached.
It's true that one of the approaches to string searching is building a DFA. In fact, for the majority of the string search algorithms, it looks like a small modification on failure to match (increment counter) and success (keep going) can serve as a general strategy.

Getting fuzzy string matches from database very fast

I have a database of ~150'000 words and a pattern (any single word) and I want to get all words from the database which has Damerau-Levenshtein distance between it and the pattern less than given number. I need to do it extremely fast. What algorithm could you suggest? If there's no good algorithm for Damerau-Levenshtein distance, just Levenshtin distance will be welcome as well.
Thank you for your help.
P.S. I'm not going to use SOUNDEX.
I would start with a SQL function to calculate the Levenshtein distance (in T-SQl or .Net) (yes, I'm a MS person...) with a maximum distance parameter that would cause an early exit.
This function could then be used to compare your input with each string to check the distanve and move on to the next if it breaks the threshold.
I was also thinking you could, for example, set the maximum distance to be 2, then filter all words where the length is more than 1 different whilst the first letter is different. With an index this may be slightly quicker.
You could also shortcut to bring back all strings that are perfect matches (indexing will speed this up) as these will actually take longer to calculate the Levenshtein distance of 0.
Just some thoughts....
I do not think you can calculate this kind of function without actually enumerating all rows.
So the solutions are:
Make it a very fast enumeration (but this doesn't really scale)
Filter initial variants somehow (index by a letter, at least x common letters)
Use alternative (indexable) algorithm, such as N-Grams (however I do not have details on result quality of ngrams versus D-L distance).
A solution off the top of my head might be to store the database in a sorted set (e.g., std::set in C++), as it seems to me that strings sorted lexicographically would compare well. To approximate the position of the given string in the set, use std::upper_bound on the string, then iterate over the set outward from the found position in both directions, computing the distance as you go, and stop when it falls below a certain threshold. I have a feeling that this solution would probably only match strings with the same start character, but if you're using the algorithm for spell-checking, then that restriction is common, or at least unsurprising.
Edit: If you're looking for an optimisation of the algorithm itself, however, this answer is irrelevant.
I have used KNIME for string fuzzy matching and has got very fast results. It is also very easy to make visual workflows in it. Just install KNIME free edition from https://www.knime.org/ then use "String Distance" and "Similarity Search" nodes to get your results. I have attached a small fuzzy matching smaple workflow in here (the input data come from top and the patterns to search for come from the bottom in this case):
I would recommend looking into Ankiro.
I'm not certain that it meets your requirements for precision, but it is fast.

Resources