This is what I've in mind, but it's O(n^2):
For ex: Input is "Thisisawesome", we need to check if adding the current character makes the older found set any longer and meaningful. But in order to see till where we need to back up we'll have to traverse all the way to the beginning. For ex: "awe" and "some" make proper words but "awesome" makes the bigger word. Please suggest how can we improve the complexity. Here is the code:
void update(string in)
{
int len= in.length();
int DS[len];
string word;
for(int i=0; i<len; i++) DS[i]=0;
for(int i=0; i<len; i++)
for(int j=i+1; j<=len; j++)
{
word = in.substr(i,j-i);
if(dict.find(word)!=dict.end())
DS[j-1] = (DS[j-1] > word.length()) ? DS[j-1] : word.length();
}
}
There is a dynamic programming solution which at first looks like it is going to be O(n^2) but which turns out to be only O(n) for sufficiently large n and fixed size dictionary.
Work through the string from left to right. At the ith stage you need to work out whether there is a solution for the first i characters. To solve this, consider every possible way to break those i characters into two chunks. If the second chunk is a word and the first chunk can be broken up into words then there is a solution. The first requirement you can check with your dictionary. The second requirement you can check by looking to see if you found an answer for the first j characters, where j is the length of the first chunk.
This would be O(n^2) because for each of 1,2,3,...n lengths you consider every possible split. However, if you know what the longest word in your dictionary is you know that there is no point considering splits which make the second chunk longer than this. So for each of 1,2,3...n lengths you consider at most w possible splits, where w is the longest word in your dictionary, and the cost is O(n).
I have coded my solution today, and will put it on a web site tomorrow. Anyway, the method is as follows:
Arrange the dictionary in a trie.
The trie can help to do multiple matches quickly, because all dictionary words starting with the same letters can be matched at the same time.
(e.g. "chairman" matches "chair" and "chairman" in a trie.)
Use Dijkstra algorithm to find the best match.
(e.g. for "chairman", if we count "c" as position 0, then we have the relationships 0->5, 0->8, 1->5, 2->5, 5->8. These relationship form a network perfect for Dijkstra algorithm.)
(Note: Where's the weights of the edges? See the next point.)
Assign weighting to dictionary words.
Without weighting bad matches do weight over good matches. (e.g. "iamahero" becomes "i ama hero" instead of "i am a hero".)
The SCOWL dictionary at http://app.aspell.net/create serve the purpose well, because it has dictionaries of different sizes. These sizes (10, 20, etc.) is a good choice for weighing).
After some tries I found a need to reduce the weighing of words ending with "s", so "eyesandme" become "eyes and me" instead of "eye sand me".
I have been able to split a paragraph in milliseconds. The algorithm has linear complexity on the length of the string to be splitted, so the algorithm scales well as long as memory is enough.
Here's the dump (sorry for bragging). (The passage selected is "Novel" in Wikipedia.)
D:\GoogleDrive\programs\WordBreaker>"word breaker"<novelnospace.txt>output.txt
D:\GoogleDrive\programs\WordBreaker>type output.txt
Number of words after reading words-10.txt : 4101
Number of words after reading words-20.txt : 11329
Number of words after reading words-35.txt : 43292
Number of words after reading words-40.txt : 49406
Number of words after reading words-50.txt : 87966
Time elapsed in reading dictionary: 0.956782s
Enter the string to be broken into words:
Result:
a novel is along narrative normally in prose which describes fictional character
s and events usually in the form of a sequential story while i an watt in the ri
se of the novel 1957 suggests that the novel came into being in the early 18 th
century the genre has also been described as possessing a continuous and compreh
ensive history of about two thousand years with historical roots in classical gr
eece and rome medieval early modern romance and in the tradition of the novel la
the latter an italian word used to describe short stories supplied the present g
eneric english term in the 18 th century miguel de cervantes author of don quixo
te is frequently cited as the first significant europe an novelist of the modern
era the first part of don quixote was published in 1605 while a more precise de
finition of the genre is difficult the main elements that critics discuss are ho
w the narrative and especially the plot is constructed the themes settings and c
haracterization how language is used and the way that plot character and setting
relate to reality the romance is a related long prose narrative w alter scott d
efined it as a fictitious narrative in prose or verse the interest of which turn
s upon marvellous and uncommon incidents whereas in the novel the events are acc
ommodated to the ordinary train of human events and the modern state of society
however many romances including the historical romances of scott emily brontes w
u the ring heights and her man melvilles mo by dick are also frequently called n
ovels and scott describes romance as a kind red term romance as defined here sho
uld not be confused with the genre fiction love romance or romance novel other e
urope an languages do not distinguish between romance and novel a novel isle rom
and err o ma nil roman z o
Time elapsed in splitting: 0.00495095s
D:\GoogleDrive\programs\WordBreaker>type novelnospace.txt
Anovelisalongnarrativenormallyinprosewhichdescribesfictionalcharactersandeventsu
suallyintheformofasequentialstoryWhileIanWattinTheRiseoftheNovel1957suggeststhat
thenovelcameintobeingintheearly18thcenturythegenrehasalsobeendescribedaspossessi
ngacontinuousandcomprehensivehistoryofabouttwothousandyearswithhistoricalrootsin
ClassicalGreeceandRomemedievalearlymodernromanceandinthetraditionofthenovellaThe
latteranItalianwordusedtodescribeshortstoriessuppliedthepresentgenericEnglishter
minthe18thcenturyMigueldeCervantesauthorofDonQuixoteisfrequentlycitedasthefirsts
ignificantEuropeannovelistofthemodernerathefirstpartofDonQuixotewaspublishedin16
05Whileamoreprecisedefinitionofthegenreisdifficultthemainelementsthatcriticsdisc
ussarehowthenarrativeandespeciallytheplotisconstructedthethemessettingsandcharac
terizationhowlanguageisusedandthewaythatplotcharacterandsettingrelatetorealityTh
eromanceisarelatedlongprosenarrativeWalterScottdefineditasafictitiousnarrativein
proseorversetheinterestofwhichturnsuponmarvellousanduncommonincidentswhereasinth
enoveltheeventsareaccommodatedtotheordinarytrainofhumaneventsandthemodernstateof
societyHowevermanyromancesincludingthehistoricalromancesofScottEmilyBrontesWuthe
ringHeightsandHermanMelvillesMobyDickarealsofrequentlycallednovelsandScottdescri
besromanceasakindredtermRomanceasdefinedhereshouldnotbeconfusedwiththegenreficti
onloveromanceorromancenovelOtherEuropeanlanguagesdonotdistinguishbetweenromancea
ndnovelanovelisleromanderRomanilromanzo
D:\GoogleDrive\programs\WordBreaker>
Related
Finding the Lexicographically minimal string rotation is a well known problem, for which a linear time algorithm was proposed by Jean Pierre Duval in 1983. This blog post is probably the only publicly available resource that talks about the algorithm in detail. However, Duval's algorithms is based on the idea of pairwise comparisons ("duels"), and the blog conveniently uses an even-length string as an example.
How does the algorithm work for odd-length strings, where the last character wouldn't have a competing one to duel with?
One character can get a "bye", where it wins without participating in a "duel". The correctness of the algorithm does not rely on the specific duels that you perform; given any two distinct indices i and j, you can always conclusively rule out that one of them is the start-index of the lexicographically-minimal rotation (unless both are start-indices of identical lexicographically-minimal rotations, in which case it doesn't matter which one you reject). The reason to perform the duels in a specific order is performance: to get asymptotically linear time by ensuring that half the duels only need to compare one character, half of the rest only need to compare two characters, and so on, until the last duel only needs to compare half the length of the string. But a single odd character here and there doesn't change the asymptotic complexity, it just makes the math (and implementation) a little bit more complicated. A string of length 2n+1 still requires fewer "duels" than one of length 2n+1.
OP here: I accepted ruakh's answer as it pertains to my question, but I wanted to provide my own explanation for others that might stumble across this post trying to understand Duval's algorithm.
Problem:
Lexicographically least circular substring is the problem of finding
the rotation of a string possessing the lowest lexicographical order
of all such rotations. For example, the lexicographically minimal
rotation of "bbaaccaadd" would be "aaccaaddbb".
Solution:
A O(n) time algorithm was proposed by Jean Pierre Duval (1983).
Given two indices i and j, Duval's algorithm compares string segments of length j - i starting at i and j (called a "duel"). If index + j - i is greater than the length of the string, the segment is formed by wrapping around.
For example, consider s = "baabbaba", i = 5 and j = 7. Since j - i = 2, the first segment starting at i = 5 is "ab". The second segment starting at j = 7 is constructed by wrapping around, and is also "ab".
If the strings are lexicographically equal, like in the above example, we choose the one starting at i as the winner, which is i = 5.
The above process repeated until we have a single winner. If the input string is of odd length, the last character wins without a comparison in the first iteration.
Time complexity:
The first iteration compares n strings each of length 1 (n/2 comparisons), the second iteration may compare n/2 strings of length 2 (n/2 comparisons), and so on, until the i-th iteration compares 2 strings of length n/2 (n/2 comparisons). Since the number of winners is halved each time, the height of the recursion tree is log(n), thus giving us a O(n log(n)) algorithm. For small n, this is approximately O(n).
Space complexity is O(n) too, since in the first iteration, we have to store n/2 winners, second iteration n/4 winners, and so on. (Wikipedia claims this algorithm uses constant space, I don't understand how).
Here's a Scala implementation; feel free to convert to your favorite programming language.
def lexicographicallyMinRotation(s: String): String = {
#tailrec
def duel(winners: Seq[Int]): String = {
if (winners.size == 1) s"${s.slice(winners.head, s.length)}${s.take(winners.head)}"
else {
val newWinners: Seq[Int] = winners
.sliding(2, 2)
.map {
case Seq(x, y) =>
val range = y - x
Seq(x, y)
.map { i =>
val segment = if (s.isDefinedAt(i + range - 1)) s.slice(i, i + range)
else s"${s.slice(i, s.length)}${s.take(s.length - i)}"
(i, segment)
}
.reduce((a, b) => if (a._2 <= b._2) a else b)
._1
case xs => xs.head
}
.toSeq
duel(newWinners)
}
}
duel(s.indices)
}
I have a list of texts which are required to be sorted by similarity, based on a value exclusively obtained from the text itself. Hence, no comparison allowed as it could take too long to compare with thousands of other texts.
The idea is to generate values (numeric or not), from arbitrary text phrases, like in the next example:
42334220 = "A white horse is running accross the field"
42334229 = "The white horse is running across that field"
42334403 = "A white animal is running across the green field"
Notice that the first and second phrases are together because they're more similar than the last, plus despite both start with the same letter.
I have used, in other scenarios, the Soundex function, but it is oriented to pronuntiation, for single words and dependant on the first letter.
So, how to generate (aka what algorithms exists for classify) that exemplified value which represents a phrase for sorting purposes?
I suggest looking at fourier transformation. This is successfully applied to find similarities in images or audio samples. I think it could be well-suited for your problem.
You could view a String as int array, i.e. a signal function from position => char value
Do a fourier transform on this function
return a sublist of k indexes of highest elements in the result (greater k means greater precision)
Perhaps the signal function has to be adjusted to get a better match to the common understanding of similarity:
use a function position => word-index (look up the word in a dictionary and get its index)
use a function position => ngram-index (= weighted sum of the n chars)
if the sequence of words is not of relevance, use a function char -> frequency of the character (ordered alphabetically)
Maybe other transforms (e.g. wavelet transform) would do better.
I was looking through a programming question, when the following question suddenly seemed related.
How do you convert a string to another string using as few swaps as follows. The strings are guaranteed to be interconvertible (they have the same set of characters, this is given), but the characters can be repeated. I saw web results on the same question, without the characters being repeated though.
Any two characters in the string can be swapped.
For instance : "aabbccdd" can be converted to "ddbbccaa" in two swaps, and "abcc" can be converted to "accb" in one swap.
Thanks!
This is an expanded and corrected version of Subhasis's answer.
Formally, the problem is, given a n-letter alphabet V and two m-letter words, x and y, for which there exists a permutation p such that p(x) = y, determine the least number of swaps (permutations that fix all but two elements) whose composition q satisfies q(x) = y. Assuming that n-letter words are maps from the set {1, ..., m} to V and that p and q are permutations on {1, ..., m}, the action p(x) is defined as the composition p followed by x.
The least number of swaps whose composition is p can be expressed in terms of the cycle decomposition of p. When j1, ..., jk are pairwise distinct in {1, ..., m}, the cycle (j1 ... jk) is a permutation that maps ji to ji + 1 for i in {1, ..., k - 1}, maps jk to j1, and maps every other element to itself. The permutation p is the composition of every distinct cycle (j p(j) p(p(j)) ... j'), where j is arbitrary and p(j') = j. The order of composition does not matter, since each element appears in exactly one of the composed cycles. A k-element cycle (j1 ... jk) can be written as the product (j1 jk) (j1 jk - 1) ... (j1 j2) of k - 1 cycles. In general, every permutation can be written as a composition of m swaps minus the number of cycles comprising its cycle decomposition. A straightforward induction proof shows that this is optimal.
Now we get to the heart of Subhasis's answer. Instances of the asker's problem correspond one-to-one with Eulerian (for every vertex, in-degree equals out-degree) digraphs G with vertices V and m arcs labeled 1, ..., m. For j in {1, ..., n}, the arc labeled j goes from y(j) to x(j). The problem in terms of G is to determine how many parts a partition of the arcs of G into directed cycles can have. (Since G is Eulerian, such a partition always exists.) This is because the permutations q such that q(x) = y are in one-to-one correspondence with the partitions, as follows. For each cycle (j1 ... jk) of q, there is a part whose directed cycle is comprised of the arcs labeled j1, ..., jk.
The problem with Subhasis's NP-hardness reduction is that arc-disjoint cycle packing on Eulerian digraphs is a special case of arc-disjoint cycle packing on general digraphs, so an NP-hardness result for the latter has no direct implications for the complexity status of the former. In very recent work (see the citation below), however, it has been shown that, indeed, even the Eulerian special case is NP-hard. Thus, by the correspondence above, the asker's problem is as well.
As Subhasis hints, this problem can be solved in polynomial time when n, the size of the alphabet, is fixed (fixed-parameter tractable). Since there are O(n!) distinguishable cycles when the arcs are unlabeled, we can use dynamic programming on a state space of size O(mn), the number of distinguishable subgraphs. In practice, that might be sufficient for (let's say) a binary alphabet, but if I were to try to try to solve this problem exactly on instances with large alphabets, then I likely would try branch and bound, obtaining bounds by using linear programming with column generation to pack cycles fractionally.
#article{DBLP:journals/corr/GutinJSW14,
author = {Gregory Gutin and
Mark Jones and
Bin Sheng and
Magnus Wahlstr{\"o}m},
title = {Parameterized Directed \$k\$-Chinese Postman Problem and \$k\$
Arc-Disjoint Cycles Problem on Euler Digraphs},
journal = {CoRR},
volume = {abs/1402.2137},
year = {2014},
ee = {http://arxiv.org/abs/1402.2137},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
You can construct the "difference" strings S and S', i.e. a string which contains the characters at the differing positions of the two strings, e.g. for acbacb and abcabc it will be cbcb and bcbc. Let us say this contains n characters.
You can now construct a "permutation graph" G which will have n nodes and an edge from i to j if S[i] == S'[j]. In the case of all unique characters, it is easy to see that the required number of swaps will be (n - number of cycles in G), which can be found out in O(n) time.
However, in the case where there are any number of duplicate characters, this reduces to the problem of finding out the largest number of cycles in a directed graph, which, I think, is NP-hard, (e.g. check out: http://www.math.ucsd.edu/~jverstra/dcig.pdf ).
In that paper a few greedy algorithms are pointed out, one of which is particularly simple:
At each step, find the minimum length cycle in the graph (e.g. Find cycle of shortest length in a directed graph with positive weights )
Delete it
Repeat until all vertexes have not been covered.
However, there may be efficient algorithms utilizing the properties of your case (the only one I can think of is that your graphs will be K-partite, where K is the number of unique characters in S). Good luck!
Edit:
Please refer to David's answer for a fuller and correct explanation of the problem.
Do an A* search (see http://en.wikipedia.org/wiki/A-star_search_algorithm for an explanation) for the shortest path through the graph of equivalent strings from one string to the other. Use the Levenshtein distance / 2 as your cost heuristic.
Let's consider a query set Q and a larger superset S. Each element of Q exists in S. The goal is to express Q using the smallest number of (connected) "components" of S.
Here is a concrete example:
Q={I love France and wine}
S={(I live here), (I love you and her), (France is beautiful), (cheese and wine)}
A solution for Q might:
- "I" from "I live here"
- "love" from "I love you and her"
- "France" from "France is beautiful"
- "and" from "I love you and her"
- "wine" from "cheese and wine"
This results in 5 "components", i.e. "I", "love", "France", "and", "wine"
A better solution is:
- "I love" from "I love you and her"
- "France" from "France is beautiful"
- "and wine" from "cheese and wine"
This results in 3 "components", i.e. "I love", "France", "and wine"
which might be the optimal solution for this example. We want to minimize this number of "components".
Is there anyone who knows how such algorithm is called?
I searched in text parsing, text mining and so on but I did not find anything appropriate.
What you're describing sounds like the set cover problem, in which you have a master set (in your case, the query) and a family of sets (your components) to pick from with the goal of covering every element of the master set. This problem is well studied, but unfortunately it's NP-hard and there is no known polynomial-time algorithm for it. Moreover, the best polynomial-time approximation algorithms for set cover only get within a factor of O(log n) of the true solution in the worst case.
If you're dealing with small queries or small numbers of components, you can just brute-force the answer by listing all subsets and checking which ones work. For large queries or large numbers of components, though, you should not expect to get exact answers efficiently.
Hope this helps!
I would describe this problem as "minimum interval cover"; I'm not sure that's the canonical name, but I'm not the first to use that phrase.
There's an efficient algorithm that has two phases. In the first phase, identify the maximal substrings of the query that appear in the source. For each such substring, output an interval for the second phase. In the second phase, find a minimum-cardinality cover by choosing repeatedly the interval with the highest endpoint that covers the lowest uncovered position.
In your example
Q=(I love France and wine)
S={(I live here), (I love you and her), (France is beautiful), (cheese and wine)}
the intervals are, indexing from one, (1, 2) "I love", (3, 3) "France", (4, 5) "and wine". Oops, now the second phase is trivial. Suppose instead
Q=(a b c d)
S={(a b), (b c), (c d)}
then the intervals are (1, 2) "a b", (2, 3) "b c", (3, 4) "c d". The lowest uncovered is 1; we take (1, 2). The lowest uncovered is 3; we take (3, 4) over (2, 3) because 4 > 3.
Edited to add:
The bottleneck is likely to be the first phase. If it's a problem, there's an algorithm for that: construct a suffix tree containing the source sentences. Then, traverse the tree according to the query string. Unless the query appears verbatim in the source, you'll eventually try to follow a nonexistent link; in that case, the current maximal interval ends, and you need to follow the suffix links until you can make progress again. (Computational biologists, which algorithm am I describing?)
I have a corpus of 900,000 strings. They vary in length, but have an average character count of about 4,500. I need to find the most efficient way of computing the Dice coefficient of every string as it relates to every other string. Unfortunately, this results in the Dice coefficient algorithm being used some 810,000,000,000 times.
What is the best way to structure this program for increased efficiency? Obviously, I can prevent computing the Dice of sections A and B, and then B and A--but this only halves the work required. Should I consider taking some shortcuts or creating some sort of binary tree?
I'm using the following implementation of the Dice coefficient algorithm in Java:
public static double diceCoefficient(String s1, String s2) {
Set<String> nx = new HashSet<String>();
Set<String> ny = new HashSet<String>();
for (int i = 0; i < s1.length() - 1; i++) {
char x1 = s1.charAt(i);
char x2 = s1.charAt(i + 1);
String tmp = "" + x1 + x2;
nx.add(tmp);
}
for (int j = 0; j < s2.length() - 1; j++) {
char y1 = s2.charAt(j);
char y2 = s2.charAt(j + 1);
String tmp = "" + y1 + y2;
ny.add(tmp);
}
Set<String> intersection = new HashSet<String>(nx);
intersection.retainAll(ny);
double totcombigrams = intersection.size();
return (2 * totcombigrams) / (nx.size() + ny.size());
}
My ultimate goal is to output an ID for every section that has a Dice coefficient of greater than 0.9 with another section.
Thanks for any advice that you can provide!
Make a single pass over all the Strings, and build up a HashMap which maps each bigram to a set of the indexes of the Strings which contain that bigram. (Currently you are building the bigram set 900,000 times, redundantly, for each String.)
Then make a pass over all the sets, and build a HashMap of [index,index] pairs to common-bigram counts. (The latter Map should not contain redundant pairs of keys, like [1,2] and [2,1] -- just store one or the other.)
Both of these steps can easily be parallelized. If you need some sample code, please let me know.
NOTE one thing, though: from the 26 letters of the English alphabet, a total of 26x26 = 676 bigrams can be formed. Many of these will never or almost never be found, because they don't conform to the rules of English spelling. Since you are building up sets of bigrams for each String, and the Strings are so long, you will probably find almost the same bigrams in each String. If you were to build up lists of bigrams for each String (in other words, if the frequency of each bigram counted), it's more likely that you would actually be able to measure the degree of similarity between Strings, but then the calculation of Dice's coefficient as given in the Wikipedia article wouldn't work; you'd have to find a new formula.
I suggest you continue researching algorithms for determining similarity between Strings, try implementing a few of them, and run them on a smaller set of Strings to see how well they work.
You should come up with some kind of inequality like: D(X1,X2) > 1-p, D(X1,X3) < 1-q and p D(X2,X3) < 1-q+p . Or something like that. Now, if 1-q+p < 0.9, then probably you don't have to evaluate D(X2,X3).
PS: I am not sure about this exact inequality, but I have a gut feeling that this might be right (but I do not have enough time to actually do the derivations now). Look for some of the inequalities with other similarity measures and see if any of them are valid for Dice co-efficient.
=== Also ===
If there are a elements in set A, and if your threshold is r (=0.9), then set B should have number of elements b should be such that: r*a/(2-r) <= b <= (2-r)*a/r . This should eliminate need for lots of comparisons IMHO. You can probably sort the strings according to length and use the window describe above to limit comparisons.
Disclaimer first: This will not reduce the number of comparisons you'll have to make. But this should make a Dice comparison faster.
1) Don't build your HashSets every time you do a diceCoefficient() call! It should speed things up considerably if you just do it once for each string and keep the result around.
2) Since you only care about if a particular bigram is present in the string, you could get away with a BitSet with a bit for each possible bigram, rather than a full HashMap. Coefficient calculation would then be simplified to ANDing two bit sets and counting the number of set bits in the result.
3) Or, if you have a huge number of possible bigrams (Unicode, perhaps?) - or monotonous strings with only a handful of bigrams each - a sorted Array of bigrams might provide faster, more space-efficent comparisons.
Is their charset limited somehow? If it is, you can compute character counts by their code in each string and compare these numbers. After such pre-computation (it will occupy 2*900K*S bytes of memory [if we assume no character is found more then 65K time in the same string], where S is different character count). Then computing the coefficent would take O(S) time. Sure, this would be helpful if S<4500.