Find the smallest set of connected substrings - string

Let's consider a query set Q and a larger superset S. Each element of Q exists in S. The goal is to express Q using the smallest number of (connected) "components" of S.
Here is a concrete example:
Q={I love France and wine}
S={(I live here), (I love you and her), (France is beautiful), (cheese and wine)}
A solution for Q might:
- "I" from "I live here"
- "love" from "I love you and her"
- "France" from "France is beautiful"
- "and" from "I love you and her"
- "wine" from "cheese and wine"
This results in 5 "components", i.e. "I", "love", "France", "and", "wine"
A better solution is:
- "I love" from "I love you and her"
- "France" from "France is beautiful"
- "and wine" from "cheese and wine"
This results in 3 "components", i.e. "I love", "France", "and wine"
which might be the optimal solution for this example. We want to minimize this number of "components".
Is there anyone who knows how such algorithm is called?
I searched in text parsing, text mining and so on but I did not find anything appropriate.

What you're describing sounds like the set cover problem, in which you have a master set (in your case, the query) and a family of sets (your components) to pick from with the goal of covering every element of the master set. This problem is well studied, but unfortunately it's NP-hard and there is no known polynomial-time algorithm for it. Moreover, the best polynomial-time approximation algorithms for set cover only get within a factor of O(log n) of the true solution in the worst case.
If you're dealing with small queries or small numbers of components, you can just brute-force the answer by listing all subsets and checking which ones work. For large queries or large numbers of components, though, you should not expect to get exact answers efficiently.
Hope this helps!

I would describe this problem as "minimum interval cover"; I'm not sure that's the canonical name, but I'm not the first to use that phrase.
There's an efficient algorithm that has two phases. In the first phase, identify the maximal substrings of the query that appear in the source. For each such substring, output an interval for the second phase. In the second phase, find a minimum-cardinality cover by choosing repeatedly the interval with the highest endpoint that covers the lowest uncovered position.
In your example
Q=(I love France and wine)
S={(I live here), (I love you and her), (France is beautiful), (cheese and wine)}
the intervals are, indexing from one, (1, 2) "I love", (3, 3) "France", (4, 5) "and wine". Oops, now the second phase is trivial. Suppose instead
Q=(a b c d)
S={(a b), (b c), (c d)}
then the intervals are (1, 2) "a b", (2, 3) "b c", (3, 4) "c d". The lowest uncovered is 1; we take (1, 2). The lowest uncovered is 3; we take (3, 4) over (2, 3) because 4 > 3.
Edited to add:
The bottleneck is likely to be the first phase. If it's a problem, there's an algorithm for that: construct a suffix tree containing the source sentences. Then, traverse the tree according to the query string. Unless the query appears verbatim in the source, you'll eventually try to follow a nonexistent link; in that case, the current maximal interval ends, and you need to follow the suffix links until you can make progress again. (Computational biologists, which algorithm am I describing?)


How do I produce the correct Cohen's d estimate?

I am trying to replicate the result of an academic paper below to practice how to apply statistical methods in R.
This is what the paper states:
Pre- and postoutbreak differences in voter intentions.
Across the 32 elections included in primary analyses, the mean voter-intention difference score was greater than zero (M=1.02%), d=0.84, t(31)=2.34, p=.026.
This result is consistent with the pre- and postelection difference in nationwide polling results for the House of Representatives elections, which indicates a general postoutbreak shift toward favoring Republican rather than Democratic candidates. (If the two outliers were
included in the analysis, the mean voter-intention difference score was not meaningfully different from zero, p=.937.) (Source screenshot)
The t-test result matched the values in the paper, however, I cannot figure out how to produce the right cohen's d estimate. I looked into the documentation for cohen.d function over and over again, googled how to do this, even watched some boring Youtube videos, but to no avail. The code itself runs, but it gives me a wrong value. Am I missing something? Can someone help me with how I should format the arguments?
# excluding outliers from the dataset
no_outliers <- study2 %>%
filter(StateSenateRace != "Rhode Island", StateSenateRace != "Hawaii")
# paired t-test
t.test(no_outliers$OctMeanVoterIntentionIndex, no_outliers$SeptMeanVoterIntentionIndex, paired = TRUE, var.equal = TRUE, na.rm = TRUE)
cohen.d(no_outliers$SeptMeanVoterIntentionIndex, no_outliers$OctMeanVoterIntentionIndex, na.rm = TRUE)
Here's the result I got.
Cohen's d
d estimate: -0.02130406 (negligible)
95 percent confidence interval:
lower upper
-0.5171041 0.4744960
Thank you in advance--I wish I could contribute to Stack Overflow as a R expert someday!

Word2Vec Subsampling -- Implementation

I am implementing the Skipgram model, both in Pytorch and Tensorflow2. I am having doubts about the implementation of subsampling of frequent words. Verbatim from the paper, the probability of subsampling word wi is computed as
where t is a custom threshold (usually, a small value such as 0.0001) and f is the frequency of the word in the document. Although the authors implemented it in a different, but almost equivalent way, let's stick with this definition.
When computing the P(wi), we can end up with negative values. For example, assume we have 100 words, and one of them appears extremely more often than others (as it is the case for my dataset).
import numpy as np
import seaborn as sns
# generate counts in [1, 20]
counts = np.random.randint(low=1, high=20, size=99)
# add an extremely bigger count
counts = np.insert(counts, 0, 100000)
# compute frequencies
f = counts/counts.sum()
# define threshold as in paper
t = 0.0001
# compute probabilities as in paper
probs = 1 - np.sqrt(t/f)
Q: What is the correct way to implement subsampling using this "probability"?
As an additional info, I have seen that in keras the function keras.preprocessing.sequence.make_sampling_table takes a different approach:
def make_sampling_table(size, sampling_factor=1e-5):
"""Generates a word rank-based probabilistic sampling table.
Used for generating the `sampling_table` argument for `skipgrams`.
`sampling_table[i]` is the probability of sampling
the i-th most common word in a dataset
(more common words should be sampled less frequently, for balance).
The sampling probabilities are generated according
to the sampling distribution used in word2vec:
p(word) = (min(1, sqrt(word_frequency / sampling_factor) /
(word_frequency / sampling_factor)))
We assume that the word frequencies follow Zipf's law (s=1) to derive
a numerical approximation of frequency(rank):
`frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))`
where `gamma` is the Euler-Mascheroni constant.
# Arguments
size: Int, number of possible words to sample.
sampling_factor: The sampling factor in the word2vec formula.
# Returns
A 1D Numpy array of length `size` where the ith entry
is the probability that a word of rank i should be sampled.
gamma = 0.577
rank = np.arange(size)
rank[0] = 1
inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank)
f = sampling_factor * inv_fq
return np.minimum(1., f / np.sqrt(f))
I tend to trust deployed code more than paper write-ups, especially in a case like word2vec, where the original authors' word2vec.c code released by the paper's authors has been widely used & served as the template for other implementations. If we look at its subsampling mechanism...
if (sample > 0) {
real ran = (sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn;
next_random = next_random * (unsigned long long)25214903917 + 11;
if (ran < (next_random & 0xFFFF) / (real)65536) continue;
...we see that those words with tiny counts (.cn) that could give negative values in the original formula instead here give values greater-than 1.0, and thus can never be less than the long-random-masked-and-scaled to never be more than 1.0 ((next_random & 0xFFFF) / (real)65536). So, it seems the authors' intent was for all negative-values of the original formula to mean "never discard".
As per the keras make_sampling_table() comment & implementation, they're not consulting the actual word-frequencies at all. Instead, they're assuming a Zipf-like distribution based on word-rank order to synthesize a simulated word-frequency.
If their assumptions were to hold – the related words are from a natural-language corpus with a Zipf-like frequency-distribution – then I'd expect their sampling probabilities to be close to down-sampling probabilities that would have been calculated from true frequency information. And that's probably "close enough" for most purposes.
I'm not sure why they chose this approximation. Perhaps other aspects of their usual processes have not maintained true frequencies through to this step, and they're expecting to always be working with natural-language texts, where the assumed frequencies will be generally true.
(As luck would have it, and because people often want to impute frequencies to public sets of word-vectors which have dropped the true counts but are still sorted from most- to least-frequent, just a few days ago I wrote an answer about simulating a fake-but-plausible distribution using Zipf's law – similar to what this keras code is doing.)
But, if you're working with data that doesn't match their assumptions (as with your synthetic or described datasets), their sampling-probabilities will be quite different than what you would calculate yourself, with any form of the original formula that uses true word frequencies.
In particular, imagine a distribution with one token a million times, then a hundred tokens all appearing just 10 times each. Those hundred tokens' order in the "rank" list is arbitrary – truly, they're all tied in frequency. But the simulation-based approach, by fitting a Zipfian distribution on that ordering, will in fact be sampling each of them very differently. The one 10-occurrence word lucky enough to be in the 2nd rank position will be far more downsampled, as if it were far more frequent. And the 1st-rank "tall head" value, by having its true frequency *under-*approximated, will be less down-sampled than otherwise. Neither of those effects seem beneficial, or in the spirit of the frequent-word-downsampling option - which should only "thin out" very-frequent words, and in all cases leave words of the same frequency as each other in the original corpus roughly equivalently present to each other in the down-sampled corpus.
So for your case, I would go with the original formula (probability-of-discarding-that-requires-special-handling-of-negative-values), or the word2vec.c practical/inverted implementation (probability-of-keeping-that-saturates-at-1.0), rather than the keras-style approximation.
(As a totally-separate note that nonetheless may be relevant for your dataset/purposes, if you're using negative-sampling: there's another parameter controlling the relative sampling of negative examples, often fixed at 0.75 in early implementations, that one paper has suggested can usefully vary for non-natural-language token distributions & recommendation-related end-uses. This parameter is named ns_exponent in the Python gensim implementation, but simply a fixed power value internal to a sampling-table pre-calculation in the original word2vec.c code.)

Finding the minimum number of swaps to convert one string to another, where the strings may have repeated characters

I was looking through a programming question, when the following question suddenly seemed related.
How do you convert a string to another string using as few swaps as follows. The strings are guaranteed to be interconvertible (they have the same set of characters, this is given), but the characters can be repeated. I saw web results on the same question, without the characters being repeated though.
Any two characters in the string can be swapped.
For instance : "aabbccdd" can be converted to "ddbbccaa" in two swaps, and "abcc" can be converted to "accb" in one swap.
This is an expanded and corrected version of Subhasis's answer.
Formally, the problem is, given a n-letter alphabet V and two m-letter words, x and y, for which there exists a permutation p such that p(x) = y, determine the least number of swaps (permutations that fix all but two elements) whose composition q satisfies q(x) = y. Assuming that n-letter words are maps from the set {1, ..., m} to V and that p and q are permutations on {1, ..., m}, the action p(x) is defined as the composition p followed by x.
The least number of swaps whose composition is p can be expressed in terms of the cycle decomposition of p. When j1, ..., jk are pairwise distinct in {1, ..., m}, the cycle (j1 ... jk) is a permutation that maps ji to ji + 1 for i in {1, ..., k - 1}, maps jk to j1, and maps every other element to itself. The permutation p is the composition of every distinct cycle (j p(j) p(p(j)) ... j'), where j is arbitrary and p(j') = j. The order of composition does not matter, since each element appears in exactly one of the composed cycles. A k-element cycle (j1 ... jk) can be written as the product (j1 jk) (j1 jk - 1) ... (j1 j2) of k - 1 cycles. In general, every permutation can be written as a composition of m swaps minus the number of cycles comprising its cycle decomposition. A straightforward induction proof shows that this is optimal.
Now we get to the heart of Subhasis's answer. Instances of the asker's problem correspond one-to-one with Eulerian (for every vertex, in-degree equals out-degree) digraphs G with vertices V and m arcs labeled 1, ..., m. For j in {1, ..., n}, the arc labeled j goes from y(j) to x(j). The problem in terms of G is to determine how many parts a partition of the arcs of G into directed cycles can have. (Since G is Eulerian, such a partition always exists.) This is because the permutations q such that q(x) = y are in one-to-one correspondence with the partitions, as follows. For each cycle (j1 ... jk) of q, there is a part whose directed cycle is comprised of the arcs labeled j1, ..., jk.
The problem with Subhasis's NP-hardness reduction is that arc-disjoint cycle packing on Eulerian digraphs is a special case of arc-disjoint cycle packing on general digraphs, so an NP-hardness result for the latter has no direct implications for the complexity status of the former. In very recent work (see the citation below), however, it has been shown that, indeed, even the Eulerian special case is NP-hard. Thus, by the correspondence above, the asker's problem is as well.
As Subhasis hints, this problem can be solved in polynomial time when n, the size of the alphabet, is fixed (fixed-parameter tractable). Since there are O(n!) distinguishable cycles when the arcs are unlabeled, we can use dynamic programming on a state space of size O(mn), the number of distinguishable subgraphs. In practice, that might be sufficient for (let's say) a binary alphabet, but if I were to try to try to solve this problem exactly on instances with large alphabets, then I likely would try branch and bound, obtaining bounds by using linear programming with column generation to pack cycles fractionally.
author = {Gregory Gutin and
Mark Jones and
Bin Sheng and
Magnus Wahlstr{\"o}m},
title = {Parameterized Directed \$k\$-Chinese Postman Problem and \$k\$
Arc-Disjoint Cycles Problem on Euler Digraphs},
journal = {CoRR},
volume = {abs/1402.2137},
year = {2014},
ee = {},
bibsource = {DBLP,}
You can construct the "difference" strings S and S', i.e. a string which contains the characters at the differing positions of the two strings, e.g. for acbacb and abcabc it will be cbcb and bcbc. Let us say this contains n characters.
You can now construct a "permutation graph" G which will have n nodes and an edge from i to j if S[i] == S'[j]. In the case of all unique characters, it is easy to see that the required number of swaps will be (n - number of cycles in G), which can be found out in O(n) time.
However, in the case where there are any number of duplicate characters, this reduces to the problem of finding out the largest number of cycles in a directed graph, which, I think, is NP-hard, (e.g. check out: ).
In that paper a few greedy algorithms are pointed out, one of which is particularly simple:
At each step, find the minimum length cycle in the graph (e.g. Find cycle of shortest length in a directed graph with positive weights )
Delete it
Repeat until all vertexes have not been covered.
However, there may be efficient algorithms utilizing the properties of your case (the only one I can think of is that your graphs will be K-partite, where K is the number of unique characters in S). Good luck!
Please refer to David's answer for a fuller and correct explanation of the problem.
Do an A* search (see for an explanation) for the shortest path through the graph of equivalent strings from one string to the other. Use the Levenshtein distance / 2 as your cost heuristic.

Given a phrase without spaces add spaces to make proper sentence

This is what I've in mind, but it's O(n^2):
For ex: Input is "Thisisawesome", we need to check if adding the current character makes the older found set any longer and meaningful. But in order to see till where we need to back up we'll have to traverse all the way to the beginning. For ex: "awe" and "some" make proper words but "awesome" makes the bigger word. Please suggest how can we improve the complexity. Here is the code:
void update(string in)
int len= in.length();
int DS[len];
string word;
for(int i=0; i<len; i++) DS[i]=0;
for(int i=0; i<len; i++)
for(int j=i+1; j<=len; j++)
word = in.substr(i,j-i);
DS[j-1] = (DS[j-1] > word.length()) ? DS[j-1] : word.length();
There is a dynamic programming solution which at first looks like it is going to be O(n^2) but which turns out to be only O(n) for sufficiently large n and fixed size dictionary.
Work through the string from left to right. At the ith stage you need to work out whether there is a solution for the first i characters. To solve this, consider every possible way to break those i characters into two chunks. If the second chunk is a word and the first chunk can be broken up into words then there is a solution. The first requirement you can check with your dictionary. The second requirement you can check by looking to see if you found an answer for the first j characters, where j is the length of the first chunk.
This would be O(n^2) because for each of 1,2,3,...n lengths you consider every possible split. However, if you know what the longest word in your dictionary is you know that there is no point considering splits which make the second chunk longer than this. So for each of 1,2,3...n lengths you consider at most w possible splits, where w is the longest word in your dictionary, and the cost is O(n).
I have coded my solution today, and will put it on a web site tomorrow. Anyway, the method is as follows:
Arrange the dictionary in a trie.
The trie can help to do multiple matches quickly, because all dictionary words starting with the same letters can be matched at the same time.
(e.g. "chairman" matches "chair" and "chairman" in a trie.)
Use Dijkstra algorithm to find the best match.
(e.g. for "chairman", if we count "c" as position 0, then we have the relationships 0->5, 0->8, 1->5, 2->5, 5->8. These relationship form a network perfect for Dijkstra algorithm.)
(Note: Where's the weights of the edges? See the next point.)
Assign weighting to dictionary words.
Without weighting bad matches do weight over good matches. (e.g. "iamahero" becomes "i ama hero" instead of "i am a hero".)
The SCOWL dictionary at serve the purpose well, because it has dictionaries of different sizes. These sizes (10, 20, etc.) is a good choice for weighing).
After some tries I found a need to reduce the weighing of words ending with "s", so "eyesandme" become "eyes and me" instead of "eye sand me".
I have been able to split a paragraph in milliseconds. The algorithm has linear complexity on the length of the string to be splitted, so the algorithm scales well as long as memory is enough.
Here's the dump (sorry for bragging). (The passage selected is "Novel" in Wikipedia.)
D:\GoogleDrive\programs\WordBreaker>"word breaker"<novelnospace.txt>output.txt
D:\GoogleDrive\programs\WordBreaker>type output.txt
Number of words after reading words-10.txt : 4101
Number of words after reading words-20.txt : 11329
Number of words after reading words-35.txt : 43292
Number of words after reading words-40.txt : 49406
Number of words after reading words-50.txt : 87966
Time elapsed in reading dictionary: 0.956782s
Enter the string to be broken into words:
a novel is along narrative normally in prose which describes fictional character
s and events usually in the form of a sequential story while i an watt in the ri
se of the novel 1957 suggests that the novel came into being in the early 18 th
century the genre has also been described as possessing a continuous and compreh
ensive history of about two thousand years with historical roots in classical gr
eece and rome medieval early modern romance and in the tradition of the novel la
the latter an italian word used to describe short stories supplied the present g
eneric english term in the 18 th century miguel de cervantes author of don quixo
te is frequently cited as the first significant europe an novelist of the modern
era the first part of don quixote was published in 1605 while a more precise de
finition of the genre is difficult the main elements that critics discuss are ho
w the narrative and especially the plot is constructed the themes settings and c
haracterization how language is used and the way that plot character and setting
relate to reality the romance is a related long prose narrative w alter scott d
efined it as a fictitious narrative in prose or verse the interest of which turn
s upon marvellous and uncommon incidents whereas in the novel the events are acc
ommodated to the ordinary train of human events and the modern state of society
however many romances including the historical romances of scott emily brontes w
u the ring heights and her man melvilles mo by dick are also frequently called n
ovels and scott describes romance as a kind red term romance as defined here sho
uld not be confused with the genre fiction love romance or romance novel other e
urope an languages do not distinguish between romance and novel a novel isle rom
and err o ma nil roman z o
Time elapsed in splitting: 0.00495095s
D:\GoogleDrive\programs\WordBreaker>type novelnospace.txt

Ways to calculate similarity

I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes:
age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others.
Can anyone tell me how to go about this problem or point me to some resources?
Another way of computing (in R) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be of mixed types. The handling of nominal, ordinal, and (a)symmetric binary data is achieved by using the general dissimilarity coefficient of Gower (Gower, J. C. (1971) A general coefficient of similarity and some of its properties, Biometrics 27, 857–874). For more check out this on page 47. If x contains any columns of these data-types, Gower's coefficient will be used as the metric.
For example
x1 <- factor(c(10, 12, 25, 14, 29))
x2 <- factor(c("oily", "dry", "dry", "dry", "oily"))
x3 <- factor(c("medium", "short", "medium", "medium", "long"))
x4 <- factor(c("active outdoor lover", "TV junky", "TV junky", "active outdoor lover", "TV junky"))
x <- cbind(x1,x2,x3,x4)
daisy(x, metric = "euclidean")
you'll get :
Dissimilarities :
1 2 3 4
2 2.000000
3 3.316625 2.236068
4 2.236068 1.732051 1.414214
5 4.242641 3.741657 1.732051 2.645751
If you are interested on a method for dimensionality reduction for categorical data (also a way to arrange variables into homogeneous clusters) check this
Give each attribute an appropriate weight, and add the differences between values.
enum SkinType
Dry, Medium, Oily
enum HairLength
Bald, Short, Medium, Long
UserDifference(user1, user2)
total := 0
total += abs(user1.Age - user2.Age) * 0.1
total += abs((int)user1.Skin - (int)user2.Skin) * 0.5
total += abs((int)user1.Hair - (int)user2.Hair) * 0.8
# etc...
return total
If you really need similarity instead of difference, use 1 / UserDifference(a, b)
You probably should take a look for
Data Mining and Data Warehousing (Essential)
Machine Learning (Extra)
Artificial Neural Networks (Especially SOM)
Pattern Recognition (Related)
These topics will let you your program recognize similarities and clusters in your users collection and try to adapt to them...
You can then know different hidden common groups of related users... (i.e users with green hair usually do not like watching TV..)
As an advice, try to use ready implemented tools for this feature instead of implementing it yourself...
Take a look at Open Directory Data Mining Projects
Three steps to achieve a simple subjective metric for difference between two datapoints that might work fine in your case:
Capture all your variables in a representative numeric variable, for example: skin type (oily=-1, dry=1), hair type (long=2, short=0, medium=1),lifestyle (active outdoor lover=1, TV junky=-1), age is a number.
Scale all numeric ranges so that they fit the relative importance you give them for indicating difference. For example: An age difference of 10 years is about as different as the difference between long and medium hair, and the difference between oily and dry skin. So 10 on the age scale is as different as 1 on the hair scale is as different as 2 on the skin scale, so scale the difference in age by 0.1, that in hair by 1 and and that in skin by 0.5
Use an appropriate distance metric to combine the differences between two people on the various scales in one overal difference. The smaller this number, the more similar they are. I'd suggest simple quadratic difference as a first attempt at your distance function.
Then the difference between two people could be calculated with (I assume Person.age, .skin, .hair, etc. have already gone through step 1 and are numeric):
double Difference(Person p1, Person p2) {
double agescale=0.1;
double skinscale=0.5;
double hairscale=1;
double lifestylescale=1;
double agediff = (p1.age-p2.age)*agescale;
double skindiff = (*skinscale;
double hairdiff = (*hairscale;
double lifestylediff = (*lifestylescale;
double diff = sqrt(agediff^2 + skindiff^2 + hairdiff^2 + lifestylediff^2);
return diff;
Note that diff in this example is not on a nice scale like (0..1). It's value can range from 0 (no difference) to something large (high difference). Also, this method is almost completely unscientific, it is just designed to quickly give you a working difference metric.
Look at algorithms for computing srting difference. Its very similar to what you need. Store your attributes as a bit string and compute the distance between the strings
You should read these two topics.
Most popular clustering algorithm k - means
And similarity matrix are essential in clustering
