remove fragments in a sentence [puzzle] - string

Question:
Write a program to remove fragment that occur in "all" strings,where a fragment
is 3 or more consecutive word.
Example:
Input::
s1 = "It is raining and I want to drive home.";
s2 = "It is raining and I want to go skiing.";
s3 = "It is hot and I want to go swimming.";
Output::
s1 = "It is raining drive home.";
s2 = "It is raining go skiing.";
s3 = "It is hot go swimming.";
Removed fragment = "and i want to"
The program will be tested again large files.
Efficiency will be taken into consideration.
Assumptions: Ignore capitalization ,punctuation. but preserve in output.
Note: Take care of cases like
a a a a a b c b c b c b c where removing would create more fragments.
My Solution: (which i think is not the most efficient)
Hash three word phrases into an int and store them in an array, for all strings.
reduces to array of numbers like
1 2 3 4 5
3 5 7 9 8
9 3 1 7 9
Problem reduces to intersection of arrays.
sort the arrays. (k * nlogn)
keep k pointers. if all equal match found. else increment the pointer pointing to least value.
To solve for the Note above. I was thinking of doing a lazy delete, i.e mark phrases for deletion and delete at the end.
Are there cases where my solution might not work? Can we optimize my solution/ find the best solution ?

First observation: replace each word with a single "letter" in a big alphabet(i.e. hash the worlds in some way), remove whitespaces and punctuation.
Now you have the problem reduced to remove the longest letter sequence that appears in all words of a given list.
So you have to compute the longest common substring for a set of "words". You find it using a generalized suffix tree as this is the most efficient algorithm. This should do the trick and I believe has the best possible complexity.

The first step is as already suggested by izomorphius:
Replace each word with a single "letter" in a big alphabet(i.e. hash the worlds in some way), remove whitespaces and punctuation.
For the second you don't need to know the longest common substring - you just want to erase it from all the strings.
Note that this is equivalent to erasing all common substrings of length exactly 3, because if you have a longer commmon substring, then its substrings with length 3 are also common.
To do that you can use a hash table (storing key value pairs).
Just iterate over the first string and put all it's 3-substrings into the hash table as keys with values equal to 1.
Then iterate over the second string and for each 3-substring x if x is in the hash table and its value is 1, then set the value to 2.
Then iterate over the third string and for each 3-substring x, if x is in the hash table and its value is 2, then set the value to 3.
...and so on.
At the end the keys that have the value of k are the common 3-substrings.
Now just iterate once more over all the strings and remove those 3-substrings that are common.

import java.io.*;
import java.util.*;
public class remove_unique{
public static void main(String args[]){
String s1 = "Everyday I do exercise if";
String s2 = "Sometimes I do exercise if i feel stressed";
String s3 = "Mostly I do exercise on morning";
String[] words1=s1.split("\\s");
String[] words2=s2.split("\\s");
String[] words3=s3.split("\\s");
StringBuilder sb = new StringBuilder();
for(int i=0;i<words1.length;i++){
for(int j=0;j<words2.length;j++){
for(int k=0;k<words3.length;k++){
if(words1[i].equals(words2[j]) && words2[j].equals(words3[k])
&&words3[k].equals(words1[i])){
//Concatenating the returned Strings
sb.append(words1[i]+" ");
}
}
}
}
System.out.println(s1.replaceAll(sb.toString(), ""));
System.out.println(s2.replaceAll(sb.toString(), ""));
System.out.println(s3.replaceAll(sb.toString(), ""));
}
}
//LAKSHMI ARJUNA

My solution would be something like,
F = all fragments with length > 3 shared by the first 2 lines, avoid overlaps
for each line from the 3rd line and up
remove fragments in F which do not exist in line, or cause overlaps
return sentences with fragments in F removed
I assume finding/matching fragments in sentences can be done with some known algo. but in terms of the time complexity for n lines this is O(n)

Related

find number of repeating substrings in a string

I am looking for an algorithm that will find the number of repeating substrings in a single string.
For this, I was looking for some dynamic programming algorithms but didn't find any that would help me. I just want some tutorial on how to do this.
Let's say I have a string ABCDABCDABCD. The expected output for this would be 3, because there is ABCD 3 times.
For input AAAA, output would be 4, since A is repeated 4 times.
For input ASDF, output would be 1, since every individual character is repeated 1 time only.
I hope that someone can point me in the right direction. Thank you.
I am taking the following assumptions:
The repeating substrings must be consecutive. That is, in case of ABCDABC, ABC would not count as a repeating substring, but it would in case of ABCABC.
The repeating substrings must be non-overalpping. That is, in case of ABCABC, ABC would not count as a repeating substring.
In case of multiple possible answers, we want the one with the maximum value. That is, in the case of AAAA, the answer should be 4 (a is the substring) rather than 2 (aa is the substring).
Under these assumptions, the algorithm is as follows:
Let the input string be denoted as inputString.
Calculate the KMP failure function array for the input string. Let this array be denoted as failure[]. This operation if of linear time complexity with respect to the length of the string. So, by definition, failure[i] denotes the length of the longest proper-prefix of the substring inputString[0....i] that is also a proper-suffix of the same substring.
Let len = inputString.length - failure.lastIndexValue. At this point, we know that if there is any repeating string at all, then it has to be of this length len. But we'll need to check for that; First, just check if len perfectly divides inputString.length (that is, inputString.length % len == 0). If yes, then check if every consecutive (non-overlapping) substring of len characters is the same or not; this operation is again of linear time complexity with respect to the length of the input string.
If it turns out that every consecutive non-overlapping substring is the same, then the answer would be = inputString.length/ len. Otherwise, the answer is simply inputString.length, as there is no such repeating substring present.
The overall time complexity would be O(n), where n is the number of characters in the input string.
A sample code for calculating the KMP failure array is given here.
For example,
Let the input string be abcaabcaabca.
Its KMP failure array would be - [0, 0, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8].
So, our len = (12 - 8) = 4.
And every consecutive non-overlapping substring of length 4 is the same (abca).
Therefore the answer is 12/4 = 3. That is, abca is repeated 3 times repeatedly.
The solution for this with C# is:
class Program
{
public static string CountOfRepeatedSubstring(string str)
{
if (str.Length < 2)
{
return "-1";
}
StringBuilder substr = new StringBuilder();
// Length of the substring cannot be greater than half of the actual string
for (int i = 0; i < str.Length / 2; i++)
{
// We will iterate through half of the actual string and
// create a new string by appending the current character to the previous character
substr.Append(str[i]);
String clearedOfNewSubstrings = str.Replace(substr.ToString(), "");
// We will remove the newly created substring from the actual string and
// check if the length of the actual string, cleared of the newly created substring, is 0.
// If 0 it tells us that it is only made of its substring
if (clearedOfNewSubstrings.Length == 0)
{
// Next we will return the count of the newly created substring in the actual string.
var countOccurences = Regex.Matches(str, substr.ToString()).Count;
return countOccurences.ToString();
}
}
return "-1";
}
static void Main(string[] args)
{
// Input: {"abcdaabcdaabcda"}
// Output: 3
// Input: { "abcdaabcdaabcda" }
// Output: -1
// Input: {"barrybarrybarry"}
// Output: 3
var s = "asdf"; // Output will be -1
Console.WriteLine(CountOfRepeatedSubstring(s));
}
}
How do you want to specify the "repeating string"? Is it simply the first group of characters up until either a) the first character is found again, b) the pattern begins to repeat, or c) some other criteria?
So, if your string is "ABBAABBA", is that a 2 because "ABBA" repeats twice or is it 1 because you have "ABB" followed by "AAB"? What about "ABCDABCE" -- does "ABC" count (despite the "D" in between repetitions?) In "ABCDABCABCDABC", is the repeating string "ABCD" (1) or "ABCDABC" (2)?
What about "AAABBAAABB" -- is that 3 ("AAA") or 2 ("AAABB")?
If the end of the repeating string is another instance of the first letter, it's pretty simple:
Work your way through the string character by character, putting each character into another variable as you go, until the next character matches the first one. Then, given the length of the substring in your second variable, check the next bit of your string to see if it matches. Continue until it doesn't match or you hit the end of the string.
If you just want to find any length pattern that repeats regardless of whether the first character is repeated within the pattern, it gets more complicated (but, fortunately, it's the sort of thing computers are good at).
You'll need to go character by character building a pattern in another variable as above, but you'll also have to watch for the first character to reappear and start building a second substring as you go, to see if it matches the first. This should probably go in an array as you might encounter a third (or more) instance of the first character which would trigger the need to track yet another possible match.
It's not difficult but there is a lot to keep track of and it's a rather annoying problem. Is there a particular reason you're doing this?

Efficient algorithm for phrase anagrams

What is an efficient way to produce phrase anagrams given a string?
The problem I am trying to solve
Assume you have a word list with n words. Given an input string, say, "peanutbutter", produce all phrase anagrams. Some contenders are: pea nut butter, A But Ten Erupt, etc.
My solution
I have a trie that contains all words in the given word list. Given an input string, I calculate all permutations of it. For each permutation, I have a recursive solution (something like this) to determine if that specific permuted string can be broken in to words. For example, if one of the permutations of peanutbutter was "abuttenerupt", I used this method to break it into "a but ten erupt". I use the trie to determine if a string is a valid word.
What sucks
My problem is that because I calculate all permutations, my solution runs very slow for phrases that are longer than 10 characters, which is a big let down. I want to know if there is a way to do this in a different way.
Websites like https://wordsmith.org/anagram/ can do the job in less than a second and I am curious to know how they do it.
Your problem can be decomposed to 2 sub-problems:
Find combination of words that use up all characters of the input string
Find all permutations of the words found in the first sub-problem
Subproblem #2 is a basic algorithm and you can find existing standard implementation in most programming language. Let's focus on subproblem #1
First convert the input string to a "character pool". We can implement the character pool as an array oc, where oc[c] = number of occurrence of character c.
Then we use backtracking algorithm to find words that fit in the charpool as in this pseudo-code:
result = empty;
function findAnagram(pool)
if (pool empty) then print result;
for (word in dictionary) {
if (word fit in charpool) {
result = result + word;
update pool to exclude characters in word;
findAnagram(pool);
// as with any backtracking algorithm, we have to restore global states
restore pool;
restore result;
}
}
}
Note: If we pass the charpool by value then we don't have to restore it. But as it is quite big, I prefer passing it by reference.
Now we remove redundant results and apply some optimizations:
Assuming A comes before B in the dictionary. If we choose the first word is B, then we don't have to consider word A in following steps, because those results (if we take A) would already be in the case where A is chosen as the first word
If the character set is small enough (< 64 characters is best), we can use a bitmask to quickly filter words that cannot fit in the pool. A bitmask mask which character is in a word, no matter how many time it occurs.
Update the pseudo-code to reflect those optimizations:
function findAnagram(charpool, minDictionaryIndex)
pool_bitmask <- bitmask(charpool);
if (pool empty) then print result;
for (word in dictionary AND word's index >= minDictionaryIndex) {
// bitmask of every words in the dictionary should be pre-calculated
word_bitmask <- bitmask(word)
if (word_bitmask contains bit(s) that is not in pool_bitmask)
then skip this for iteration
if (word fit in charpool) {
result = result + word;
update charpool to exclude characters in word;
findAnagram(charpool, word's index);
// as with any backtracking algorithm, we have to restore global states
restore pool;
restore result;
}
}
}
My C++ implementation of subproblem #1 where the character set contains only lowercase 'a'..'z': http://ideone.com/vf7Rpl .
Instead of a two stage solution where you generate permutations and then try and break them into words, you could speed it up by checking for valid words as you recursively generate the permutations. If at any point your current partially-complete permutation does not correspond to any valid words, stop there and do not recurse any further. This means you don't waste time generating useless permutations. For example, if you generate "tt", there is no need to permute "peanubuter" and append all the permutations to "tt" because there are no English words beginning with tt.
Suppose you are doing basic recursive permutation generation, keep track of the current partial word you have generated. If at any point it is a valid word, you can output a space and start a new word, and recursively permute the remaining character. You can also try adding each of the remaining characters to the current partial word, and only recurse if doing so results in a valid partial word (i.e. a word exists starting with those characters).
Something like this (pseudo-code):
void generateAnagrams(String partialAnagram, String currentWord, String remainingChars)
{
// at each point, you can either output a space, or each of the remaining chars:
// if the current word is a complete valid word, you can output a space
if(isValidWord(currentWord))
{
// if there are no more remaining chars, output the anagram:
if(remainingChars.length == 0)
{
outputAnagram(partialAnagram);
}
else
{
// output a space and start a new word
generateAnagrams(partialAnagram + " ", "", remainingChars);
}
}
// for each of the chars in remainingChars, check if it can be
// added to currentWord, to produce a valid partial word (i.e.
// there is at least 1 word starting with these characters)
for(i = 0 to remainingChars.length - 1)
{
char c = remainingChars[i];
if(isValidPartialWord(currentWord + c)
{
generateAnagrams(partialAnagram + c, currentWord + c,
remainingChars.remove(i));
}
}
}
You could call it like this
generateAnagrams("", "", "peanutbutter");
You could optimize this algorithm further by passing the node in the trie corresponding to the current partially completed word, as well as passing currentWord as a string. This would make your isValidPartialWord check even faster.
You can enforce uniqueness by changing your isValidWord check to only return true if the word is in ascending (greater or equal) alphabetic order compared to the previous word output. You might also need another check for dupes at the end, to catch cases where two of the same word can be output.

Fast way to find strings in set of strings containing substring

Task
I have a set S of n = 10,000,000 strings s and need to find the set Sp containing the strings s of S that contain the substring p.
Simple solution
As I'm using C# this is quite a simple task using LINQ:
string[] S = new string[] { "Hello", "world" };
string p = "ll";
IEnumerable<string> S_p = S.Where(s => s.Contains(p));
Problem
If S contains many strings (like the mentioned 10,000,000 strings) this gets horribly slow.
Idea
Build some kind of index to retrieve Sp faster.
Question
What is the best way to index S for this task and do you have any implementation in C#?
Here is one way to do it:
1. Create a string T = S[0] + sep_0 + S[1] + sep_1 + ... + S[n - 1] + sep_n-1(where sep_i is a unique character that never appears in S[j] for any j(it can actually be an integer number if the set of characters is not big enough)).
2. Build a suffix tree for T(it can be done in linear time).
3. For each query string Q traverse the suffix tree(it takes O(length(Q)) time). Then all possible answers will be located in the leaves of some subtree. So you can just traverse all these leaves. If Q is rather long, then the number of leaves in this subtree is likely to be much smaller than n.
4. If Q is really short, then the number of leaves in a subtree can be pretty large. That's why you can use another strategy for short query strings: precompute all short substrings of S[0] ... S[n - 1] and for each of them store a set of indices where it has occurred. Then you can just print these indices for a given Q. It is difficult to say what 'short' exactly means here, but it can be found out experimentally.

How to find all cyclic shifted strings in a given input?

This is a coding exercise. Suppose I have to decide if one string is created by a cyclic shift of another. For example: cab is a cyclic shift of abc but cba is not.
Given two strings s1 and s2 we can do that as follows:
if (s1.length != s2.length)
return false
for(int i = 0; i < s1.length(); i++)
if ((s1.substring(i) + s1.substring(0, i)).equals(s2))
return true
return false
Now what if I have an array of strings and want to find all strings that are cyclic shift of one another? For example: ["abc", "xyz", "yzx", "cab", "xxx"] -> ["abc", "cab"], ["xyz", "yzx"], ["xxx"]
It looks like I have to check all pairs of the strings. Is there a "better" (more efficient) way to do that?
As a start, you can know if a string s1 is a rotation of a string s2 with a single call to contains(), like this:
public boolean isRotation(String s1, String s2){
String s2twice = s2+s2;
return s2twice.contains(s1);
}
Namely, if s1 is "rotation" and s2 is "otationr", the concat gives you "otationrotationr", which contains s1 indeed.
Now, even if we assume this is linear, or close to it (which is not impossible using Rabin-Karp, for instance), you are still left with O(n^2) pair comparisons, which may be too much.
What you could do is build an hashtable where the sorted word is the key, and the posting list contains all the words from your list that, if sorted, give the key (ie. key("bca") and key("cab") both should return "abc"):
private Map<String, List<String>> index;
/* ... */
public void buildIndex(String[] words){
for(String word : words){
String sortedWord = sortWord(word);
if(!index.containsKey(sortedWord)){
index.put(sortedWord, new ArrayList<String>());
}
index.get(sortedWord).add(word);
}
}
CAVEAT: The hashtable will contain, for each key, all the words that have exactly the same letters occurring the same amount of times (not just the rotations, ie. "abba" and "baba" will have the same key but isRotation("abba", "baba") will return false).
But once you have built this index, you can significantly reduce the number of pairs you need to consider: if you want all the rotations for "bca" you just need to sort("bca"), look it up in the hashtable, and check (using the isRotation method above, if you want) if the words in the posting list are the result of a rotation or not.
If strings are short compared to the number of strings in the list, you can do significantly better by rotating all strings to some normal form (lexicographic minimum, for example). Then sort lexicographically and find runs of the same string. That's O(n log n), I think... neglecting string lengths. Something to try, maybe.
Concerning the way to find the pairs in the table, there could be many better way, but what I came up as a first thought is to sort the table and apply the check per adjacent pair.
This is much better and simpler that checking every string with every other string in the table
Consider building an automaton for each string against which you wish to test.
Each automaton should have one entry point for each possible character in the string, and transitions for each character, plus an extra transition from the end to the start.
You could improve performance even further if you amalgated the automata.
I think a combination of the answers by Patrick87 and savinos would make a fair amount of sense. Specifically, in a Java-esque pseudo-code:
List<String> inputs = ["abc", "xyz", "yzx", "cab", "xxx"];
Map<String,List<String>> uniques = new Map<String,List<String>>();
for(String value : inputs) {
String normalized = normalize(value);
if(!uniques.contains(normalized)) {
unqiues.put(normalized, new List<String>());
}
uniques.get(normalized).add(value);
}
// you now have a Map of normalized strings to every string in the input
// that is "equal to" that normalized version
Normalizing the string, as stated by Patrick87 might be best done by picking the rotation of the string that results in the lowest lexographic ordering.
It's worth noting, however, that the "best" algorithm probably relies heavily on the inputs... the number of strings, the length of those string, how many duplicates there are, etc.
You can rotate all the strings to a normalized form using Booth's algorithm (https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation) in O(s) time, where s is the length of the string.
You can then use the normalized form as a key in a HashMap (where the value is the set of rotations seen in the input). You can populate this HashMap in a single pass over the data. i.e., for each string
calculate the normalized form
check if the HashMap contains the normalized form as a key - if not insert the empty Set at this key
add the string to the Set in the HashMap
You then just need to output the values of the HashMap. This makes the total runtime of the algorithm O(n * s) - where n is the number of words and s is the average word length. The total space usage is also O(n * s).

What is an easy way to tell if a list of words are anagrams of each other?

How would you list words that are anagrams of each other?
I was asked this question when I applied for my current job.
orchestra can be rearranged into carthorse with all original letters used exactly once therefore the words are anagrams of each other.
Put all the letters in alphabetical order in the string (sorting algorithm) and then compare the resulting string.
Good thing we all live in the C# reality of in-place sorting of short words on quad core machines with oozles of memory. :-)
However, if you happen to be memory constrained and can't touch the original data and you know that those words contain characters from the lower half of the ASCII table, you could go for a different algorithm that counts the occurrence of each letter in each word instead of sorting.
You could also opt for that algorithm if you want to do it in O(N) and don't care about the memory usage (a counter for each Unicode char can be quite expensive).
Sort each element (removing whitespace) and compare against the previous. If they are all the same, they're all anagrams.
Interestingly enough, Eric Lippert's Fabulous Adventures In Coding Blog dealt with a variation on this very problem on February 4, 2009 in this post.
The following algorithm should work:
Sort the letters in each word.
Sort the sorted lists of letters in each list.
Compare each element in each list for equality.
Well Sort the words in the list.
if abc, bca, cab, cba are the inputs, then the sorted list will be abc, abc, abc, abc.
Now all of their Hash codes are equal. Compare the HashCodes.
Sort the letters and compare (letter by letter, string compare, ...) is the first things that comes to mind.
compare length (if not equal, not a chance)
make a bit vector of the length of the strings
for each char in the first string find occurrences of it in the second
set the bit for the first unset occurrence
if you can find one stop with fail
public static void main(String[] args) {
String s= "abc";
String s1="cba";
char[] aArr = s.toLowerCase().toCharArray();
char[] bArr = s1.toLowerCase().toCharArray();
// An array to hold the number of occurrences of each character
int[] counts = new int[26];
for (int i = 0; i < aArr.length; i++){
counts[aArr[i]-97]++; // Increment the count of the character at respective position
counts[bArr[i]-97]--; // Decrement the count of the character at respective position
}
// If the strings are anagrams, then counts array will be full of zeros not otherwise
for (int i = 0; i<26; i++){
if (counts[i] != 0)
return false;
}
Tried hashcode logic for anagram gives me false output
public static Boolean anagramLogic(String s,String s2){
char[] ch1 = s.toLowerCase().toCharArray();
Arrays.sort(ch1);
char[] ch2= s2.toLowerCase().toCharArray();
Arrays.sort(ch2);
return ch1.toString().hashCode()==ch2.toString().hashCode(); //wrong
}
to rectify this code, below is the only option I see,appreciate any recommendations
char[] ch1 = s.toLowerCase().toCharArray();
Arrays.sort(ch1);
char[] ch2= s2.toLowerCase().toCharArray();
Arrays.sort(ch2);
return Arrays.equals(ch1,ch2);
}

Resources