String permutations in lexigraphic order without inversions/reversions - string

The problem:
I would like to generate a list of permutations of strings in lexigraphical but excluding string inversions. For instance, if I have the following string: abc, I would like to generate the following list
abc
acb
bac
instead of the typical
abc
acb
bac
bca
cab
cba
An alternative example would look something like this:
100
010
instead of
100
010
001
Currently, I can generate the permutations using perl, but I am not sure on how to best remove the reverse duplicates.
I had thought of applying something like the following:
create map with the following:
1) 100
2) 010
3) 001
then perform the reversion/inversion on each element in the map and create a new map with:
1') 001
2') 010
3') 100
then compare and if the primed list value matches the original value, leave it in place, if it is different, if it's index is greater than the median index, keep it, else remove.
Trouble is, I am not sure if this is an efficient approach or not.
Any advice would be great.

Two possibilities represented by examples are for permutations where all elements are different (abcd), or for variations of two symbols where one appears exactly once (1000). More general cases are addressed as well.
Non-repeating elements (permutations)
Here we can make use of Algorithm::Permute, and of the particular observation:
Each permutation where the first element is greater than its last need be excluded. It comes from this post, brought up in the answer by ysth.
This rule holds as follows. Consider substrings of a string without its first and last elements. For each such substring, all permutations of the string must contain its inverse. One of these, padded with last and first, is thus the inverse of the string. By construction, for each substring there is exactly one inverse. Thus permutations with swapped first and last elements of each string need be excluded.
use warnings;
use strict;
use feature 'say';
use Algorithm::Permute;
my $size = shift || 4;
my #arr = ('a'..'z')[0..$size-1]; # 1..$size for numbers
my #res;
Algorithm::Permute::permute {
push #res, (join '', #arr) unless $arr[0] gt $arr[-1]
} #arr;
say for #arr;
Problems with repetead elements (abbcd) can be treated the exact same way as above, and we need to also prune duplicates as permutations of b generate abbcd and abbcd (same)
use List::MoreUtils 'uniq';
# build #res the same way as above ...
my #res = uniq #res;
Doing this during construction would not reduce complexity nor speed things up.
The permute is quoted as the fastest method in the module, by far. It is about an order of magnitude faster than the other modules I tested (below), taking about 1 second for 10 elements on my system. But note that this problem's complexity is factorial in size. It blows up really fast.
Two symbols, where one appears exactly once (variations)
This is different and the above module is not meant for it, nor would the exclusion criterion work. There are other modules, see at the end. However, the problem here is very simple.
Start from (1,0,0,...) and 'walk' 1 along the list, up to the "midpoint" – which is the half for even sized list (4 for 8-long), or next past half for odd sizes (5 for 9-long). All strings obtained this way, by moving 1 by one position up to midpoint, form the set. The second "half" are their inversions.
use warnings;
use strict;
my $size = shift || 4;
my #n = (1, map { 0 } 1..$size-1);
my #res = (join '', #n); # first element of the result
my $end_idx = ( #n % 2 == 0 ) ? #n/2 - 1 : int(#n/2);
foreach my $i (0..$end_idx-1) # stop one short as we write one past $i
{
#n[$i, $i+1] = (0, 1); # move 1 by one position from where it is
push #res, join '', #n;
}
print "$_\n" for #res;
We need to stop before the last index since it has been filled in the previous iteration.
This can be modified if both symbols (0,1) may appear repeatedly, but it is far simpler to use a module and then exclude inverses. The Algorithm::Combinatorics has routines for all needs here. For all variations of 0 and 1 of lenght $size, where both may repeat
use Algorithm::Combinatorics qw(variations_with_repetition);
my #rep_vars = variations_with_repetition([0, 1], $size);
Inverse elements can then be excluded by a brute-force search, with O(N2) complexity at worst.
Also note Math::Combinatorics.

The answer in the suggested duplicate Generating permutations which are not mirrors of each other doesn't deal with repeated elements (because that wasn't part of that question) so naively following it would include e.g. both 0100 and 0010. So this isn't an exact duplicate. But the idea applies.
Generate all the permutations but filter only for those with $_ le reverse $_. I think this is essentially what you suggest in the question, but there's no need to compute a map when a simple expression applied to each permutation will tell you whether to include it or not.

Related

Minimum Character that needed to be deleted

Original Problem:
A word was K-good if for every two letters in the word, if the first appears x times and the second appears y times, then |x - y| ≤ K.
Given some word w, how many letters does he have to remove to make it K-good?
Problem Link.
I have solved the above problem and i not asking solution for the above
problem
I just misread the statement for first time and just thought how can we solve this problem in linear line time , which just give rise to a new problem
Modification Problem
A word was K-good if for every two consecutive letters in the word, if the first appears x times and the second appears y times, then |x - y| ≤ K.
Given some word w, how many letters does he have to remove to make it K-good?
Is this problem is solvable in linear time , i thought about it but could not find any valid solution.
Solution
My Approach: I could not approach my crush but her is my approach to this problem , try everything( from movie Zooptopia)
i.e.
for i range(0,1<<n): // n length of string
for j in range(0,n):
if(i&(1<<j) is not zero): delete the character
Now check if String is K good
For N in Range 10^5. Time Complexity: Time does not exist in that dimension.
Is there any linear solution to this problem , simple and sweet like people of stackoverflow.
For Ex:
String S = AABCBBCB and K=1
If we delete 'B' at index 5 so String S = AABCBCB which is good string
F[A]-F[A]=0
F[B]-F[A]=1
F[C]-F[B]=1
and so on
I guess this is a simple example there can me more complex example as deleting an I element makens (I-1) and (I+1) as consecutive
Is there any linear solution to this problem?
Consider the word DDDAAABBDC. This word is 3-good, becauseDandCare consecutive and card(D)-card(C)=3, and removing the lastDmakes it 1-good by makingDandCnon-consecutive.
Inversely if I consider DABABABBDC which is 2-good, removing the lastDmakes CandBconsecutive and increases the K-value of the word to 3.
This means that in the modified problem, the K-value of a word is determined by both the cardinals of each letter and the cardinals of each couple of consecutive letters.
By removing a letter, I reduce its cardinal of the letter as well as the cardinals of the pairs to which it belongs, but I also increase the cardinal of other pair (potentially creating new ones).
It is also important to notice that if in the original problem, all letters are equivalent (I can remove any indifferently), while it is no longer the case in the modified problem.
As a conclusion, I think we can safely assume that the "consecutive letters" constrain makes the problem not solvable in linear time for any alphabet/word.
Instead of finding the linear time solution, which i think doesn't exist (among others because there seem to be a multitude of alternative solutions to each K request), i'd like to preset the totally geeky solution.
Namely, take the parallel array processing language Dyalog APL and create these two tiny dynamic functions:
good←{1≥⍴⍵:¯1 ⋄ b←(⌈/a←(∪⍵)⍳⍵)⍴0 ⋄ b[a]+←1 ⋄ ⌈/|2-/b[a]}
make←{⍵,(good ⍵),a,⍺,(l-⍴a←⊃b),⍴b←(⍺=good¨b/¨⊂⍵)⌿(b←↓⍉~(l⍴2)⊤0,⍳2⊥(l←⍴⍵)⍴1)/¨⊂⍵}
good tells us the K-goodness of a string. A few examples below:
// fn" means the fn executes on each of the right args
good" 'AABCBBCB' 'DDDAAABBDC' 'DDDAAABBC' 'DABABABBDC' 'DABABABBC' 'STACKOVERFLOW'
2 3 1 2 3 1
make takes as arguments
[desired K] make [any string]
and returns
- original string
- K for original string
- reduced string for desired K
- how many characters were removed to achieve deired K
- how many possible solutions there are to achieve desired K
For example:
3 make 'DABABABBDC'
┌──────────┬─┬─────────┬─┬─┬──┐
│DABABABBDC│2│DABABABBC│3│1│46│
└──────────┴─┴─────────┴─┴─┴──┘
A little longer string:
1 make 'ABCACDAAFABBC'
┌─────────────┬─┬────────┬─┬─┬────┐
│ABCACDAAFABBC│4│ABCACDFB│1│5│3031│
└─────────────┴─┴────────┴─┴─┴────┘
It is possible to both increase and decrease the K-goodness.
Unfortunately, this is brute force. We generate the 2-base of all integers between 2^[lenght of string] and 1, for example:
0 1 0 1 1
Then we test the goodness of the substring, for example of:
0 1 0 1 1 / 'STACK' // Substring is now 'TCK'
We pick only those results (substrings) that match the desired K-good. Finally, out of the multitude of possible results, we pick the first one, which is the one with most characters left.
At least this was fun to code :-).

Finding length of substring

I have given n strings . I have to find a string S so that, given n strings are sub-sequence of S.
For example, I have given the following 5 strings:
AATT
CGTT
CAGT
ACGT
ATGC
Then the string is "ACAGTGCT" . . Because, ACAGTGCT contains all given strings as super-sequence.
To solve this problem I have to know the algorithm . But I have no idea how to solve this . Guys, can you help me by telling technique to solve this problem ?
This is a NP-complete problem called multiple sequence alignment.
The wiki page describes solution methods such as dynamic programming which will work for small n, but becomes prohibitively expensive for larger n.
The basic idea is to construct an array f[a,b,c,...] representing the length of the shortest string S that generates "a" characters of the first string, "b" characters of the second, and "c" characters of the third.
My Approach: using Trie
Building a Trie from the given words.
create empty string (S)
create empty string (prev)
for each layer in the trie
create empty string (curr)
for each character used in the current layer
if the character not used in the previous layer (not in prev)
add the character to S
add the character to curr
prev = curr
Hope this helps :)
1 Definitions
A sequence of length n is a concatenation of n symbols taken from an alphabet .
If S is a sequence of length n and T is a sequence of length m and n m then S is a subsequence of T if S can be obtained by deleting m-n symbols from T. The symbols need not be contiguous.
A sequence T of length m is a supersequence of S of length n if T can be obtained by inserting m-n symbols. That is, T is a supersequence of S if and only if S is a subsequence of T.
A sequence T is a common supersequence of the sequences S1 and S2 of T is a supersequence of both S1 and S2.
2 The problem
The problem is to find a shortest common supersequence (SCS), which is a common supersequence of minimal length. There could be more than one SCS for a given problem.
2.1 Example
S= {a, b, c}
S1 = bcb
S2 = baab
S3 = babc
One shortest common supersequence is babcab (babacb, baabcb, bcaabc, bacabc, baacbc).
3 Techniques
Dynamic programming Requires too much memory unless the number of input-sequences are very small.
Branch and bound Requires too much time unless the alphabet is very small.
Majority merge The best known heuristic when the number of sequences is large compared to the alphabet size. [1]
Greedy (take two sequences and replace them by their optimal shortest common supersequence until a single string is left) Worse than majority merge. [1]
Genetic algorithms Indications that it might be better than majority merge. [1]
4 Implemented heuristics
4.1 The trivial solution
The trivial solution is at most || times the optimal solution length and is obtained by concatenating the concatenation of all characters in sigma as many times as the longest sequence. That is, if = {a, b, c} and the longest input sequence is of length 4 we get abcabcabcabc.
4.2 Majority merge heuristic
The Majority merge heuristic builds up a supersequence from the empty sequence (S) in the following way:
WHILE there are non-empty input sequences
s <- The most frequent symbol at the start of non-empty input-sequences.
Add s to the end of S.
Remove s from the beginning of each input sequence that starts with s.
END WHILE
Majority merge performs very well when the number of sequences is large compared to the alphabet size.
5 My approach - Local search
My approach was to apply a local search heuristic to the SCS problem and compare it to the Majority merge heuristic to see if it might do better in the case when the alphabet size is larger than the number of sequences.
Since the length of a valid supersequence may vary and any change to the supersequence may give an invalid string a direct representation of a supersequence as a feasible solution is not an option.
I chose to view a feasible solution (S) as a sequence of mappings x1...xSl where Sl is the sum of the lengths of all sequences and xi is a mapping to a sequencenumber and an index.
That means, if L={{s1,1...s1,m1}, {s2,1...s2,m2} ...{sn,1...s3,mn}} is the set of input sequences and L(i) is the ith sequence the mappings are represented like this:
xi {k, l}, where k L and l L(k)
To be sure that any solution is valid we need to introduce the following constraints:
Every symbol in every sequence may only have one xi mapped to it.
If xi ss,k and xj ss,l and k < l then i < j.
If xi ss,k and xj ss,l and k > l then i > j.
The second constraint enforces that the order of each sequence is preserved but not its position in S. If we have two mappings xi and xj then we may only exchange mappings between them if they map to different sequences.
5.1 The initial solution
There are many ways to choose an initial solution. As long as the order of the sequences are preserved it is valid. I chose not to in some way randomize a solution but try two very different solution-types and compare them.
The first one is to create an initial solution by simply concatenating all the sequences.
The second one is to interleave the sequences one symbol at a time. That is to start with the first symbol of every sequence then, in the same order, take the second symbol of every sequence and so on.
5.2 Local change and the neighbourhood
A local change is done by exchanging two mappings in the solution.
One way of doing the iteration is to go from i to Sl and do the best exchange for each mapping.
Another way is to try to exchange the mappings in the order they are defined by the sequences. That is, first exchange s1,1, then s2,1. That is what we do.
There are two variants I have tried.
In the first one, if a single mapping exchange does not yield a better value I return otherwise I go on.
In the second one, I seperately for each sequence do as many exchanges as there are sequences so a symbol in each sequence will have a possibility of moving. The exchange that gives the best value I keep and if that value is worse than the value of the last step in the algorithm I return otherwise I go on.
A symbol may move any number of position to the left or to the right as long as the exchange does not change the order of the original sequences.
The neighbourhood in the first variant is the number of valid exchanges that can be made for the symbol. In the second variant it is the sum of valid exchanges of each symbol after the previous symbol has been exchanged.
5.3 Evaluation
Since the length of the solution is always constant it has to be compressed before the real length of the solution may be obtained.
The solution S, which consists of mappings is converted to a string by using the symbols each mapping points to. A new, initialy empty, solution T is created. Then this algorithm is performed:
T = {}
FOR i = 0 TO Sl
found = FALSE
FOR j = 0 TO |L|
IF first symbol in L(j) = the symbol xi maps to THEN
Remove first symbol from L(j)
found = TRUE
END IF
END FOR
IF found = TRUE THEN
Add the symbol xi maps to to the end of T
END IF
END FOR
Sl is as before the sum of the lengths of all sequences. L is the set of all sequences and L(j) is sequence number j.
The value of the solution S is obtained as |T|.
With Many Many Thanks to : Andreas Westling

algorithms for fast string approximate matching

Given a source string s and n equal length strings, I need to find a quick algorithm to return those strings that have at most k characters that are different from the source string s at each corresponding position.
What is a fast algorithm to do so?
PS: I have to claim that this is a academic question. I want to find the most efficient algorithm if possible.
Also I missed one very important piece of information. The n equal length strings form a dictionary, against which many source strings s will be queried upon. There seems to be some sort of preprocessing step to make it more efficient.
My gut instinct is just to iterate over each String n, maintaining a counter of how many characters are different than s, but I'm not claiming it is the most efficient solution. However it would be O(n) so unless this is a known performance problem, or an academic question, I'd go with that.
Sedgewick in his book "Algorithms" writes that Ternary Search Tree allows "to locate all words within a given Hamming distance of a query word". Article in Dr. Dobb's
Given that the strings are fixed length, you can compute the Hamming distance between two strings to determine the similarity; this is O(n) on the length of the string. So, worst case is that your algorithm is O(nm) for comparing your string against m words.
As an alternative, a fast solution that's also a memory hog is to preprocess your dictionary into a map; keys are a tuple (p, c) where p is the position in the string and c is the character in the string at that position, values are the strings that have characters at that position (so "the" will be in the map at {(0, 't'), "the"}, {(1, 'h'), "the"}, {(2, 'e'), "the"}). To query the map, iterate through query string's characters and construct a result map with the retrieved strings; keys are strings, values are the number of times the strings have been retrieved from the primary map (so with the query string "the", the key "thx" will have a value of 2, and the key "tee" will have a value of 1). Finally, iterate through the result map and discard strings whose values are less than K.
You can save memory by discarding keys that can't possibly equal K when the result map has been completed. For example, if K is 5 and N is 8, then when you've reached the 4th-8th characters of the query string you can discard any retrieved strings that aren't already in the result map since they can't possibly have 5 matching characters. Or, when you've finished with the 6th character of the query string, you can iterate through the result map and remove all keys whose values are less than 3.
If need be you can offload the primary precomputed map to a NoSql key-value database or something along those lines in order to save on main memory (and also so that you don't have to precompute the dictionary every time the program restarts).
Rather than storing a tuple (p, c) as the key in the primary map, you can instead concatenate the position and character into a string (so (5, 't') becomes "5t", and (12, 'x') becomes "12x").
Without knowing where in each input string the match characters will be, for a particular string, you might need to check every character no matter what order you check them in. Therefore it makes sense to just iterate over each string character-by-character and keep a sum of the total number of mismatches. If i is the number of mismatches so far, return false when i == k and true when there are fewer than k-i unchecked characters remaining in the string.
Note that depending on how long the strings are and how many mismatches you'll allow, it might be faster to iterate over the whole string rather than performing these checks, or perhaps to perform them only after every couple characters. Play around with it to see how you get the fastest performance.
My method if we're thinking out loud :P I can't see a way to do this without going through each n string, but I'm happy to be corrected. On that it would begin with a pre-process to save a second set of your n strings so that the characters are in ascending order.
The first part of the comparison would then be to check each n string a character at a time say n' to each character in s say s'.
If s' is less than n' then not equal and move to the next s'. If n' is less than s' then go to next n'. Otherwise record a matching character. Repeat this until k miss matches are found or the alternate matches are found and mark n accordingly.
For further consideration, an added pre-processing could be done on each adjacent string in n to see the total number of characters that differ. This could then be used when comparing strings n to s and if sufficient difference exist between these and the adjacent n there may not be a need to compare it?

How can I remove repeated characters in a string with R?

I would like to implement a function with R that removes repeated characters in a string. For instance, say my function is named removeRS, so it is supposed to work this way:
removeRS('Buenaaaaaaaaa Suerrrrte')
Buena Suerte
removeRS('Hoy estoy tristeeeeeee')
Hoy estoy triste
My function is going to be used with strings written in spanish, so it is not that common (or at least correct) to find words that have more than three successive vowels. No bother about the possible sentiment behind them. Nonetheless, there are words that can have two successive consonants (especially ll and rr), but we could skip this from our function.
So, to sum up, this function should replace the letters that appear at least three times in a row with just that letter. In one of the examples above, aaaaaaaaa is replaced with a.
Could you give me any hints to carry out this task with R?
I did not think very carefully on this, but this is my quick solution using references in regular expressions:
gsub('([[:alpha:]])\\1+', '\\1', 'Buenaaaaaaaaa Suerrrrte')
# [1] "Buena Suerte"
() captures a letter first, \\1 refers to that letter, + means to match it once or more; put all these pieces together, we can match a letter two or more times.
To include other characters besides alphanumerics, replace [[:alpha:]] with a regex matching whatever you wish to include.
I think you should pay attention to the ambiguities in your problem description. This is a first stab, but it clearly does not work with "Good Luck" in the manner you desire:
removeRS <- function(str) paste(rle(strsplit(str, "")[[1]])$values, collapse="")
removeRS('Buenaaaaaaaaa Suerrrrte')
#[1] "Buena Suerte"
Since you want to replace letters that appear AT LEAST 3 times, here is my solution:
gsub("([[:alpha:]])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
#[1] "Buenna Suertee"
As you can see the 4 "a" have been reduced to only 1 a, the 3 r have been reduced to 1 r but the 2 n and the 2 e have not been changed.
As suggested above you can replace the [[:alpha:]] by any combination of [a-zA-KM-Z] or similar, and even use the "or" operator | inside the squre brackets [y|Q] if you want your code to affect only repetitions of y and Q.
gsub("([a|e])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
# [1] "Buenna Suerrrtee"
# triple r are not affected and there are no triple e.

remove fragments in a sentence [puzzle]

Question:
Write a program to remove fragment that occur in "all" strings,where a fragment
is 3 or more consecutive word.
Example:
Input::
s1 = "It is raining and I want to drive home.";
s2 = "It is raining and I want to go skiing.";
s3 = "It is hot and I want to go swimming.";
Output::
s1 = "It is raining drive home.";
s2 = "It is raining go skiing.";
s3 = "It is hot go swimming.";
Removed fragment = "and i want to"
The program will be tested again large files.
Efficiency will be taken into consideration.
Assumptions: Ignore capitalization ,punctuation. but preserve in output.
Note: Take care of cases like
a a a a a b c b c b c b c where removing would create more fragments.
My Solution: (which i think is not the most efficient)
Hash three word phrases into an int and store them in an array, for all strings.
reduces to array of numbers like
1 2 3 4 5
3 5 7 9 8
9 3 1 7 9
Problem reduces to intersection of arrays.
sort the arrays. (k * nlogn)
keep k pointers. if all equal match found. else increment the pointer pointing to least value.
To solve for the Note above. I was thinking of doing a lazy delete, i.e mark phrases for deletion and delete at the end.
Are there cases where my solution might not work? Can we optimize my solution/ find the best solution ?
First observation: replace each word with a single "letter" in a big alphabet(i.e. hash the worlds in some way), remove whitespaces and punctuation.
Now you have the problem reduced to remove the longest letter sequence that appears in all words of a given list.
So you have to compute the longest common substring for a set of "words". You find it using a generalized suffix tree as this is the most efficient algorithm. This should do the trick and I believe has the best possible complexity.
The first step is as already suggested by izomorphius:
Replace each word with a single "letter" in a big alphabet(i.e. hash the worlds in some way), remove whitespaces and punctuation.
For the second you don't need to know the longest common substring - you just want to erase it from all the strings.
Note that this is equivalent to erasing all common substrings of length exactly 3, because if you have a longer commmon substring, then its substrings with length 3 are also common.
To do that you can use a hash table (storing key value pairs).
Just iterate over the first string and put all it's 3-substrings into the hash table as keys with values equal to 1.
Then iterate over the second string and for each 3-substring x if x is in the hash table and its value is 1, then set the value to 2.
Then iterate over the third string and for each 3-substring x, if x is in the hash table and its value is 2, then set the value to 3.
...and so on.
At the end the keys that have the value of k are the common 3-substrings.
Now just iterate once more over all the strings and remove those 3-substrings that are common.
import java.io.*;
import java.util.*;
public class remove_unique{
public static void main(String args[]){
String s1 = "Everyday I do exercise if";
String s2 = "Sometimes I do exercise if i feel stressed";
String s3 = "Mostly I do exercise on morning";
String[] words1=s1.split("\\s");
String[] words2=s2.split("\\s");
String[] words3=s3.split("\\s");
StringBuilder sb = new StringBuilder();
for(int i=0;i<words1.length;i++){
for(int j=0;j<words2.length;j++){
for(int k=0;k<words3.length;k++){
if(words1[i].equals(words2[j]) && words2[j].equals(words3[k])
&&words3[k].equals(words1[i])){
//Concatenating the returned Strings
sb.append(words1[i]+" ");
}
}
}
}
System.out.println(s1.replaceAll(sb.toString(), ""));
System.out.println(s2.replaceAll(sb.toString(), ""));
System.out.println(s3.replaceAll(sb.toString(), ""));
}
}
//LAKSHMI ARJUNA
My solution would be something like,
F = all fragments with length > 3 shared by the first 2 lines, avoid overlaps
for each line from the 3rd line and up
remove fragments in F which do not exist in line, or cause overlaps
return sentences with fragments in F removed
I assume finding/matching fragments in sentences can be done with some known algo. but in terms of the time complexity for n lines this is O(n)

Resources