How can I remove repeated characters in a string with R? - string

I would like to implement a function with R that removes repeated characters in a string. For instance, say my function is named removeRS, so it is supposed to work this way:
removeRS('Buenaaaaaaaaa Suerrrrte')
Buena Suerte
removeRS('Hoy estoy tristeeeeeee')
Hoy estoy triste
My function is going to be used with strings written in spanish, so it is not that common (or at least correct) to find words that have more than three successive vowels. No bother about the possible sentiment behind them. Nonetheless, there are words that can have two successive consonants (especially ll and rr), but we could skip this from our function.
So, to sum up, this function should replace the letters that appear at least three times in a row with just that letter. In one of the examples above, aaaaaaaaa is replaced with a.
Could you give me any hints to carry out this task with R?

I did not think very carefully on this, but this is my quick solution using references in regular expressions:
gsub('([[:alpha:]])\\1+', '\\1', 'Buenaaaaaaaaa Suerrrrte')
# [1] "Buena Suerte"
() captures a letter first, \\1 refers to that letter, + means to match it once or more; put all these pieces together, we can match a letter two or more times.
To include other characters besides alphanumerics, replace [[:alpha:]] with a regex matching whatever you wish to include.

I think you should pay attention to the ambiguities in your problem description. This is a first stab, but it clearly does not work with "Good Luck" in the manner you desire:
removeRS <- function(str) paste(rle(strsplit(str, "")[[1]])$values, collapse="")
removeRS('Buenaaaaaaaaa Suerrrrte')
#[1] "Buena Suerte"

Since you want to replace letters that appear AT LEAST 3 times, here is my solution:
gsub("([[:alpha:]])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
#[1] "Buenna Suertee"
As you can see the 4 "a" have been reduced to only 1 a, the 3 r have been reduced to 1 r but the 2 n and the 2 e have not been changed.
As suggested above you can replace the [[:alpha:]] by any combination of [a-zA-KM-Z] or similar, and even use the "or" operator | inside the squre brackets [y|Q] if you want your code to affect only repetitions of y and Q.
gsub("([a|e])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
# [1] "Buenna Suerrrtee"
# triple r are not affected and there are no triple e.

Related

String permutations in lexigraphic order without inversions/reversions

The problem:
I would like to generate a list of permutations of strings in lexigraphical but excluding string inversions. For instance, if I have the following string: abc, I would like to generate the following list
abc
acb
bac
instead of the typical
abc
acb
bac
bca
cab
cba
An alternative example would look something like this:
100
010
instead of
100
010
001
Currently, I can generate the permutations using perl, but I am not sure on how to best remove the reverse duplicates.
I had thought of applying something like the following:
create map with the following:
1) 100
2) 010
3) 001
then perform the reversion/inversion on each element in the map and create a new map with:
1') 001
2') 010
3') 100
then compare and if the primed list value matches the original value, leave it in place, if it is different, if it's index is greater than the median index, keep it, else remove.
Trouble is, I am not sure if this is an efficient approach or not.
Any advice would be great.
Two possibilities represented by examples are for permutations where all elements are different (abcd), or for variations of two symbols where one appears exactly once (1000). More general cases are addressed as well.
Non-repeating elements (permutations)
Here we can make use of Algorithm::Permute, and of the particular observation:
Each permutation where the first element is greater than its last need be excluded. It comes from this post, brought up in the answer by ysth.
This rule holds as follows. Consider substrings of a string without its first and last elements. For each such substring, all permutations of the string must contain its inverse. One of these, padded with last and first, is thus the inverse of the string. By construction, for each substring there is exactly one inverse. Thus permutations with swapped first and last elements of each string need be excluded.
use warnings;
use strict;
use feature 'say';
use Algorithm::Permute;
my $size = shift || 4;
my #arr = ('a'..'z')[0..$size-1]; # 1..$size for numbers
my #res;
Algorithm::Permute::permute {
push #res, (join '', #arr) unless $arr[0] gt $arr[-1]
} #arr;
say for #arr;
Problems with repetead elements (abbcd) can be treated the exact same way as above, and we need to also prune duplicates as permutations of b generate abbcd and abbcd (same)
use List::MoreUtils 'uniq';
# build #res the same way as above ...
my #res = uniq #res;
Doing this during construction would not reduce complexity nor speed things up.
The permute is quoted as the fastest method in the module, by far. It is about an order of magnitude faster than the other modules I tested (below), taking about 1 second for 10 elements on my system. But note that this problem's complexity is factorial in size. It blows up really fast.
Two symbols, where one appears exactly once (variations)
This is different and the above module is not meant for it, nor would the exclusion criterion work. There are other modules, see at the end. However, the problem here is very simple.
Start from (1,0,0,...) and 'walk' 1 along the list, up to the "midpoint" – which is the half for even sized list (4 for 8-long), or next past half for odd sizes (5 for 9-long). All strings obtained this way, by moving 1 by one position up to midpoint, form the set. The second "half" are their inversions.
use warnings;
use strict;
my $size = shift || 4;
my #n = (1, map { 0 } 1..$size-1);
my #res = (join '', #n); # first element of the result
my $end_idx = ( #n % 2 == 0 ) ? #n/2 - 1 : int(#n/2);
foreach my $i (0..$end_idx-1) # stop one short as we write one past $i
{
#n[$i, $i+1] = (0, 1); # move 1 by one position from where it is
push #res, join '', #n;
}
print "$_\n" for #res;
We need to stop before the last index since it has been filled in the previous iteration.
This can be modified if both symbols (0,1) may appear repeatedly, but it is far simpler to use a module and then exclude inverses. The Algorithm::Combinatorics has routines for all needs here. For all variations of 0 and 1 of lenght $size, where both may repeat
use Algorithm::Combinatorics qw(variations_with_repetition);
my #rep_vars = variations_with_repetition([0, 1], $size);
Inverse elements can then be excluded by a brute-force search, with O(N2) complexity at worst.
Also note Math::Combinatorics.
The answer in the suggested duplicate Generating permutations which are not mirrors of each other doesn't deal with repeated elements (because that wasn't part of that question) so naively following it would include e.g. both 0100 and 0010. So this isn't an exact duplicate. But the idea applies.
Generate all the permutations but filter only for those with $_ le reverse $_. I think this is essentially what you suggest in the question, but there's no need to compute a map when a simple expression applied to each permutation will tell you whether to include it or not.

Minimum Character that needed to be deleted

Original Problem:
A word was K-good if for every two letters in the word, if the first appears x times and the second appears y times, then |x - y| ≤ K.
Given some word w, how many letters does he have to remove to make it K-good?
Problem Link.
I have solved the above problem and i not asking solution for the above
problem
I just misread the statement for first time and just thought how can we solve this problem in linear line time , which just give rise to a new problem
Modification Problem
A word was K-good if for every two consecutive letters in the word, if the first appears x times and the second appears y times, then |x - y| ≤ K.
Given some word w, how many letters does he have to remove to make it K-good?
Is this problem is solvable in linear time , i thought about it but could not find any valid solution.
Solution
My Approach: I could not approach my crush but her is my approach to this problem , try everything( from movie Zooptopia)
i.e.
for i range(0,1<<n): // n length of string
for j in range(0,n):
if(i&(1<<j) is not zero): delete the character
Now check if String is K good
For N in Range 10^5. Time Complexity: Time does not exist in that dimension.
Is there any linear solution to this problem , simple and sweet like people of stackoverflow.
For Ex:
String S = AABCBBCB and K=1
If we delete 'B' at index 5 so String S = AABCBCB which is good string
F[A]-F[A]=0
F[B]-F[A]=1
F[C]-F[B]=1
and so on
I guess this is a simple example there can me more complex example as deleting an I element makens (I-1) and (I+1) as consecutive
Is there any linear solution to this problem?
Consider the word DDDAAABBDC. This word is 3-good, becauseDandCare consecutive and card(D)-card(C)=3, and removing the lastDmakes it 1-good by makingDandCnon-consecutive.
Inversely if I consider DABABABBDC which is 2-good, removing the lastDmakes CandBconsecutive and increases the K-value of the word to 3.
This means that in the modified problem, the K-value of a word is determined by both the cardinals of each letter and the cardinals of each couple of consecutive letters.
By removing a letter, I reduce its cardinal of the letter as well as the cardinals of the pairs to which it belongs, but I also increase the cardinal of other pair (potentially creating new ones).
It is also important to notice that if in the original problem, all letters are equivalent (I can remove any indifferently), while it is no longer the case in the modified problem.
As a conclusion, I think we can safely assume that the "consecutive letters" constrain makes the problem not solvable in linear time for any alphabet/word.
Instead of finding the linear time solution, which i think doesn't exist (among others because there seem to be a multitude of alternative solutions to each K request), i'd like to preset the totally geeky solution.
Namely, take the parallel array processing language Dyalog APL and create these two tiny dynamic functions:
good←{1≥⍴⍵:¯1 ⋄ b←(⌈/a←(∪⍵)⍳⍵)⍴0 ⋄ b[a]+←1 ⋄ ⌈/|2-/b[a]}
make←{⍵,(good ⍵),a,⍺,(l-⍴a←⊃b),⍴b←(⍺=good¨b/¨⊂⍵)⌿(b←↓⍉~(l⍴2)⊤0,⍳2⊥(l←⍴⍵)⍴1)/¨⊂⍵}
good tells us the K-goodness of a string. A few examples below:
// fn" means the fn executes on each of the right args
good" 'AABCBBCB' 'DDDAAABBDC' 'DDDAAABBC' 'DABABABBDC' 'DABABABBC' 'STACKOVERFLOW'
2 3 1 2 3 1
make takes as arguments
[desired K] make [any string]
and returns
- original string
- K for original string
- reduced string for desired K
- how many characters were removed to achieve deired K
- how many possible solutions there are to achieve desired K
For example:
3 make 'DABABABBDC'
┌──────────┬─┬─────────┬─┬─┬──┐
│DABABABBDC│2│DABABABBC│3│1│46│
└──────────┴─┴─────────┴─┴─┴──┘
A little longer string:
1 make 'ABCACDAAFABBC'
┌─────────────┬─┬────────┬─┬─┬────┐
│ABCACDAAFABBC│4│ABCACDFB│1│5│3031│
└─────────────┴─┴────────┴─┴─┴────┘
It is possible to both increase and decrease the K-goodness.
Unfortunately, this is brute force. We generate the 2-base of all integers between 2^[lenght of string] and 1, for example:
0 1 0 1 1
Then we test the goodness of the substring, for example of:
0 1 0 1 1 / 'STACK' // Substring is now 'TCK'
We pick only those results (substrings) that match the desired K-good. Finally, out of the multitude of possible results, we pick the first one, which is the one with most characters left.
At least this was fun to code :-).

Find the minimal lexographical string formed by merging two strings

Suppose we are given two strings s1 and s2(both lowercase). We have two find the minimal lexographic string that can be formed by merging two strings.
At the beginning , it looks prettty simple as merge of the mergesort algorithm. But let us see what can go wrong.
s1: zyy
s2: zy
Now if we perform merge on these two we must decide which z to pick as they are equal, clearly if we pick z of s2 first then the string formed will be:
zyzyy
If we pick z of s1 first, the string formed will be:
zyyzy which is correct.
As we can see the merge of mergesort can lead to wrong answer.
Here's another example:
s1:zyy
s2:zyb
Now the correct answer will be zybzyy which will be got only if pick z of s2 first.
There are plenty of other cases in which the simple merge will fail. My question is Is there any standard algorithm out there used to perform merge for such output.
You could use dynamic programming. In f[x][y] store the minimal lexicographical string such that you've taken x charecters from the first string s1 and y characters from the second s2. You can calculate f in bottom-top manner using the update:
f[x][y] = min(f[x-1][y] + s1[x], f[x][y-1] + s2[y]) \\ the '+' here represents
\\ the concatenation of a
\\ string and a character
You start with f[0][0] = "" (empty string).
For efficiency you can store the strings in f as references. That is, you can store in f the objects
class StringRef {
StringRef prev;
char c;
}
To extract what string you have at certain f[x][y] you just follow the references. To udapate you point back to either f[x-1][y] or f[x][y-1] depending on what your update step says.
It seems that the solution can be almost the same as you described (the "mergesort"-like approach), except that with special handling of equality. So long as the first characters of both strings are equal, you look ahead at the second character, 3rd, etc. If the end is reached for some string, consider the first character of the other string as the next character in the string for which the end is reached, etc. for the 2nd character, etc. If the ends for both strings are reached, then it doesn't matter from which string to take the first character. Note that this algorithm is O(N) because after a look-ahead on equal prefixes you know the whole look-ahead sequence (i.e. string prefix) to include, not just one first character.
EDIT: you look ahead so long as the current i-th characters from both strings are equal and alphabetically not larger than the first character in the current prefix.

Perl Morgan and a String?

I am trying to solve this problem on hackerrank:
So the problem is:
Jack and Daniel are friends. Both of them like letters, especially upper-case ones.
They are cutting upper-case letters from newspapers, and each one of them has their collection of letters stored in separate stacks.
One beautiful day, Morgan visited Jack and Daniel. He saw their collections. Morgan wondered what is the lexicographically minimal string, made of that two collections. He can take a letter from a collection when it is on the top of the stack.
Also, Morgan wants to use all the letters in the boys' collections.
This is my attempt in Perl:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $n=<>);
while($n>0){
chomp(my $string1=<>);
chomp(my $string2=<>);
lexi($string1,$string2);
$n--;
}
sub lexi{
my($str1,$str2)=#_;
my #str1=split(//,$str1);
my #str2=split(//,$str2);
my $final_string="";
while(#str2 && #str1){
my $st2=$str2[0];
my $st1=$str1[0];
if($st1 le $st2){
$final_string.=$st1;
shift #str1;
}
else{
$final_string.=$st2;
shift #str2;
}
}
if(#str1){
$final_string=$final_string.join('',#str1);
}
else{
$final_string=$final_string.join('',#str2);
}
print $final_string,"\n";
}
Sample Input:
2
JACK
DANIEL
ABACABA
ABACABA
The first line contains the number of test cases, T.
Every next two lines have such format: the first line contains string A, and the second line contains string B.
Sample Output:
DAJACKNIEL
AABABACABACABA
But for Sample test-case it is giving right results while it is giving wrong results for other test-cases. One case for which it gives an incorrect result is
1
AABAC
AACAB
It outputs AAAABACCAB instead of AAAABACABC.
I don't know what is wrong with the algorithm and why it is failing with other test cases?
Update:
As per #squeamishossifrage comments If I add
($str1,$str2)=sort{$a cmp $b}($str1,$str2);
The results become same irrespective of user-inputs but still the test-case fails.
The problem is in your handling of the equal characters. Take the following example:
ACBA
BCAB
When faced with two identical characters (C in my example), you naïvely chose the one from the first string, but that's not always correct. You need to look ahead to break ties. You may even need to look many characters ahead. In this case, next character after C of the second string is lower than the next character of the first string, so you should take the C from the second string first.
By leaving the strings as strings, a simple string comparison will compare as many characters as needed to determine which character to consume.
sub lexi {
my ($str1, $str2) = #_;
utf8::downgrade($str1); # Makes sure length() will be fast
utf8::downgrade($str2); # since we only have ASCII letters.
my $final_string = "";
while (length($str2) && length($str1)) {
$final_string .= substr($str1 le $str2 ? $str1 : $str2, 0, 1, '');
}
$final_string .= $str1;
$final_string .= $str2;
print $final_string, "\n";
}
Too little rep to comment thus the answer:
What you need to do is to look ahead if the two characters match. You currently do a simple le match and in the case of
ZABB
ZAAA
You'll get ZABBZAA since the first match Z will be le Z. So what you need to do (a naive solution which most likely won't be very effective) is to keep looking as long as the strings/chars match so:
Z eq Z
ZA eq ZA
ZAB gt ZAA
and at that point will you know that the second string is the one you want to pop from for the first character.
Edit
You updated with sorting the strings, but like I wrote you still need to look ahead. The sorting will solve the two above strings but will fail with these two:
ZABAZA
ZAAAZB
ZAAAZBZABAZA
Because here the correct answer is ZAAAZABAZAZB and you can't find that will simply comparing character per character

Count word occurrences in R

Is there a function for counting the number of times a particular keyword is contained in a dataset?
For example, if dataset <- c("corn", "cornmeal", "corn on the cob", "meal") the count would be 3.
Let's for the moment assume you wanted the number of element containing "corn":
length(grep("corn", dataset))
[1] 3
After you get the basics of R down better you may want to look at the "tm" package.
EDIT: I realize that this time around you wanted any-"corn" but in the future you might want to get word-"corn". Over on r-help Bill Dunlap pointed out a more compact grep pattern for gathering whole words:
grep("\\<corn\\>", dataset)
Another quite convenient and intuitive way to do it is to use the str_count function of the stringr package:
library(stringr)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for mere occurences of the pattern:
str_count(dataset, "corn")
# [1] 1 1 1 0
# for occurences of the word alone:
str_count(dataset, "\\bcorn\\b")
# [1] 1 0 1 0
# summing it up
sum(str_count(dataset, "corn"))
# [1] 3
You can also do something like the following:
length(dataset[which(dataset=="corn")])
I'd just do it with string division like:
library(roperators)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for each vector element:
dataset %s/% 'corn'
# for everything:
sum(dataset %s/% 'corn')
You can use the str_count function from the stringr package to get the number of keywords that match a given character vector.
The pattern argument of the str_count function accepts a regular expression that can be used to specify the keyword.
The regular expression syntax is very flexible and allows matching whole words as well as character patterns.
For example the following code will count all occurrences of the string "corn" and will return 3:
sum(str_count(dataset, regex("corn")))
To match complete words use:
sum(str_count(dataset, regex("\\bcorn\\b")))
The "\b" is used to specify a word boundary. When using str_count function, the default definition of word boundary includes apostrophe. So if your dataset contains the string "corn's", it would be matched and included in the result.
This is because apostrophe is considered as a word boundary by default. To prevent words containing apostrophe from being counted, use the regex function with parameter uword = T. This will cause the regular expression engine to use the unicode TR 29 definition of word boundaries. See http://unicode.org/reports/tr29/tr29-4.html. This definition does not consider apostrophe as a word boundary.
The following code will give the number of time the word "corn" occurs. Words such as "corn's" will not be included.
sum(str_count(dataset, regex("\\bcorn\\b", uword = T)))

Resources