I picked up J a few weeks ago, about the same time the CodeGolf.SE beta opened to the public.
A recurrent issue (of mine) when using J over there is reformatting input and output to fit the problem specifications. So I tend to use code like this:
( ] ` ('_'"0) ) #. (= & '-')
This one untested for various reasons (edit me if wrong); intended meaning is "convert - to _". Also come up frequently: convert newlines to spaces (and converse), merge numbers with j, change brackets.
This takes up quite a few characters, and is not that convenient to integrate to the rest of the program.
Is there any other way to proceed with this? Preferably shorter, but I'm happy to learn anything else if it's got other advantages. Also, a solution with an implied functional obverse would relieve a lot.
It sometimes goes against the nature of code golf to use library methods, but in the string library, the charsub method is pretty useful:
'_-' charsub '_123'
-123
('_-', LF, ' ') charsub '_123', LF, '_stuff'
-123 -stuff
rplc is generally short for simple replacements:
'Test123' rplc 'e';'3'
T3st123
Amend m} is very short for special cases:
'*' 0} 'aaaa'
*aaa
'*' 0 2} 'aaaa'
*a*a
'*&' 0 2} 'aaaa'
*a&a
but becomes messy when the list has to be a verb:
b =: 'abcbdebf'
'L' (]g) } b
aLcLdeLf
where g has to be something like g =: ('b' E. ]) # ('b' E. ]) * [: i. #.
There are a lot of other "tricks" that work on a case by case basis. Example from the manual:
To replace lowercase 'a' through 'f' with uppercase 'A'
through 'F' in a string that contains only 'a' through 'f':
('abcdef' i. y) { 'ABCDEF'
Extending the previous example: to replace lowercase 'a' through
'f' with uppercase 'A' through 'F' leaving other characters unchanged:
(('abcdef' , a.) i. y) { 'ABCDEF' , a.
I've only dealt with the newlines and CSV, rather than the general case of replacement, but here's how I've handled those. I assume Unix line endings (or line endings fixed with toJ) with a final line feed.
Single lines of input: ".{:('1 2 3',LF) (Haven't gotten to use this yet)
Rectangular input: (".;._2) ('1 2 3',LF,'4 5 6',LF)
Ragged input: probably (,;._2) or (<;._2) (Haven't used this yet either.)
One line, comma separated: ".;._1}:',',('1,2,3',LF)
This doesn't replace tr at all, but does help with line endings and other garbage.
You might want to consider using the 8!:2 foreign:
8!:2]_1
-1
Related
For example, I have a text ,
10 3 4 2 10 , 4 ,10 ....
No I want to change each 10 with different words
I know %s/10/replace-words/gc but it only let me replace interactively like yes/no but I want to change each occurrence of 10 with different words like replace1, 3, 4 , 2 , replace2, 4, replace3 ....
Replaces each occurence of 10 with replace{index_of_match}:
:let #a=1 | %s/10/\='replace'.(#a+setreg('a',#a+1))/g
Replaces each occurence of 10 with a word from a predefined array:
:let b = ['foo', 'bar', 'vim'] | %s/10/\=(remove(b, 0))/g
Replaces each occurence of 10 with a word from a predefined array, and the index of the match:
:let #a=1 | let b = ['foo', 'bar', 'vim'] | %s/10/\=(b[#a-1]).(#a+setreg('a',#a+1))/g
But since you have to type in any word anyway, the benefit of the second and third function this is minimal. See the answer from SpoonMeiser for the "manual" solution.
Update: As wished, the explanation for the regex part in the second example:
%= on every line in the document
s/<search>/<replace>/g = s means do a search & replace, g means replace every occurence.
\= interprets the following as code.
remove(b, 0) removes the element at index 0 of the list b and returns it.
so for the first occurrence. the line will be %s/10/foo/g the second time, the list is now only ['bar', 'vim'] so the line will be %s/10/bar/g and so on
Note: This is a quick draft, and unlikely the best & cleanest way to achieve it, if somebody wants to improve it, feel free to add a comment
Is there a pattern to the words you want or would you want to type each word at each occurrence of the word you're replacing?
If I were replacing each instance of "10" with a different word, I'd probably do it somewhat manually:
/10
cw
<type word>ESC
ncw
<type word>ESC
ncw
<type word>ESC
Which doesn't seem too onerous, if each word is different and has to be typed separately anyway.
I'm essentially trying to solve this problem: http://rosalind.info/problems/revc/
I want to replace all occurrences of A, C, G, T with their compliments T, G, C, A .. in other words all A's will be replaced with T's, all C's with G's and etc.
I had previously used the replace() function to replace all occurrences of 'T' with 'U' and was hoping that the replace function would take a list of characters to replace with another list of characters but I haven't been able to make it work, so it might not have that functionality.
I know I could solve this easily using the BioJulia package and have done so using the following:
# creating complementary strand of DNA
# reverse the string
# find the complementary nucleotide
using Bio.Seq
s = dna"AAAACCCGGT"
t = reverse(complement(s))
println("$t")
But I'd like to not have to rely on the package.
Here's the code I have so far, if someone could steer me in the right direction that'd be great.
# creating complementary strand of DNA
# reverse the string
# find the complementary nucleotide
s = open("nt.txt") # open file containing sequence
t = reverse(s) # reverse the sequence
final = replace(t, r'[ACGT]', '[TGCA]') # this is probably incorrect
# replace characters ACGT with TGCA
println("$final")
It seems that replace doesn't yet do translations quite like, say, tr in Bash. So instead, here are couple of approaches using a dictionary mapping instead (the BioJulia package also appears to make similar use of dictionaries):
compliments = Dict('A' => 'T', 'C' => 'G', 'G' => 'C', 'T' => 'A')
Then if str = "AAAACCCGGT", you could use join like this:
julia> join([compliments[c] for c in str])
"TTTTGGGCCA"
Another approach could be to use a function and map:
function translate(c)
compliments[c]
end
Then:
julia> map(translate, str)
"TTTTGGGCCA"
Strings are iterable objects in Julia; each of these approaches reads one character in turn, c, and passes it to the dictionary to get back the complimentary character. A new string is built up from these complimentary characters.
Julia's strings are also immutable: you can't swap characters around in place, rather you need to build a new string.
I am trying to solve this problem on hackerrank:
So the problem is:
Jack and Daniel are friends. Both of them like letters, especially upper-case ones.
They are cutting upper-case letters from newspapers, and each one of them has their collection of letters stored in separate stacks.
One beautiful day, Morgan visited Jack and Daniel. He saw their collections. Morgan wondered what is the lexicographically minimal string, made of that two collections. He can take a letter from a collection when it is on the top of the stack.
Also, Morgan wants to use all the letters in the boys' collections.
This is my attempt in Perl:
#!/usr/bin/perl
use strict;
use warnings;
chomp(my $n=<>);
while($n>0){
chomp(my $string1=<>);
chomp(my $string2=<>);
lexi($string1,$string2);
$n--;
}
sub lexi{
my($str1,$str2)=#_;
my #str1=split(//,$str1);
my #str2=split(//,$str2);
my $final_string="";
while(#str2 && #str1){
my $st2=$str2[0];
my $st1=$str1[0];
if($st1 le $st2){
$final_string.=$st1;
shift #str1;
}
else{
$final_string.=$st2;
shift #str2;
}
}
if(#str1){
$final_string=$final_string.join('',#str1);
}
else{
$final_string=$final_string.join('',#str2);
}
print $final_string,"\n";
}
Sample Input:
2
JACK
DANIEL
ABACABA
ABACABA
The first line contains the number of test cases, T.
Every next two lines have such format: the first line contains string A, and the second line contains string B.
Sample Output:
DAJACKNIEL
AABABACABACABA
But for Sample test-case it is giving right results while it is giving wrong results for other test-cases. One case for which it gives an incorrect result is
1
AABAC
AACAB
It outputs AAAABACCAB instead of AAAABACABC.
I don't know what is wrong with the algorithm and why it is failing with other test cases?
Update:
As per #squeamishossifrage comments If I add
($str1,$str2)=sort{$a cmp $b}($str1,$str2);
The results become same irrespective of user-inputs but still the test-case fails.
The problem is in your handling of the equal characters. Take the following example:
ACBA
BCAB
When faced with two identical characters (C in my example), you naïvely chose the one from the first string, but that's not always correct. You need to look ahead to break ties. You may even need to look many characters ahead. In this case, next character after C of the second string is lower than the next character of the first string, so you should take the C from the second string first.
By leaving the strings as strings, a simple string comparison will compare as many characters as needed to determine which character to consume.
sub lexi {
my ($str1, $str2) = #_;
utf8::downgrade($str1); # Makes sure length() will be fast
utf8::downgrade($str2); # since we only have ASCII letters.
my $final_string = "";
while (length($str2) && length($str1)) {
$final_string .= substr($str1 le $str2 ? $str1 : $str2, 0, 1, '');
}
$final_string .= $str1;
$final_string .= $str2;
print $final_string, "\n";
}
Too little rep to comment thus the answer:
What you need to do is to look ahead if the two characters match. You currently do a simple le match and in the case of
ZABB
ZAAA
You'll get ZABBZAA since the first match Z will be le Z. So what you need to do (a naive solution which most likely won't be very effective) is to keep looking as long as the strings/chars match so:
Z eq Z
ZA eq ZA
ZAB gt ZAA
and at that point will you know that the second string is the one you want to pop from for the first character.
Edit
You updated with sorting the strings, but like I wrote you still need to look ahead. The sorting will solve the two above strings but will fail with these two:
ZABAZA
ZAAAZB
ZAAAZBZABAZA
Because here the correct answer is ZAAAZABAZAZB and you can't find that will simply comparing character per character
I would like to implement a function with R that removes repeated characters in a string. For instance, say my function is named removeRS, so it is supposed to work this way:
removeRS('Buenaaaaaaaaa Suerrrrte')
Buena Suerte
removeRS('Hoy estoy tristeeeeeee')
Hoy estoy triste
My function is going to be used with strings written in spanish, so it is not that common (or at least correct) to find words that have more than three successive vowels. No bother about the possible sentiment behind them. Nonetheless, there are words that can have two successive consonants (especially ll and rr), but we could skip this from our function.
So, to sum up, this function should replace the letters that appear at least three times in a row with just that letter. In one of the examples above, aaaaaaaaa is replaced with a.
Could you give me any hints to carry out this task with R?
I did not think very carefully on this, but this is my quick solution using references in regular expressions:
gsub('([[:alpha:]])\\1+', '\\1', 'Buenaaaaaaaaa Suerrrrte')
# [1] "Buena Suerte"
() captures a letter first, \\1 refers to that letter, + means to match it once or more; put all these pieces together, we can match a letter two or more times.
To include other characters besides alphanumerics, replace [[:alpha:]] with a regex matching whatever you wish to include.
I think you should pay attention to the ambiguities in your problem description. This is a first stab, but it clearly does not work with "Good Luck" in the manner you desire:
removeRS <- function(str) paste(rle(strsplit(str, "")[[1]])$values, collapse="")
removeRS('Buenaaaaaaaaa Suerrrrte')
#[1] "Buena Suerte"
Since you want to replace letters that appear AT LEAST 3 times, here is my solution:
gsub("([[:alpha:]])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
#[1] "Buenna Suertee"
As you can see the 4 "a" have been reduced to only 1 a, the 3 r have been reduced to 1 r but the 2 n and the 2 e have not been changed.
As suggested above you can replace the [[:alpha:]] by any combination of [a-zA-KM-Z] or similar, and even use the "or" operator | inside the squre brackets [y|Q] if you want your code to affect only repetitions of y and Q.
gsub("([a|e])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
# [1] "Buenna Suerrrtee"
# triple r are not affected and there are no triple e.
On reading this question, I thought the following problem would be simple using StringSplit
Given the following string, I want to 'cut' it to the left of every "D" such that:
I get a List of fragments (with sequence unchanged)
StringJoin#fragments gives back the original string (but is does not matter if I have to reorder the fragments to obtain this). That is, sequence within each fragment is important, and I do not want to lose any characters.
(The example I am interested in is a protein sequence (string) where each character represents an amino acid in one-letter code. I want to obtain the theoretical list of ALL fragments obtained by treating with an enzyme known to split before "D")
str = "MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"
The best I can come up with is to insert a space before each "D" using StringReplace and then use StringSplit. This seems quite awkward, to say the least.
frags1 = StringSplit#StringReplace[str, "D" -> " D"]
giving as output:
{"MTP", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}
or, alternatively, using StringReplacePart:
frags1alt =
StringSplit#StringReplacePart[str, " D", StringPosition[str, "D"]]
Finally (and more realistically), if I want to split before "D" provided that the residue immediately preceding it is not "P" [ie P-D,(Pro-Asp) bonds are not cleaved], I do it as follows:
StringSplit#StringReplace[str, (x_ /; x != "P") ~~ "D" -> x ~~ " D"]
Is there a more elegant way?
Speed is not necessarily an issue. I am unlikely to be dealing with strings of greater than, say, 500 characters. I am using Mma 7.
Update
I have added the bioinformatics tag, and I thought it might be of interest to add an example from that field.
The following imports a protein sequence (Bovine serum albumin, accession number 3336842) from the NCBI database using eutils and then generates a (theoretical) trypsin digest. I have assumed that the enzyme tripsin cleaves between residues A1-A2 when A1 is either "R" or "K", provided that A2 is not "R", "K" or "P". If anyone has any suggestions for improvements, please feel free to suggest modifications.
Using a modification of sakra's method ( a carriage return after '?db=' possibly needs to be removed):
StringJoin /#
Split[Characters[#],
And ## Function[x, #1 != x] /# {"R", "K"} ||
Or ## Function[xx, #2 == xx] /# {"R", "K", "P"} &] & #
StringJoin#
Rest#Import[
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=\
protein&id=3336842&rettype=fasta&retmode=text", "Data"]
My possibly ham-fisted attempt at using the regex method (Sasha/WReach) to do the same thing:
StringSplit[#, RegularExpression["(?![PKR])(?<=[KR])"]] &#
StringJoin#Rest#Import[...]
Output
{MK,WVTFISLLLLFSSAYSR,GVFRR,<<69>>,CCAADDK,EACFAVEGPK,LVVSTQTALA}
I can not build anything much simpler that your code. Here is a regex code, which you might happen to like:
In[281]:= StringSplit#
StringReplace[str, RegularExpression["(?<!P)D"] -> " D"]
Out[281]= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", \
"DYFRYLSEVASG", "DN"}
It uses negative lookbehind pattern, borrowed from this site.
EDIT Adding WReach's cool solution:
In[2]:= StringSplit[str, RegularExpression["(?<!P)(?=D)"]]
Out[2]= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", \
"DYFRYLSEVASG", "DN"}
Here are some alternate solutions:
Splitting by any occurrence of "D":
In[18]:= StringJoin /# Split[Characters["MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"], #2!="D" &]
Out[18]:= {"MTP", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}
Splitting by any occurrence of "D" provided it is not preceded by "P":
In[19]:= StringJoin /# Split[Characters["MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"], #2!="D" || #1=="P" &]
Out[19]:= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}
Your first solution isn't that bad, is it? Everything that I can think of is longer or uglier than that. Is the problem there might be spaces in the original string?
StringCases[str, "D" | StartOfString ~~ Longest[Except["D"] ..]]
or
Prepend["D" <> # & /# Rest[StringSplit[str, "D"]], First[StringSplit[str, "D"]]]