How to replace several characters in a string using Julia - string

I'm essentially trying to solve this problem: http://rosalind.info/problems/revc/
I want to replace all occurrences of A, C, G, T with their compliments T, G, C, A .. in other words all A's will be replaced with T's, all C's with G's and etc.
I had previously used the replace() function to replace all occurrences of 'T' with 'U' and was hoping that the replace function would take a list of characters to replace with another list of characters but I haven't been able to make it work, so it might not have that functionality.
I know I could solve this easily using the BioJulia package and have done so using the following:
# creating complementary strand of DNA
# reverse the string
# find the complementary nucleotide
using Bio.Seq
s = dna"AAAACCCGGT"
t = reverse(complement(s))
println("$t")
But I'd like to not have to rely on the package.
Here's the code I have so far, if someone could steer me in the right direction that'd be great.
# creating complementary strand of DNA
# reverse the string
# find the complementary nucleotide
s = open("nt.txt") # open file containing sequence
t = reverse(s) # reverse the sequence
final = replace(t, r'[ACGT]', '[TGCA]') # this is probably incorrect
# replace characters ACGT with TGCA
println("$final")

It seems that replace doesn't yet do translations quite like, say, tr in Bash. So instead, here are couple of approaches using a dictionary mapping instead (the BioJulia package also appears to make similar use of dictionaries):
compliments = Dict('A' => 'T', 'C' => 'G', 'G' => 'C', 'T' => 'A')
Then if str = "AAAACCCGGT", you could use join like this:
julia> join([compliments[c] for c in str])
"TTTTGGGCCA"
Another approach could be to use a function and map:
function translate(c)
compliments[c]
end
Then:
julia> map(translate, str)
"TTTTGGGCCA"
Strings are iterable objects in Julia; each of these approaches reads one character in turn, c, and passes it to the dictionary to get back the complimentary character. A new string is built up from these complimentary characters.
Julia's strings are also immutable: you can't swap characters around in place, rather you need to build a new string.

Related

Python - how to recursively search a variable substring in texts that are elements of a list

let me explain better what I mean in the title.
Examples of strings where to search (i.e. strings of variable lengths
each one is an element of a list; very large in reality):
STRINGS = ['sftrkpilotndkpilotllptptpyrh', 'ffftapilotdfmmmbtyrtdll', 'gftttepncvjspwqbbqbthpilotou', 'htfrpilotrtubbbfelnxcdcz']
The substring to find, which I know is for sure:
contained in each element of STRINGS
is also contained in a SOURCE string
is of a certain fixed LENGTH (5 characters in this example).
SOURCE = ['gfrtewwxadasvpbepilotzxxndffc']
I am trying to write a Python3 program that finds this hidden word of 5 characters that is in SOURCE and at what position(s) it occurs in each element of STRINGS.
I am also trying to store the results in an array or a dictionary (I do not know what is more convenient at the moment).
Moreover, I need to perform other searches of the same type but with different LENGTH values, so this value should be provided by a variable in order to be of more general use.
I know that the first point has been already solved in previous posts, but
never (as far as I know) together with the second point, which is the part of the code I could not be able to deal with successfully (I do not post my code because I know it is just too far from being fixable).
Any help from this great community is highly appreciated.
-- Maurizio
You can iterate over the source string and for each sub-string use the re module to find the positions within each of the other strings. Then if at least one occurrence was found for each of the strings, yield the result:
import re
def find(source, strings, length):
for i in range(len(source) - length):
sub = source[i:i+length]
positions = {}
for s in strings:
# positions[s] = [m.start() for m in re.finditer(re.escape(sub), s)]
positions[s] = [i for i in range(len(s)) if s.startswith(sub, i)] # Using built-in functions.
if not positions[s]:
break
else:
yield sub, positions
And the generator can be used as illustrated in the following example:
import pprint
pprint.pprint(dict(find(
source='gfrtewwxadasvpbepilotzxxndffc',
strings=['sftrkpilotndkpilotllptptpyrh',
'ffftapilotdfmmmbtyrtdll',
'gftttepncvjspwqbbqbthpilotou',
'htfrpilotrtubbbfelnxcdcz'],
length=5
)))
which produces the following output:
{'pilot': {'ffftapilotdfmmmbtyrtdll': [5],
'gftttepncvjspwqbbqbthpilotou': [21],
'htfrpilotrtubbbfelnxcdcz': [4],
'sftrkpilotndkpilotllptptpyrh': [5, 13]}}

Python3 and combining Diacritics

I've been having a problem with Unicode in python3 and I can't seem to understand why that's happening.
symbol= "ῇ̣"
print(len(symbol))
>>>>2
This letter comes from a word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ where I have combining diacritical marks. I want to do the statistical analysis in Python 3 and store the results in a database, the thing is that I also store the character's position (index) in the text. The database-application correctly counts the symbol-variable in the example as one-character, whereas Python counts it as two - throwing off the entire indexing.
The project requires me to keep the diacritics, so I can't simply ignore them or do a .replace("combining diacritical mark","") on the string.
Since Python3 has unicode as default for strings I'm a bit dumbfounded by this.
I have tried to use the base(), strip(), and strip_length() method from Greek-accentuation: https://pypi.org/project/greek-accentuation/ but that's not helping either.
Project requirements are:
Detect the alphabet belonging to the character (OK)
Store string-positions (needed for highlighting in the database) (NotOK)
Be able to process multiple languages/alphabets mixed in one string. (OK)
Iterate over CSV-input. (OK)
Ignore set of predefined strings (OK)
Ignore set of strings that match certain conditions (OK)
This is the simplified code for this project:
# -*- coding: utf-8 -*-
import csv
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
with open("tbltext.csv", "r", encoding="utf8") as txt:
data = csv.reader(txt)
for row in data:
text = row[1]
### Here I have some string manipulation (lowering everything, replacing the predefined set of strings by equal-length '-',...)
###then I use the ad-module to detect the language by looping over my characters, this is where it goes wrong.
for letter in text:
lang = ad.detect_alphabet(letter)
If I use the word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ as example with a forloop; my result is:
>>> word = "ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ"
>>> for letter in word:
... print(letter)
...
ἐ
̣
ν
̣
τ
̣
ῇ
̣
[
α
ὐ
τ
]
ῇ
How can I make Python see letters with a combining diacritical mark as one letter instead of making it print the letter and the diacritical mark separately?
The string has 2 in length, so this is correct: two code point:
>>> list(hex(ord(c)) for c in symbol)
['0x1fc7', '0x323']
>>> list(unicodedata.name(c) for c in symbol)
['GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI', 'COMBINING DOT BELOW']
So you should not use len to count the characters.
You could count the characters that are non-combining, so:
>>> import unicodedata
>>> len(''.join(ch for ch in symbol if unicodedata.combining(ch) == 0))
1
From: How do I get the "visible" length of a combining Unicode string in Python? (but I ported it to python3).
But this is also not the optimal solution, depending on the scope of counting characters. I think in your case it is enough, but fonts could merge characters into ligatures. On some languages, that are visually new (and very different) characters (and not like ligature in western languages).
As last comment: I think you should normalize strings. With above code, in this case it doesn't matter, but in other cases, you may get different results. Especially if someone used combatibility characters (e.g. mu for units, or Eszett, instead of the true Greek characters).

Prolog DCG Building/Recognizing Word Strings from Alphanumeric Characters

So I'm writing simple parsers for some programming languages in SWI-Prolog using Definite Clause Grammars. The goal is to return true if the input string or file is valid for the language in question, or false if the input string or file is not valid.
In all almost all of the languages there is an "identifier" predicate. In most of the languages the identifier is defined as the one of the following in EBNF: letter { letter | digit } or ( letter | digit ) { letter | digit }, that is to say in the first case a letter followed by zero or more alphanumeric characters, or i
My input file is split into a list of word strings (i.e. someIdentifier1 = 3 becomes the list [someIdentifier1,=,3]). The reason for the string to be split into lists of words rather than lists of letters is for recognizing keywords defined as terminals.
How do I implement "identifier" so that it recognizes any alphanumeric string or a string consisting of a letter followed by alphanumeric characters.
Is it possible or necessary to further split the word into letters for this particular predicate only, and if so how would I go about doing this? Or is there another solution, perhaps using SWI-Prolog libraries' built-in predicates?
I apologize for the poorly worded title of this question; however, I am unable to clarify it any further.
First, when you need to reason about individual letters, it is typically most convenient to reason about lists of characters.
In Prolog, you can easily convert atoms to characters with atom_chars/2.
For example:
?- atom_chars(identifier10, Cs).
Cs = [i, d, e, n, t, i, f, i, e, r, '1', '0'].
Once you have such characters, you can used predicates like char_type/2 to reason about properties of each character.
For example:
?- char_type(i, T).
T = alnum ;
T = alpha ;
T = csym ;
etc.
The general pattern to express identifiers such as yours with DCGs can look as follows:
identifier -->
[L],
{ letter(L) },
identifier_rest.
identifier_rest --> [].
identifier_rest -->
[I],
{ letter_or_digit(I) },
identifier_rest.
You can use this as a building block, and only need to define letter/1 and letter_or_digit/1. This is very easy with char_type/2.
Further, you can of course introduce an argument to relate such lists to atoms.

Efficient way to insert characters between other characters in a string

What is an efficient way in MATLAB to replace/insert one symbol (in series of symbols) with several others that correspond to the one that is being replaced?
For example, consider having a string Eq: Eq = 'A*exp(-((x-xc)/w)^2)'. Is there a way to replace * with .*, / with ./,\ with .\, and ^ with .^ without writing four separate strrep() lines?
Regular expressions will do the job nicely. Regular expressions simply find patterns in text. You specify what kind of pattern you are looking for by a regular expression, and the output gives you the locations of where the pattern occurred.
For our particular case, not only do we want to find where patterns occur, we also want to replace those patterns with something else. Specifically, use the function regexprep from MATLAB to replace matches in a string with something else. What you want to do is replace all *, /, \ and ^ symbols by adding a . in front of each.
How regexprep works is that the first input is the string you're looking at, the second input is a pattern that you're trying to find. In our case, we want to find any of *, /, \ and ^. To specify this pattern, you put those desired symbols in [] brackets. Regular expressions reserve \ as a special symbol to delineate characters that can be parsed as a regular expression but actually aren't. As such, you need to use \\ for the \ character and \^ for the ^ character. The third input is what you want to replace each match with. In our case, we simply want to reuse each matched character, but we add a . at the beginning of the match. This is done by doing \.$0 in the regular expression syntax. $0 means to grab the first token produced by a match... which is essentially the matched symbol from the pattern. . is also a reserved keyword using regular expressions, so we must prepend this symbol with a \ character.
Without further ado:
>> Eq = 'A*exp(-((x-xc)/w)^2)';
>> out = regexprep(Eq, '[*/\\\^]', '\.$0')
out =
A.*exp(-((x-xc)./w).^2)
The pattern we are looking for is [*/\\\^], which means that we want to find any of *, /, \ - denoted as \\ in regex, and \^ - denoted as ^ in regex. We want to find any of these symbols and replace them with the same symbol by adding a . character in front - \.$0.
As a more complicated example, let's make sure that we include all of the symbols you're looking for in a sample equation:
>> A = 'A*exp(-((x-xc)/w)^2) \ b^2';
>> out = regexprep(A, '[*/\\\^]', '\.$0')
out =
A.*exp(-((x-xc)./w).^2) .\ b.^2
I'd go with regexp as in rayryeng's answer. But here's another approach, just to provide an alternative.
ops = '*/\^'; %// operators that need a dot
ii = find(ismember(Eq, ops)); %// find where dots should be inserted
[~, jj] = sort([1:numel(Eq) ii-.5]); %// will be used to properly order the result
result = [Eq repmat('.',1,numel(ii))]; %// insert dots at the end
result = result(jj); %// properly order the result
And a variant:
ops = '*/\^'; %// operators that need a dot
ii = find(ismember(Eq, ops)); %// find where dots should be inserted
jj = sort([1:numel(Eq) ii-.5]); %// dot locations are marked with fractional part
result = Eq(ceil(jj)); %// repeat characters where the dots will be placed
result(mod(jj,1)>0) = '.'; %// place dots at indices with fractional part
The vectorize function already does almost all of what you want except that it does not convert mldivide (\) to ldivide (.\).
By "efficient," do you mean fewer lines of code or faster? Regular expressions are almost always slower than other approaches and less readable. I don't think they're necessary or a good choice in this case. If you only need to convert your string once, then speed is less of a concern than readability (strrep will still be faster). If you need to do it many times, this simple code that you alluded to is 4–5 times faster than regexrep for short strings like your example (and much faster for longer strings):
out = strrep(Eq,'*','.*');
out = strrep(out,'/','./');
out = strrep(out,'\','.\');
out = strrep(out,'^','.^');
If you want one line, use:
out = strrep(strrep(strrep(strrep(Eq,'*','.*'),'/','./'),'\','.\'),'^','.^');
which will also be slightly faster still. Or create your own version of vectorize and call that.
Where regular expressions shine is in more complex cases, e.g., if your string is already partially vectorized: Eq = 'A.*exp(-((x-xc)/w)^2)'. Even still, the vectorize function just uses strrep and then calls strfind to "remove any possible '..*', '../', etc." and replace them with the proper element-wise operators because it's faster (symbolic math strings can get very large, for example).

How can I remove repeated characters in a string with R?

I would like to implement a function with R that removes repeated characters in a string. For instance, say my function is named removeRS, so it is supposed to work this way:
removeRS('Buenaaaaaaaaa Suerrrrte')
Buena Suerte
removeRS('Hoy estoy tristeeeeeee')
Hoy estoy triste
My function is going to be used with strings written in spanish, so it is not that common (or at least correct) to find words that have more than three successive vowels. No bother about the possible sentiment behind them. Nonetheless, there are words that can have two successive consonants (especially ll and rr), but we could skip this from our function.
So, to sum up, this function should replace the letters that appear at least three times in a row with just that letter. In one of the examples above, aaaaaaaaa is replaced with a.
Could you give me any hints to carry out this task with R?
I did not think very carefully on this, but this is my quick solution using references in regular expressions:
gsub('([[:alpha:]])\\1+', '\\1', 'Buenaaaaaaaaa Suerrrrte')
# [1] "Buena Suerte"
() captures a letter first, \\1 refers to that letter, + means to match it once or more; put all these pieces together, we can match a letter two or more times.
To include other characters besides alphanumerics, replace [[:alpha:]] with a regex matching whatever you wish to include.
I think you should pay attention to the ambiguities in your problem description. This is a first stab, but it clearly does not work with "Good Luck" in the manner you desire:
removeRS <- function(str) paste(rle(strsplit(str, "")[[1]])$values, collapse="")
removeRS('Buenaaaaaaaaa Suerrrrte')
#[1] "Buena Suerte"
Since you want to replace letters that appear AT LEAST 3 times, here is my solution:
gsub("([[:alpha:]])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
#[1] "Buenna Suertee"
As you can see the 4 "a" have been reduced to only 1 a, the 3 r have been reduced to 1 r but the 2 n and the 2 e have not been changed.
As suggested above you can replace the [[:alpha:]] by any combination of [a-zA-KM-Z] or similar, and even use the "or" operator | inside the squre brackets [y|Q] if you want your code to affect only repetitions of y and Q.
gsub("([a|e])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
# [1] "Buenna Suerrrtee"
# triple r are not affected and there are no triple e.

Resources