R extract a part of a string in R - string

I have 5 million sequences (probes to be specific) as below. I need to extract the name from each string.
The names here are 1007_s_at:123:381, 10073_s_at:128:385 and so on..
I am using lapply function but it is taking too much time. I have several other similar files. Would you suggest a faster way to do this.
nm = c(
"probe:HG-Focus:1007_s_at:123:381; Interrogation_Position=3570; Antisense;",
"probe:HG-Focus:1007_s_at:128:385; Interrogation_Position=3615; Antisense;",
"probe:HG-Focus:1007_s_at:133:441; Interrogation_Position=3786; Antisense;",
"probe:HG-Focus:1007_s_at:142:13; Interrogation_Position=3878; Antisense;" ,
"probe:HG-Focus:1007_s_at:156:191; Interrogation_Position=3443; Antisense;",
"probe:HTABC:1007_s_at:244:391; Interrogation_Position=3793; Antisense;")
extractProbe <- function(x) sub("probe:", "", strsplit(x, ";", fixed=TRUE)[[1]][1], ignore.case=TRUE)
pr = lapply(nm, extractProbe)
Output
1007_s_at:123:381
1007_s_at:128:385
1007_s_at:133:441
1007_s_at:142:13
1007_s_at:156:191
1007_s_at:244:391

Using regular expressions:
sub("probe:(.*?):(.*?);.*$", "\\2", nm, perl = TRUE)
A bit of explanation:
. means "any character".
.* means "any number of characters".
.*? means "any number of characters, but do not be greedy.
patterns within parenthesis are captured and assigned to \\1, \\2, etc.
$ means end of the line (or string).
So here, the pattern matches the whole line, and captures two things via the two (.*?): the HG-Focus (or other) thing you don't want as \\1 and your id as \\2. By setting the replacement to \\2, we are effectively replacing the whole string with your id.
I now realize it was not necessary to capture the first thing, so this would work just as well:
sub("probe:.*?:(.*?);.*$", "\\1", nm, perl = TRUE)

A roundabout technique:
sapply(strsplit(sapply(strsplit(nm, "e:"), "[[", 2), ";"), "[[", 1)

Related

Replace matched susbtring using re sub

Is there a way to replace the matched pattern substring using a single re.sub() line?.
What I would like to avoid is using a string replace method to the current re.sub() output.
Input = "/J&L/LK/Tac1_1/shareloc.pdf"
Current output using re.sub("[^0-9_]", "", input): "1_1"
Desired output in a single re.sub use: "1.1"
According to the documentation, re.sub is defined as
re.sub(pattern, repl, string, count=0, flags=0)
If repl is a function, it is called for every non-overlapping occurrence of pattern.
This said, if you pass a lambda function, you can remain the code in one line. Furthermore, remember that the matched characters can be accessed easier to an individual group by: x[0].
I removed _ from the regex to reach the desired output.
txt = "/J&L/LK/Tac1_1/shareloc.pdf"
x = re.sub("[^0-9]", lambda x: '.' if x[0] is '_' else '', txt)
print(x)
There is no way to use a string replacement pattern in Python re.sub to replace with two possible strings, as there is no conditional replacement construct support in Python re.sub. So, using a callable as the replacement argument or use other work-arounds.
It looks like you only expect one match of <DIGITS>_<DIGITS> in the input string. In this case, you can use
import re
text = "/J&L/LK/Tac1_1/shareloc.pdf"
print( re.sub(r'^.*?(\d+)_(\d+).*', r'\1.\2', text, flags=re.S) )
# => 1.1
See the Python demo. See the regex demo. Details:
^ - start of string
.*? - zero or more chars as few as possible
(\d+) - Group 1: one or more digits
_ - a _ char
(\d+) - Group 2: one or more digits
.* - zero or more chars as many as possible.

How to match a part of string before a character into one variable and all after it into another

I have a problem with splitting string into two parts on special character.
For example:
12345#data
or
1234567#data
I have 5-7 characters in first part separated with "#" from second part, where are another data (characters,numbers, doesn't matter what)
I need to store two parts on each side of # in two variables:
x = 12345
y = data
without "#" character.
I was looking for some Lua string function like splitOn("#") or substring until character, but I haven't found that.
Use string.match and captures.
Try this:
s = "12345#data"
a,b = s:match("(.+)#(.+)")
print(a,b)
See this documentation:
First of all, although Lua does not have a split function is its standard library, it does have string.gmatch, which can be used instead of a split function in many cases. Unlike a split function, string.gmatch takes a pattern to match the non-delimiter text, instead of the delimiters themselves
It is easily achievable with the help of a negated character class with string.gmatch:
local example = "12345#data"
for i in string.gmatch(example, "[^#]+") do
print(i)
end
See IDEONE demo
The [^#]+ pattern matches one or more characters other than # (so, it "splits" a string with 1 character).

Efficient way to insert characters between other characters in a string

What is an efficient way in MATLAB to replace/insert one symbol (in series of symbols) with several others that correspond to the one that is being replaced?
For example, consider having a string Eq: Eq = 'A*exp(-((x-xc)/w)^2)'. Is there a way to replace * with .*, / with ./,\ with .\, and ^ with .^ without writing four separate strrep() lines?
Regular expressions will do the job nicely. Regular expressions simply find patterns in text. You specify what kind of pattern you are looking for by a regular expression, and the output gives you the locations of where the pattern occurred.
For our particular case, not only do we want to find where patterns occur, we also want to replace those patterns with something else. Specifically, use the function regexprep from MATLAB to replace matches in a string with something else. What you want to do is replace all *, /, \ and ^ symbols by adding a . in front of each.
How regexprep works is that the first input is the string you're looking at, the second input is a pattern that you're trying to find. In our case, we want to find any of *, /, \ and ^. To specify this pattern, you put those desired symbols in [] brackets. Regular expressions reserve \ as a special symbol to delineate characters that can be parsed as a regular expression but actually aren't. As such, you need to use \\ for the \ character and \^ for the ^ character. The third input is what you want to replace each match with. In our case, we simply want to reuse each matched character, but we add a . at the beginning of the match. This is done by doing \.$0 in the regular expression syntax. $0 means to grab the first token produced by a match... which is essentially the matched symbol from the pattern. . is also a reserved keyword using regular expressions, so we must prepend this symbol with a \ character.
Without further ado:
>> Eq = 'A*exp(-((x-xc)/w)^2)';
>> out = regexprep(Eq, '[*/\\\^]', '\.$0')
out =
A.*exp(-((x-xc)./w).^2)
The pattern we are looking for is [*/\\\^], which means that we want to find any of *, /, \ - denoted as \\ in regex, and \^ - denoted as ^ in regex. We want to find any of these symbols and replace them with the same symbol by adding a . character in front - \.$0.
As a more complicated example, let's make sure that we include all of the symbols you're looking for in a sample equation:
>> A = 'A*exp(-((x-xc)/w)^2) \ b^2';
>> out = regexprep(A, '[*/\\\^]', '\.$0')
out =
A.*exp(-((x-xc)./w).^2) .\ b.^2
I'd go with regexp as in rayryeng's answer. But here's another approach, just to provide an alternative.
ops = '*/\^'; %// operators that need a dot
ii = find(ismember(Eq, ops)); %// find where dots should be inserted
[~, jj] = sort([1:numel(Eq) ii-.5]); %// will be used to properly order the result
result = [Eq repmat('.',1,numel(ii))]; %// insert dots at the end
result = result(jj); %// properly order the result
And a variant:
ops = '*/\^'; %// operators that need a dot
ii = find(ismember(Eq, ops)); %// find where dots should be inserted
jj = sort([1:numel(Eq) ii-.5]); %// dot locations are marked with fractional part
result = Eq(ceil(jj)); %// repeat characters where the dots will be placed
result(mod(jj,1)>0) = '.'; %// place dots at indices with fractional part
The vectorize function already does almost all of what you want except that it does not convert mldivide (\) to ldivide (.\).
By "efficient," do you mean fewer lines of code or faster? Regular expressions are almost always slower than other approaches and less readable. I don't think they're necessary or a good choice in this case. If you only need to convert your string once, then speed is less of a concern than readability (strrep will still be faster). If you need to do it many times, this simple code that you alluded to is 4–5 times faster than regexrep for short strings like your example (and much faster for longer strings):
out = strrep(Eq,'*','.*');
out = strrep(out,'/','./');
out = strrep(out,'\','.\');
out = strrep(out,'^','.^');
If you want one line, use:
out = strrep(strrep(strrep(strrep(Eq,'*','.*'),'/','./'),'\','.\'),'^','.^');
which will also be slightly faster still. Or create your own version of vectorize and call that.
Where regular expressions shine is in more complex cases, e.g., if your string is already partially vectorized: Eq = 'A.*exp(-((x-xc)/w)^2)'. Even still, the vectorize function just uses strrep and then calls strfind to "remove any possible '..*', '../', etc." and replace them with the proper element-wise operators because it's faster (symbolic math strings can get very large, for example).

split string by char

scala has a standard way of splitting a string in StringOps.split
it's behaviour somewhat surprised me though.
To demonstrate, using the quick convenience function
def sp(str: String) = str.split('.').toList
the following expressions all evaluate to true
(sp("") == List("")) //expected
(sp(".") == List()) //I would have expected List("", "")
(sp("a.b") == List("a", "b")) //expected
(sp(".b") == List("", "b")) //expected
(sp("a.") == List("a")) //I would have expected List("a", "")
(sp("..") == List()) // I would have expected List("", "", "")
(sp(".a.") == List("", "a")) // I would have expected List("", "a", "")
so I expected that split would return an array with (the number a separator occurrences) + 1 elements, but that's clearly not the case.
It is almost the above, but remove all trailing empty strings, but that's not true for splitting the empty string.
I'm failing to identify the pattern here. What rules does StringOps.split follow?
For bonus points, is there a good way (without too much copying/string appending) to get the split I'm expecting?
For curious you can find the code here.https://github.com/scala/scala/blob/v2.12.0-M1/src/library/scala/collection/immutable/StringLike.scala
See the split function with the character as an argument(line 206).
I think, the general pattern going on over here is, all the trailing empty splits results are getting ignored.
Except for the first one, for which "if no separator char is found then just send the whole string" logic is getting applied.
I am trying to find if there is any design documentation around these.
Also, if you use string instead of char for separator it will fall back to java regex split. As mentioned by #LRLucena, if you provide the limit parameter with a value more than size, you will get your trailing empty results. see http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String,%20int)
You can use split with a regular expression. I´m not sure, but I guess that the second parameter is the largest size of the resulting array.
def sp(str: String) = str.split("\\.", str.length+1).toList
Seems to be consistent with these three rules:
1) Trailing empty substrings are dropped.
2) An empty substring is considered trailing before it is considered leading, if applicable.
3) First case, with no separators is an exception.
split follows the behaviour of http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String)
That is split "around" the separator character, with the following exceptions:
Regardless of anything else, splitting the empty string will always give Array("")
Any trailing empty substrings are removed
Surrogate characters only match if the matched character is not part of a surrogate pair.

Extract numeric part of strings of mixed numbers and characters in R

I have a lot of strings, and each of which tends to have the following format: Ab_Cd-001234.txt
I want to replace it with 001234. How can I achieve it in R?
The stringr package has lots of handy shortcuts for this kind of work:
# input data following #agstudy
data <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
# load library
library(stringr)
# prepare regular expression
regexp <- "[[:digit:]]+"
# process string
str_extract(data, regexp)
Which gives the desired result:
[1] "001234" "001234"
To explain the regexp a little:
[[:digit:]] is any number 0 to 9
+ means the preceding item (in this case, a digit) will be matched one or more times
This page is also very useful for this kind of string processing: http://en.wikibooks.org/wiki/R_Programming/Text_Processing
Using gsub or sub you can do this :
gsub('.*-([0-9]+).*','\\1','Ab_Cd-001234.txt')
"001234"
you can use regexpr with regmatches
m <- gregexpr('[0-9]+','Ab_Cd-001234.txt')
regmatches('Ab_Cd-001234.txt',m)
"001234"
EDIT the 2 methods are vectorized and works for a vector of strings.
x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
sub('.*-([0-9]+).*','\\1',x)
"001234" "001234"
m <- gregexpr('[0-9]+',x)
> regmatches(x,m)
[[1]]
[1] "001234"
[[2]]
[1] "001234"
You could use genXtract from the qdap package. This takes a left character string and a right character string and extracts the elements between.
library(qdap)
genXtract("Ab_Cd-001234.txt", "-", ".txt")
Though I much prefer agstudy's answer.
EDIT Extending answer to match agstudy's:
x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
genXtract(x, "-", ".txt")
# $`- : .txt1`
# [1] "001234"
#
# $`- : .txt2`
# [1] "001234"
gsub Remove prefix and suffix:
gsub(".*-|\\.txt$", "", x)
tools package Use file_path_sans_ext from tools to remove extension and then use sub to remove prefix:
library(tools)
sub(".*-", "", file_path_sans_ext(x))
strapplyc Extract the digits after - and before dot. See gsubfn home page for more info:
library(gsubfn)
strapplyc(x, "-(\\d+)\\.", simplify = TRUE)
Note that if it were desired to return a numeric we could use strapply rather than strapplyc like this:
strapply(x, "-(\\d+)\\.", as.numeric, simplify = TRUE)
I'm adding this answer because it works regardless of what non-numeric characters you have in the strings you want to clean up, and because OP said that the string tends to follow the format "Ab_Cd-001234.txt", which I take to mean allows for variation.
Note that this answer takes all numeric characters from the string and keeps them together, so if the string were "4_Ab_Cd_001234.txt", your result would be "4001234".
If you're wanting to point your solution at a column in a dataframe you've got,
df$clean_column<-gsub("[^0-9]", "", df$dirty_column)
This is very similar to the answer here:
https://stackoverflow.com/a/52729957/9731173.
Essentially what you are doing with my solution is replacing any non-numeric character with "", while the answer I've linked to replaces any character that is not numeric, - or .

Resources