Extract numeric part of strings of mixed numbers and characters in R - string

I have a lot of strings, and each of which tends to have the following format: Ab_Cd-001234.txt
I want to replace it with 001234. How can I achieve it in R?

The stringr package has lots of handy shortcuts for this kind of work:
# input data following #agstudy
data <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
# load library
library(stringr)
# prepare regular expression
regexp <- "[[:digit:]]+"
# process string
str_extract(data, regexp)
Which gives the desired result:
[1] "001234" "001234"
To explain the regexp a little:
[[:digit:]] is any number 0 to 9
+ means the preceding item (in this case, a digit) will be matched one or more times
This page is also very useful for this kind of string processing: http://en.wikibooks.org/wiki/R_Programming/Text_Processing

Using gsub or sub you can do this :
gsub('.*-([0-9]+).*','\\1','Ab_Cd-001234.txt')
"001234"
you can use regexpr with regmatches
m <- gregexpr('[0-9]+','Ab_Cd-001234.txt')
regmatches('Ab_Cd-001234.txt',m)
"001234"
EDIT the 2 methods are vectorized and works for a vector of strings.
x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
sub('.*-([0-9]+).*','\\1',x)
"001234" "001234"
m <- gregexpr('[0-9]+',x)
> regmatches(x,m)
[[1]]
[1] "001234"
[[2]]
[1] "001234"

You could use genXtract from the qdap package. This takes a left character string and a right character string and extracts the elements between.
library(qdap)
genXtract("Ab_Cd-001234.txt", "-", ".txt")
Though I much prefer agstudy's answer.
EDIT Extending answer to match agstudy's:
x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
genXtract(x, "-", ".txt")
# $`- : .txt1`
# [1] "001234"
#
# $`- : .txt2`
# [1] "001234"

gsub Remove prefix and suffix:
gsub(".*-|\\.txt$", "", x)
tools package Use file_path_sans_ext from tools to remove extension and then use sub to remove prefix:
library(tools)
sub(".*-", "", file_path_sans_ext(x))
strapplyc Extract the digits after - and before dot. See gsubfn home page for more info:
library(gsubfn)
strapplyc(x, "-(\\d+)\\.", simplify = TRUE)
Note that if it were desired to return a numeric we could use strapply rather than strapplyc like this:
strapply(x, "-(\\d+)\\.", as.numeric, simplify = TRUE)

I'm adding this answer because it works regardless of what non-numeric characters you have in the strings you want to clean up, and because OP said that the string tends to follow the format "Ab_Cd-001234.txt", which I take to mean allows for variation.
Note that this answer takes all numeric characters from the string and keeps them together, so if the string were "4_Ab_Cd_001234.txt", your result would be "4001234".
If you're wanting to point your solution at a column in a dataframe you've got,
df$clean_column<-gsub("[^0-9]", "", df$dirty_column)
This is very similar to the answer here:
https://stackoverflow.com/a/52729957/9731173.
Essentially what you are doing with my solution is replacing any non-numeric character with "", while the answer I've linked to replaces any character that is not numeric, - or .

Related

Replace matched susbtring using re sub

Is there a way to replace the matched pattern substring using a single re.sub() line?.
What I would like to avoid is using a string replace method to the current re.sub() output.
Input = "/J&L/LK/Tac1_1/shareloc.pdf"
Current output using re.sub("[^0-9_]", "", input): "1_1"
Desired output in a single re.sub use: "1.1"
According to the documentation, re.sub is defined as
re.sub(pattern, repl, string, count=0, flags=0)
If repl is a function, it is called for every non-overlapping occurrence of pattern.
This said, if you pass a lambda function, you can remain the code in one line. Furthermore, remember that the matched characters can be accessed easier to an individual group by: x[0].
I removed _ from the regex to reach the desired output.
txt = "/J&L/LK/Tac1_1/shareloc.pdf"
x = re.sub("[^0-9]", lambda x: '.' if x[0] is '_' else '', txt)
print(x)
There is no way to use a string replacement pattern in Python re.sub to replace with two possible strings, as there is no conditional replacement construct support in Python re.sub. So, using a callable as the replacement argument or use other work-arounds.
It looks like you only expect one match of <DIGITS>_<DIGITS> in the input string. In this case, you can use
import re
text = "/J&L/LK/Tac1_1/shareloc.pdf"
print( re.sub(r'^.*?(\d+)_(\d+).*', r'\1.\2', text, flags=re.S) )
# => 1.1
See the Python demo. See the regex demo. Details:
^ - start of string
.*? - zero or more chars as few as possible
(\d+) - Group 1: one or more digits
_ - a _ char
(\d+) - Group 2: one or more digits
.* - zero or more chars as many as possible.

define a character string containing "

I wish to define a character variable as: a"", as in: my.string <- 'a""' Nothing I have tried works. I always get: "a\"\"", or some variation thereof.
I have been reading the documentation for: grep, strsplit, regex, substr, gregexpr and other functions for clues on how to tell R that " is a character I want to keep unchanged, and I have tried maybe a hundred variations of a"" by adding \\, \, /, //, [], _, $, [, #.
The only potential example I can find on the internet of a string including " is: ‘{}>=40*" years"’ from here: http://cran.r-project.org/doc/manuals/R-lang.html However, that example is for performing a mathematical operation.
Sorry for such a very basic question. Thank you for any advice.
The backslashes is an artifact of the print method. In fact the default print surrounds your string with quotes. You can disable this by setting argument quote to FALSE.
For example You can use :
print(my.string,quote=FALSE)
[1] a""
But I would use cat or write like this :
cat(my.string)
a""
write(my.string,"")
a""
Using substr, one sees that the backslashes seem just to be an artefact of printing:
substr(my.string,2,2)
gives
[1] "\""
also, the string length is as you want it:
> nchar(my.string)
[1] 3
if you want to print your string without the backslashes, use noquote :
> noquote(my.string)
[1] a""

R extract a part of a string in R

I have 5 million sequences (probes to be specific) as below. I need to extract the name from each string.
The names here are 1007_s_at:123:381, 10073_s_at:128:385 and so on..
I am using lapply function but it is taking too much time. I have several other similar files. Would you suggest a faster way to do this.
nm = c(
"probe:HG-Focus:1007_s_at:123:381; Interrogation_Position=3570; Antisense;",
"probe:HG-Focus:1007_s_at:128:385; Interrogation_Position=3615; Antisense;",
"probe:HG-Focus:1007_s_at:133:441; Interrogation_Position=3786; Antisense;",
"probe:HG-Focus:1007_s_at:142:13; Interrogation_Position=3878; Antisense;" ,
"probe:HG-Focus:1007_s_at:156:191; Interrogation_Position=3443; Antisense;",
"probe:HTABC:1007_s_at:244:391; Interrogation_Position=3793; Antisense;")
extractProbe <- function(x) sub("probe:", "", strsplit(x, ";", fixed=TRUE)[[1]][1], ignore.case=TRUE)
pr = lapply(nm, extractProbe)
Output
1007_s_at:123:381
1007_s_at:128:385
1007_s_at:133:441
1007_s_at:142:13
1007_s_at:156:191
1007_s_at:244:391
Using regular expressions:
sub("probe:(.*?):(.*?);.*$", "\\2", nm, perl = TRUE)
A bit of explanation:
. means "any character".
.* means "any number of characters".
.*? means "any number of characters, but do not be greedy.
patterns within parenthesis are captured and assigned to \\1, \\2, etc.
$ means end of the line (or string).
So here, the pattern matches the whole line, and captures two things via the two (.*?): the HG-Focus (or other) thing you don't want as \\1 and your id as \\2. By setting the replacement to \\2, we are effectively replacing the whole string with your id.
I now realize it was not necessary to capture the first thing, so this would work just as well:
sub("probe:.*?:(.*?);.*$", "\\1", nm, perl = TRUE)
A roundabout technique:
sapply(strsplit(sapply(strsplit(nm, "e:"), "[[", 2), ";"), "[[", 1)

Count word occurrences in R

Is there a function for counting the number of times a particular keyword is contained in a dataset?
For example, if dataset <- c("corn", "cornmeal", "corn on the cob", "meal") the count would be 3.
Let's for the moment assume you wanted the number of element containing "corn":
length(grep("corn", dataset))
[1] 3
After you get the basics of R down better you may want to look at the "tm" package.
EDIT: I realize that this time around you wanted any-"corn" but in the future you might want to get word-"corn". Over on r-help Bill Dunlap pointed out a more compact grep pattern for gathering whole words:
grep("\\<corn\\>", dataset)
Another quite convenient and intuitive way to do it is to use the str_count function of the stringr package:
library(stringr)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for mere occurences of the pattern:
str_count(dataset, "corn")
# [1] 1 1 1 0
# for occurences of the word alone:
str_count(dataset, "\\bcorn\\b")
# [1] 1 0 1 0
# summing it up
sum(str_count(dataset, "corn"))
# [1] 3
You can also do something like the following:
length(dataset[which(dataset=="corn")])
I'd just do it with string division like:
library(roperators)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for each vector element:
dataset %s/% 'corn'
# for everything:
sum(dataset %s/% 'corn')
You can use the str_count function from the stringr package to get the number of keywords that match a given character vector.
The pattern argument of the str_count function accepts a regular expression that can be used to specify the keyword.
The regular expression syntax is very flexible and allows matching whole words as well as character patterns.
For example the following code will count all occurrences of the string "corn" and will return 3:
sum(str_count(dataset, regex("corn")))
To match complete words use:
sum(str_count(dataset, regex("\\bcorn\\b")))
The "\b" is used to specify a word boundary. When using str_count function, the default definition of word boundary includes apostrophe. So if your dataset contains the string "corn's", it would be matched and included in the result.
This is because apostrophe is considered as a word boundary by default. To prevent words containing apostrophe from being counted, use the regex function with parameter uword = T. This will cause the regular expression engine to use the unicode TR 29 definition of word boundaries. See http://unicode.org/reports/tr29/tr29-4.html. This definition does not consider apostrophe as a word boundary.
The following code will give the number of time the word "corn" occurs. Words such as "corn's" will not be included.
sum(str_count(dataset, regex("\\bcorn\\b", uword = T)))

Wrapping strings, but not substrings in quotes, using R

This question is related to my question about Roxygen.
I want to write a new function that does word wrapping of strings, similar to strwrap or stringr::str_wrap, but with the following twist: Any elements (substrings) in the string that are enclosed in quotes must not be allowed to wrap.
So, for example, using the following sample data
test <- "function(x=123456789, y=\"This is a long string argument\")"
cat(test)
function(x=123456789, y="This is a long string argument")
strwrap(test, width=40)
[1] "function(x=123456789, y=\"This is a long"
[2] "string argument\")"
I want the desired output of a newWrapFunction(x, width=40, ...) to be:
desired <- c("function(x=123456789, ", "y=\"This is a long string argument\")")
desired
[1] "function(x=123456789, "
[2] "y=\"This is a long string argument\")"
identical(desired, newWrapFunction(tsring, width=40))
[1] TRUE
Can you think of a way to do this?
PS. If you can help me solve this, I will propose this code as a patch to roxygen2. I have identified where this patch should be applied and will acknowledge your contribution.
Here's what I did to get strwrap so it would not break single quoted sections on spaces:
A) Pre-process the "even" sections after splitting by the single-quotes by substituting "~|~" for the spaces:
Define new function strwrapqt
....
zz <- strsplit(x, "\'") # will be only working on even numbered sections
for (i in seq_along(zz) ){
for (evens in seq(2, length(zz[[i]]), by=2)) {
zz[[i]][evens] <- gsub("[ ]", "~|~", zz[[i]][evens])}
}
zz <- unlist(zz)
.... insert just before
z <- lapply(strsplit) ...........
Then at the end replace all the "~|~" with spaces. It might be necessary to doa lot more thinking about the other sorts of whitespace "events" to get a fully regular treatment.
....
y <- gsub("~\\|~", " ", y)
....
Edit: Tested #joran's suggestion. Matching single and double quotes would be a difficult task with the methods I am using but if one were willing to consider any quote as equally valid as a separator target, one could just use zz <- strsplit(x, "\'|\"") as the splitting criterion in the code above.

Resources