R: combinatorial string replacement - string

I am on the lookout for a gsub based function which would enable me to do combinatorial string replacement, so that if I would have an arbitrary number of string replacement rules
replrules=list("<x>"=c(3,5),"<ALK>"=c("hept","oct","non"),"<END>"=c("ane","ene"))
and a target string
string="<x>-methyl<ALK><END>"
it would give me a dataframe with the final string name and the substitutions that were made as in
name x ALK END
3-methylheptane 3 hept ane
5-methylheptane 5 hept ane
3-methyloctane 3 oct ane
5-methyloctane 5 ... ...
3-methylnonane 3
5-methylnonane 5
3-methylheptene 3
5-methylheptene 5
3-methyloctene 3
5-methyloctene 5
3-methylnonene 3
5-methylnonene 5
The target string would be of arbitrary structure, e.g. it could also be string="1-<ALK>anol" or each pattern could occur several times, as in string="<ALK>anedioic acid, di<ALK>yl ester"
What would be the most elegant way to do this kind of thing in R?

How about
d <- do.call(expand.grid, replrules)
d$name <- paste0(d$'<x>', "-", "methyl", d$'<ALK>', d$'<END>')
EDIT
This seems to work (substituting each of these into the strplit)
string = "<x>-methyl<ALK><END>"
string2 = "<x>-ethyl<ALK>acosane"
string3 = "1-<ALK>anol"
Using Richards regex
d <- do.call(expand.grid, list(replrules, stringsAsFactors=FALSE))
names(d) <- gsub("<|>","",names(d))
s <- strsplit(string3, "(<|>)", perl = TRUE)[[1]]
out <- list()
for(i in s) {
out[[i]] <- ifelse (i %in% names(d), d[i], i)
}
d$name <- do.call(paste0, unlist(out, recursive=F))
EDIT
This should work for repeat items
d <- do.call(expand.grid, list(replrules, stringsAsFactors=FALSE))
names(d) <- gsub("<|>","",names(d))
string4 = "<x>-methyl<ALK><END>oate<ALK>"
s <- strsplit(string4, "(<|>)", perl = TRUE)[[1]]
out <- list()
for(i in seq_along(s)) {
out[[i]] <- ifelse (s[i] %in% names(d), d[s[i]], s[i])
}
d$name <- do.call(paste0, unlist(out, recursive=F))

Well, I'm not exactly sure we can even produce a "correct" answer to your question, but hopefully this helps give you some ideas.
Okay, so in s, I just split the string where it might be of most importance. Then g gets the first value in each element of r. Then I constructed a data frame as an example. So then dat is a one row example of how it would look.
> (s <- strsplit(string, "(?<=l|\\>)", perl = TRUE)[[1]])
# [1] "<x>" "-methyl" "<ALK>" "<END>"
> g <- sapply(replrules, "[", 1)
> dat <- data.frame(name = paste(append(g, s[2], after = 1), collapse = ""))
> dat[2:4] <- g
> names(dat)[2:4] <- sapply(strsplit(names(g), "<|>"), "[", -1)
> dat
# name x ALK END
# 1 3-methylheptane 3 hept ane

Related

Vectorisation of a similar string finding loop

I have a large vector of strings like this:
d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )
I wan't to fetch similar strings for each string from the same vector d.
I am doing this by
1. calculating for each string the edit distance with all other strings strings based on certain rules such as forcing exact matching if any digits are present or if number of alphabet characters are less than 5.
2. putting it in a dataframe dist along with string.
3. subsetting dist based on distances < 3.
4. collapsing and adding the similar strings to original dataframe as a new column.
I am using the stringr and stringdist packages
d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2))
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)
require(stringr)
require(stringdist)
for (i in 1:M){
# if string has digits or is of short size (<5) do exact matching
if (grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE || str_count(d[i, "d"], "[[:alpha:]]") < 5){
Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=0.000001) # maxDist as fraction to force exact matching
# otherwise do approximate matching
} else {
Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=3)
}
# subset similar strings (with edit distance <3)
subDist <- subset(Dist, dist < 3 )
# add to original data.frame d
d[i, "sim"] <- paste(as.character(unlist(subDist$string)), collapse=", ")
}
Is it possible to vectorise the procedure instead of using a loop? I have a very large vector of strings, so a calculating a distance matrix using stringdistmatrix on the entire vector can't be done due to memory restrictions. The loop works fine for large data, but is very slow.
stringdist has a version for computing all the distances in a matrix, so I think that something like this will be an improvement, it's about four times as quick on my computer when run with the 100 reps line included:
d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )
#d <- rep(d, each=100) #make it a bit longer for timing
d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2))
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)
require(stringr)
require(stringdist)
ind_short <- grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE | str_count(d$d, "[[:alpha:]]") < 5
short <- stringdistmatrix(d$d[ind_short], d$d, method="lv", maxDist=0.000001)
long <- stringdistmatrix(d$d[!ind_short], d$d, method="lv", maxDist=3)
d$sim[ind_short] <- apply(short,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
d$sim[!ind_short] <- apply(long,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
The basic strategy is to split into short and long components, and use the matrix form of stringdist, then collapse these using paste, and assign to the right places in your d$sim
Edited to add: in the light of your comment about not being able to work on the whole matrix at once, try choosing chunk_length so that stringdistmatrix() works on a chunk_length*M matrix. Of course, if you set it to 1, you're back to your original unvectorised form
chunk_length <- 100
ind_short <- grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE | str_count(d$d, "[[:alpha:]]") < 5
d$iter <- rep(1:M,each=chunk_length,length.out=M)
for (i in unique(d$iter))
{
in_iter <- (d$iter == i)
short <- stringdistmatrix(d$d[in_iter & ind_short], d$d, method="lv", maxDist=0.000001)
long <- stringdistmatrix(d$d[in_iter & !ind_short], d$d, method="lv", maxDist=3)
if(sum(in_iter & ind_short)==1) short <- t(short)
if(sum(in_iter & !ind_short)==1) long <- t(long)
if(sum(in_iter & ind_short)>0) d$sim[in_iter & ind_short] <- apply(short,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
if(sum(in_iter & !ind_short)>0) d$sim[in_iter & !ind_short] <- apply(long,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
}
It's not really an answer, but I thought it might be good to mention that agrep may be useful for you in this project. It does partial pattern matching.
> d <- c("herb", "market", "merchandise", "fun", "casket93",
"old", "herbb", "basket", "bottle", "plastic", "baskket",
"markket", "pasword", "plastik", "oldg", "mahagony",
"mahaagoni", "sim23", "asket", "trump" )
> agr <- sapply(d, function(x) agrep(x, d, value = TRUE))
> head(agr)
$herb
[1] "herb" "herbb"
$market
[1] "market" "markket"
$merchandise
[1] "merchandise"
$fun
[1] "fun"
$casket93
[1] "casket93"
$old
[1] "old" "pasword" "oldg"

algorithm/code in R to find pattern from any position in a string

I want to find the pattern from any position in any given string such that the pattern repeats for a threshold number of times at least.
For example for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab". Another example: for the string "ff00f0f0f0f0f0f0f0f0000" the pattern should be "0f".
In both cases threshold has been taken as 3 i.e. the pattern should be repeated for at least 3 times.
If someone can suggest an optimized method in R for finding a solution to this problem, please do share with me. Currently I am achieving this by using 3 nested loops, and it's taking a lot of time.
Thanks!
Use regular expressions, which are made for this type of stuff. There may be more optimized ways of doing it, but in terms of easy to write code, it's hard to beat. The data:
vec <- c("a0cc0vaaaabaaaabaaaabaa00bvw","ff00f0f0f0f0f0f0f0f0000")
The function that does the matching:
find_rep_path <- function(vec, reps) {
regexp <- paste0(c("(.+)", rep("\\1", reps - 1L)), collapse="")
match <- regmatches(vec, regexpr(regexp, vec, perl=T))
substr(match, 1, nchar(match) / reps)
}
And some tests:
sapply(vec, find_rep_path, reps=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw ff00f0f0f0f0f0f0f0f0000
# "aaaab" "0f0f"
sapply(vec, find_rep_path, reps=5L)
# $a0cc0vaaaabaaaabaaaabaa00bvw
# character(0)
#
# $ff00f0f0f0f0f0f0f0f0000
# [1] "0f"
Note that with threshold as 3, the actual longest pattern for the second string is 0f0f, not 0f (reverts to 0f at threshold 5). In order to do this, I use back references (\\1), and repeat these as many time as necessary to reach threshold. I need to then substr the result because annoyingly base R doesn't have an easy way to get just the captured sub expressions when using perl compatible regular expressions. There is probably a not too hard way to do this, but the substr approach works well in this example.
Also, as per the discussion in #G. Grothendieck's answer, here is the version with the cap on length of pattern, which is just adding the limit argument and the slight modification of the regexp.
find_rep_path <- function(vec, reps, limit) {
regexp <- paste0(c("(.{1,", limit,"})", rep("\\1", reps - 1L)), collapse="")
match <- regmatches(vec, regexpr(regexp, vec, perl=T))
substr(match, 1, nchar(match) / reps)
}
sapply(vec, find_rep_path, reps=3L, limit=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw ff00f0f0f0f0f0f0f0f0000
# "a" "0f"
find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}
and here are some tests. The last test processes the entire text of James Joyce's Ulysses in 1.4 seconds on my laptop:
> find.string("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
>
> joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
> joycec <- paste(joyce, collapse = " ")
> system.time(result <- find.string2(joycec, len = 25))
user system elapsed
1.36 0.00 1.39
> result
[1] " Hoopsa boyaboy hoopsa!"
ADDED
Although I developed my answer before having seen BrodieG's, as he points out they are very similar to each other. I have added some features of his to the above to get the solution below and tried the tests again. Unfortunately when I added the variation of his code the James Joyce example no longer works although it does work on the other two examples shown. The problem seems to be in adding the len constraint to the code and may represent a fundamental advantage of the code above (i.e. it can handle such a constraint and such constraints may be essential for very long strings).
find.string2 <- function(string, th = 3, len = floor(nchar(string)/th)) {
pat <- paste0(c("(.", "{1,", len, "})", rep("\\1", th-1)), collapse = "")
r <- regexpr(pat, string, perl = TRUE)
ifelse(r > 0, substring(string, r, r + attr(r, "capture.length")-1), "")
}
> find.string2("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string2("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
> system.time(result <- find.string2(joycec, len = 25))
user system elapsed
0 0 0
> result
[1] "w"
REVISED The James Joyce test that was supposed to be testing find.string2 was actually using find.string. This is now fixed.
Not optimized (even it is fast) function , but I think it is more R way to do this.
Get all patterns of certains length > threshold : vectorized using mapply and substr
Get the occurrence of these patterns and extract the one with maximum occurrence : vectorized using str_locate_all.
Repeat 1-2 this for all lengths and tkae the one with maximum occurrence.
Here my code. I am creating 2 functions ( steps 1-2) and step 3:
library(stringr)
ss = "ff00f0f0f0f0f0f0f0f0000"
ss <- "a0cc0vaaaabaaaabaaaabaa00bvw"
find_pattern_length <-
function(length=1,ss){
patt = mapply(function(x,y) substr(ss,x,y),
1:(nchar(ss)-length),
(length+1):nchar(ss))
res = str_locate_all(ss,unique(patt))
ll = unlist(lapply(res,length))
list(patt = patt[which.max(ll)],
rep = max(ll))
}
get_pattern_threshold <-
function(ss,threshold =3 ){
res <-
sapply(seq(threshold,nchar(ss)),find_pattern_length,ss=ss)
res[,which.max(res['rep',])]
}
some tests:
get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',5)
$patt
[1] "0f0f0"
$rep
[1] 6
> get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',2)
$patt
[1] "f0"
$rep
[1] 18
Since you want at least three repetitions, there is a nice O(n^2) approach.
For each possible pattern length d cut string into parts of length d. In case of d=5 it would be:
a0cc0
vaaaa
baaaa
baaaa
baa00
bvw
Now look at each pairs of subsequent strings A[k] and A[k+1]. If they are equal then there is a pattern of at least two repetitions. Then go further (k+2, k+3) and so on. Finally you also check if suffix of A[k-1] and prefix of A[k+n] fit (where k+n is the first string that doesn't match).
Repeat it for each d starting from some upper bound (at most n/3).
You have n/3 possible lengths, then n/d strings of length d to check for each d. It should give complexity O(n (n/d) d)= O(n^2).
Maybe not optimal but I found this cutting idea quite neat ;)
For a bounded pattern (i.e not huge) it's best I think to just create all possible substrings first and then count them. This is if the sub-patterns can overlap. If not change the step fun in the loop.
pat="a0cc0vaaaabaaaabaaaabaa00bvw"
len=nchar(pat)
thr=3
reps=floor(len/2)
# all poss strings up to half length of pattern
library(stringr)
pat=str_split(pat, "")[[1]][-1]
str.vec=vector()
for(win in 2:reps)
{
str.vec= c(str.vec, rollapply(data=pat,width=win,FUN=paste0, collapse=""))
}
# the max length string repeated more than 3 times
tbl=table(str.vec)
tbl=tbl[tbl>=3]
tbl[which.max(nchar(names(tbl)))]
aaaabaa
3
NB Whilst I'm lazy and append/grow the str.vec here in a loop, for a larger problem I'm pretty sure the actual length of str.vec is predetermined by the length of the pattern if you care to work it out.
Here is my solution, it's not optimized (build vector with patterns <- c() ; pattern <- c(patterns, x) for example) and can be improve but simpler than yours, I think.
I can't understand which pattern exactly should (I just return the max) be returned but you can adjust the code to what you want exactly.
str <- "a0cc0vaaaabaaaabaaaabaa00bvw"
findPatternMax <- function(str){
nb <- nchar(str):1
length.patt <- rev(nb)
patterns <- c()
for (i in 1:length(nb)){
for (j in 1:nb[i]){
patterns <- c(patterns, substr(str, j, j+(length.patt[i]-1)))
}
}
patt.max <- names(which(table(patterns) == max(table(patterns))))
return(patt.max)
}
findPatternMax(str)
> findPatternMax(str)
[1] "a"
EDIT :
Maybe you want the returned pattern have a min length ?
then you can add a nchar.patt parameter for example :
nchar.patt <- 2 #For a pattern of 2 char min
nb <- nb[length.patt >= nchar.patt]
length.patt <- length.patt[length.patt >= nchar.patt]

Extract part of string

I've been trying to extract something from a string (actually a $call) in R, and it's driving me nuts. If you have:
library(vars)
data <- as.data.frame(matrix(c(runif(40)), ncol=2))
z <- matrix(c(runif(40)), ncol=2)
var.modell <- VAR(data, p = 2, exogen=z, type = "trend")
How do you extract the z? I've tried googling and searching stack overflow. I found this: R extract a part of a string in R
which made me try:
sub(".*?exogen=(.*?)", "\\1", var.modell$call, perl = TRUE)
But it returns:
[1] "VAR" "data" "2" "trend" "z"
What am I doing wrong?
Look at the call object itself:
m <- lm(speed~dist,data=cars)
m$call$data
## cars
You'll want var.modell$call$exogen.

How to paste two vectors together and pad at the end?

I would like to paste two character strings together and pad at the end with another character to make the combination a certain length. I was wondering if there was an option to paste that one can pass or another trick that I am missing? I can do this in multiple lines by figuring out the length of each and then calling paste with rep(my_pad_character,N) but I would like to do this in one line.
Ex: pad together "hi", and "hello" and pad with an "a" to make the sequence length 10. the result would be "hihelloaaa"
Here is one option:
s1 <- "hi"
s2 <- "hello"
f <- function(x, y, pad = "a", length = 10) {
out <- paste0(x, y)
nc <- nchar(out)
paste0(out, paste(rep(pad, length - nc), collapse = ""))
}
> f(s1, s2)
[1] "hihelloaaa"
You can use the stringr function str_pad
library(stringr)
str_pad(paste0('hi','hello'), side = 'right', width = 10 , pad = 'a')

R: How can I replace let's say the 5th element within a string?

I would like to convert the a string like be33szfuhm100060 into BESZFUHM0060.
In order to replace the small letters with capital letters I've so far used the gsub function.
test1=gsub("be","BE",test)
Is there a way to tell this function to replace the 3rd and 4th string element? If not, I would really appreciate if you could tell me another way to solve this problem. Maybe there is also a more general solution to change a string element at a certain position into a capital letter whatever the element is?
A couple of observations:
Cnverting a string to uppercase can be done with toupper, e.g.:
> toupper('be33szfuhm100060')
> [1] "BE33SZFUHM100060"
You could use substr to extract a substring by character positions and paste to concatenate strings:
> x <- 'be33szfuhm100060'
> paste(substr(x, 1, 2), substr(x, 5, nchar(x)), sep='')
[1] "beszfuhm100060"
As an alternative, if you are going to be doing this alot:
String <- function(x="") {
x <- as.character(paste(x, collapse=""))
class(x) <- c("String","character")
return(x)
}
"[.String" <- function(x,i,j,...,drop=TRUE) {
unlist(strsplit(x,""))[i]
}
"[<-.String" <- function(x,i,j,...,value) {
tmp <- x[]
tmp[i] <- String(value)
x <- String(tmp)
x
}
print.String <- function(x, ...) cat(x, "\n")
## try it out
> x <- String("be33szfuhm100060")
> x[3:4] <- character(0)
> x
beszfuhm100060
You can use substring to remove the third and fourth elements.
x <- "be33szfuhm100060"
paste(substring(x, 1, 2), substring(x, 5), sep = "")
If you know what portions of the string you want based on their position(s), use substr or substring. As I mentioned in my comment, you can use toupper to coerce characters to uppercase.
paste( toupper(substr(test,1, 2)),
toupper(substr(test,5,10)),
substr(test,12,nchar(test)),sep="")
# [1] "BESZFUHM00060"

Resources