Vectorisation of a similar string finding loop - string

I have a large vector of strings like this:
d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )
I wan't to fetch similar strings for each string from the same vector d.
I am doing this by
1. calculating for each string the edit distance with all other strings strings based on certain rules such as forcing exact matching if any digits are present or if number of alphabet characters are less than 5.
2. putting it in a dataframe dist along with string.
3. subsetting dist based on distances < 3.
4. collapsing and adding the similar strings to original dataframe as a new column.
I am using the stringr and stringdist packages
d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2))
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)
require(stringr)
require(stringdist)
for (i in 1:M){
# if string has digits or is of short size (<5) do exact matching
if (grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE || str_count(d[i, "d"], "[[:alpha:]]") < 5){
Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=0.000001) # maxDist as fraction to force exact matching
# otherwise do approximate matching
} else {
Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=3)
}
# subset similar strings (with edit distance <3)
subDist <- subset(Dist, dist < 3 )
# add to original data.frame d
d[i, "sim"] <- paste(as.character(unlist(subDist$string)), collapse=", ")
}
Is it possible to vectorise the procedure instead of using a loop? I have a very large vector of strings, so a calculating a distance matrix using stringdistmatrix on the entire vector can't be done due to memory restrictions. The loop works fine for large data, but is very slow.

stringdist has a version for computing all the distances in a matrix, so I think that something like this will be an improvement, it's about four times as quick on my computer when run with the 100 reps line included:
d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )
#d <- rep(d, each=100) #make it a bit longer for timing
d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2))
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)
require(stringr)
require(stringdist)
ind_short <- grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE | str_count(d$d, "[[:alpha:]]") < 5
short <- stringdistmatrix(d$d[ind_short], d$d, method="lv", maxDist=0.000001)
long <- stringdistmatrix(d$d[!ind_short], d$d, method="lv", maxDist=3)
d$sim[ind_short] <- apply(short,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
d$sim[!ind_short] <- apply(long,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
The basic strategy is to split into short and long components, and use the matrix form of stringdist, then collapse these using paste, and assign to the right places in your d$sim
Edited to add: in the light of your comment about not being able to work on the whole matrix at once, try choosing chunk_length so that stringdistmatrix() works on a chunk_length*M matrix. Of course, if you set it to 1, you're back to your original unvectorised form
chunk_length <- 100
ind_short <- grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE | str_count(d$d, "[[:alpha:]]") < 5
d$iter <- rep(1:M,each=chunk_length,length.out=M)
for (i in unique(d$iter))
{
in_iter <- (d$iter == i)
short <- stringdistmatrix(d$d[in_iter & ind_short], d$d, method="lv", maxDist=0.000001)
long <- stringdistmatrix(d$d[in_iter & !ind_short], d$d, method="lv", maxDist=3)
if(sum(in_iter & ind_short)==1) short <- t(short)
if(sum(in_iter & !ind_short)==1) long <- t(long)
if(sum(in_iter & ind_short)>0) d$sim[in_iter & ind_short] <- apply(short,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
if(sum(in_iter & !ind_short)>0) d$sim[in_iter & !ind_short] <- apply(long,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
}

It's not really an answer, but I thought it might be good to mention that agrep may be useful for you in this project. It does partial pattern matching.
> d <- c("herb", "market", "merchandise", "fun", "casket93",
"old", "herbb", "basket", "bottle", "plastic", "baskket",
"markket", "pasword", "plastik", "oldg", "mahagony",
"mahaagoni", "sim23", "asket", "trump" )
> agr <- sapply(d, function(x) agrep(x, d, value = TRUE))
> head(agr)
$herb
[1] "herb" "herbb"
$market
[1] "market" "markket"
$merchandise
[1] "merchandise"
$fun
[1] "fun"
$casket93
[1] "casket93"
$old
[1] "old" "pasword" "oldg"

Related

R: combinatorial string replacement

I am on the lookout for a gsub based function which would enable me to do combinatorial string replacement, so that if I would have an arbitrary number of string replacement rules
replrules=list("<x>"=c(3,5),"<ALK>"=c("hept","oct","non"),"<END>"=c("ane","ene"))
and a target string
string="<x>-methyl<ALK><END>"
it would give me a dataframe with the final string name and the substitutions that were made as in
name x ALK END
3-methylheptane 3 hept ane
5-methylheptane 5 hept ane
3-methyloctane 3 oct ane
5-methyloctane 5 ... ...
3-methylnonane 3
5-methylnonane 5
3-methylheptene 3
5-methylheptene 5
3-methyloctene 3
5-methyloctene 5
3-methylnonene 3
5-methylnonene 5
The target string would be of arbitrary structure, e.g. it could also be string="1-<ALK>anol" or each pattern could occur several times, as in string="<ALK>anedioic acid, di<ALK>yl ester"
What would be the most elegant way to do this kind of thing in R?
How about
d <- do.call(expand.grid, replrules)
d$name <- paste0(d$'<x>', "-", "methyl", d$'<ALK>', d$'<END>')
EDIT
This seems to work (substituting each of these into the strplit)
string = "<x>-methyl<ALK><END>"
string2 = "<x>-ethyl<ALK>acosane"
string3 = "1-<ALK>anol"
Using Richards regex
d <- do.call(expand.grid, list(replrules, stringsAsFactors=FALSE))
names(d) <- gsub("<|>","",names(d))
s <- strsplit(string3, "(<|>)", perl = TRUE)[[1]]
out <- list()
for(i in s) {
out[[i]] <- ifelse (i %in% names(d), d[i], i)
}
d$name <- do.call(paste0, unlist(out, recursive=F))
EDIT
This should work for repeat items
d <- do.call(expand.grid, list(replrules, stringsAsFactors=FALSE))
names(d) <- gsub("<|>","",names(d))
string4 = "<x>-methyl<ALK><END>oate<ALK>"
s <- strsplit(string4, "(<|>)", perl = TRUE)[[1]]
out <- list()
for(i in seq_along(s)) {
out[[i]] <- ifelse (s[i] %in% names(d), d[s[i]], s[i])
}
d$name <- do.call(paste0, unlist(out, recursive=F))
Well, I'm not exactly sure we can even produce a "correct" answer to your question, but hopefully this helps give you some ideas.
Okay, so in s, I just split the string where it might be of most importance. Then g gets the first value in each element of r. Then I constructed a data frame as an example. So then dat is a one row example of how it would look.
> (s <- strsplit(string, "(?<=l|\\>)", perl = TRUE)[[1]])
# [1] "<x>" "-methyl" "<ALK>" "<END>"
> g <- sapply(replrules, "[", 1)
> dat <- data.frame(name = paste(append(g, s[2], after = 1), collapse = ""))
> dat[2:4] <- g
> names(dat)[2:4] <- sapply(strsplit(names(g), "<|>"), "[", -1)
> dat
# name x ALK END
# 1 3-methylheptane 3 hept ane

Password generator function in R

I am looking for a smart way to code a password generator function in R:
generate.password (length, capitals, numbers)
length: the length of the password
capitals: a vector of defining where capitals shall occur, vector reflects the corresponsing password string position, default should be no capitals
numbers: a vector defining where capitals shall occur, vector reflects the corresponsing password string position, default should be no numbers
Examples:
generate.password(8)
[1] "hqbfpozr"
generate.password(length=8, capitals=c(2,4))
[1] "hYbFpozr"
generate.password(length=8, capitals=c(2,4), numbers=c(7:8))
[1] "hYbFpo49"
There is function which generates random strings in the stringi (version >= 0.2-3) package:
require(stringi)
stri_rand_strings(n=2, length=8, pattern="[A-Za-z0-9]")
## [1] "90i6RdzU" "UAkSVCEa"
So using different patterns you can generate parts for your desired password and then paste it like this:
x <- stri_rand_strings(n=4, length=c(2,1,2,3), pattern=c("[a-z]","[A-Z]","[0-9]","[a-z]"))
x
## [1] "ex" "N" "81" "tsy"
stri_flatten(x)
## [1] "exN81tsy"
Here's one approach
generate.password <- function(length,
capitals = integer(0),
numbers = integer(0)) {
stopifnot(is.numeric(length), length > 0L,
is.numeric(capitals), capitals > 0L, capitals <= length,
is.numeric(numbers), numbers > 0L, numbers <= length,
length(intersect(capitals, numbers)) == 0L)
lc <- sample(letters, length, replace = TRUE)
uc <- sample(LETTERS, length(capitals), replace = TRUE)
num <- sample(0:9, length(numbers), replace = TRUE)
pass <- lc
pass[capitals] <- uc
pass[numbers] <- num
paste0(pass, collapse = "")
}
## Examples
set.seed(1)
generate.password(8)
# [1] "gjoxfxyr"
set.seed(1)
generate.password(length=8, capitals=c(2,4))
# [1] "gQoBfxyr"
set.seed(1)
generate.password(length=8, capitals=c(2,4), numbers=c(7:8))
# [1] "gQoBfx21"
You can also add other special characters in the same fashion. If you want repeated values for letters and numbers, then add replace =TRUE in sample function.
I liked the solution given by #Hadd E. Nuff... and What I did, is the inclusion of digits between 0 and 9, at random... here is the modified solution...
generate.password <- function(LENGTH){
punct <- c("!", "#", "$", "%", "&", "(", ")", "*", "+", "-", "/", ":",
";", "<", "=", ">", "?", "#", "[", "^", "_", "{", "|", "}", "~")
nums <- c(0:9)
chars <- c(letters, LETTERS, punct, nums)
p <- c(rep(0.0105, 52), rep(0.0102, 25), rep(0.02, 10))
pword <- paste0(sample(chars, LENGTH, TRUE, prob = p), collapse = "")
return(pword)
}
generate.password(8)
This will generate very strong passwords like:
"C2~mD20U" # 8 alpha-numeric-specialchar
"+J5Gi3" # 6 alpha-numeric-specialchar
"77{h6RsGQJ66if5" # 15 alpha-numeric-specialchar

Reduce string length by removing contiguous duplicates

I have an R dataframe whith 2 fields:
ID WORD
1 AAAAABBBBB
2 ABCAAABBBDDD
3 ...
I'd like to simplify the words with repeating letters by keeping only the letter and not the duplicates in a repetition:
e.g.: AAAAABBBBB should give me AB
and ABCAAABBBDDD should give me ABCABD
Anyone has an idea on how to do this?
Here's a solution with regex:
x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
gsub("([A-Za-z])\\1+","\\1",x)
EDIT: By request, some benchmarking. I added Matthew Lundberg's pattern in the comment, matching any character. It appears that gsub is faster by an order of magnitude, and matching any character is faster than matching letters.
library(microbenchmark)
set.seed(1)
##create sample dataset
x <- apply(
replicate(100,sample(c(LETTERS[1:3],""),10,replace=TRUE))
,2,paste0,collapse="")
##benchmark
xm <- microbenchmark(
SAPPLY = sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
,GSUB.LETTER = gsub("([A-Za-z])\\1+","\\1",x)
,GSUB.ANY = gsub("(.)\\1+","\\1",x)
)
##print results
print(xm)
# Unit: milliseconds
# expr min lq median uq max
# 1 GSUB.ANY 1.433873 1.509215 1.562193 1.664664 3.324195
# 2 GSUB.LETTER 1.940916 2.059521 2.108831 2.227435 3.118152
# 3 SAPPLY 64.786782 67.519976 68.929285 71.164052 77.261952
##boxplot of times
boxplot(xm)
##plot with ggplot2
library(ggplot2)
qplot(y=time, data=xm, colour=expr) + scale_y_log10()
x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
## [1] "AB" "ABCABD"

R: How can I replace let's say the 5th element within a string?

I would like to convert the a string like be33szfuhm100060 into BESZFUHM0060.
In order to replace the small letters with capital letters I've so far used the gsub function.
test1=gsub("be","BE",test)
Is there a way to tell this function to replace the 3rd and 4th string element? If not, I would really appreciate if you could tell me another way to solve this problem. Maybe there is also a more general solution to change a string element at a certain position into a capital letter whatever the element is?
A couple of observations:
Cnverting a string to uppercase can be done with toupper, e.g.:
> toupper('be33szfuhm100060')
> [1] "BE33SZFUHM100060"
You could use substr to extract a substring by character positions and paste to concatenate strings:
> x <- 'be33szfuhm100060'
> paste(substr(x, 1, 2), substr(x, 5, nchar(x)), sep='')
[1] "beszfuhm100060"
As an alternative, if you are going to be doing this alot:
String <- function(x="") {
x <- as.character(paste(x, collapse=""))
class(x) <- c("String","character")
return(x)
}
"[.String" <- function(x,i,j,...,drop=TRUE) {
unlist(strsplit(x,""))[i]
}
"[<-.String" <- function(x,i,j,...,value) {
tmp <- x[]
tmp[i] <- String(value)
x <- String(tmp)
x
}
print.String <- function(x, ...) cat(x, "\n")
## try it out
> x <- String("be33szfuhm100060")
> x[3:4] <- character(0)
> x
beszfuhm100060
You can use substring to remove the third and fourth elements.
x <- "be33szfuhm100060"
paste(substring(x, 1, 2), substring(x, 5), sep = "")
If you know what portions of the string you want based on their position(s), use substr or substring. As I mentioned in my comment, you can use toupper to coerce characters to uppercase.
paste( toupper(substr(test,1, 2)),
toupper(substr(test,5,10)),
substr(test,12,nchar(test)),sep="")
# [1] "BESZFUHM00060"

Implement ROT-13 in R

I'd like a function, that when passed a string containing only letters, rotates each letter in the string through the alphabet by X characters, where X is a parameter of the function. The famous instance of this is when X=13, which is called ROT-13
function <- ROTx(str,x) {
??
}
It's the kind of thing that I'd expect an R wizard could do in just a few lines, whereas I'd end up with 10 or more.
See ?chartr (Examples section):
rot <- function(ch, k = 13) {
p0 <- function(...) paste(c(...), collapse="")
A <- c(letters, LETTERS, " '")
I <- seq_len(k)
chartr(p0(A), p0(c(A[-I], A[I])), ch)
}
or here http://rosettacode.org/wiki/Rot-13#R:
rot13 <- function(x)
{
old <- paste(letters, LETTERS, collapse="", sep="")
new <- paste(substr(old, 27, 52), substr(old, 1, 26), sep="")
chartr(old, new, x)
}
rotX <- function(ch,x)
{ #rotate each letter of a string ch by x letters thru the alphabet, as long as x<=13
old <- paste(letters, LETTERS, collapse="", sep="")
new <- paste(substr(old, 2*x+1, 26*2), substr(old, 1, 26), sep="")
chartr(old, new, ch)
}
This fixes both of the problems I noted in my comment.

Resources