How to extract different part of 2 strings - string

I have
x<-c('abczzzdef','abcxxdef')
I want a function
fn(x)
that returns a length 2 vector
[1] 'zzz' 'xx'
How?
(I have tried searching for an answer but search terms like 'partial matching' give me something quite different)
Update
'length 2 vector' means length(fn(x)) is 2 and fn(x)[1] give "zzz" while fn(x)[2] gives "xx".
After trying out the answers provided, I realize I haven't been specific enough.
There will only be 2 strings (in a vector) that I am comparing.
The location of the different parts (zzz and xx) can be anywhere in the string. i.e. it could be x<-c('zzzabcdef','xxabcdef') or it could be at the end. But the 2 strings are always at the same respective place (i.e. both at the beginning, or both at the middle, or both at the end).
zzz and xx are obviously generic names. They could be different things (numbers, alphabet, symbols) and of different length (not necessarily 3 and 2).
Same comment applies to abc and def.
I have got some test cases
x1<-c('abcxxxttt','abczzttt')
x2<-c('abcxxxdef','abczz126gsdef')
x3<-c('xx_x123../t','z_z126gs123../t')
fn(x1) should give "xxx" "zz"
fn(x2) should give "xxx" "zz126gs"
fn(x3) should give "xx_x" "z_z126gs"

x<-c('abczzzdef','abcxxdef')
fn <- function(x) unlist(regmatches(x, gregexpr("(.)\\1+", x)))
fn(x)
# [1] "zzz" "xx"

First of all, it would have been better to include all that detail in the first version of the question. No need to waste people's time coming up with solutions that wont work for you just because you didn't clearly explain what you needed. If you need to change a question that much after it's already been answered, it probably would be best to ask a new question rather than completely changing your first one.
What you are tying to do, find the largest non-shared portion of a string, can be a pretty messy process for a computer. A somewhat standard measure of string dissimilarity is the generalized Levenshtein distance which R has implemented in the adist function. It can produce a string which tells you how to transform one string into another via matches, insertions, deletions, and substitutions. If I find the longest string of matches, I'll have a pretty good idea of where to extract the unique information.
So this method basically focuses on extracting the regions outside of the best matches. Here's the function that does the matching
fn <- function(x) {
ld <- attr(adist(x[1], x[2], counts=T,
costs=c(substitutions=500)),"trafos")[1,1]
starts <- gregexpr("M+", ld)[[1]]
lens <- attr(starts,"match.length")
starts <- as.vector(starts)
ends <- starts + lens - 1
bm <- which.max(lens)
if (starts[bm]==1 | ends[bm]==nchar(ld)) {
#beg/end
for( i in which(starts==1 | ends==nchar(ld))) {
substr(ld, starts[i], ends[i]) <-
paste(rep("X", lens[i]), collapse="")
}
} else {
#middle
substr(ld, starts[bm], ends[bm]) <-
paste(rep("X", lens[bm]), collapse="")
}
tr <- strsplit(ld,"")[[1]]
x1 <- cumsum(tr %in% c("D","M","X"))[!tr %in% c("X","I")]
x2 <- cumsum(tr %in% c("I","M","X"))[!tr %in% c("X","D")]
c(substr(x[1], min(x1), max(x1)), substr(x[2], min(x2), max(x2)))
}
Now we can apply it to your test data
x1 <- c('abcxxxttt','abczzttt')
x2 <- c('abcxxxdef','abczz126gsdef')
x3 <- c('xx_x123../t','z_z126gs123../t')
fn(x1)
# [1] "xxx" "zz"
fn(x2)
# [1] "xxx" "zz126gs"
fn(x3)
# [1] "xx_x" "z_z126gs"
So we get the results you expect. Here I do little error checking. I assume there will always be some overlap and some non-overlapping regions. If that's not true, the function will likely produce an error or unexpected results.

gsub("([^xz]*)([xz]*)([^xz]*)", "\\2", x)
[1] "zzz" "xx"
> getxz <- function(x, str) gsub(paste0("([^",str, ']*)([', str, ']*)([^', str, ']*)'),
"\\2", x)
> getxz(x=x,"xz")
[1] "zzz" "xx"
In response to the new examples I offer these tests which I think provides three successes:
> getxz(x=x1,"xz_")
[1] "xxx" "zz"
> getxz(x=x2,"xz_")
[1] "xxx" "zz"
> getxz(x=x3,"xz_")
[1] "xx_x" "z_z"

Related

List comprehension in haskell with let and show, what is it for?

I'm studying project euler solutions and this is the solution of problem 4, which asks to
Find the largest palindrome made from the product of two 3-digit
numbers
problem_4 =
maximum [x | y<-[100..999], z<-[y..999], let x=y*z, let s=show x, s==reverse s]
I understand that this code creates a list such that x is a product of all possible z and y.
However I'm having a problem understanding what does s do here. Looks like everything after | is going to be executed everytime a new element from this list is needed, right?
I don't think I understand what's happening here. Shouldn't everything to the right of | be constraints?
A list comprehension is a rather thin wrapper around a do expression:
problem_4 = maximum $ do
y <- [100..999]
z <- [y..999]
let x = y*z
let s = show x
guard $ s == reverse s
return x
Most pieces translate directly; pieces that aren't iterators (<-) or let expressions are treated as arguments to the guard function found in Control.Monad. The effect of guard is to short-circuit the evaluation; for the list monad, this means not executing return x for the particular value of x that led to the false argument.
I don't think I understand what's happening here. Shouldn't everything to the right of | be constraints?
No, at the right part you see an expression that is a comma-separated (,) list of "parts", and every part is one of the following tree:
an "generator" of the form somevar <- somelist;
a let statement which is an expression that can be used to for instance introduce a variable that stores a subresult; and
expressions of the type boolean that act like a filter.
So it is not some sort of "constraint programming" where one simply can list some constraints and hope that Haskell figures it out (in fact personally that is the difference between a "programming language" and a "specification language": in a programming language you have "control" how the data flows, in a specification language, that is handled by a system that reads your specifications)
Basically an iterator can be compared to a "foreach" loop in many imperative programming languages. A "let" statement can be seen as introducing a temprary variable (but note that in Haskell you do not assign variable, you declare them, so you can not reassign values). The filter can be seen as an if statement.
So the list comprehension would be equivalent to something in Python like:
for y in range(100, 1000):
for z in range(y, 1000):
x = y * z
s = str(x)
if x == x[::-1]:
yield x
We thus first iterate over two ranges in a nested way, then we declare x to be the multiplication of y and z, with let s = show x, we basically convert a number (for example 15129) to its string counterpart (for example "15129"). Finally we use s == reverse s to reverse the string and check if it is equal to the original string.
Note that there are more efficient ways to test Palindromes, especially for multiplications of two numbers.

Haskell: can I use laziness to "abort early" and gain performance?

I'm writing a Haskell program that reads a wordlist of the English language and a rectangular grid of letters such as:
I T O L
I H W S
N H I S
K T S I
and then finds a Hamiltonian path through the grid from the top-left corner that spells out a sequence of English words, such as:
--> $ runghc unpacking.hs < 4x4grid.txt
I THINK THIS IS SLOW
(If there are multiple solutions, it can just print any one it finds and stop looking.)
The naïve, strict approach is to generate a full path and then try to split it up into words. However, assuming that I'm doing this (and currently I am forcing myself to -- see below) I'm spending a lot of time finding paths like:
IINHHTOL...
IINHHTOW...
IINHHWOL...
These are obviously never going to turn out to be words, looking at the first few letters ("IINH" can't be split into words, and no English word contains "NHH".) So, say, in the above grid, I don't want to look at the many[1] paths that begin with IINHH.
Now, my functions look like this:
paths :: Coord -> Coord -> [[Coord]]
paths (w, h) (1, 1) = [[(1, 1), (1, 2), ... (x, y)], ...]
lexes :: Set String -> String -> [[String]]
lexes englishWordset "ITHINKTHISWILLWORK" = [["I", "THINK", "THIS", ...], ...]
paths just finds all the paths worth considering on a (w, h) grid. lexes finds all the ways to chop a phrase up, and is defined as:
lexes language [] = [[]]
lexes language phrase = let
splits = tail $ zip (inits phrase) (tails phrase)
in concat [map (w:) (lexes language p') | (w, p') <- splits,
w `S.member` language]
Given "SAMPLESTRING", it looks at "S", then "SA", then "SAM"... as soon as it finds a valid word, it recurses and tries to "lex" the rest of the string. (First it will recurse on "PLESTRING" and try to make phrases with "SAM", but find no way to chop "plestring" up into words, and fail; then it will find ["SAMPLE", "STRING"].)
Of course, for an invalid string above, any hope of being "lazy" is lost by following this approach: in the example from earlier we need to still search beyond a ridiculous phrase like "ITOLSHINHISIST", because maybe "ITOLSHINHISISTK" (one letter longer) might form a valid single word.
I feel like somehow I could use laziness here to improve performance throughout the entire program: if the first few characters of phrase aren't a prefix of any word, we can bail out entirely, stop evaluating the rest of phrase, and thus the rest of the path.[2] Does this make sense at all? Is there some tree-like data structure that will help me check not for set membership, but set "prefix-ness", thereby making checking validity lazier?
[1] Obviously, for a 4x4 grid there are very few of these, but this argument is about the general case: for bigger grids I could skip hundreds of thousands of paths the moment I see they start with "JX".
[2] phrase is just map (grid M.!) path for some Map Coord Char grid read from the input file.

Automatically add variable names to elements of a list [duplicate]

This question already has answers here:
Can lists be created that name themselves based on input object names?
(4 answers)
Closed 6 years ago.
I have a list of models, and to make the code easiser to maintain (so roubst to adding and removing models) I'd like to have a single place where I store them and their names. To do this I have to solve the following naming problem.
Upstream, i have generated models in a way that's less efficient than the following (if it was this compressed, i would assign them to their own env).
lmNms <- c( "mod1", "mod2", "mod3", "mod4", "mod5", "mod6")
lapply(lmNms, function(N) assign(N, lm(runif(10) ~ rnorm(10)), env = .GlobalEnv))
Downstream, i have collected the mess into a list:
modelList <- list(mod1, mod2, mod3, mod4, mod5, mod6)
I have an (un-named) lists of variable output, and attach the names as follows:
output <- list(1, 2, 3, 4, 5, 6)
names(output) <- lmNms
I'd like to be able to use the model names from modelList:
modelList <- list(mod1, mod2, mod3, mod4, mod5, mod6)
names(output) <- someFun(modelList)
I'm sure there exists someFun -- but I cannot figure it out ... can this be done?
To be clear, the aim is to do this without using lmNms -- i want to get the names either from modelList, or have them attach at the point that i build modelList (the point is to avoid list(a = a, b=b ...) boilerplate.
The key to this is to re-make the list function to stick on the names when you don't supply the names as well.
listN <- function(...){
anonList <- list(...)
names(anonList) <- as.character(substitute(list(...)))[-1]
anonList
}
With this, you make modelList as follows:
modelList <- listN(mod1, mod2, mod3, mod4, mod5, mod6)
With the names attached:
R> names(modelList)
[1] "mod1" "mod2" "mod3" "mod4" "mod5" "mod6"
A fuller solution is given here, which is robust to the use of a mixture of anonymous and named arguments to list.
listN2 <- function(...){
dots <- list(...)
inferred <- sapply(substitute(list(...)), function(x) deparse(x)[1])[-1]
if(is.null(names(inferred))){
names(dots) <- inferred
} else {
names(dots)[names(inferred) == ""] <- inferred[names(inferred) == ""]
}
dots
}
You can do this with environments:
e <- new.env()
output <- list(1,2,3,4,5,6)
nms <- c( "mod1", "mod2", "mod3", "mod4", "mod5", "mod6")
for(i in 1:length(output)) {
nm <- nms[i]
e[[nm]] <- output[[i]]
}
You can reference items in the environment like any list, or coerce it to a list
> ls(e)
[1] "mod1" "mod2" "mod3" "mod4" "mod5" "mod6"
> e[['mod1']]
[1] 1
> e$mod1
[1] 1
> new_output <- as.list(e)
Since environments act a lot like lists, there is probably an easy way to do it with your original list as well.
Use sapply with simplify=FALSE. This will assign names to the result.
sapply(lmNms, get, simplify=FALSE)

Does R have an equivalent of Python's "repr" (or Lisp's "prin1-to-string")?

I occasionally find that it would be useful to get the printed representation of an R object as a character string, like Python's repr function or Lisp's prin1-to-string. Does such a function exist in R? I don't need it to work on complicated or weird objects, just simple vectors and lists.
Edit: I want the string that I would have to type into the console to generate an identical object, not the output of print(object).
I'm not familiar with the Python/Lisp functions you listed, but I think you want either dput or dump.
x <- data.frame(1:10)
dput(x)
dump("x", file="clipboard")
See ?evaluate in the evaluate package.
EDIT: Poster later clarified in comments that he wanted commands that would reconstruct the object rather than a string that held the print(object) output. In that case evaluate is not what is wanted but dput (as already mentioned by Joshua Ullrich in comments and since I posted has been transferred to an answer) and dump will work. recordPlot and replayPlot will store and replot classic graphics on at least Windows. trellis.last.object will retrieve the last lattice graphics object. Also note that .Last.value holds the very last value at the interactive console.
You can use capture.output:
repr <- function(x) {
paste(sprintf('%s\n', capture.output(show(x))), collapse='')
}
For a version without the line numbers something along these lines should work:
repr <- function(x) {
cat(sprintf('%s\n', capture.output(show(x))), collapse='')
}
I had exactly the same question. I was wondering if something was built-in for this or if I would need to write it myself. I didn't find anything built-in so I wrote the following functions:
dputToString <- function (obj) {
con <- textConnection(NULL,open="w")
tryCatch({dput(obj,con);
textConnectionValue(con)},
finally=close(con))
}
dgetFromString <- function (str) {
con <- textConnection(str,open="r")
tryCatch(dget(con), finally=close(con))
}
I think this does what you want. Here is a test:
> rep <- dputToString(matrix(1:10,2,5))
> rep
[1] "structure(1:10, .Dim = c(2L, 5L))"
> mat <- dgetFromString(rep)
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10

Repeat each element in a string a certain number of times

I'm using the rep() function to repeat each element in a string a number of times. Each character I have contains information for a state, and I need the first three elements of the character vector repeated three times, and the fourth element repeated five times.
So lets say I have the following character vectors.
al <- c("AlabamaCity", "AlabamaCityST", "AlabamaCityState", "AlabamaZipCode")
ak <- c("AlaskaCity", "AlaskaCityST", "AlaskaCityState", "AlaskaZipCode")
az <- c("ArizonaCity", "ArizonaCityST", "ArizonaCityState", "ArizonaZipCode")
ar <- c("ArkansasCity", "ArkansasCityST", "ArkansasCityState", "ArkansasZipCode")
I want to end up having the following output.
AlabamaCity
AlabamaCity
AlabamaCity
AlabamaCityST
AlabamaCityST
AlabamaCityST
AlabamaCityState
AlabamaCityState
AlabamaCityState
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
...
I was able to get the desired output with the following command, but it's a little inconvenient when I'm running through all fifty states. Plus, I might have another column with 237 cities in Alabama, and I'll inevitably run into problems matching up the names in the first column with the values in the second column.
dat = data.frame(name=c(rep(al[1:3],each=3), rep(al[4],each=6),
rep(ak[1:3],each=3), rep(ak[4],each=6)))
dat
dat2 = data.frame(name=c(rep(al[1:3],each=3), rep(al[4],each=6),
rep(ak[1:3],each=3), rep(ak[4],each=6)),
city=c(rep("x",each=15), rep("y",each=15)))
dat2
Of course, in real life, the 'x' and 'y' won't be single values.
So my question concerns if there is a more efficient way of performing this task. And closely related to the question, when does it become important to ditch procedural programming in favor of OOP in R. (not a programmer, so the second part may be a really stupid question) More importantly, is this a task where I should look for a oop related solution.
According to ?rep, times= can be a vector. So, how about this:
dat <- data.frame(name=rep(al, times=c(3,3,3,6)))
It would also be more convenient if your "state" data were in a list.
stateData <- list(al,ak,az,ar)
Data <- lapply(stateData, function(x) data.frame(name=rep(x, times=c(3,3,3,6))))
Data <- do.call(rbind, Data)
I think you can combine the times() argument of rep to work through a list with sapply(). So first, we need to make our list object:
vars <- list(al, ak, az, ar)
# Iterate through each object in vars. By default, this returns a column for each list item.
# Convert to vector and then to data.frame...This is probably not that efficient.
as.data.frame(as.vector(sapply(vars, function(x) rep(x, times = c(3,3,3,6)))))
1 AlabamaCity
2 AlabamaCity
3 AlabamaCity
4 AlabamaCityST
....snip....
....snip....
57 ArkansasZipCode
58 ArkansasZipCode
59 ArkansasZipCode
60 ArkansasZipCode
You might consider using expand.grid, then paste on the results from that.

Resources