R coding string - string

I have a list of 96 files that I would like to open and perform some functions on the data. I am VERY new to R, and am unsure how to manipulate strings to open the sequential file names. Here is my code below, which clearly does not work:
for (N in (1:96)){
if (N > 10)
TrackID <- "000$N"
}
else{
TrackID <- "00$N"
}
fname_in <- 'input/intersections_track_calibrated_jma_from1951_$TrackID.csv'
fname_out <- 'output/tracks_crossing_regional_polygon_$TrackID.csv'
......data manipulations.....
}
So basically I just need to be able to, for instance, when N=1, reference a file called intersections_track_calibrated_jma_from1951_0001.csv.
Thanks in advance!
Kimberly

I think that what you are looking for is the sprintf() function...
sprintf will save you from having the tests on n, to known how many leading zeros are needed.
Combined with the paste() or paste0 function, producing the desired file name becomes a one-liner.
Indeed it would be possibly to just use the sprintf() function alone, as in
sprintf("intersections_track_calibrated_jma_from1951_%04d.csv", n) but having a function to produce the file names and/or the "TrakID" may allow to hide all these file naming convention details away.
Below, see sprintf() and paste0() in action, in the context of a convenience function created to produce the filename given a number n.
> GetFileName <- function(n)
paste0("intersections_track_calibrated_jma_from1951_",
sprintf("%04d", n),
".csv")
> GetFileName(1)
[1] "intersections_track_calibrated_jma_from1951_0001.csv"
> GetFileName(13)
[1] "intersections_track_calibrated_jma_from1951_0013.csv"
> GetFileName(321)
[1] "intersections_track_calibrated_jma_from1951_0321.csv"
>
Of course, you could make the GetFileName function more versatile by adding parameters, some of which with a default value. In that fashion it could be used for both input and output file name (or any other file prefix/extension). For example:
GetFileName <- function(n, prefix=NA, ext="csv") {
if (is.na(prefix)) {
prefix <- "intersections_track_calibrated_jma_from1951_"
}
paste0(prefix, sprintf("%04d", n), ".", ext)
}
> GetFileName(12)
[1] "intersections_track_calibrated_jma_from1951_0012.csv"
> GetFileName(12, "output/tracks_crossing_regional_polygon_", "txt")
[1] "output/tracks_crossing_regional_polygon_0012.txt"
> GetFileName(12, "output/tracks_crossing_regional_polygon_")
[1] "output/tracks_crossing_regional_polygon_0012.csv"
>

Try using paste and paste0 to generate strings like this instead.
for (N in (1:96)){
if (N > 10)
TrackID <- paste0("000",N)
}
else{
TrackID <- paste0("00",N)
}
fname_in <- paste0('input/intersections_track_calibrated_jma_from1951_',
TrackID.'.csv')
fname_out <- paste0('output/tracks_crossing_regional_polygon_',
TrackID,'.csv')
......data manipulations.....
}
paste0 just saves you from writing sep="" if you don't require a seperator (as in your case)

Related

assign global variable in R function using tm: bug on linux?

Using the package tm in R, I want to transform a corpus with a pretty complicated function, and I need some side effects for storing pertinent information. Since content_transformer requires a specific function format, the easy way is to use <<- in my function. The problem occurs with the code below:
library(tm)
a <- 4
n <- 2
corp<-VCorpus(VectorSource(rep("fish",n)))
(func<-content_transformer(
function(x) {
a <<- 42
return(x)
}))
corp<-tm_map(corp,func)
print(a)
It prints the wrong answer, i.e. 4. But with n=1, it prints the right one. So I assume it is the multi-threading that tm does that fails to behave as scalar R. I guess it's a bug since on windows, it works. (Note: I use R 3.1.2 on linux and R 3.1.1 on windows).
Questions: is it a bug? if yes, a known bug? Is there an easy solution that does not require to refactor the code?
Thanks!
edit: additional example using assing
rm(list=ls())
library(tm)
env <- new.env()
a <- 1
n <- 2
corp<-VCorpus(VectorSource(rep("fish",n)))
(func<-content_transformer(
function(x,e) {
assign("a", 42, envir=e)
print(e)
print(ls.str(e))
return(x)
}))
corp<-tm_map(corp,func,env)
print(env)
print(ls.str(env))
In fact this works accidentally under Windows because mclapply is not defined under windows and it is just a call to to an lapply.
Indeed, when you are calling tm_map , you are using this function :
tm:::tm_map.VCorpus
function (x, FUN, ..., lazy = FALSE)
{
if (lazy) {
fun <- function(x) FUN(x, ...)
if (is.null(x$lazy))
x$lazy <- list(index = rep(TRUE, length(x)), maps = list(fun))
else x$lazy$maps <- c(x$lazy$maps, list(fun))
}
else x$content <- mclapply(content(x), FUN, ...) ## this the important line
x
}
So you can reproduce the "odd/normal" behavior by calling mclapply:
library(parallel)
res <- mclapply(1:2, function(x){a<<- 20;x})
a
[1] 4
a is unchanged and is still equal to 4. This is the normal paralel behavior since we avoid to have side effect. Under windows mcapply is just a call to lapply so the value of the global variable is correctly changed.
pseudo solution
Here better to use lapply if you want to the global side effect, but you can emulate the global variable in a reading specially if you add a as a second argument your function...
func <-
function(x,a) {
a <- 42 ## use a here
x$a <- a ## assign it to the x environment
return(x) ## but the new value of a can not be used by others documents..
}
(Func<-content_transformer(func))
res <- tm_map(corp,Func,a=4)

How to extract different part of 2 strings

I have
x<-c('abczzzdef','abcxxdef')
I want a function
fn(x)
that returns a length 2 vector
[1] 'zzz' 'xx'
How?
(I have tried searching for an answer but search terms like 'partial matching' give me something quite different)
Update
'length 2 vector' means length(fn(x)) is 2 and fn(x)[1] give "zzz" while fn(x)[2] gives "xx".
After trying out the answers provided, I realize I haven't been specific enough.
There will only be 2 strings (in a vector) that I am comparing.
The location of the different parts (zzz and xx) can be anywhere in the string. i.e. it could be x<-c('zzzabcdef','xxabcdef') or it could be at the end. But the 2 strings are always at the same respective place (i.e. both at the beginning, or both at the middle, or both at the end).
zzz and xx are obviously generic names. They could be different things (numbers, alphabet, symbols) and of different length (not necessarily 3 and 2).
Same comment applies to abc and def.
I have got some test cases
x1<-c('abcxxxttt','abczzttt')
x2<-c('abcxxxdef','abczz126gsdef')
x3<-c('xx_x123../t','z_z126gs123../t')
fn(x1) should give "xxx" "zz"
fn(x2) should give "xxx" "zz126gs"
fn(x3) should give "xx_x" "z_z126gs"
x<-c('abczzzdef','abcxxdef')
fn <- function(x) unlist(regmatches(x, gregexpr("(.)\\1+", x)))
fn(x)
# [1] "zzz" "xx"
First of all, it would have been better to include all that detail in the first version of the question. No need to waste people's time coming up with solutions that wont work for you just because you didn't clearly explain what you needed. If you need to change a question that much after it's already been answered, it probably would be best to ask a new question rather than completely changing your first one.
What you are tying to do, find the largest non-shared portion of a string, can be a pretty messy process for a computer. A somewhat standard measure of string dissimilarity is the generalized Levenshtein distance which R has implemented in the adist function. It can produce a string which tells you how to transform one string into another via matches, insertions, deletions, and substitutions. If I find the longest string of matches, I'll have a pretty good idea of where to extract the unique information.
So this method basically focuses on extracting the regions outside of the best matches. Here's the function that does the matching
fn <- function(x) {
ld <- attr(adist(x[1], x[2], counts=T,
costs=c(substitutions=500)),"trafos")[1,1]
starts <- gregexpr("M+", ld)[[1]]
lens <- attr(starts,"match.length")
starts <- as.vector(starts)
ends <- starts + lens - 1
bm <- which.max(lens)
if (starts[bm]==1 | ends[bm]==nchar(ld)) {
#beg/end
for( i in which(starts==1 | ends==nchar(ld))) {
substr(ld, starts[i], ends[i]) <-
paste(rep("X", lens[i]), collapse="")
}
} else {
#middle
substr(ld, starts[bm], ends[bm]) <-
paste(rep("X", lens[bm]), collapse="")
}
tr <- strsplit(ld,"")[[1]]
x1 <- cumsum(tr %in% c("D","M","X"))[!tr %in% c("X","I")]
x2 <- cumsum(tr %in% c("I","M","X"))[!tr %in% c("X","D")]
c(substr(x[1], min(x1), max(x1)), substr(x[2], min(x2), max(x2)))
}
Now we can apply it to your test data
x1 <- c('abcxxxttt','abczzttt')
x2 <- c('abcxxxdef','abczz126gsdef')
x3 <- c('xx_x123../t','z_z126gs123../t')
fn(x1)
# [1] "xxx" "zz"
fn(x2)
# [1] "xxx" "zz126gs"
fn(x3)
# [1] "xx_x" "z_z126gs"
So we get the results you expect. Here I do little error checking. I assume there will always be some overlap and some non-overlapping regions. If that's not true, the function will likely produce an error or unexpected results.
gsub("([^xz]*)([xz]*)([^xz]*)", "\\2", x)
[1] "zzz" "xx"
> getxz <- function(x, str) gsub(paste0("([^",str, ']*)([', str, ']*)([^', str, ']*)'),
"\\2", x)
> getxz(x=x,"xz")
[1] "zzz" "xx"
In response to the new examples I offer these tests which I think provides three successes:
> getxz(x=x1,"xz_")
[1] "xxx" "zz"
> getxz(x=x2,"xz_")
[1] "xxx" "zz"
> getxz(x=x3,"xz_")
[1] "xx_x" "z_z"

Automatically add variable names to elements of a list [duplicate]

This question already has answers here:
Can lists be created that name themselves based on input object names?
(4 answers)
Closed 6 years ago.
I have a list of models, and to make the code easiser to maintain (so roubst to adding and removing models) I'd like to have a single place where I store them and their names. To do this I have to solve the following naming problem.
Upstream, i have generated models in a way that's less efficient than the following (if it was this compressed, i would assign them to their own env).
lmNms <- c( "mod1", "mod2", "mod3", "mod4", "mod5", "mod6")
lapply(lmNms, function(N) assign(N, lm(runif(10) ~ rnorm(10)), env = .GlobalEnv))
Downstream, i have collected the mess into a list:
modelList <- list(mod1, mod2, mod3, mod4, mod5, mod6)
I have an (un-named) lists of variable output, and attach the names as follows:
output <- list(1, 2, 3, 4, 5, 6)
names(output) <- lmNms
I'd like to be able to use the model names from modelList:
modelList <- list(mod1, mod2, mod3, mod4, mod5, mod6)
names(output) <- someFun(modelList)
I'm sure there exists someFun -- but I cannot figure it out ... can this be done?
To be clear, the aim is to do this without using lmNms -- i want to get the names either from modelList, or have them attach at the point that i build modelList (the point is to avoid list(a = a, b=b ...) boilerplate.
The key to this is to re-make the list function to stick on the names when you don't supply the names as well.
listN <- function(...){
anonList <- list(...)
names(anonList) <- as.character(substitute(list(...)))[-1]
anonList
}
With this, you make modelList as follows:
modelList <- listN(mod1, mod2, mod3, mod4, mod5, mod6)
With the names attached:
R> names(modelList)
[1] "mod1" "mod2" "mod3" "mod4" "mod5" "mod6"
A fuller solution is given here, which is robust to the use of a mixture of anonymous and named arguments to list.
listN2 <- function(...){
dots <- list(...)
inferred <- sapply(substitute(list(...)), function(x) deparse(x)[1])[-1]
if(is.null(names(inferred))){
names(dots) <- inferred
} else {
names(dots)[names(inferred) == ""] <- inferred[names(inferred) == ""]
}
dots
}
You can do this with environments:
e <- new.env()
output <- list(1,2,3,4,5,6)
nms <- c( "mod1", "mod2", "mod3", "mod4", "mod5", "mod6")
for(i in 1:length(output)) {
nm <- nms[i]
e[[nm]] <- output[[i]]
}
You can reference items in the environment like any list, or coerce it to a list
> ls(e)
[1] "mod1" "mod2" "mod3" "mod4" "mod5" "mod6"
> e[['mod1']]
[1] 1
> e$mod1
[1] 1
> new_output <- as.list(e)
Since environments act a lot like lists, there is probably an easy way to do it with your original list as well.
Use sapply with simplify=FALSE. This will assign names to the result.
sapply(lmNms, get, simplify=FALSE)

Selecting output lines in chunk output by modifying the default output hook

The knitr book, p. 118, \S 12.3.5, has an example of how to suppress long output by modifying
the output chunk hook, but it isn't at all general because it does it globally for all chunks.
I've tried to generalize that, to allow a chunk option, output.lines, which, if NULL, has no
effect, but otherwise selects and prints only the first output.lines lines. However, this version
seems to have no effect when I try it, and I can't figure out how to tell why.
More generally, I think this is useful enough to be included in knitr, and would be better if one
could specify a range of lines, e.g., output.lines=3:15, as is possible with echo=.
# get the default output hook
hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
lines <- options$output.lines
if (is.null(lines)) {
hook_output(x, options) # pass to default hook
}
else {
x <- unlist(stringr::str_split(x, "\n"))
if (length(x) > lines) {
# truncate the output, but add ....
x <- c(head(x, lines), "...\n")
}
# paste these lines together
x <- paste(x, collapse = "\n")
hook_output(x, options)
}
})
Test case:
<<print-painters, output.lines=8>>=
library(MASS)
painters
#
Actually, this solution does work. My actual test example was flawed. Maybe others will find this helpful.

Could I get help implementing a concept, "When a String changes, its type changes"

One day on #haskell, someone mentioned the concept of how a string's type should change when the string changes. This reminded me of some code I have in my project. It keeps bugging me, and I couldn't articulate why. The reason, I now surmise, is because I am not implementing this concept. Here's the code below, followed by some ideas of how I can begin to change it for the better. What I would like is some input to the effect of , "You're on the right track." or , "No, way off.", or "Here's this other thing you should be mindful of.".
> processHTML :: String -> [[String]]
> processHTML htmlFILE =
> let parsedHTML = parseTags htmlFILE
> allTagOpens = sections (~== TagOpen "a" [("href","")]) parsedHTML
> taggedTEXT = head $ map (filter isTagOpen) allTagOpens
> allHREFS = map (fromAttrib "href") taggedTEXT
> allPotentials = map (dropWhile (/= '?')) allHREFS
> removedNulls = filter (not . null) allPotentials
> removedQs = map (drop 1) removedNulls
> in map (splitOn "&") removedQs
The idea here is I'm taking raw HTML and filtering out everything I don't want until I get what I do want. Each let binding represents a stage in filtering. This could be the foundation of a data structure, like so:
> data Stage = Stage1 Foo
> | Stage2 Bar
> | Stage3 Baz
Where Foo Bar and Baz are the appropriate datatype; a String, or TagOpen for example, depending on what stage I am at in the filtering process. I could use this data type to get precise information when I add in the error handling code. Plus, it could help me keep track of what is happening when.
Feedback appreciated.
You're on the right track.
First of all, when you're building a long pipeline like this, you may prefer to compose functions directly:
> processHTML :: String -> [[String]]
> processHTML =
> parseTags
> >>> sections (~== TagOpen "a" [("href","")])
> >>> head $ map (filter isTagOpen)
> >>> map (fromAttrib "href")
> >>> map (dropWhile (/= '?'))
> >>> filter (not . null)
> >>> map (drop 1)
> >>> map (splitOn "&")
This uses Control.Category.(>>>), which is just (at least in this case) flipped function composition.
Now for your actual question, it looks like you're using the tagsoup package for parsing tags. This already does some type changing throughout the pipeline: parseTags generates a Tag, some functions operate on it, and then fromAttrib goes back to a String.
Depending on how much work you'll be doing, I might create a newtype:
newtype QueryElement = QE { unQE :: String } deriving (Eq, Show)
> processHTML :: String -> [[QueryElement]]
> processHTML =
> parseTags
> >>> sections (~== TagOpen "a" [("href","")])
> >>> head $ map (filter isTagOpen)
> >>> map (fromAttrib "href")
> >>> map (dropWhile (/= '?'))
> >>> filter (not . null)
> >>> map (drop 1)
> >>> map (splitOn "&" >>> map QE)
Only the last line has changed here, to add the QE newtype tags to each element.
Depending on your use case, you could take a difference approach. For example, you may want to add more information to the URI instead of just collecting the query variables. Or you might want to fold over the query items and produce a Map String String directly.
Finally, if you're trying to gain type safety, you usually wouldn't make a sum type such as your Stage. This is because each constructor creates a value of the same type, so the compiler can't do any extra checking. Instead you'd create a separate type for each stage:
data Stage1 = Stage1 Foo
data Stage2 = Stage2 Bar
data Stage3 = Stage3 Baz
doStage1 :: Stage1 -> Stage2
doStage2 :: Stage2 -> Stage3
It's easy to create very fine-grained classes and data structures, but at some point they get out of hand. For example, in your functions allPotentials, removedNulls, and removedQs, you may want to just work on Strings. There isn't a lot of semantic meaning that can be attached to the output of those stages, especially as they're partial steps within a slightly larger process.
This page talks about using types to enforce safety of operations, and causing common errors to show up at compile-time. I'm not sure, but I think this is along the lines of what you're trying to implement.
An example of the problem:
You're running a web application that needs to use a database. It generates an SQL query from the username and password (for example) and sends it off to the database server, gets a response, and presents it to the user. This works great for a while. But then a very rude user types in " OR 1 = 1; -- for the username. Can you imagine sending that string to the following query:
SELECT * FROM users WHERE password = "$" AND username = "$";
Disaster!
The basic solution:
1) create a type for strings that are safe to send to the database server (i.e. GoodSQLString)
2) make sure that all GoodSQLString's really are safe (perhaps the constructor passes the argument query string through an escaping function)
3) only allow GoodSQLString's to be sent to the database server from an application
That said, it's hard to say how that translates to your processHTML problem. Perhaps the type signature should be processHTML :: HTML -> [Tags] -- unless it's meaningful to pass in String's that are invalid HTML.

Resources