I am populating a new variable of a dataframe, based on string conditions from another variable. I receive the following error msg:
Error in Source == "httpWWW.BGDAILYNEWS.COM" | Source == :
operations are possible only for numeric, logical or complex types
My code is as follows:
County <- ifelse(Source == 'httpWWW.BGDAILYNEWS.COM' | 'WWW.BGDAILYNEWS.COM', 'Warren',
ifelse(Source == 'httpWWW.HCLOCAL.COM' | 'WWW.HCLOCAL.COM', 'Henry',
ifelse(Source == 'httpWWW.KENTUCKY.COM' | 'WWW.KENTUCKY.COM', 'Fayette',
ifelse(Source == 'httpWWW.KENTUCKYNEWERA.COM' | 'WWW.KENTUCKYNEWERA.COM', 'Christian')
)))
I suggest you break down that deeply nested ifelse statement into more manageable chunks.
But the error is telling you that you cannot use | like that. 'a' | 'b' doesn't make sense since its a logical comparison. Instead use %in%:
Source %in% c('htpWWW.BGDAILYNEWS.com', 'WWW.BGDAILYNEWS.COM')
I think... If I understand what you're doing, you will be much better off using multiple assignments:
County = vector(mode='character', length=length(Source))
County[County %in% c('htpWWW.BGDAILYNEWS.com', 'WWW.BGDAILYNEWS.COM')] <- 'Warren'
etc.
You can also use a switch statement for this type of thing:
myfun <- function(x) {
switch(x,
'httpWWW.BGDAILYNEWS.COM'='Warren',
'httpWWW.HCLOCAL.COM'='Henry',
etc...)
}
Then you want to do a simple apply (sapply) passing each element in Source to myfun:
County = sapply(Source, myfun)
Or finally, you can use factors and levels, but I'll leave that as an exercise to the reader...
A different approach:
county <- c("Warren","Henry","Fayette","Christian")
sites <- c("WWW.BGDAILYNEWS.COM","WWW.HCLOCAL.COM","WWW.KENTUCKY.COM","WWW.KENTUCKYNEWERA.COM")
County <- county[match(gsub("^http","",Source), sites)]
This will return NA for strings that do no match any of the given inputs.
Using Hadley's suggestion (lookup-tables-character-subsetting):
lookup <- c(WWW.BGDAILYNEWS.COM="Warren", WWW.HCLOCAL.COM="Henry", WWW.KENTUCKY.COM="Fayette", WWW.KENTUCKYNEWERA.COM="Christian")
County <- unname(lookup[gsub("^http","",Source)])
Related
I'm trying to practice Haskell returns and datatypes. I'm trying to pass the following information into the program:
worm = 1:2:3:worm
eel = [1,2,3,1,2,3,1,2,3]
snake = 3:2:1:snake
whale = [1..100]
And i want to create a function that has a switch function to get the data and match it to its definition. For example, in Python:
def compare(str): #for one case and using string to clarify
if str == "1:2:3:worm":
return "worm"
I know the datatypes are lists but causes a lot of confusion. My code is giving me an error of Could not deduce (Num Char) Arising from use of worm
My code:
which :: [a] -> String
which x | x == [1,2,3,1,2,3,1,2,3] = "worm" | x == 3:2:1:snake = "snake" | otherwise = "F"
Is there another approach i'm missing? and why is my function giving me that error?
Two problems:
You can't have a function that returns a list of numbers sometimes and a string other times. That's literally the entire point of a strongly typed language. If you want something like that, you need to use a sum type.
You can't compare infinite lists. You can try, but your program will never finish.
I'm studying project euler solutions and this is the solution of problem 4, which asks to
Find the largest palindrome made from the product of two 3-digit
numbers
problem_4 =
maximum [x | y<-[100..999], z<-[y..999], let x=y*z, let s=show x, s==reverse s]
I understand that this code creates a list such that x is a product of all possible z and y.
However I'm having a problem understanding what does s do here. Looks like everything after | is going to be executed everytime a new element from this list is needed, right?
I don't think I understand what's happening here. Shouldn't everything to the right of | be constraints?
A list comprehension is a rather thin wrapper around a do expression:
problem_4 = maximum $ do
y <- [100..999]
z <- [y..999]
let x = y*z
let s = show x
guard $ s == reverse s
return x
Most pieces translate directly; pieces that aren't iterators (<-) or let expressions are treated as arguments to the guard function found in Control.Monad. The effect of guard is to short-circuit the evaluation; for the list monad, this means not executing return x for the particular value of x that led to the false argument.
I don't think I understand what's happening here. Shouldn't everything to the right of | be constraints?
No, at the right part you see an expression that is a comma-separated (,) list of "parts", and every part is one of the following tree:
an "generator" of the form somevar <- somelist;
a let statement which is an expression that can be used to for instance introduce a variable that stores a subresult; and
expressions of the type boolean that act like a filter.
So it is not some sort of "constraint programming" where one simply can list some constraints and hope that Haskell figures it out (in fact personally that is the difference between a "programming language" and a "specification language": in a programming language you have "control" how the data flows, in a specification language, that is handled by a system that reads your specifications)
Basically an iterator can be compared to a "foreach" loop in many imperative programming languages. A "let" statement can be seen as introducing a temprary variable (but note that in Haskell you do not assign variable, you declare them, so you can not reassign values). The filter can be seen as an if statement.
So the list comprehension would be equivalent to something in Python like:
for y in range(100, 1000):
for z in range(y, 1000):
x = y * z
s = str(x)
if x == x[::-1]:
yield x
We thus first iterate over two ranges in a nested way, then we declare x to be the multiplication of y and z, with let s = show x, we basically convert a number (for example 15129) to its string counterpart (for example "15129"). Finally we use s == reverse s to reverse the string and check if it is equal to the original string.
Note that there are more efficient ways to test Palindromes, especially for multiplications of two numbers.
I am working with a very long list of commodity names (var1). I would like to extract information from this list by creating a second variable (var2) that is equal to 1 if var1 contains certain keywords.
I was using the following code:
g soy = strpos(productsproduced, "Soybeans, ") | strpos(productsproduced, "Soybean, ") | strpos(productsproduced, "soybeans, ")| strpos(productsproduced, "soybean, ") | productsproduced == "Soybeans"
The list is much longer, given that the data was not properly coded, and each name appears in many different ways (as the excerpt in the code sample shows).
I believe that it would be much easier to work with a list (easier to look through the list certainly, and see if I am missing anything, etc.)
Unfortunately, it has been a while since I have worked with loops, but I was thinking something of the sort:
local mylist Soybean soybean Soybeans soybeans Soybeans, soybeans,
forval i = mylist {
g soy = strpos(var1, "`i'")
}
This doesn't quite work, but I am not sure how to code it. One definite issue is that Stata would not know in this case whether I would like it to use the or operator (yes, I would) or the and operator.
The spirit is evident; the details need various fixes.
local mywords Soybean soybean Soybeans soybeans Soybeans, soybeans,
gen soy = 0
foreach w of local mywords {
replace soy = soy | strpos(var1, "`w'")
}
What's crucial is that you need replace inside the loop; otherwise the loop will fail second time round on a generate as the variable already exists.
In fact this example reduces to
gen soy = strpos(var1, "oybean") > 0
on the assumption that oybean wouldn't match anything not wanted.
Standardising to lower case is often helpful
local mywords soybean soybeans soybeans,
gen soy = 0
foreach w of local mywords {
replace soy = soy | strpos(lower(var1), "`w'")
}
I have
x<-c('abczzzdef','abcxxdef')
I want a function
fn(x)
that returns a length 2 vector
[1] 'zzz' 'xx'
How?
(I have tried searching for an answer but search terms like 'partial matching' give me something quite different)
Update
'length 2 vector' means length(fn(x)) is 2 and fn(x)[1] give "zzz" while fn(x)[2] gives "xx".
After trying out the answers provided, I realize I haven't been specific enough.
There will only be 2 strings (in a vector) that I am comparing.
The location of the different parts (zzz and xx) can be anywhere in the string. i.e. it could be x<-c('zzzabcdef','xxabcdef') or it could be at the end. But the 2 strings are always at the same respective place (i.e. both at the beginning, or both at the middle, or both at the end).
zzz and xx are obviously generic names. They could be different things (numbers, alphabet, symbols) and of different length (not necessarily 3 and 2).
Same comment applies to abc and def.
I have got some test cases
x1<-c('abcxxxttt','abczzttt')
x2<-c('abcxxxdef','abczz126gsdef')
x3<-c('xx_x123../t','z_z126gs123../t')
fn(x1) should give "xxx" "zz"
fn(x2) should give "xxx" "zz126gs"
fn(x3) should give "xx_x" "z_z126gs"
x<-c('abczzzdef','abcxxdef')
fn <- function(x) unlist(regmatches(x, gregexpr("(.)\\1+", x)))
fn(x)
# [1] "zzz" "xx"
First of all, it would have been better to include all that detail in the first version of the question. No need to waste people's time coming up with solutions that wont work for you just because you didn't clearly explain what you needed. If you need to change a question that much after it's already been answered, it probably would be best to ask a new question rather than completely changing your first one.
What you are tying to do, find the largest non-shared portion of a string, can be a pretty messy process for a computer. A somewhat standard measure of string dissimilarity is the generalized Levenshtein distance which R has implemented in the adist function. It can produce a string which tells you how to transform one string into another via matches, insertions, deletions, and substitutions. If I find the longest string of matches, I'll have a pretty good idea of where to extract the unique information.
So this method basically focuses on extracting the regions outside of the best matches. Here's the function that does the matching
fn <- function(x) {
ld <- attr(adist(x[1], x[2], counts=T,
costs=c(substitutions=500)),"trafos")[1,1]
starts <- gregexpr("M+", ld)[[1]]
lens <- attr(starts,"match.length")
starts <- as.vector(starts)
ends <- starts + lens - 1
bm <- which.max(lens)
if (starts[bm]==1 | ends[bm]==nchar(ld)) {
#beg/end
for( i in which(starts==1 | ends==nchar(ld))) {
substr(ld, starts[i], ends[i]) <-
paste(rep("X", lens[i]), collapse="")
}
} else {
#middle
substr(ld, starts[bm], ends[bm]) <-
paste(rep("X", lens[bm]), collapse="")
}
tr <- strsplit(ld,"")[[1]]
x1 <- cumsum(tr %in% c("D","M","X"))[!tr %in% c("X","I")]
x2 <- cumsum(tr %in% c("I","M","X"))[!tr %in% c("X","D")]
c(substr(x[1], min(x1), max(x1)), substr(x[2], min(x2), max(x2)))
}
Now we can apply it to your test data
x1 <- c('abcxxxttt','abczzttt')
x2 <- c('abcxxxdef','abczz126gsdef')
x3 <- c('xx_x123../t','z_z126gs123../t')
fn(x1)
# [1] "xxx" "zz"
fn(x2)
# [1] "xxx" "zz126gs"
fn(x3)
# [1] "xx_x" "z_z126gs"
So we get the results you expect. Here I do little error checking. I assume there will always be some overlap and some non-overlapping regions. If that's not true, the function will likely produce an error or unexpected results.
gsub("([^xz]*)([xz]*)([^xz]*)", "\\2", x)
[1] "zzz" "xx"
> getxz <- function(x, str) gsub(paste0("([^",str, ']*)([', str, ']*)([^', str, ']*)'),
"\\2", x)
> getxz(x=x,"xz")
[1] "zzz" "xx"
In response to the new examples I offer these tests which I think provides three successes:
> getxz(x=x1,"xz_")
[1] "xxx" "zz"
> getxz(x=x2,"xz_")
[1] "xxx" "zz"
> getxz(x=x3,"xz_")
[1] "xx_x" "z_z"
I'm using the rep() function to repeat each element in a string a number of times. Each character I have contains information for a state, and I need the first three elements of the character vector repeated three times, and the fourth element repeated five times.
So lets say I have the following character vectors.
al <- c("AlabamaCity", "AlabamaCityST", "AlabamaCityState", "AlabamaZipCode")
ak <- c("AlaskaCity", "AlaskaCityST", "AlaskaCityState", "AlaskaZipCode")
az <- c("ArizonaCity", "ArizonaCityST", "ArizonaCityState", "ArizonaZipCode")
ar <- c("ArkansasCity", "ArkansasCityST", "ArkansasCityState", "ArkansasZipCode")
I want to end up having the following output.
AlabamaCity
AlabamaCity
AlabamaCity
AlabamaCityST
AlabamaCityST
AlabamaCityST
AlabamaCityState
AlabamaCityState
AlabamaCityState
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
AlabamaZipCode
...
I was able to get the desired output with the following command, but it's a little inconvenient when I'm running through all fifty states. Plus, I might have another column with 237 cities in Alabama, and I'll inevitably run into problems matching up the names in the first column with the values in the second column.
dat = data.frame(name=c(rep(al[1:3],each=3), rep(al[4],each=6),
rep(ak[1:3],each=3), rep(ak[4],each=6)))
dat
dat2 = data.frame(name=c(rep(al[1:3],each=3), rep(al[4],each=6),
rep(ak[1:3],each=3), rep(ak[4],each=6)),
city=c(rep("x",each=15), rep("y",each=15)))
dat2
Of course, in real life, the 'x' and 'y' won't be single values.
So my question concerns if there is a more efficient way of performing this task. And closely related to the question, when does it become important to ditch procedural programming in favor of OOP in R. (not a programmer, so the second part may be a really stupid question) More importantly, is this a task where I should look for a oop related solution.
According to ?rep, times= can be a vector. So, how about this:
dat <- data.frame(name=rep(al, times=c(3,3,3,6)))
It would also be more convenient if your "state" data were in a list.
stateData <- list(al,ak,az,ar)
Data <- lapply(stateData, function(x) data.frame(name=rep(x, times=c(3,3,3,6))))
Data <- do.call(rbind, Data)
I think you can combine the times() argument of rep to work through a list with sapply(). So first, we need to make our list object:
vars <- list(al, ak, az, ar)
# Iterate through each object in vars. By default, this returns a column for each list item.
# Convert to vector and then to data.frame...This is probably not that efficient.
as.data.frame(as.vector(sapply(vars, function(x) rep(x, times = c(3,3,3,6)))))
1 AlabamaCity
2 AlabamaCity
3 AlabamaCity
4 AlabamaCityST
....snip....
....snip....
57 ArkansasZipCode
58 ArkansasZipCode
59 ArkansasZipCode
60 ArkansasZipCode
You might consider using expand.grid, then paste on the results from that.