apply functions for xts - excel

My data is currently an xts or zoo object of daily stock prices per row and each column is a different company.
library(quantmod)
getSymbols("AAPL;MSFT;YHOO")
closePrices <- merge(Cl(AAPL),Cl(MSFT),Cl(YHOO))
I am still new to R and need some assistance reproducing this Excel function. My first thought was to split the function into numerator and denominator, and then compute the index:
dailyDiff <- abs(diff(closePrices,1))
numerJ <- diff(closePrices,10)
denomJ <- as.xts(rollapply(dailyDiff,11, sum))
idx <- abs(numerJ/denomJ)
This was great because the values for each portion were accurate, but are aligned by incorrect dates for denomJ. For example, the tail of numerJ goes to 6/21/2012, while the tail of denomJ goes to 6/14/2012.
The output that I am looking for is:
6/21/2012 = .11
6/20/2012 = .27
6/19/2012 = .46
6/18/2012 = .39
6/15/2012 = .22

It's hard to tell exactly what your problem is without exact data, but the problem appears to be with rollapply. rollapply will only apply the function to whole intervals unless the argument partial is set to TRUE. Consider the following example
require(zoo)
#make up some data
mat <- matrix(1:100,ncol=2)
colnames(mat) <- c("x1","x2")
dates <- seq.Date(from=as.Date("2010-01-01"),length.out=50,by="1 day")
zoo.obj <- zoo(mat,dates)
#apply the funcitons
numerJ <- diff(zoo.obj,10) #dates okay
denomJ <- rollapply(zoo.obj,11, sum,partial=TRUE) #right dates
denomJ2 <- rollapply(zoo.obj,11,sum) #wrong dates
index <- abs(numerJ/denomJ) #right dates

You can use a combination of diff and either runSum or rollapplyr
#Get the data
library(quantmod)
getSymbols("AAPL")
I think this is what you're trying to do (note the use of the lag argument to diff.xts, and the n argument to runSum)
out <- diff(Cl(AAPL), lag=10) / runSum(abs(diff(Cl(AAPL))), n=11)
tail(out['/2012-06-21'])
# AAPL.Close
#2012-06-14 -0.1047297
#2012-06-15 0.2176938
#2012-06-18 0.3888185
#2012-06-19 0.4585821
#2012-06-20 0.2653782
#2012-06-21 0.1117371
Edit
Upon closer review of your question, I do not understand why rollapplyr is not the answer you're looking for. If I take your code, exactly as is, except I change rollapply to rollapplyr, it looks to me like it's exactly the output you're looking for.
dailyDiff <- abs(diff(closePrices,1))
numerJ <- diff(closePrices,10)
denomJ <- as.xts(rollapplyr(dailyDiff,11, sum))
idx <- abs(numerJ/denomJ)
# AAPL.Close MSFT.Close YHOO.Close
#2012-06-14 0.1047297 0.03826531 0.06936416
#2012-06-15 0.2176938 0.35280899 0.25581395
#2012-06-18 0.3888185 0.33161954 0.31372549
#2012-06-19 0.4585821 0.47096774 0.34375000
#2012-06-20 0.2653782 0.32644628 0.23750000
#2012-06-21 0.1117371 0.18997912 0.10256410
Also, note that both numerJ and denomJ both end on the same date if you use rollapplyr (which is the same as using rollapply with align="right")
end(numerJ); end(denomJ)
#[1] "2012-07-20"
#[1] "2012-07-20"
Yahoo Bug
Maybe the problem you're seeing is the yahoo bug where sometimes -- for example, right now -- yahoo duplicates the last (chronologically speaking) row of data. If so, try deleting the duplicated row before attempting to use the data for your calculations.
tidx <- tail(index(closePrices), 2)
if(tidx[1] == tidx[2]) {
closePrices <- closePrices[-NROW(closePrices), ]
}

Related

How to retrieve bbox for osmdata from spatial feature?

How to define the bbox to download OSM data based on the extent of a spatial file?
The following example returns an error message:
...the only allowed values are floats between -90.0 and 90.0
This shows that the bbox-values are out of allowed range. It also shows that the convertion between NAD27 and EPSG:3857 did not return the spatial data at place where it should be.
With other spatial data I had similar problems. Eventhough within allowed range, the data didn't appear at the expected place. Downloaded OSM data appeared at a different place as the input spatial file.
library(sf)
library(raster)
library(osmdata)
osm_proj <-("+init=epsg:3857")
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- st_transform(nc, osm_proj)
bbox.nc <- as.vector(extent(nc[22,]))/100000
q <- opq(bbox = bbox.nc) %>%
add_osm_feature(key = 'natural', value = 'water')
osm.water <- osmdata_sf(q)
How to prepare the bbox that downloaded OSM data matches spatial extend of input spatial file?
OSM works in lat-lon, which means EPSG:4326. You need to transform the coordinates accordingly. You also don't need raster::extent(); sf::st_bbox() will be sufficient in this use case.
Or in your context consider this code; as this is only a toy example I am not using the whole NC state, but a single county (otherwise errors on timeout may occur, which would be a separate kind of a problem - this question is about bounding boxes).
library(sf)
library(osmdata)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
strelitz <- st_transform(nc, 4326) %>%
dplyr::filter(NAME == "Mecklenburg") # as in Charlotte of Mecklenburg-Strelitz
q <- opq(bbox = sf::st_bbox(strelitz)) %>%
add_osm_feature(key = 'natural', value = 'water') %>%
osmdata_sf()
plot(st_geometry(strelitz))
plot(st_geometry(q$osm_lines), col = 'blue', add = T)
A shameles plug: I wrote about querying OSM for points of interest a while back, you may find this post interesting :)
https://www.jla-data.net/eng/finding-pois-along-a-route/

Need help working with lists within lists

I'm taking a programming class and have our first assignment. I understand how it's supposed to work, but apparently I haven't hit upon the correct terms to search to get help (and the book is less than useless).
The assignment is to take a provided data set (names and numbers) and perform some manipulation and computation with it.
I'm able to get the names into a list, and know the general format of what commands I'm giving, but the specifics are evading me. I know that you refer to the numbers as names[0][1], names[1][1], etc, but not how to refer to just that record that is being changed. For example, we have to have the program check if a name begins with a letter that is Q or later; if it does, we double the number associated with that name.
This is what I have so far, with ??? indicating where I know something goes, but not sure what it's called to search for it.
It's homework, so I'm not really looking for answers, but guidance to figure out the right terms to search for my answers. I already found some stuff on the site (like the statistics functions), but just can't find everything the book doesn't even mention.
names = [("Jack",456),("Kayden",355),("Randy",765),("Lisa",635),("Devin",358),("LaWanda",452),("William",308),("Patrcia",256)]
length = len(names)
count = 0
while True
count < length:
if ??? > "Q" # checks if first letter of name is greater than Q
??? # doubles number associated with name
count += 1
print(names) # self-check
numberNames = names # creates new list
import statistics
mean = statistics.mean(???)
median = statistics.median(???)
print("Mean value: {0:.2f}".format(mean))
alphaNames = sorted(numberNames) # sorts names list by name and creates new list
print(alphaNames)
first of all you need to iter over your names list. To do so use for loop:
for person in names:
print(person)
But names are a list of tuples so you will need to get the person name by accessing the first item of the tuple. You do this just like you do with lists
name = person[0]
score = person[1]
Finally to get the ASCII code of a character, you use ord() function. That is going to be helpful to know if name starts with a Q or above.
print(ord('A'))
print(ord('Q'))
print(ord('R'))
This should be enough informations to get you started with.
I see a few parts to your question, so I'll try to separate them out in my response.
check if first letter of name is greater than Q
Hopefully this will help you with the syntax here. Like list, str also supports element access by index with the [] syntax.
$ names = [("Jack",456),("Kayden",355)]
$ names[0]
('Jack', 456)
$ names[0][0]
'Jack'
$ names[0][0][0]
'J'
$ names[0][0][0] < 'Q'
True
$ names[0][0][0] > 'Q'
False
double number associated with name
$ names[0][1]
456
$ names[0][1] * 2
912
"how to refer to just that record that is being changed"
We are trying to update the value associated with the name.
In theme with my previous code examples - that is, we want to update the value at index 1 of the tuple stored at index 0 in the list called names
However, tuples are immutable so we have to be a little tricky if we want to use the data structure you're using.
$ names = [("Jack",456), ("Kayden", 355)]
$ names[0]
('Jack', 456)
$ tpl = names[0]
$ tpl = (tpl[0], tpl[1] * 2)
$ tpl
('Jack', 912)
$ names[0] = tpl
$ names
[('Jack', 912), ('Kayden', 355)]
Do this for all tuples in the list
We need to do this for the whole list, it looks like you were onto that with your while loop. Your counter variable for indexing the list is named count so just use that to index a specific tuple, like: names[count][0] for the countth name or names[count][1] for the countth number.
using statistics for calculating mean and median
I recommend looking at the documentation for a module when you want to know how to use it. Here is an example for mean:
mean(data)
Return the sample arithmetic mean of data.
$ mean([1, 2, 3, 4, 4])
2.8
Hopefully these examples help you with the syntax for continuing your assignment, although this could turn into a long discussion.
The title of your post is "Need help working with lists within lists" ... well, your code example uses a list of tuples
$ names = [("Jack",456),("Kayden",355)]
$ type(names)
<class 'list'>
$ type(names[0])
<class 'tuple'>
$ names = [["Jack",456], ["Kayden", 355]]
$ type(names)
<class 'list'>
$ type(names[0])
<class 'list'>
notice the difference in the [] and ()
If you are free to structure the data however you like, then I would recommend using a dict (read: dictionary).
I know that you refer to the numbers as names[0][1], names[1][1], etc, but
not how to refer to just that record that is being changed. For
example, we have to have the program check if a name begins with a
letter that is Q or later; if it does, we double the number associated
with that name.
It's not entirely clear what else you have to do in this assignment, but regarding your concerns above, to reference the ith"record that is being changed" in your names list, simply use names[i]. So, if you want to access the first record in names, simply use names[0], since indexing in Python begins at zero.
Since each element in your list is a tuple (which can also be indexed), using constructs like names[0][0] and names[0][1] are ways to index the values within the tuple, as you pointed out.
I'm unsure why you're using while True if you're trying to iterate through each name and check whether it begins with "Q". It seems like a for loop would be better, unless your class hasn't gotten there yet.
As for checking whether the first letter is 'Q', str (string) objects are indexed similarly to lists and tuples. To access the first letter in a string, for example, see the following:
>>> my_string = 'Hello'
>>> my_string[0]
'H'
If you give more information, we can help guide you with the statistics piece, as well. But I would first suggest you get some background around mean and median (if you're unfamiliar).

How to find correlation between any combination of arrays

I have 10 data sets and I want to check the correlation between all possible pairs.For example, if I had:
ABCD
I want to check the correlation between AB, AC, AD, BC etc.
I've been using Correl function in excel which is fine for small data sets but if I had 1000 data sets instead of 10, how would I do this?
This solution assumes you have datasets in your global environment and they can be "scraped" based on some criterion. In my case, I opted for ".string" handle. If not, you have to come up with your own way of putting names into a string. Another way would be to put all datasets into a list and work with indices.
A.string <- runif(5)
B.string <- runif(5)
C.string <- runif(5)
# find variables based on a common string
pairs <- combn(ls(pattern = "\\.string"), 2)
# for each pair, fetch variable and use function cor()
apply(pairs, MARGIN = 2, FUN = function(x) {
cor(get(x[1]), get(x[2]))
})
[1] 0.2586141 0.7106571 0.7119712

R - paste() invokes as.numeric() when passing data frame to it?

I have a particular problem with Rs paste-function in combination with the row- and coloumn-selection of a data frame. It seems that paste always surrounds it's input-arguments with as.numeric() or something which does a similar job.
Here is a code snippet of what I am doing:
paste(df[1, c("entry1", "entry2")], collapse="; ")
This passes the first row of a data frame df with column entries for column "entry1" and "entry2". I assumed an output like this:
"Auffuellung; Holozaen"
Instead I am receiving the concatenated number equivalents (not indices) of the passed data frame entries:
"1; 5"
Calling str(df[1, c("entry1", "entry2")]) on my real data base results in the following output (German, do not wonder ;) ):
'data.frame': 1 obs. of 2 variables:
$ Hauptbestandteile: Factor w/ 38 levels "Auffuellung",..: 1
$ Chronografie : Factor w/ 18 levels "Devon","Famennium",..: 5
What am I doing wrong in this case? Until now, I never faced such a problem with the paste-function and I would have never expected something like this to happen. So, how do I solve the problem and get the correct output of concatenated strings instead of concatenated number equivalents?
Thank you in advance!
Your problem is related to the fact that your data are factor variables. paste is pasting the underlying "integer" code. This is confusing and not immediately obvious as to how to get around it. You need to turn it into a vector using unlist() and it will work as exepcted...
Example
df <- data.frame( Month = factor(month.name) , Short = factor(month.abb) )
df[ 1 , ]
# Month Short
#1 January Jan
paste( df[ 1 , ] , collapse = "; " )
#[1] "5; 5"
paste( unlist( df[ 1 , ] ) , collapse = "; " )
#[1] "January; Jan"
Of course when reading your data in you can avoid strings being automatically converted to factors using the stringsAsFactors = FALSE argument to read.*.
See the R room chat log here for a discussion on this.

Calculate 95th percentile of values with grouping variable

I'm trying to calculate the 95th percentile for multiple water quality values grouped by watershed, for example:
Watershed WQ
50500101 62.370661
50500101 65.505046
50500101 58.741477
50500105 71.220034
50500105 57.917249
I reviewed this question posted - Percentile for Each Observation w/r/t Grouping Variable. It seems very close to what I want to do but it's for EACH observation. I need it for each grouping variable. so ideally,
Watershed WQ - 95th
50500101 x
50500105 y
This can be achieved using the plyr library. We specify the grouping variable Watershed and ask for the 95% quantile of WQ.
library(plyr)
#Random seed
set.seed(42)
#Sample data
dat <- data.frame(Watershed = sample(letters[1:2], 100, TRUE), WQ = rnorm(100))
#plyr call
ddply(dat, "Watershed", summarise, WQ95 = quantile(WQ, .95))
and the results
Watershed WQ95
1 a 1.353993
2 b 1.461711
I hope I understand your question correctly. Is this what you're looking for?
my.df <- data.frame(group = gl(3, 5), var = runif(15))
aggregate(my.df$var, by = list(my.df$group), FUN = function(x) quantile(x, probs = 0.95))
Group.1 x
1 1 0.6913747
2 2 0.8067847
3 3 0.9643744
EDIT
Based on Vincent's answer,
aggregate(my.df$var, by = list(my.df$group), FUN = quantile, probs = 0.95)
also works (you can skin a cat 1001 ways - I've been told). A side note, you can specify a vector of desired -iles, say c(0.1, 0.2, 0.3...) for deciles. Or you can try function summary for some predefined statistics.
aggregate(my.df$var, by = list(my.df$group), FUN = summary)
Use a combination of the tapply and quantile functions. For example, if your dataset looks like this:
DF <- data.frame('watershed'=sample(c('a','b','c','d'), 1000, replace=T), wq=rnorm(1000))
Use this:
with(DF, tapply(wq, watershed, quantile, probs=0.95))
In Excel, you're going to want to use an array formula to make this easy. I suggest the following:
{=PERCENTILE(IF($A2:$A6 = Watershed ID, $B$2:$B$6), 0.95)}
Column A would be the Watershed ids, and Column B would be the WQ values.
Also, be sure to enter the formula as an array formula. Do so by pressing Ctrl+Shift+Enter when entering the formula.
Using the data.table-package you can do:
set.seed(42)
#Sample data
dt <- data.table(Watershed = sample(letters[1:2], 100, TRUE), WQ = rnorm(100))
dt[ ,
j = .(WQ95 = quantile(WQ, .95, na.rm = TRUE),
by = Watershed]

Resources