I am trying to build a classify decision tree using rpart and partykit, and I am wondering is there any function within those packages (or any packages, for that matter) to allow me to create a dataset containing data from a specific subtree or branch?
I know that I can manually create the subset from original data set with DT rules, but I am trying to automate certain process and finding that function will help me immensely.
Example:
library (rpart)
library(partykit)
data("Titanic", package = "datasets")
ttnc <- as.data.frame(Titanic)
ttnc <- ttnc[rep(1:nrow(ttnc), ttnc$Freq), 1:4]
names(ttnc)[2] <- "Gender"
rp <- rpart(Survived ~ Gender + Age + Class, data = ttnc)
prp <- as.party(rp)
prp[5]
Lets say that I wanna extract data from the subtree #5, is there any function within those packages that allow me to do that?
Thank you!
In addition to the solution posted by #JakobGepp you can use the data_party() function provided by partykit:
data_party(prp, id = 5)
Essentially, this does the same thing internally that Jakob did explicitly by hand.
I don't know if you meant this by using the DT rules, but you could use the predict() function of the partykit package to predict the node / branches and then split the data according to your subtree.
ttnc$Node <- predict(prp, newdata = ttnc, type = "node")
subtree <- subset(ttnc, Node == 5)
Related
I'm interested in estimating a shared, global trend over time for counts monitored at several different sites using generalized additive models (gams). I've read this great introduction to hierarchical gams (hgams) by Pederson et al. (2019), and I believe I can setup the model as follows (the Pederson et al. (2019) GS model),
fit_model = gam(count ~ s(year, m = 2) + s(year, site, bs = 'fs', m = 2),
data = count_df,
family = nb(link = 'log'),
method = 'REML')
I can plot the partial effect smooths, look at the fit diagnostics, and everything looks reasonable. My question is how to extract a non-centered annual relative count index? My first thought would be to add the estimated intercept (the average count across sites at the beginning of the time series) to the s(year) smooth (the shared global smooth). But I'm not sure if the uncertainty around that smooth already incorporates uncertainty in the estimated intercept? Or if I need to add that in? All of this was possible thanks to the amazing R libraries mgcv, gratia, and dplyr.
Your way doesn't include the uncertainty in the constant term, it just shifts everything around.
If you want to do this it would be easier to use the constant argument to gratia:::draw.gam():
draw(fit_model, select = "s(year)", constant = coef(fit_model)[1L])
which does what your code does, without as much effort (on your part).
An better way — with {gratia}, seeing as you are using it already — would be to create a data frame containing a sequence of values over the range of year and then use gratia::fitted_values() to generate estimates from the model for those values of year. To get what you want (which seems to be to exclude the random smooth component of the fit, such that you are setting the random component to equal 0 on the link scale) you need to pass that smooth to the exclude argument:
## data to predict at
new_year <- with(count_df,
tibble(year = gratia::seq_min_max(year, n = 100),
site = factor(levels(site)[1], levels = levels(site)))
## predict
fv <- fitted_values(fit_model, data = new_year, exclude = "s(year,site)")
If you want to read about exclude, see ?predict.gam
How to define the bbox to download OSM data based on the extent of a spatial file?
The following example returns an error message:
...the only allowed values are floats between -90.0 and 90.0
This shows that the bbox-values are out of allowed range. It also shows that the convertion between NAD27 and EPSG:3857 did not return the spatial data at place where it should be.
With other spatial data I had similar problems. Eventhough within allowed range, the data didn't appear at the expected place. Downloaded OSM data appeared at a different place as the input spatial file.
library(sf)
library(raster)
library(osmdata)
osm_proj <-("+init=epsg:3857")
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- st_transform(nc, osm_proj)
bbox.nc <- as.vector(extent(nc[22,]))/100000
q <- opq(bbox = bbox.nc) %>%
add_osm_feature(key = 'natural', value = 'water')
osm.water <- osmdata_sf(q)
How to prepare the bbox that downloaded OSM data matches spatial extend of input spatial file?
OSM works in lat-lon, which means EPSG:4326. You need to transform the coordinates accordingly. You also don't need raster::extent(); sf::st_bbox() will be sufficient in this use case.
Or in your context consider this code; as this is only a toy example I am not using the whole NC state, but a single county (otherwise errors on timeout may occur, which would be a separate kind of a problem - this question is about bounding boxes).
library(sf)
library(osmdata)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
strelitz <- st_transform(nc, 4326) %>%
dplyr::filter(NAME == "Mecklenburg") # as in Charlotte of Mecklenburg-Strelitz
q <- opq(bbox = sf::st_bbox(strelitz)) %>%
add_osm_feature(key = 'natural', value = 'water') %>%
osmdata_sf()
plot(st_geometry(strelitz))
plot(st_geometry(q$osm_lines), col = 'blue', add = T)
A shameles plug: I wrote about querying OSM for points of interest a while back, you may find this post interesting :)
https://www.jla-data.net/eng/finding-pois-along-a-route/
I found H2O has the function h2o.deepfeatures in R to pull the hidden layer features
https://www.rdocumentation.org/packages/h2o/versions/3.20.0.8/topics/h2o.deepfeatures
train_features <- h2o.deepfeatures(model_nn, train, layer=3)
But I didn't find any example in Python? Can anyone provide some sample code?
Most Python/R API functions are wrappers around REST calls. See http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/model/model_base.html#ModelBase.deepfeatures
So, to convert an R example to a Python one, move the model to be the this, and all other args should shuffle along. I.e. the example from the manual becomes (with dots in variable names changed to underlines):
prostate_hex = ...
prostate_dl = ...
prostate_deepfeatures_layer1 = prostate_dl.deepfeatures(prostate_hex, 1)
prostate_deepfeatures_layer2 = prostate_dl.deepfeatures(prostate_hex, 2)
Sometimes the function name will change slightly (e.g. h2o.importFile() vs. h2o.import_file() so you need to hunt for it at http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html
How do I access elements of a named list by name?
I have 3 functions, all of which return a ListSexpVector of class htest. One of them has 5 elements, ['method', 'parameter', 'statistic', 'p.value', 'data.name'], others have a different number, and order. I am interested in extracting the p.value, statistic and parameter from this list. In R I can use $, like so:
p.value <- fit$p.value
statistic <- fit$statistic
param <- fit$parameter
The best equivalent I found in rpy2 goes like:
p_val = fit[list(fit.do_slot('names')).index('p.value')]
stat = fit[list(fit.do_slot('names')).index('statistic')]
param = fit[list(fit.do_slot('names')).index('parameter')]
Which is quite long-winded. Is there a better (shorter, sweeter, Pythonic) way?
There is the good-old-fashioned integer based indexing:
p_val = fit[3]
stat = fit[2]
param = fit[1]
But it doesn't work when the positions are changed, and therefore is a serious limitation because I am fitting 3 different functions, and each return a different order.
The high-level interface is meant to provide a friendlier interface as the low-level interface is quite close to R's C-API. With it one can do:
p_val = fit.rx2('p.value')
or
p_val = fit[fit.names.index('p.value')]
If working with the low-level interface, you will essentially have to implement your own convenience wrapper to reproduce these functionalities. For example:
def dollar(obj, name):
"""R's "$"."""
return obj[fit.do_slot('names').index(name)]
Im new to programming in general and I need some help for accessing a previously created instance of Class. I did some search on SO but I could not find anything... Maybe it's just because I should not try to do that.
for s in servers:
c = rconprotocol.Rcon(s[0], s[2],s[1])
t = threading.Thread(target=c.connect)
t.start()
c.messengers(allmessages, 10)
Now, what can I do if I want to call a function on "c" ?
Thanks, Hugo
You're creating several different objects that you briefly name c as you go through the loop. If you want to be able to access more than the last of them, you'll need to save them somewhere that won't be overwritten. Probably the best approach is to use a list to hold the successive values, but depending on your specific needs another data structure might make sense too (for instance, using a dictionary you could look up each value by a specific key).
Here's a trivial adjustment to your current code that will save the c values in a list:
c_list = []
for s in servers:
c = rconprotocol.Rcon(s[0], s[2],s[1])
t = threading.Thread(target=c.connect)
t.start()
c.messengers(allmessages, 10)
c_list.append(c)
Later you can access any of the c values with c_list[index], or by iterating with for c in c_list.
A slightly more Pythonic version might use a list comprehension rather than append to create the list (this also shows what a loop over c_list later one might look like):
c_list = [rconprotocol.Rcon(s[0], s[2],s[1]) for s in servers]
for c in c_list:
t = threading.Thread(target=c.connect)
t.start()
c.messengers(allmessages, 10)