Forest Plot for Cox Regression stratified by Treatment Effect on a subgroup - survival-analysis

relative novice to R here.
I am looking to plot treatment effect by subgroup. I believe this was asked before, but without response (How to create forest plots of subgroups by treatment (ggforest))
So for example, using the 'colon' dataset, in the below code sex and rx are treated as separate predictors for status.
model <- coxph( Surv(time, status) ~ sex + rx + adhere,
data = colon )
However, I would like to see if there are different effects of rx on status stratified by sex.
Meaning, I would like to generate a forest plot where for female and for male sex, there will be three plots each - one for each treatment arm.
Simply adding an interaction term does not seem to make any difference.
I tried instead to create an indicator variable instead, but this didn't seem to help either.
colon <- colon %>%
mutate(indicator = factor(case_when(sex==0 & rx=="Obs" ~ "Female-Obs",
sex==0 & rx=="Lev" ~ "Female-Lev",
sex==0 & rx=="Lev+5FU" ~ "Female-FU",
sex==1 & rx=="Obs" ~ "Male-Obs",
sex==1 & rx=="Lev" ~ "Male-Lev",
sex==1 & rx=="Lev+5FU" ~ "Male-FU"), levels=c("Female-Obs", "Female-Lev", "Female-FU", "Male-Obs", "Male-Lev", "Male-FU")))
model <- coxph(Surv(time, status) ~ indicator, data=colon)
I really appreciate your help!


Annual count index from GAM looking at long-term trends by site

I'm interested in estimating a shared, global trend over time for counts monitored at several different sites using generalized additive models (gams). I've read this great introduction to hierarchical gams (hgams) by Pederson et al. (2019), and I believe I can setup the model as follows (the Pederson et al. (2019) GS model),
fit_model = gam(count ~ s(year, m = 2) + s(year, site, bs = 'fs', m = 2),
data = count_df,
family = nb(link = 'log'),
method = 'REML')
I can plot the partial effect smooths, look at the fit diagnostics, and everything looks reasonable. My question is how to extract a non-centered annual relative count index? My first thought would be to add the estimated intercept (the average count across sites at the beginning of the time series) to the s(year) smooth (the shared global smooth). But I'm not sure if the uncertainty around that smooth already incorporates uncertainty in the estimated intercept? Or if I need to add that in? All of this was possible thanks to the amazing R libraries mgcv, gratia, and dplyr.
Your way doesn't include the uncertainty in the constant term, it just shifts everything around.
If you want to do this it would be easier to use the constant argument to gratia:::draw.gam():
draw(fit_model, select = "s(year)", constant = coef(fit_model)[1L])
which does what your code does, without as much effort (on your part).
An better way — with {gratia}, seeing as you are using it already — would be to create a data frame containing a sequence of values over the range of year and then use gratia::fitted_values() to generate estimates from the model for those values of year. To get what you want (which seems to be to exclude the random smooth component of the fit, such that you are setting the random component to equal 0 on the link scale) you need to pass that smooth to the exclude argument:
## data to predict at
new_year <- with(count_df,
tibble(year = gratia::seq_min_max(year, n = 100),
site = factor(levels(site)[1], levels = levels(site)))
## predict
fv <- fitted_values(fit_model, data = new_year, exclude = "s(year,site)")
If you want to read about exclude, see ?predict.gam

How to retrieve bbox for osmdata from spatial feature?

How to define the bbox to download OSM data based on the extent of a spatial file?
The following example returns an error message:
...the only allowed values are floats between -90.0 and 90.0
This shows that the bbox-values are out of allowed range. It also shows that the convertion between NAD27 and EPSG:3857 did not return the spatial data at place where it should be.
With other spatial data I had similar problems. Eventhough within allowed range, the data didn't appear at the expected place. Downloaded OSM data appeared at a different place as the input spatial file.
osm_proj <-("+init=epsg:3857")
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- st_transform(nc, osm_proj) <- as.vector(extent(nc[22,]))/100000
q <- opq(bbox = %>%
add_osm_feature(key = 'natural', value = 'water')
osm.water <- osmdata_sf(q)
How to prepare the bbox that downloaded OSM data matches spatial extend of input spatial file?
OSM works in lat-lon, which means EPSG:4326. You need to transform the coordinates accordingly. You also don't need raster::extent(); sf::st_bbox() will be sufficient in this use case.
Or in your context consider this code; as this is only a toy example I am not using the whole NC state, but a single county (otherwise errors on timeout may occur, which would be a separate kind of a problem - this question is about bounding boxes).
nc <- st_read(system.file("shape/nc.shp", package="sf"))
strelitz <- st_transform(nc, 4326) %>%
dplyr::filter(NAME == "Mecklenburg") # as in Charlotte of Mecklenburg-Strelitz
q <- opq(bbox = sf::st_bbox(strelitz)) %>%
add_osm_feature(key = 'natural', value = 'water') %>%
plot(st_geometry(q$osm_lines), col = 'blue', add = T)
A shameles plug: I wrote about querying OSM for points of interest a while back, you may find this post interesting :)

Comparing marginal values across two sub-samples in logistic models

I have a logit model with an interaction term. I want to test this interaction coefficient across two sub-samples (divided based on some value of another variable). Since logistic models, interpretation of coefficients requires marginal analysis (Ai and Norton, 2003), I performed marginal analysis after running the model for each sub-sample. I don't know how I can show the statistical difference of the marginal values across these two sub-samples. I am going to use the "birth-weight data" to clarify further what I mean.
webuse lbw
logit low i.smoke##c.lwt ftv ptl if ui==0
su ptl if e(sample), detail
global ptl_mean = r(mean)
display $ptl_mean
global ptl_sd = r(sd)
display $ptl_sd
global ptl_mean_plus_sigma = $ptl_mean + $ptl_sd
display $ptl_mean_plus_sigma
global ptl_mean_plus_twosigma = $ptl_mean + 2*$ptl_sd
display $ptl_mean_plus_twosigma
margins if e(sample), dydx (smoke) at( (mean) _all ptl=(0 $ptl_mean $ptl_mean_plus_sigmaa $ptl_mean_plus_twosigma )) predict(pr) saving(m1, replace)
logit low i.smoke##c.lwt ftv age ptl if ui==1
su ptl if e(sample), detail
global ptl_mean = r(mean)
display $ptl_mean
global ptl_sd = r(sd)
display $ptl_sd
global ptl_mean_plus_sigma = $ptl_mean + $ptl_sd
display $ptl_mean_plus_sigma
global ptl_mean_plus_twosigma = $ptl_mean + 2*$ptl_sd
display $ptl_mean_plus_twosigma
margins if e(sample), dydx (smoke) at( (mean) _all ptl=(0 $ptl_mean $ptl_mean_plus_sigmaa $ptl_mean_plus_twosigma )) predict(pr) saving(m2, replace)
Now, I use the user-written command --combomarginsplot-- to combine the two marginal plots:
graph set window fontface "Times New Roman"
#delimit ;
combomarginsplot m1 m2,
labels("ui==0" "ui==1" position(3))
title("Marginal Effects of ptl on low brith weight likelihood", size(medium))
xtitle("ptl", size(small))
ytitle("Change in likelihood", size(small))
noci file1opts(lpattern(shortdash) msymbol(i))
file2opts(lpattern(dash) msymbol(D));
graph save "Graph" "marginal.gph",replace
The graph will be:
One might say that it is easy to say that the marginal values of the second sub-sample (i.e., ui==1) are not different from zero. Hence, we can conclude that the marginal values when ui==0 are greater than the marginal values when ui==1.
What if marginal values across both sub-samples were statistically significant. How can one test their statistical difference?

Geospatial fixed radius cluster hunting in python

I want to take an input of millions of lat long points (with a numerical attribute) and then find all fixed radius geospatial clusters where the sum of the attribute within the circle is above a defined threshold.
I started by using sklearn BallTree to sum the attribute within any defined circle, with the intention of then expanding this out to run across a grid or lattice of circles. The run time for one circle is around 0.01s, so this is fine for small lattices, but won't scale if I want to run 200m radius circles across the whole of the UK.
#example data (use 2m rows from postcode centroid file)
df = pandas.read_csv('National_Statistics_Postcode_Lookup_Latest_Centroids.csv', usecols=[0,1], nrows=2000000)
#this will be our grid of points (or lattice) use points from same file for example
df2 = pandas.read_csv('National_Statistics_Postcode_Lookup_Latest_Centroids.csv', usecols=[0,1], nrows=2000)
#reorder lat long columns for balltree input
df = df.reindex(columns=columnTitles)
df2 = df2.reindex(columns=columnTitles)
# assign new columns to existing dataframe. attribute will hold the data we want to sum over (set to 1 for now)
df['attribute'] = 1
df2['aggregation'] = 0
class BallTreeIndex:
def __init__(self, lat_longs):
self.lat_longs = np.radians(lat_longs)
self.ball_tree_index =BallTree(self.lat_longs, metric='haversine')
def query_radius(self,query,radius):
radius_km = radius/1000
radius_radiant = radius_km / RADIANT_TO_KM_CONSTANT
query = np.radians(np.array([query]))
indices = self.ball_tree_index.query_radius(query,r=radius_radiant)
return indices[0]
#index the base data
#begin to loop over the lattice to test performance
for i in range(0,100):
b = df2.iloc[i,0:2]
output = a.query_radius(b, 200)
accumulation = sum(df.iloc[output, 2])
df2.iloc[i,2] = accumulation
It feels as if the above code is really inefficient as I don't need to run the calculation across all circles on my lattice (as most will be well below my threshold - or will have no data points in at all).
Instead of this for loop, is there a better way of scaling this algorithm to give me the most dense circles?
I'm new to python, so any help would be massively appreciated!!
First don't try to do this on a sphere! GB is small and we have a well defined geographic projection that will work. So use the oseast1m and osnorth1m columns as X and Y. They are in metres so no need to convert (roughly) to degrees and use Haversine. That should help.
Next add a spatial index to speed up lookups.
If you need more speed there are various tricks like loading a 2R strip across the country into memory and then running your circles across that strip, then moving down a grid step and updating that strip (checking Y values against a fixed value is quick, especially if you store the data sorted on Y then X value). If you need more speed then look at any of the papers the Stan Openshaw (and sometimes I) wrote about parallelising the GAM. There are examples of implementing GAM in python (e.g. this paper, this paper) that may also point to better ways.

How to deal with temporal correlation/trend in mppm

Good day,
I have been working through Baddeley et al. 2015 to fit a point process model to several point patterns using mppm {spatstat}.
My point patterns are annual count data of large herbivores (i.e. point localities (x, y) of male/female animals * 5 years) in a protected area (owin). I have a number of spatial covariates e.g. distance to rivers (rivD) and vegetation productivity (NDVI).
Originally I fitted a model where herbivore response was a function of rivD + NDVI and allowed the coefficients to vary by sex (see mppm1 in reproducible example below). However, my annual point patterns are not independent between years in that there is a temporally increasing trend (i.e. there are exponentially more animals in year 1 compared to year 5).
So I added year as a random effect, thinking that if I allowed the intercept to change per year I could account for this (see mppm2).
Now I'm wondering if this is the right way to go about it? If I was fitting a GAMM gamm {mgcv} I would add a temporal correlation structure e.g. correlation = corAR1(form=~year) but don't think this is possible in mppm (see mppm3)?
I would really appreciate any ideas on how to deal with this temporal correlation structure in a replicated point pattern with mppm {spatstat}.
Thank you very much
# R version 3.3.1 (64-bit)
library(spatstat) # spatstat version 1.45-2.008
#### Simulate point patterns
# multitype Neyman-Scott process (each cluster is a multitype process)
nclust2 = function(x0, y0, radius, n, types=factor(c("male", "female"))) {
X = runifdisc(n, radius, centre=c(x0, y0))
M = sample(types, n, replace=TRUE)
marks(X) = M
year1 = rNeymanScott(5,0.1,nclust2, radius=0.1, n=5)
# plot(year1)
year2 = rNeymanScott(10,0.1,nclust2, radius=0.1, n=5)
# plot(year2)
year2 = rNeymanScott(15,0.1,nclust2, radius=0.1, n=10)
# plot(year2)
year3 = rNeymanScott(20,0.1,nclust2, radius=0.1, n=10)
# plot(year3)
year4 = rNeymanScott(25,0.1,nclust2, radius=0.1, n=15)
# plot(year4)
year5 = rNeymanScott(30,0.1,nclust2, radius=0.1, n=15)
# plot(year5)
#### Simulate distance to rivers
line <- psp(runif(10), runif(10), runif(10), runif(10), window=owin())
# plot(line)
# plot(year1, add=TRUE)
#------------------------ UPDATE ------------------------#
#### Create hyperframe
#---> NDVI simulated with distmap to point patterns (not ideal but just to test)
hyp.years = hyperframe(year=factor(2010:2014),
hyp.years$numYear = with(hyp.years,as.numeric(year)-1)
#### Run mppm models
# mppm1 = mppm(ppp~(NDVI+rivD)/marks,data=hyp.years); summary(mppm1)
# mppm2 = mppm(ppp~(NDVI+rivD)/marks,random = ~1|year,data=hyp.years); summary(mppm2)
# correlation = corAR1(form=~year)
# mppm3 = mppm(ppp~(NDVI+rivD)/marks,correlation = corAR1(form=~year),use.gam = TRUE,data=hyp.years); summary(mppm3)
###---> Run mppm model with annual trend and random variation in growth
mppmCorr = mppm(ppp~(NDVI+rivD+numYear)/marks,random = ~1|year,data=hyp.years)
If there's a trend in population size over time, then it might make sense to include this trend in the systematic part of the model. I would suggest you add a new numeric variable NumYear to the data frame (eg giving the number of years since 2010). Then try adding simple trend terms such as +NumYear to the model formula (this would correspond to the exponential growth in population that you observed.) You can keep the 1|year random effect term which will then allow for random variation in population size around the long term growth trend.
There's no need to split the data patterns for each year into separate male and female patterns. The variable marks in the model formula can be used to specify any model that depends on sex.
I'm pretty sure that mppm with use.gam=TRUE does not recognise the argument correlation and this is probably just ignored. (It depends what happens inside gam).
