Annual count index from GAM looking at long-term trends by site - gam

I'm interested in estimating a shared, global trend over time for counts monitored at several different sites using generalized additive models (gams). I've read this great introduction to hierarchical gams (hgams) by Pederson et al. (2019), and I believe I can setup the model as follows (the Pederson et al. (2019) GS model),
fit_model = gam(count ~ s(year, m = 2) + s(year, site, bs = 'fs', m = 2),
data = count_df,
family = nb(link = 'log'),
method = 'REML')
I can plot the partial effect smooths, look at the fit diagnostics, and everything looks reasonable. My question is how to extract a non-centered annual relative count index? My first thought would be to add the estimated intercept (the average count across sites at the beginning of the time series) to the s(year) smooth (the shared global smooth). But I'm not sure if the uncertainty around that smooth already incorporates uncertainty in the estimated intercept? Or if I need to add that in? All of this was possible thanks to the amazing R libraries mgcv, gratia, and dplyr.

Your way doesn't include the uncertainty in the constant term, it just shifts everything around.
If you want to do this it would be easier to use the constant argument to gratia:::draw.gam():
draw(fit_model, select = "s(year)", constant = coef(fit_model)[1L])
which does what your code does, without as much effort (on your part).
An better way — with {gratia}, seeing as you are using it already — would be to create a data frame containing a sequence of values over the range of year and then use gratia::fitted_values() to generate estimates from the model for those values of year. To get what you want (which seems to be to exclude the random smooth component of the fit, such that you are setting the random component to equal 0 on the link scale) you need to pass that smooth to the exclude argument:
## data to predict at
new_year <- with(count_df,
tibble(year = gratia::seq_min_max(year, n = 100),
site = factor(levels(site)[1], levels = levels(site)))
## predict
fv <- fitted_values(fit_model, data = new_year, exclude = "s(year,site)")
If you want to read about exclude, see ?predict.gam

Related

Contribution analysis versus lca.score

I am interested in which processes/activities contribute most to the Life Cycle Impact Assessment (LCIA) that I am conducting. For this, I run a contribution analysis (see code below). To crosscheck the results of my contribution analysis and to ensure that I get everything right, I wanted to compare the returned contributions with the impact assessment result (lca.score).
The documentation of ca.annotated_top_processes(lca) says: "Returns a list of tuples: (lca score, supply, activity)."
In my understanding, lca.score should be the same value as the sum of all the first values in the tuples that are returned by ca.annotated_top_processes(lca) (the printed values). However, this is not the case. What am I missing? Is there some sort of cut-off applied or did I misunderstand something?
import bw2analyzer as bwa
random_act = db_ei381.random()
lca = bw2data.LCA(
{random_act: 1},
('ReCiPe Midpoint (H) V1.13', 'water depletion', 'WDP')
)
lca.lci()
lca.lcia()
print(lca.score)
# %% Contribution analysis
ca = bwa.ContributionAnalysis()
contributions = ca.annotated_top_processes(lca)
print(sum([i[0] for i in contributions]))
It is not well documented, but you can introduce an argument limit that specifies the number of activities that are considered in the contribution analysis. The default value I think is 25. It is sorted so the most important activities come first. If you write something like this you should see how the result converges to the total score as the number of activities increase:
import matplotlib.pyplot as plt
cutoff = [25,50,100,500,1000,1200]
scores = []
for n in cutoff:
contributions = ca.annotated_top_processes(lca,limit=n)
contr_sum = sum([i[0] for i in contributions])
scores.append(contr_sum)
plt.plot(cutoff,scores)
plt.axhline(lca.score,ls='--',color='r');

How can I use VIPS for image normalization?

I want to normalize the exposure and color palettes of a set of images. For context, this is for training a neural net in image classification on medical images. I'm also doing this for hundreds of thousands of images, so efficiency is very important.
So far I've been using VIPS, specifically PyVIPS, and would prefer a solution using that library. After finding this answer and looking through the documentation, I tried
x = pyvips.Image.new_from_file('test.ndpi')
x = x.hist_norm()
x.write_to_file('test_normalized.tiff')
but that seems to always produce a pure-white image.
You need hist_equal for histogram equalisation.
The main docs are here:
https://libvips.github.io/libvips/API/current/libvips-histogram.html
However, that will be extremely slow for large slide images. It will need to scan the whole slide once to build the histogram, then scan again to equalise it. It would be much faster to find the histogram of a low-res layer, then use that to equalise the high-res one.
For example:
#!/usr/bin/env python3
import sys
import pyvips
# open the slide image and get the number of layers ... we are not fetching
# pixels, so this is quick
x = pyvips.Image.new_from_file(sys.argv[1])
levels = int(x.get("openslide.level-count"))
# find the histogram of the highest level ... again, this should be quick
x = pyvips.Image.new_from_file(sys.argv[1],
level=levels - 1)
hist = x.hist_find()
# from that, compute the transform for histogram equalisation
equalise = hist.hist_cum().hist_norm()
# and use that on the full-res image
x = pyvips.Image.new_from_file(sys.argv[1])
x = x.maplut(equalise)
x.write_to_file(sys.argv[2])
Another factor is that histogram equalisation is non-linear, so it will distort lightness relationships. It can also distort colour relationships and make noise and compression artifacts look crazy. I tried that program on an image I have here:
$ ~/try/equal.py bild.ndpi[level=7] y.jpg
The stripes are from the slide scanner and the ugly fringes from compression.
I think I would instead find image max and min from the low-res level, then use them to do a simple linear stretch of pixel values.
Something like:
x = pyvips.Image.new_from_file(sys.argv[1])
levels = int(x.get("openslide.level-count"))
x = pyvips.Image.new_from_file(sys.argv[1],
level=levels - 1)
mn = x.min()
mx = x.max()
x = pyvips.Image.new_from_file(sys.argv[1])
x = (x - mn) * (256 / (mx - mn))
x.write_to_file(sys.argv[2])
Did you find the new Region feature in pyvips? It makes generating patches for training MUCH faster, up to 100x faster in some cases:
https://github.com/libvips/pyvips/issues/100#issuecomment-493960943

Why is my notebook crashing when I run this for loop and what is the fix?

I have taken code in relation to the Kalman Filter and am attempting to iterate through each column of data. What I would like to have happen is:
The column data is fed into the filter
The filtered column data (xhat) is placed into another DataFrame (filtered)
The filtered column data (xhat) is used to produce a visual.
I have created a for loop to iterate through the column data, but when I run the cell, I crash the notebook. When it doesn't crash, I get this warning:
C:\Users\perso\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:45: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Thanks in advance for any help. I hope this question is detailed enough. I bombed on the last one.
'''A Python implementation of the example given in pages 11-15 of "An
Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
University of North Carolina at Chapel Hill, Department of Computer
Science, TR 95-041,
https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf'''
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
# dataframe created to hold filtered data
filtered = pd.DataFrame()
# intial parameters
for column in data:
n_iter = len(data.index) #number of iterations equal to sample numbers
sz = (n_iter,) # size of array
z = data[column] # observations
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 1.0**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = z[0]
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
# add new data to created dataframe
filtered.assign(a = [xhat])
#create visualization of noise reduction
plt.rcParams['figure.figsize'] = (10, 8)
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('column data')
plt.ylabel('Measurement')
This seems like a pretty straightforward error. The warning indicates that you have attempted to plot more figures than the current limit before a warning is created (a parameter you can change but which by default is set to 20). This is because in each iteration of your for loop, you create a new figure. Depending on the size of n_iter, you are opening potentially hundreds or thousands of figures. Each of these figures takes resources to generate and show, so you are creating a very large resource load on your system. Either it is processing very slowly due or is crashing altogether. In any case, the solution is to plot fewer figures.
I don't know exactly what you're plotting in your loop but it seems like each iteration of your loop corresponds to one time step and at each time step you'd like to plot the estimated and actual values. In this case, you need to define a figure and figure options once, outside of the loop, rather than at each iteration. But a better way to do this is probably to generate all of the data you want to plot ahead of time and store it in an easy-to-plot datatype like lists, then plot it once at the end.

Geospatial fixed radius cluster hunting in python

I want to take an input of millions of lat long points (with a numerical attribute) and then find all fixed radius geospatial clusters where the sum of the attribute within the circle is above a defined threshold.
I started by using sklearn BallTree to sum the attribute within any defined circle, with the intention of then expanding this out to run across a grid or lattice of circles. The run time for one circle is around 0.01s, so this is fine for small lattices, but won't scale if I want to run 200m radius circles across the whole of the UK.
#example data (use 2m rows from postcode centroid file)
df = pandas.read_csv('National_Statistics_Postcode_Lookup_Latest_Centroids.csv', usecols=[0,1], nrows=2000000)
#this will be our grid of points (or lattice) use points from same file for example
df2 = pandas.read_csv('National_Statistics_Postcode_Lookup_Latest_Centroids.csv', usecols=[0,1], nrows=2000)
#reorder lat long columns for balltree input
columnTitles=["Y","X"]
df = df.reindex(columns=columnTitles)
df2 = df2.reindex(columns=columnTitles)
# assign new columns to existing dataframe. attribute will hold the data we want to sum over (set to 1 for now)
df['attribute'] = 1
df2['aggregation'] = 0
RADIANT_TO_KM_CONSTANT = 6367
class BallTreeIndex:
def __init__(self, lat_longs):
self.lat_longs = np.radians(lat_longs)
self.ball_tree_index =BallTree(self.lat_longs, metric='haversine')
def query_radius(self,query,radius):
radius_km = radius/1000
radius_radiant = radius_km / RADIANT_TO_KM_CONSTANT
query = np.radians(np.array([query]))
indices = self.ball_tree_index.query_radius(query,r=radius_radiant)
return indices[0]
#index the base data
a=BallTreeIndex(df.iloc[:,0:2])
#begin to loop over the lattice to test performance
for i in range(0,100):
b = df2.iloc[i,0:2]
output = a.query_radius(b, 200)
accumulation = sum(df.iloc[output, 2])
df2.iloc[i,2] = accumulation
It feels as if the above code is really inefficient as I don't need to run the calculation across all circles on my lattice (as most will be well below my threshold - or will have no data points in at all).
Instead of this for loop, is there a better way of scaling this algorithm to give me the most dense circles?
I'm new to python, so any help would be massively appreciated!!
First don't try to do this on a sphere! GB is small and we have a well defined geographic projection that will work. So use the oseast1m and osnorth1m columns as X and Y. They are in metres so no need to convert (roughly) to degrees and use Haversine. That should help.
Next add a spatial index to speed up lookups.
If you need more speed there are various tricks like loading a 2R strip across the country into memory and then running your circles across that strip, then moving down a grid step and updating that strip (checking Y values against a fixed value is quick, especially if you store the data sorted on Y then X value). If you need more speed then look at any of the papers the Stan Openshaw (and sometimes I) wrote about parallelising the GAM. There are examples of implementing GAM in python (e.g. this paper, this paper) that may also point to better ways.

How to deal with temporal correlation/trend in mppm

Good day,
I have been working through Baddeley et al. 2015 to fit a point process model to several point patterns using mppm {spatstat}.
My point patterns are annual count data of large herbivores (i.e. point localities (x, y) of male/female animals * 5 years) in a protected area (owin). I have a number of spatial covariates e.g. distance to rivers (rivD) and vegetation productivity (NDVI).
Originally I fitted a model where herbivore response was a function of rivD + NDVI and allowed the coefficients to vary by sex (see mppm1 in reproducible example below). However, my annual point patterns are not independent between years in that there is a temporally increasing trend (i.e. there are exponentially more animals in year 1 compared to year 5).
So I added year as a random effect, thinking that if I allowed the intercept to change per year I could account for this (see mppm2).
Now I'm wondering if this is the right way to go about it? If I was fitting a GAMM gamm {mgcv} I would add a temporal correlation structure e.g. correlation = corAR1(form=~year) but don't think this is possible in mppm (see mppm3)?
I would really appreciate any ideas on how to deal with this temporal correlation structure in a replicated point pattern with mppm {spatstat}.
Thank you very much
Sandra
# R version 3.3.1 (64-bit)
library(spatstat) # spatstat version 1.45-2.008
#### Simulate point patterns
# multitype Neyman-Scott process (each cluster is a multitype process)
nclust2 = function(x0, y0, radius, n, types=factor(c("male", "female"))) {
X = runifdisc(n, radius, centre=c(x0, y0))
M = sample(types, n, replace=TRUE)
marks(X) = M
return(X)
}
year1 = rNeymanScott(5,0.1,nclust2, radius=0.1, n=5)
# plot(year1)
#-------------------
year2 = rNeymanScott(10,0.1,nclust2, radius=0.1, n=5)
# plot(year2)
#-------------------
year2 = rNeymanScott(15,0.1,nclust2, radius=0.1, n=10)
# plot(year2)
#-------------------
year3 = rNeymanScott(20,0.1,nclust2, radius=0.1, n=10)
# plot(year3)
#-------------------
year4 = rNeymanScott(25,0.1,nclust2, radius=0.1, n=15)
# plot(year4)
#-------------------
year5 = rNeymanScott(30,0.1,nclust2, radius=0.1, n=15)
# plot(year5)
#### Simulate distance to rivers
line <- psp(runif(10), runif(10), runif(10), runif(10), window=owin())
# plot(line)
# plot(year1, add=TRUE)
#------------------------ UPDATE ------------------------#
#### Create hyperframe
#---> NDVI simulated with distmap to point patterns (not ideal but just to test)
hyp.years = hyperframe(year=factor(2010:2014),
ppp=list(year1,year2,year3,year4,year5),
NDVI=list(distmap(year5),distmap(year1),distmap(year2),distmap(year3),distmap(year4)),
rivD=distmap(line),
stringsAsFactors=TRUE)
hyp.years$numYear = with(hyp.years,as.numeric(year)-1)
hyp.years
#### Run mppm models
# mppm1 = mppm(ppp~(NDVI+rivD)/marks,data=hyp.years); summary(mppm1)
#..........................
# mppm2 = mppm(ppp~(NDVI+rivD)/marks,random = ~1|year,data=hyp.years); summary(mppm2)
#..........................
# correlation = corAR1(form=~year)
# mppm3 = mppm(ppp~(NDVI+rivD)/marks,correlation = corAR1(form=~year),use.gam = TRUE,data=hyp.years); summary(mppm3)
###---> Run mppm model with annual trend and random variation in growth
mppmCorr = mppm(ppp~(NDVI+rivD+numYear)/marks,random = ~1|year,data=hyp.years)
summary(mppm1)
If there's a trend in population size over time, then it might make sense to include this trend in the systematic part of the model. I would suggest you add a new numeric variable NumYear to the data frame (eg giving the number of years since 2010). Then try adding simple trend terms such as +NumYear to the model formula (this would correspond to the exponential growth in population that you observed.) You can keep the 1|year random effect term which will then allow for random variation in population size around the long term growth trend.
There's no need to split the data patterns for each year into separate male and female patterns. The variable marks in the model formula can be used to specify any model that depends on sex.
I'm pretty sure that mppm with use.gam=TRUE does not recognise the argument correlation and this is probably just ignored. (It depends what happens inside gam).

Resources