How to account fot autocorrelation in mixed effect model nlme - modeling

I need to perform a mixed effect model and account for autocorrelation as data is taken in a time series. I have a data set that consists of a certain behaviour value (Activity) measured per day in different individuals and several years. I've grouped the behaviour values in 4 different periods.
I want to test if there are differences in behaviour values between the 4 different periods, including id and year as random effects. I am also interested in the interactions between periods and Sex. Right now I am using "nlme" as "lme4" doesn't allow to account for autocorrelation as far as I know.
model31 <- lme(Activity ~ periods * Sex, random = ~1|Year/Individual,
data = mydata)
However when I try to account for autocorrelation I am a bit lost and I am not sure how to do this. So far this is what I have tried:
model32 <- lme(Activity ~ periods * Sex, random = ~1|Year/Individual,
data = mydata, correlation = corAR1()) #what does this do?
Also, I want to account for temporal autocorrelation (activity at time t is influence by the previous value at t-1). For this I have dates for each sampled value, but this is not included in the original model, and I don't know how to do it. Hopelessly I though of this code, but it doesn't work:
model33 <- lme(Activity ~ periods * Sex, random = ~ 1|Year/Individual,
data = mydata, correlation= corCAR1(form = ~ date|periods))
And I get this error:
incompatible formulas for groups in 'random' and 'correlation'
But of course I am not interested in the autocorrelation of the variables included under random effects.
I am a bit lost here and any orientation on these would be very much appreciated.
Thank you

Related

Assessing features to labelencode or get_dummies() on dataset in Python

I'm working on the heart attack analysis on Kaggle in python.
I am a beginner and I'm trying to figure whether it's still necessary to one-hot-encode or LableEncode these features. I see so many people encoding the values for this project, but I'm confused because everything already looks scaled (apart from age, thalach, oldpeak and slope).
age: age in years
sex: (1 = male; 0 = female)
cp: ordinal values 1-4
thalach: maximum heart rate achieved
exang: (1 = yes; 0 = no)
oldpeak: depression induced by exercise
slope: the slope of the peak exercise
ca: values (0-3)
thal: ordinal values 0-3
target: 0= less chance, 1= more chance
Would you say it's still necessary to one-hot-encode, or should I just use a StandardScaler straight away?
I've seen many people encode the whole dataset for this project, but it makes no sense to me to do so. Please confirm if only using StandardScaler would be enough?
When you apply StandardScaler, the columns would have values in the same range. That helps models to keep weights under bound and gradient descent will not shoot off when converging. This will help the model converge faster.
Independently, in order to decide between Ordinal values and One hot encoding, consider if the column values are similar or different based on the distance between them. If yes, then choose ordinal values. If you know the hierarchy of the category, then you can manually assign the ordinal values. Otherwise, you should use LabelEncoder. It seems like the heart attack data is already given with ordinal values manually assigned. For example, higher chest pain = 4.
Also, it is important to refer to notebooks that perform better. Take a look at the one below for reference.
95% Accuracy - https://www.kaggle.com/code/abhinavgargacb/heart-attack-eda-predictor-95-accuracy-score

Correct number of samples, based on new mean and variance

I have a set of sample, sampled from a specific gaussian mixture. Then I update the mixture parameters (mean, variance, weights). Now, I want to correct (move) these samples by the new parameters. I don't want to randomly sample from the new mixture, just move the old samples to match the new parameters. How can I do that
Use Expectation Maximization
After updating mixture parameters (M-Step), you want to compute the expected membership values for every sample. An example can look like this
Where A, B are the mixture memberships (there can be more components, of course) and Pa, Pb are their relative frequencies with constraint Pa + Pb = 1. Those 2 steps can be repeated until the change of mixture converges to any small enough epsilon.

STL decomposition getting rid of NaN values

Following links were investigated but didn't provide me with the answer I was looking for/fixing my problem: First, Second.
Due to confidentiality issues I cannot post the actual decomposition I can show my current code and give the lengths of the data set if this isn't enough I will remove the question.
import numpy as np
from statsmodels.tsa import seasonal
def stl_decomposition(data):
data = np.array(data)
data = [item for sublist in data for item in sublist]
decomposed = seasonal.seasonal_decompose(x=data, freq=12)
seas = decomposed.seasonal
trend = decomposed.trend
res = decomposed.resid
In a plot it shows it decomposes correctly according to an additive model. However the trend and residual lists have NaN values for the first and last 6 months. The current data set is of size 10*12. Ideally this should work for something as small as only 2 years.
Is this still too small as said in the first link? I.e. I need to extrapolate the extra points myself?
EDIT: Seems that always half of the frequency is NaN on both ends of trend and residual. Same still holds for decreasing size of data set.
According to this Github link another user had a similar question. They 'fixed' this issue. To avoid NaNs an extra parameter can be passed.
decomposed = seasonal.seasonal_decompose(x=data, freq=12, extrapolate_trend='freq')
It will then use a Linear Least Squares to best approximate the values. (Source)
Obviously the information was literally on their documentation and clearly explained but I completely missed/misinterpreted it. Hence I am answering my own question for someone who has the same issue, to save them the adventure I had.
According to the parameter definition below, setting extrapolate_trend other than 0 makes the trend estimation revert to a different estimation method. I faced this issue when I had a few observations for estimation.
extrapolate_trend : int or 'freq', optional
If set to > 0, the trend resulting from the convolution is
linear least-squares extrapolated on both ends (or the single one
if two_sided is False) considering this many (+1) closest points.
If set to 'freq', use `freq` closest points. Setting this parameter
results in no NaN values in trend or resid components.

Get average values in some specific range based on spatial analysis with QGIS

I'm working on QGIS to compute average values attached to polygons
around a line.
What I'd like to do is compute the average values of polygon data within a user defined range from a line.
How would I go about doing this?
I have attached a picture below for reference:
There should be a couple ways of doing this. The easiest might be the answer in this question, and then using the answer to this question.
But, you should be able to do this with a little bit of python. I've written a little script that uses just one layer (a point layer that finds the average of points within a certain radius, it averages a field named 'cost'):
# get selected feature
layer = iface.activeLayer()
features = layer.selectedFeatures()
# Buffer the feature 10 layer units
buffer = features[0].geometry().buffer(10,-1)
# Will hold features intersecting buffer
inBuffer = []
# get selected features
for feature in layer.getFeatures():
if (feature.geometry().intersects(buffer)):
inBuffer.append(feature)
# for calculating the average
total = 0
number = 0
# Sum the features that intersect the buffer of the selection:
field = layer.fieldNameIndex('cost')
for feature in inBuffer:
total += (feature['cost'])
number += 1
#Get the average
average = float(total / number)
print (average)
This takes only the first selected feature (features[0]) and applies the search radius to that, this limitation is irrelevant if you only select on feature in the active layer
The code above can be compressed a fair bit, but I thought I'd break it out a little more. Especially as my python is fairly limited.
To find an average in a second layer based on the selection in a first layer, you could modify this slightly by grabbing all layers:
mapcanvas = iface.mapCanvas()
layers = mapcanvas.layers()
then using layers[0], layers[1] (or layers[i]) in place of layer at the appropriate places, something like:
features = layer.selectedFeatures() to features = layers[0].selectedFeatures()
for the source feature, and
for feature in layer.getFeatures(): to for feature in layers[1].getFeatures()
and
field = layer.fieldNameIndex('fieldname') to field = layers[1].fieldNameIndex('fieldname')
for the target feature (the one being averaged).
Hopefully the code I have posted is easy enough to apply. I would probably ensure that both layers use the same SRS to avoid any issues with the intersection, and remember that the buffer units are in the units of that SRS.

Randomly select increasing subset of data to see where mean levels off

Could anyone please advise the best way to do the following?
I have three variables (X, Y & Z) and four groups (1, 2, 3 & 4). I have been using discriminant function analysis in SPSS to predict group membership of known grouped data for use with future ungrouped data.
Ideally I would like to able to randomly sample an increasing number of a subset of the data to see how many observations are required to hit a desired correct classification percentage.
However, I understand this might be difficult. Therefore, I'm looking to to do this for the means.
For example, Lets say variable X has a mean of 141 for group 1. This mean might have been calculated from 2000 observations. However, it might be the case that the mean occurred at say 700 observations. I would like to be able to calculate at what number of observations/cases the mean levels of in my data. For example, perhaps starting at 10 observations and repeating this randomly say 50 or 100 times, then increasing to 20 observations....and so on.
I understand this is a form of monte carlo testing. I have access to SPSS 15, 17 and 18 and excel. I also have access to minitab 15 & 16 and amos17 and have downloaded "R" but im not familiar with these. My experience is with SPSS and excel. I have tried some syntax in SPSS Modified from this..http://pages.infinit.net/rlevesqu/Syntax/RandomSampling/Select2CasesFromEachGroup.txt but this would still be quite time consuming on my part to enter the subset number ect etc.
Hope some one can help.
Thanks for reading.
Andy
The text you linked to is a good start (you can also use the SAMPLE command in SPSS, but IMO the Raynald script you linked to is more flexible when you think about constructing the sample that way).
In pseudo-code, the process might look like;
do n for sample size (a to b)
loop 100 times
draw sample size n
compute (& save) statistics
Here is where SPSS's macro language comes into play (I think this document is a good introduction, plus you can examine other references on the SPSS tag wiki). Basically once you figure out how to draw the sample and compute the stats you want, you just need to figure out how to write a macro so you can loop through the process (and pass it the sample size parameter). I include the loop 100 times because you want to be able to make some type of estimate about the error associated with each sample size.
If you give an example of how you compute the statistics I may be able to give examples of how to make that into a macro function and loop through the desired number of times.
#Andy W
#Oliver
Thanks for your suggestions guys. Ive managed to find a work around using the following macro from.........http://www.spsstools.net/Syntax/Bootstrap/GetRandomSampleOfVariousSizeCalcStats.txt However, for this I need to copy and paste the variable data for a given group into a new data window. Thats not to much of a problem. To take this further would anyone know how: 1/ I could get other statistics recorded eg std error, std dev ect ect. 2/Use other analysis, ideally discriminant function analysis and record in a new data window the percentage of correct classificcations rather than having lots of output tables 3/not need to copy and paste variables for each group so I can just run the macro specifying n samples for x variable on group 1, 2, 3 & 4.
Thanks again.
DEFINE !sample(myvar !TOKENS(1)
/nbsampl !TOKENS(1)
/size !CMDEND).
* myvar = the variable of interest (here we want the mean of salary)
* nbsampl = number of samples.
* size = the size of each samples.
!LET !first='1'
!DO !ss !IN (!size)
!DO !count = 1 !TO !nbsampl.
GET FILE='c:\Program Files\SPSS\employee data.sav'.
COMPUTE draw=uniform(1).
SORT CASES BY draw.
N OF CASES !ss.
COMPUTE samplenb=!count.
COMPUTE ss=!ss.
AGGREGATE
/OUTFILE=*
/BREAK=samplenb
/!myvar = MEAN(!myvar) /ss=FIRST(ss).
!IF (!first !NE '1') !THEN
ADD FILES /FILE=* /FILE='c:\temp\sample.sav'.
!IFEND
SAVE OUTFILE='c:\temp\sample.sav'.
!LET !first='0'
!DOEND.
!DOEND.
VARIABLE LABEL ss 'Sample size'.
EXAMINE
VARIABLES=salary BY ss /PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL
/MISSING=REPORT.
!ENDDEFINE.
* ----------------END OF MACRO ----------------------------------------------.
* Call macro (parameters are number of samples (here 20) and sizes of sample (here 5, 10,15,30,50).
* Thus 20 samples of size 5.
* Thus 20 samples of size 10, etc.
!sample myvar=salary nbsampl=20 size= 5 10 15 30 50.

Resources