I want to run a mediation analysis to see the effect of Exposure to a pollutant (continuous) to types of Cancer (categorical with 4 levels-types of cancer) via a Blood biomarker as the mediator (continuous). So the mediation diagram would be something like this:
E -> B -> C
For the mediation variable I run the linear regression analysis:
med.fit <- lm(blood_biomarker~exposure+age+sex, data=demographics)
but when it comes to the outcome variable, I read from the docs that the only appropriate analysis is multinomial regression analysis such as:
out.fit <- multinom(cancer_type~blood_biomarker+exposure+age+sex, data=demographics)
then again the mediate function won't work with the multinom class object as input.
#this doesn't work
med.out<-mediate(med.fit,out.fit, treat="exposure", mediator="blood_biomarker")
All above models are simplified for my example. there are more confounders than age and sex
I am new to mediation analysis and I think my problem is more on the regression method required than the code itself. Is there a way to do the same analysis using glm() or lm() (or any other that produces an object recognized from mediate function) for this kind of data?
Thank you in advance.
Your problem may be that the exposure are continuous data. Try the package medflex for continuous exposure (version 4.2).
Steen J, Loeys T, Moerkerke B, Vansteelandt S. 2017. medflex : An R
Package for Flexible Mediation Analysis using Natural Effect Models.
J. Stat. Softw. 76.
However, I am not sure the mediator can be continuous, it may also be a problem for the analysis.
Related
I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!
I am working in SPSS using the "Generalized Mixed" option under "Mixed Models" to run a GLMM. I have a repeated measures design. I am seeing whether repeated sessions (5) had an effect on dogs' approach (Y/N) to three bowl locations. My outcome variable is binary: approach yes/no. Session number (1,2,3,4,5) and Bowl Position (A,B,C) are fixed effects and DogID is a random effect. I have run a Generalised Linear Mixed Model with binary logistic regression link to the linear model but cannot see on the model builder SPSS display anywhere to run post-hoc tests. I would like to run Tukey's. I know that you can run post-hoc tests using the "General Linear Model" < "Univariate" tabs but this option cannot account for the binary outcome variable as far as I can tell. Does anyone know how to run post-hoc tests on this model builder in SPSS? I understand that many use R for this type of analysis but I am not proficient yet.
I am going thru the samples for Azure Machine Learning. It looks like the examples are leading me to the point that ML is being used to classification problems like ranking, classifying or detecting the category by model trained from inferred-sample-data.
Now that I am wondering if ML can be trained to computational problems like Multiplication, Division, other series problems,..? Does this problem fit in ML scope?
MULTIPLICATION DATASET:
Num01,Num02,Result
1,1,1
1,2,2
1,3,3
1,4,4
1,5,5
1,6,6
1,7,7
1,8,8
1,9,9
1,10,10
1,11,11
1,12,12
1,13,13
1,14,14
2,1,2
2,2,4
2,3,6
2,4,8
2,5,10
2,6,12
2,7,14
2,8,16
2,9,18
2,10,20
2,11,22
2,12,24
2,13,26
2,14,28
3,1,3
3,2,6
SCORING DATASET:
Num01,Num02
1,5
3,1
2,16
3,15
1,32
It seems like you are looking for regression, which is supportd by almost every machine learning library, including Azure's services. In laymans terms, the goal of regression is to approximate an unknown function that maps data X to a continuous value y.
This can be any function, indeed including multiplication or division. However, do note that these cases are usually way too simple to solve with machine learning. Most machine learning algorithms (except maybe linear regression)do a lot more internal computations and will as a result be slower than a native implementation on your device.
As an extra point of clarification, most of the actual machine learning (ML) in Azure ML is done by great open source libraries such as sk-learn or keras. Azure mainly provides compute power and higher-level management tools, such as experiment tracking and efficient hyper-parameter-tuning.
If you are just getting started with ML and want to go more in-depth, then this extra functionality might be overkill/confusing. So I would advise to start with focusing on one of the packages that I described above. Additionally you would need to combine that with some more formal training, which will explain most of the important concepts to you.
I want to build a Random Forest Regressor to model count data (Poisson distribution). The default 'mse' loss function is not suited to this problem. Is there a way to define a custom loss function and pass it to the random forest regressor in Python (Sklearn, etc..)?
Is there any implementation to fit count data in Python in any packages?
In sklearn this is currently not supported. See discussion in the corresponding issue here, or this for another class, where they discuss reasons for that a bit more in detail (mainly the large computational overhead for calling a Python function).
So it could be done as discussed within the issues, by forking sklearn, implementing the cost function in Cython and then adding it to the list of available 'criterion'.
If the problem is that the counts c_i arise from different exposure times t_i, then indeed one cannot fit the counts, but one can still fit the rates r_i = c_i/t_i using MSE loss function, where one should, however, use weights proportional to the exposures, w_i = t_i.
For a true Random Forest Poisson regression, I've seen that in R there is the rpart library for building a single CART tree, which has a Poisson regression option. I wish this kind of algorithm would have been imported to scikit-learn.
In R, writing a custom objective function is fairly simple.
randomForestSRC package in R has provision for writing your own custom split rule. The custom split rule, however has to be written in pure C language.
All you have to do is, write your own custom split rule, register the split rule, compile and install the package.
The custom split rule has to be defined in the file called splitCustom.c in randomForestSRC source code.
You can find more info
here.
The file in which you define the split rule is
this.
I would like to understand, how to use the SALib python toolbox to perform a Sobol sensitivity analysis (to study parameters and crossed parameters influence)
From the original example I'm supposed to proceed this way:
from SALib.sample import saltelli
from SALib.analyze import sobol
from SALib.test_functions import Ishigami
import numpy as np
problem = {
'num_vars': 3,
'names': ['x1', 'x2', 'x3'],
'bounds': [[-np.pi, np.pi]]*3
}
# Generate samples
param_values = saltelli.sample(problem, 1000)
# Run model (example)
Y = Ishigami.evaluate(param_values)
# Perform analysis
Si = sobol.analyze(problem, Y, print_to_console=True)
# Returns a dictionary with keys 'S1', 'S1_conf', 'ST', and 'ST_conf'
# (first and total-order indices with bootstrap confidence intervals
Because in my case I'm getting data from experiments, I don't have the model that is linking Xi and Yi. I just have an input matrix and an output matrix.
If we assume that my input data are generated from a Latin Hypercube (a good statistical repartition), how to use Salib to evaluate the sensitivity of my parameters? From what I see in the code:
Si = sobol.analyze(problem, Y, print_to_console=True)
We are only using input parameters boundaries and output. But with this approach how is it possible to know which parameter is evolving between two sets ?
thanks for your help!
There is no direct way to compute the Sobol indices using SAlib based on your description of the data. SAlib computes the first- and total-order indices by generating two matrices (A and B) and then using additional values generated by cross-sampling a value from matrix B in matrix A. The diagram below shows how this is done. When the code evaluates the indices it expects the model output to be in this order. The method of computing indices this way is based on the methods published by Saltelli et al. (2010). Because this is not a Latin hypercube sampling method, the experimental data will most likely not work.
One possible method to still complete a sensitivity analysis is to use a surrogate or meta model from your experimental data. In this case you could use the experimental data to fit an approximation of your true model. This approximation can then be analyzed by SAlib or another sensitivity package. The surrogate model is typically a polynomial or based on kriging. Iooss et al (2006) describes some methods. Some software for this method includes UQlab (http://www.uqlab.com/, MATLAB-based) and BASS (https://cran.r-project.org/web/packages/BASS/index.html, R package) among others depending on the specific type of model and fitting techniques you want to use.
Another possibility is to find an estimator that is not based on the Saltelli et al (2010) method. I am not sure if such an estimator exists, but it would probably be better to post that question in the Math or Probability and Statistics Stack Exchanges.
References:
Iooss, B, F. Van Dorpe, N. Devictor. (2006). "Response surfaces and sensitivity analyses for an environmental model of dose calculations". Reliability Engineering and System Safety 91:1241-1251.
Saltelli, A., P. Annoni, I. Azzini, F. Campolongo, M. Ratto, S. Tarantola. 2010. "Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index". Computer Physics Communications 181:259-270.