Log transforming predictor variables in survival analysis - survival-analysis

I am running shared gamma frailty models (i.e., Coxph survival analysis models with a random effect) and want to know if it is "acceptable" to log transform one of your continuous predictor variables. I found a website (http://www.medcalc.org/manual/cox_proportional_hazards.php) that said "The Cox proportional regression model assumes ... there should be a linear relationship between the endpoint and predictor variables. Predictor variables that have a highly skewed distribution may require logarithmic transformation to reduce the effect of extreme values. Logarithmic transformation of a variable var can be obtained by entering LOG(var) as predictor variable".
I would really appreciate a second opinion from someone with more statistical knowledge on this topic. In a nutshell: Is it OK/commonplace/etc to transform (specifically log transform) predictor variables in a survival analysis model (e.g., Coxph model).
Thanks.

You can log transform any predictor in Cox regression. This is frequently necessary but has some drawbacks.
Why log transform? There are a number of good reasons why. You decrease the extent and effect of outliers, data becomes more normally distributed etc.
When possible? I doubt that there are circumstances when you can not do it. I find it hard to believe that it would compromise the precision of your estimates.
Why not do it always? Well it becomes difficult to interpret the results for a predictor which have been log transformed. If you don't log transform, and your predictor is, for example, blood pressure and you obtain a hazard ratio of 1.05, meaning a 5% increase in risk of event for 1 unit increase in blood pressure. IF you log transform blood pressure, the hazard ratio of 1.05 (it would most likely not land on 1.05 again after log transform but we'll stick to 1.05 for simplicity) means 5% increase for each log unit increase in blood pressure. Now thats more difficult to grasp.
But, if you are not interested in the particular variable that you think about log transforming (i.e you just need to adjust for it as a covariate), go ahead do it.

Related

Min Max Normalization/Normal Distribution

I have a dataset with county level data where N=3119 with 93 variables. I am trying to do a PCA, EFA and or CFA. The data has been given to me already min/max normalized, ranging from (0,1). Theory states that the data should be normally distributed for CFA/SEM, but my understanding is that min/max normalization does not change the distribution of the data, only it's scale.
It is clear to me that I do not have multivariate normality or univariate normality due to the skewness of data. I guess what's confusing me, is when people seemingly throw around the term normalization interchangeably with the meaning of normal distribution.
So can I go forward with my analysis since min/max normalization has been performed, or do I need to look more towards other log/box cox transformations to adjust the distribution prior to running my analysis? Is it okay to log transform data that has already been min/max normalized?
my understanding is that min/max normalization does not change the distribution of the data, only it's scale.
Correct. If you print a hist()ogram of original and transformed data, they should look identical. Only the x-axis scale will change.
the term normalization interchangeably with the meaning of normal distribution
Indeed, these are completely separate issues.
Is it okay to log transform data that has already been min/max normalized?
Taking the log() would affect 0--1 data differently than data further up the real-number line. But I don't see why you need to transform the data when nonnormality corrections are available for SEs (in EFA or CFA) and model-fit test statistics (relevant for CFA). Independent-components analysis might be an alternative to PCA if your data are not normal.

How do I analyze the change in the relationship between two variables?

I'm working on a simple project in which I'm trying to describe the relationship between two positively correlated variables and determine if that relationship is changing over time, and if so, to what degree. I feel like this is something people probably do pretty often, but maybe I'm just not using the correct terminology because google isn't helping me very much.
I've plotted the variables on a scatter plot and know how to determine the correlation coefficient and plot a linear regression. I thought this may be a good first step because the linear regression tells me what I can expect y to be for a given x value. This means I can quantify how "far away" each data point is from the regression line (I think this is called the squared error?). Now I'd like to see what the error looks like for each data point over time. For example, if I have 100 data points and the most recent 20 are much farther away from where the regression line/function says it should be, maybe I could say that the relationship between the variables is showing signs of changing? Does that make any sense at all or am I way off base?
I have a suspicion that there is a much simpler way to do this and/or that I'm going about it in the wrong way. I'd appreciate any guidance you can offer!
I can suggest two strands of literature that study changing relationships over time. Typing these names into google should provide you with a large number of references so I'll stick to more concise descriptions.
(1) Structural break modelling. As the name suggest, this assumes that there has been a sudden change in parameters (e.g. a correlation coefficient). This is applicable if there has been a policy change, change in measurement device, etc. The estimation approach is indeed very close to the procedure you suggest. Namely, you would estimate the squared error (or some other measure of fit) on the full sample and the two sub-samples (before and after break). If the gains in fit are large when dividing the sample, then you would favour the model with the break and use different coefficients before and after the structural change.
(2) Time-varying coefficient models. This approach is more subtle as coefficients will now evolve more slowly over time. These changes can originate from the time evolution of some observed variables or they can be modeled through some unobserved latent process. In the latter case the estimation typically involves the use of state-space models (and thus the Kalman filter or some more advanced filtering techniques).
I hope this helps!

A method to find the inconsistency or variation in the data

I am running an experiment (it's an image processing experiment) in which I have a set of paper samples and each sample has a set of lines. For each line in the paper sample, its strength is calculated which is denoted by say 's'. For a given paper sample I have to find the variation amongst the strength values 's'. If the variation is above a certain limit, we have to discard that paper.
1) I started with the Standard Deviation of the values, but the problem I am facing is that for each sample, order of magnitude for s (because of various properties of line like its length, sharpness, darkness etc) might differ and also the calculated Standard Deviations values are also differing a lot in magnitude. So I can't really use this method for different samples.
Is there any way where I can find that suitable limit which can be applicable for all samples.
I am thinking that since I don't have any history of how the strength value should behave,( for a given sample depending on the order of magnitude of the strength value more variation could be tolerated in that sample whereas because the magnitude is less in another sample, there should be less variation in that sample) I first need to find a way of baselining the variation in different samples. I don't know what approaches I could try to get started.
Please note that I have to tell variation between lines within a sample whereas the limit should be applicable for any good sample.
Please help me out.
You seem to have a set of samples. Then, for each sample you want to do two things: 1) compute a descriptive metric and 2) perform outlier detection. Both of these are vast subjects that require some knowledge of the phenomenology and statistics of the underlying problem. However, below are some ideas to get you going.
Compute a metric
Median Absolute Deviation. If your sample strength s has values that can jump by an order of magnitude across a sample then it is understandable that the standard deviation was not a good metric. The standard deviation is notoriously sensitive to outliers. So, try a more robust estimate of dispersion in your data. For example, the MAD estimate uses the median in the underlying computations which is more robust to a large spread in the numbers.
Robust measures of scale. Read up on other robust measures like the Interquartile range.
Perform outlier detection
Thresholding. This is similar to what you are already doing. However, you have to choose a suitable threshold for the metric computed above. You might consider using another robust metric for thresholding the metric. You can compute a robust estimate of their mean (e.g., the median) and a robust estimate of their standard deviation (e.g., 1.4826 * MAD). Then identify outliers as metric values above some number of robust standard deviations above the robust mean.
Histogram Another simple method is to histogram your computed metrics from step #1. This is non-parametric so it doesn't require you to model your data. If can histogram your metric values and then use the top 1% (or some other value) as your threshold limit.
Triangle Method A neat and simple heuristic for thresholding is the triangle method to perform binary classification of a skewed distribution.
Anomaly detection Read up on other outlier detection methods.

How do I measure the distribution of an attribute of a given population?

I have a catalog of 900 applications.
I need to determine how their reliability is distributed as a whole. (i.e. is it normal).
I can measure the reliability of an individual application.
How can I determine the reliability of the group as a whole without measuring each one?
That's a pretty open-ended question! Overall, distribution fitting can be quite challenging and works best with large samples (100's or even 1000's). It's generally better to pick a modeling distribution based on known characteristics of the process you're attempting to model than to try purely empirical fitting.
If you're going to go empirical, for a start you could take a random sample, measure the reliability scores (whatever you're using for that) of your sample, sort them, and plot them vs normal quantiles. If they fall along a relatively straight line the normal distribution is a plausible model, and you can estimate sample mean and variance to parameterize it. You can apply the same idea of plotting vs quantiles from other proposed distributions to see if they are plausible as well.
Watch out for behavior in the tails, in particular. Pretty much by definition the tails occur rarely and may be under-represented in your sample. Like all things statistical, the larger the sample size you can draw on the better your results will be.
I'd also add that my prior belief would be that a normal distribution wouldn't be a great fit. Your reliability scores probably fall on a bounded range, tend to fall more towards one side or the other of that range. If they tend to the high range, I'd predict that they get lopped off at the end of the range and have a long tail to the low side, and vice versa if they tend to the low range.

using skewness to segment volume

my knowledge in statistics is minuscule, sorry. I have a large volume of measured amplitudes. In the absence of a signal, the noise is assumed to have a normal distribution. When a signal is present with higher amplitude than the surrounding noise, the shape of the distribution is more tailed on the positive side. I was thinking of using skewness for detection of signal. But the area of higher amplitude (cells in the volume) is rather small compared to the volume itself. So, we are talking of in magnitude of hundreds of cells from a total of some thousands. If the skewness is zero for a normal distribution, how can I extract those cells in my volume which contribute to the non-zero skewness. If say, my skewness value is 0.5, is there a way to drop all cells and keep only those which raised the skewness value. Perhaps I sound unclear but that just shows how little I understand of the topic.
Thanks in advance.
It seems to me that the problem might best be modeled as a mixture model: we have a Gaussian background
B ~ N(0, sigma)
and a signal, about which the poster has not specified a particular model.
If we can assume that the signal also takes the form of one (or possible a mixture of several) Gaussian(s), then Gaussian mixture modelling with the EM algorithm may be a good way to solve it (see Wikipedia).
A good paper in the context of segmentation is this here:
http://www.fil.ion.ucl.ac.uk/~karl/Unified%20segmentation.pdf
If we cannot make such an assumption, I would use a robust regression method to estimate the parameters of the Gaussian noise, where the signal is treated as an outlier, e.g. Least trimmed squares (again see Wikipedia).
The outlier cells can then be found via (Bonferroni-corrected) hypothesis testing, as described e.g. in this paper:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2900857/

Resources