Multiple inputs for coxph - survival-analysis

Is there a way to run coxph for multiple inputs. Here I have used the input hsa_let_7b_5p.
coxph(Surv(Time, Status)~ hsa_let_7b_5p, data=as.data.frame(test))
Call:
coxph(formula = Surv(Time, Status) ~ hsa_let_7b_5p, data = as.data.frame(test))
coef exp(coef) se(coef) z p
hsa_let_7b_5p 0.169 1.184 0.173 0.98 0.33
Likelihood ratio test=0.94 on 1 df, p=0.333
n= 91, number of events= 45

It's not too clear to me if this answers the question you meant or the question you asked, but you can add more regression variables to the right side of the formula (after the ~)
coxph(Surv(Time, Status)~ hsa_let_7b_5p + x + y, data=as.data.frame(test))
where x & y are the names of other variables (columns) in your data frame.
You may wish to read into interactions and stratification at some point

Related

Why my fit for a logarithm function looks so wrong

I'm plotting this dataset and making a logarithmic fit, but, for some reason, the fit seems to be strongly wrong, at some point I got a good enough fit, but then I re ploted and there were that bad fit. At the very beginning there were a 0.0 0.0076 but I changed that to 0.001 0.0076 to avoid the asymptote.
I'm using (not exactly this one for the image above but now I'm testing with this one and there is that bad fit as well) this for the fit
f(x) = a*log(k*x + b)
fit = fit f(x) 'R_B/R_B.txt' via a, k, b
And the output is this
Also, sometimes it says 7 iterations were as is the case shown in the screenshot above, others only 1, and when it did the "correct" fit, it did like 35 iterations or something and got a = 32 if I remember correctly
Edit: here is again the good one, the plot I got is this one. And again, I re ploted and get that weird fit. It's curious that if there is the 0.0 0.0076 when the good fit it's about to be shown, gnuplot says "Undefined value during function evaluation", but that message is not shown when I'm getting the bad one.
Do you know why do I keep getting this inconsistence? Thanks for your help
As I already mentioned in comments the method of fitting antiderivatives is much better than fitting derivatives because the numerical calculus of derivatives is strongly scattered when the data is slightly scatered.
The principle of the method of fitting an integral equation (obtained from the original equation to be fitted) is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales . The application to the case of y=a.ln(c.x+b) is shown below.
Numerical calculus :
In order to get even better result (according to some specified criteria of fitting) one can use the above values of the parameters as initial values for iterarive method of nonlinear regression implemented in some convenient software.
NOTE : The integral equation used in the present case is :
NOTE : On the above figure one can compare the result with the method of fitting an integral equation to the result with the method of fitting with derivatives.
Acknowledgements : Alex Sveshnikov did a very good work in applying the method of regression with derivatives. This allows an interesting and enlightening comparison. If the goal is only to compute approximative values of parameters to be used in nonlinear regression software both methods are quite equivalent. Nevertheless the method with integral equation appears preferable in case of scattered data.
UPDATE (After Alex Sveshnikov updated his answer)
The figure below was drawn in using the Alex Sveshnikov's result with further iterative method of fitting.
The two curves are almost indistinguishable. This shows that (in the present case) the method of fitting the integral equation is almost sufficient without further treatment.
Of course this not always so satisfying. This is due to the low scatter of the data.
In ADDITION , answer to a question raised in comments by CosmeticMichu :
The problem here is that the fit algorithm starts with "wrong" approximations for parameters a, k, and b, so during the minimalization it finds a local minimum, not the global one. You can improve the result if you provide the algorithm with starting values, which are close to the optimal ones. For example, let's start with the following parameters:
gnuplot> a=47.5087
gnuplot> k=0.226
gnuplot> b=1.0016
gnuplot> f(x)=a*log(k*x+b)
gnuplot> fit f(x) 'R_B.txt' via a,k,b
....
....
....
After 40 iterations the fit converged.
final sum of squares of residuals : 16.2185
rel. change during last iteration : -7.6943e-06
degrees of freedom (FIT_NDF) : 18
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 0.949225
variance of residuals (reduced chisquare) = WSSR/ndf : 0.901027
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 35.0415 +/- 2.302 (6.57%)
k = 0.372381 +/- 0.0461 (12.38%)
b = 1.07012 +/- 0.02016 (1.884%)
correlation matrix of the fit parameters:
a k b
a 1.000
k -0.994 1.000
b 0.467 -0.531 1.000
The resulting plot is
Now the question is how you can find "good" initial approximations for your parameters? Well, you start with
If you differentiate this equation you get
or
The left-hand side of this equation is some constant 'C', so the expression in the right-hand side should be equal to this constant as well:
In other words, the reciprocal of the derivative of your data should be approximated by a linear function. So, from your data x[i], y[i] you can construct the reciprocal derivatives x[i], (x[i+1]-x[i])/(y[i+1]-y[i]) and the linear fit of these data:
The fit gives the following values:
C*k = 0.0236179
C*b = 0.106268
Now, we need to find the values for a, and C. Let's say, that we want the resulting graph to pass close to the starting and the ending point of our dataset. That means, that we want
a*log(k*x1 + b) = y1
a*log(k*xn + b) = yn
Thus,
a*log((C*k*x1 + C*b)/C) = a*log(C*k*x1 + C*b) - a*log(C) = y1
a*log((C*k*xn + C*b)/C) = a*log(C*k*xn + C*b) - a*log(C) = yn
By subtracting the equations we get the value for a:
a = (yn-y1)/log((C*k*xn + C*b)/(C*k*x1 + C*b)) = 47.51
Then,
log(k*x1+b) = y1/a
k*x1+b = exp(y1/a)
C*k*x1+C*b = C*exp(y1/a)
From this we can calculate C:
C = (C*k*x1+C*b)/exp(y1/a)
and finally find the k and b:
k=0.226
b=1.0016
These are the values used above for finding the better fit.
UPDATE
You can automate the process described above with the following script:
# Name of the file with the data
data='R_B.txt'
# The coordinates of the last data point
xn=NaN
yn=NaN
# The temporary coordinates of a data point used to calculate a derivative
x0=NaN
y0=NaN
linearFit(x)=Ck*x+Cb
fit linearFit(x) data using (xn=$1,dx=$1-x0,x0=$1,$1):(yn=$2,dy=$2-y0,y0=$2,dx/dy) via Ck, Cb
# The coordinates of the first data point
x1=NaN
y1=NaN
plot data using (x1=$1):(y1=$2) every ::0::0
a=(yn-y1)/log((Ck*xn+Cb)/(Ck*x1+Cb))
C=(Ck*x1+Cb)/exp(y1/a)
k=Ck/C
b=Cb/C
f(x)=a*log(k*x+b)
fit f(x) data via a,k,b
plot data, f(x)
pause -1

How fit_intercept parameter impacts linear regression with scikit learn

I am trying to fit a linear model and my dataset is normalized where each feature is divided by the maximum possible value. So the values ranges from 0-1. Now i came to know from my previous post Linear Regression vs Closed form Ordinary least squares in Python linear regression in scikit learn produces same result as Closed form OLS when fit_intercept parameter is set to false. I am not quite getting how fit_intercept works.
For any linear problem, if y is the predicted value.
y(w, x) = w_0 + w_1 x_1 + ... + w_p x_p
Across the module, the vector w = (w_1, ..., w_p) is denoted as coef_ and w_0 as intercept_
In closed form OLS we also have a bias value for w_0 and we introduce vector X_0=[1...1] before computing the dot product and solves using matrix multiplication and inverse.
w = np.dot(X.T, X)
w1 = np.dot(np.linalg.pinv(w), np.dot(X.T, Y))
When fit_intercept is True, scikit-learn linear regression solves the problem if y is the predicted value.
y(w, x) = w_0 + w_1 x_1 + ... + w_p x_p + b where b is the intercept item.
How does it differ to use fit_intercept in a model and when should one set it to True/False. I was trying to look at the source code and it seems like the coefficients are normalized by some scale.
if self.fit_intercept:
self.coef_ = self.coef_ / X_scale
self.intercept_ = y_offset - np.dot(X_offset, self.coef_.T)
else:
self.intercept_ = 0
What does this scaling do exactly. I want to interpret the coefficients in both approach (Linear Regression, Closed form OLS) but since just setting fit_intercept True/False gives different result for Linear Regression i can't quite decide on the intuition behind them. Which one is better and why?
Let's take a step back and consider the following sentence you said:
since just setting fit_intercept True/False gives different result for Linear Regression
That is not entirely true. It may or may not be different, and it depends entirely on your data. It would help to understand what goes into the calculation of regression weights. I mean this somewhat literally: what does your input (x) data look like?
Understanding your input data, and understanding why it matters, will help you realize why you sometimes get different results, and why at other times the results are the same
Data setup
Lets set up some test data:
import numpy as np
from sklearn.linear_model import LinearRegression
np.random.seed(1243)
x = np.random.randint(0,100,size=10)
y = np.random.randint(0,100,size=10)
Our x and y variables look like this:
X Y
51 29
3 73
7 77
98 29
29 80
90 37
49 9
42 53
8 17
65 35
No-intercept model
Recall that the calculation of regression weights has a closed form solution, which we can obtain using normal equations:
Using this method, we get a single regression coefficient because we only have 1 predictor variable:
x = x.reshape(-1,1)
w = np.dot(x.T, x)
w1 = np.dot(np.linalg.pinv(w), np.dot(x.T, y))
print(w1)
[ 0.53297593]
Now, let's look at scikit-learn when we set fit_intercept = False:
clf = LinearRegression(fit_intercept=False)
print(clf.fit(x, y).coef_)
[ 0.53297593]
What happens when we set fit_intercept = True instead?
clf = LinearRegression(fit_intercept=True)
print(clf.fit(x, y).coef_)
[-0.35535884]
It would seem that setting fit_intercept to True and False gives different answers, and that the "correct" answer occurs only when we set it to False, but this is not entirely correct...
Intercept model
At this point we have to consider what our input data actually is. In the models above, our data matrix (also called a feature matrix, or design matrix in statistics) is just a single vector containing our x values. The y variable is not included in the design matrix. If we want to add an intercept to our model, one common approach is to add a column of 1's to the design matrix, so x becomes:
x_vals = x.flatten()
x = np.zeros((10, 2))
x[:,0] = 1
x[:,1] = x_vals
intercept x
0 1.0 51.0
1 1.0 3.0
2 1.0 7.0
3 1.0 98.0
4 1.0 29.0
5 1.0 90.0
6 1.0 49.0
7 1.0 42.0
8 1.0 8.0
9 1.0 65.0
Now, when we use this as our design matrix, we can try the closed form solution again:
w = np.dot(x.T, x)
w1 = np.dot(np.linalg.pinv(w), np.dot(x.T, y))
print(w1)
[ 59.60686058 -0.35535884]
Notice 2 things:
We now have 2 coefficients. The first is our intercept and the second is the regression coefficient for the x predictor variable
The coefficient for x matches the coefficient from the scikit-learn output above when we set fit_intercept = True
So in the scikit-learn models above, why was there a difference between True and False? Because in one case no intercept was modeled. In the other case the underlying model included an intercept, which is confirmed when you manually add an intercept term/column when solving the normal equations
If you were to use this new design matrix in scikit-learn, it doesn't matter whether you set True or False for fit_intercept, the coefficient for the predictor variable will not change (the intercept value will be different due to centering, but thats irrelevant for this discussion):
clf = LinearRegression(fit_intercept=False)
print(clf.fit(x, y).coef_)
[ 59.60686058 -0.35535884]
clf = LinearRegression(fit_intercept=True)
print(clf.fit(x, y).coef_)
[ 0. -0.35535884]
Summing up
The output (i.e. coefficient values) you get will be entirely dependent on the matrix that you input into these calculations (whether its normal equations, scikit-learn, or any other)
How does it differ to use fit_intercept in a model and when should one set it to True/False
If your design matrix does not contain a 1's column, then normal equations and scikit-learn (fit_intercept = False) will give you the same answer (as you noted). However, if you set the parameter to True, the answer you get will actually be the same as normal equations if you calculated that with a 1's column.
When should you set True/False? As the name suggests, you set False when you don't want to include an intercept in your model. You set True when you do want an intercept, with the understanding that the coefficient values will change, but will match the normal equations approach when your data includes a 1's column
So True/False doesn't actually give you different results (compared to normal equations) when considering the same underlying model. The difference you observe is because you're looking at two different statistical models (one with an intercept term, and one without). The reason the fit_intercept parameter exists is so you can create an intercept model without the hassle of manually adding that 1's column. It effectively allows you to toggle between the two underlying statistical models.
Without going into the details of mathematical formulation, when the fit intercept is set to false, the estimator deliberately sets the intercept to zero and this in turn affects the other regressors as the 'responsibility' of the error reduction falls onto these factors. As a result, the result could be very different in either cases if it is sensitive to the presence of an intercept term. The scaling shifts the origin thereby allowing the same closed loop solutions to both intercept and intercept-free models.

Fitting exponent with gnuplot

I am trying to fit the beneath data to the form - I am most interested in 'c' (I know that c ≈ 1/8, b ≈ 3) but would like to extract all these values from the data.
Formula:
y = a*(x-b)**c
Values.txt:
# "values.txt"
2.000000e+00 6.058411e-04
2.200000e+00 5.335520e-04
2.400000e+00 3.509583e-03
2.600000e+00 1.655943e-03
2.800000e+00 1.995418e-03
3.000000e+00 9.437851e-04
3.200000e+00 5.516159e-04
3.400000e+00 6.765981e-04
3.600000e+00 3.860859e-04
3.800000e+00 2.942881e-04
4.000000e+00 5.039975e-04
4.200000e+00 3.962199e-04
4.400000e+00 4.659717e-04
4.600000e+00 2.892683e-04
4.800000e+00 2.248839e-04
5.000000e+00 2.536980e-04
I have tried using the following commands in gnuplot however I am not meaningful results
f(x) = a*(x-b)**c
b = 3
c = 1/8
fit f(x) "values.txt" via a,b,c
Does anyone know the best way to extract these values? I would rather not provide initial guesses for 'b' & 'c' if possible.
Thanks,
J
The main problem with your fitting function is finding b. You can express your equation as a linear function in log(x-b), after which the fitting is trivial:
b = 3
f(x) = c0 + c1 * x
fit f(x) "values.txt" using (log($1-b)):(log($2)) via c0, c1
a = exp(c0)
c = c1
As you see, you need to provide b but do not need initial guesses for the other parameters because it's a trivial linear fit.
Now, I would suggest that you provide a series of values of b and check how good the fitting is for each value. gnuplot gives you the error in the fitting parameter. Then you can plot the overall error (error_c0 + error_c1) as a function of b and figure out for which b the error is minimum. About the optimum b the curve error_c0 + error_c1 vs b should be quadratic and have the minimum at b_opt. Then run the fitting as in the code above with this b = b_opt and get a and c.

lsqcurvefit when expecting small coefficients

I've generated a plot of the attenutation seen in an electrical trace up to a frequency of 14e10 rad/s. The ydata ranges from approximately 1-10 Np/m. I'm trying to generate a fit of the form
y = A*sqrt(x) + B*x + C*x^2.
I expect A to be around 10^-6, B to be around 10^-11, and C to be around 10^-23. However, the smallest coefficient lsqcurvefit will return is 10^-7. Also, its will only return a nonzero coefficient for A, while returning 0 for B and C. The fit actually looks really good however the physics indicates that B and C should not be 0.
Here is how I'm calling the function
% measurement estimate
x_alpha = [1e-6 1e-11 1e-23];
lb = [1e-7, 1e-13, 1e-25];
ub = [1e-3, 1e-6, 1e-15];
x_alpha = lsqcurvefit(#modelfun, x_alpha, omega, alpha_t, lb,ub)
Here is the model function
function [ yhat ] = modelfun( x, xdata )
yhat = x(1)*xdata.^.5 + x(2)*xdata + x(3)*xdata.^2;
end
Is it possible to get lsqcurvefit to return such small coefficients? Is the error in rounding or is it something else? Any ways I can change the tolerance to see a fit closer to what I expect?
Found a stackoverflow page that seems to address this issue!
fit using lsqcurvefit

Evaluating and graphing functions in MATLAB

I am trying to graph the following Gaussian function in MATLAB (should graph in 3 dimensions) but I am making some mistakes somewhere. What is wrong?
sigma = 1
for i = 1:20
for j = 1:20
z(i,j) = (1/(2*pi*sigma^2))*exp(-(i^2+j^2)/(2*sigma^2));
end
end
surf(z)
The problem you are likely having is that you are evaluating the Gaussian over the range of 1 to 20 for both i and j. Since sigma is 1, you are only going to see a segment of one side of the Gaussian (not including the center at [i,j] = [0,0]), and the values of z from 3 to 20 in each direction are very close to 0.
Instead of using for loops, you can do things "the MATLAB way" by creating matrices of x and y values using the function MESHGRID and performing element-wise operations on them to compute and plot z:
[x,y] = meshgrid(-4:0.1:4); %# Use values from -4 to 4 in x and y directions
z = (1/(2*pi*sigma^2)).*exp(-(x.^2+y.^2)./(2*sigma^2)); %# Compute z
surf(x,y,z); %# Plot z

Resources