Is overwriting happening in the following code, and how to avoid it? - statistics

I wrote this following (written at the end of my question) piece of code which is error-free, but I think, while running it, it has an overwriting problem. During the program, there are two cases where I wanted to draw graphs; first, the graphs of the curves written with ezplot, and second, the plot regression where I wanted to draw regression lines.
When I skip the code plotregression(C_i, D_i), it has no problem displaying the graphs of all five logistic functions (actually one users here showed me the hold on-hold off codes to help doing that), but then, when I incorporate plotregression(C_i, D_i), two things happen:
it shows me all the regression lines, but contrary to having all
the regression lines all in the same figure, it keeps changing the
regression lines with varying regression coefficients. You can
actually see this happening if you run the code.
The effect of plotregression(C_i, D_i) is gone; it no more
plots the graphs of the five logistic functions.
I've two questions:
If I want to get two figures, one showing all the five logistic
curves, and the other showing all the five regression curves, how
can I modify the program minimally so as to get the job done?
How can I stop over writing the regression curves? I used the 'hold on-hold off' in order to avoid the same for the logistic curves, but is it not working on the regression curves?
Here's the code:
syms t;
hold on;
for i=1:5;
P_i=0.009;
r_i=abs(sin(i.^-1));
y_i(t)= P_i*exp(r_i*t)/(1+P_i*(exp(r_i*t)-1));
t_1= 1+rand; t_2= 16+rand; t_3=31+rand;
time_points=[1, t_1; 1, t_2; 1, t_3];
biomarker_values= double([y_i(t_1);y_i(t_2);y_i(t_3)]);
X=vertcat(X,time_points);
Z=blkdiag(Z,time_points);
Y=vertcat(Y,biomarker_values);
G=vertcat(G,[i,i,i]');
ezplot(y_i,[-50,100]);
C_i=time_points(:,2)
D_i=biomarker_values
plotregression(C_i, D_i)
end;
hold off;

Related

Separating points following different linear regressions

Given a two variables with the same number of observations, you will apparently see they follow three linear regressions in the scatter plot. How could you separate them into three groups with different linear fittings?
There exist specialized clustering algorithms for this.
Google for "correlation clustering".
If they all go through 0, then it may be easier to apply the proper feature transformation to make them separable. So don't neglect preprocessing. It's the most important part.
I would calculate the slope of the segment between any pairs of points, so with n points you get n(n+1)/2 slopes values, and then use a clustering algorithm.
It is the same idea which is behind the Theil–Sen estimator
It just came to my mind and seems worth to give a try.
Seems to be a mixture of regressions. There are several packages to do this. One of them is FlexMix, while not very satisfying. I put what I got and expected in below.
I think I solved the problem partly. We can use a r package flexmix to achieve this as the lowest panel shows. The package works fine with another two known-fitting groups of data. The seperating ratio can reach as high as 90% with fitting coefficients close to the known coefs.

How to visualize error surface in keras?

We see pretty pictures of error surface with a global minima and convergence of a neural network in many books. How can I visualize something similar in keras i.e containing error surface and how my model is converging to achieve global minimal error? Below is an example image of such illustrations. And this link has animated illustration of different optimizers. I explored tensorboard log callback for this purpose but could not find any such thing. A little guidance will be appreciated.
The pictures and animations are made for didatic purposes, but the error surface is completely unknown (or incredibly complex to be understood or visualized). That's the whole idea behind using gradient descent.
We only know, at a single point, the direction towards which the funcion increases, through getting the current gradient.
You could try to plot the way (line) you're following by getting the weights values at each iteration and the error, but then you'd face another problem: it's a massively multidimensional function. It's not actually a surface. The number of variables is the number of weights you have in the model (often thousands or even millions). This is absolutely impossible to visualize or even conceive as a visual thing.
To plot such a surface, you'd have to manually change all thousands of weights to get the error for each arrangement. Besides the "impossible to visualize" problem, this would be excessively time consuming.

Getting distance to the hyperplane from sklearn's svm.svc

I'm currently using svc to separate two classes of data (the features below are named data and the labels are condition). After fitting the data using the gridSearchCV I get a classification score of about .7 and I'm fairly happy with that number. After that though I went to get the relative distances from the hyper-plane for data from each class using grid.best_estimator_.decision_function() and plot them in a boxplot and a histogram to get a better idea of how much overlap there is. My problem is that in the histogram and the boxplot these look perfectly seperable shich I know is not the case. I'm sure I'm calling decision_function() incorrectly but not sure how to do this really.
svc=SVC(kernel='linear,probability=True,decision_function_shape='ovr')
cv=KFold(n_splits=4,shuffle=True)
svc=SVC(kernel='linear,probability=True,decision_function_shape='ovr')
C_range=[1,.001,.005,.01,.05,.1,.5,5,50,10,100]
param_grid=dict(C=C_range)
grid=GridSearchCV(svc,param_grid=param_grid, cv=cv,n_jobs=4,iid=False, refit=True)
grid.fit(data,condition)
print grid.best_params
print grid.best_score_
x=grid.best_estimator_.decision_function(data)
plt.hist(x)
sb.boxplot(condition,x)
sb.swarmplot
In the histogram and box plots it looks like almost all of the points have a distance of either exactly positive or negative one with nothing between them.

Invalid Graphics State - ongoing issue for beginner RStudio/R user

I am working on an assignment for a course. The code creates variables for use in a data.dist function from rms. We then create a simple linear regression model using ols(). Printing/creating the first plot before the data.dist() and ols() functions is simple. We use:
plot(x,y,pch='o')
lines(x,yTrue,lty=2,lwd=.5,col='red')
Then we create the data.dist() and the ols(), here named fit0.
mydat=data.frame(x=x,y=y)
dd=datadist(mydat)
options(datadist='dd')
fit0=ols(y~x,data=mydat)
fit0
anova(fit0)
This all works smoothly, printing out the results of the linear regression and the anova table. Then, we want to predict based on the model, and plot these predictions. The plot prints out nicely, however the lines and points won't show up here. The code:
ff=Predict(fit0)
plot(ff)
lines(x,yTrue,lwd=2,lty=1,col='red')
points(x,y,pch='.')
Note - this works fine in R. I much prefer to use RStudio, though can switch to R if there's no clear solution this issue. I've tried dev.off() several times (i.e. repeat until get, I've tried closing RStudio and re-opening, I've uninstalled and reinstalled R and RStudio, rms package (which includes ggplot2), updated the packages, made my RStudio graphics window larger. Any solution I've seen, doesn't work. Help!

Excel Polynomial Regression with multiple variables

I saw a lot of tutorials online on how to use polynomial regression on Excel and multi-regression but none which explain how to deal with multiple variable AND multiple regression.
In , the left columns contain all my variables X1,X2,X3,X4 (say they are features of a car), and Y1 is the price of the car I am looking for.
I got about 5000 lines of data that I got from running a model with various values of X1,X2,X3,X4 and I am looking to make a regression so that I can get a best estimate of my model without having to run it (saving me valuable computing time).
So far I've managed to do multiple linear regression using the Data Analysis pack in Excel, just by using the X1,X2,X3,X4. I noticed however that the regression looks very messy and inaccurate in places, which is due to the fact that my variables X1,X2,X3,X4, affect my output Y1 non-linearly.
I had a look online and to add polynomials to the mix, tutorial suggest adding a X^2 column. But when I do that (see right part of the chart) my regression is much much worse than when I use linear fits.
I know that polynomials, can over-fit the data, but I though that using a quadratic form was safe since the regression would only have to return a coefficient of 0 to ignore any excess polynomial orders.
Any help would be very welcome,
For info I get an adujsted-R^2 of 0.91 for linear fits and 0.66 when I add a few X^2 columns.
So far this is the best regression I can get (black line is 1:1):
As you can see I would like to increase the fit for the bottom left part and top right parts of the curve.

Resources