Kaplan Meier survival curve evaluation - survival-analysis

I have generated a Kaplan Meier survival curve on the consumer data (the event of interest is 'Churn'). I have the survival curve for both buyers and nonbuyers. Before jumping into the use of the model. I want to know how I can evaluate the truthness of the curve?
I have already tried creating a separate curve for the two different consumers cohorts (who joined in a different year) for the span of 36 months. I noticed that these curves are not similar at all. I believe this is not the right way to evaluate. Can somebody tell me what can be tried to evaluate the survival curve apart from the statistical methods?

Related

How to use a residual plot to determine if the relationship looks linear

I was attempting some questions based on residplot() in seaborn. There were two residual plots in which I had to tell whether the relationship is linear. Can anyone explain how it is determined by just looking at the plot. Apparently:
1. This plot shows the linear relationship
2. This plot shows a non-linear relationship
Roughly speaking, these residual plots enable you to visually check whether the residuals still contain some nonlinear behaviour with respect to your explanatory variables. Two remarks for further explanation:
The residuals of a correctly specified model (e.g. the baseline linear model) should be similar to random noise. In absence of remaining patterns in the residuals, we have no indications that important features have been omitted.
If the residuals suggest a pattern, then this means that we failed to take some (nonlinear) effects into account. You should reconsider the model specification. If the baseline model was linear, then including some nonlinear terms might "clean" the residuals.
This kind of visual inspection is often subjective. However, you can argue that 1. is just a random cloud of points whereas 2. shows some remaining curvature. There is also a statistical test to do this kind of assessment for you: the Ramsey Regression Equation Specification Error Test (RESET)

Gaussian Mixture Models for pixel clustering

I have a small set of aerial images where different terrains visible in the image have been have been labelled by human experts. For example, an image may contain vegetation, river, rocky mountains, farmland etc. Each image may have one or more of these labelled regions. Using this small labeled dataset, I would like to fit a gaussian mixture model for each of the known terrain types. After this is complete, I would have N number of GMMs for each N types of terrains that I might encounter in an image.
Now, given a new image, I would like to determine for each pixel, which terrain it belongs to by assigning the pixel to the most probable GMM.
Is this the correct line of thought ? And if yes, how can I go about clustering an image using GMMs
Its not clustering if you use labeled training data!
You can, however, use the labeling function of GMM clustering easily.
For this, compute the prior probabilities, mean and covariance matrixes, invert them. Then classify each pixel of the new image by the maximum probability density (weighted by prior probabilities) using the multivariate Gaussians from the training data.
Intuitively, your thought process is correct. If you already have the labels that makes this a lot easier.
For example, let's pick on a very well known and non-parametric algorithm like Known Nearest Neighbors https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
In this algorithm, you would take your new "pixels" which would then find the closest k-pixels like the one you are currently evaluating; where closest is determined by some distance function (usually Euclidean). From there, you would then assign this new pixel to the most frequently occurring classification label.
I am not sure if you are looking for a specific algorithm recommendation, but KNN would be a very good algorithm to begin testing this type of exercise out on. I saw you tagged sklearn, scikit learn has a very good KNN implementation I suggest you read up on.

Extrapolate bell shape from a section of curve

I am interested in extrapolating the curve from the population that I know is normally distributed. However, in my process, I am only able to get access to a section of the curve (from -3 standard deviations to -2 standard deviations). My question is what is the best way to fitting a curve to a section of a bell curve.
A normal distribution can be defined by an equation f(x) (its PDF, which is a little difficult to write in non-latex, you can check out the Wikipedia page) with two parameters: the mean and the variance (or standard deviation).
Therefore, if you want to know which variance and mean define it, you just need to solve for the mean and the variance given two known values (which you have infinitely many of, even on a short interval).

Excel Polynomial Regression with multiple variables

I saw a lot of tutorials online on how to use polynomial regression on Excel and multi-regression but none which explain how to deal with multiple variable AND multiple regression.
In , the left columns contain all my variables X1,X2,X3,X4 (say they are features of a car), and Y1 is the price of the car I am looking for.
I got about 5000 lines of data that I got from running a model with various values of X1,X2,X3,X4 and I am looking to make a regression so that I can get a best estimate of my model without having to run it (saving me valuable computing time).
So far I've managed to do multiple linear regression using the Data Analysis pack in Excel, just by using the X1,X2,X3,X4. I noticed however that the regression looks very messy and inaccurate in places, which is due to the fact that my variables X1,X2,X3,X4, affect my output Y1 non-linearly.
I had a look online and to add polynomials to the mix, tutorial suggest adding a X^2 column. But when I do that (see right part of the chart) my regression is much much worse than when I use linear fits.
I know that polynomials, can over-fit the data, but I though that using a quadratic form was safe since the regression would only have to return a coefficient of 0 to ignore any excess polynomial orders.
Any help would be very welcome,
For info I get an adujsted-R^2 of 0.91 for linear fits and 0.66 when I add a few X^2 columns.
So far this is the best regression I can get (black line is 1:1):
As you can see I would like to increase the fit for the bottom left part and top right parts of the curve.

Average and Measure of Spread of 3D Rotations

I've seen several similar questions, and have some ideas of what I might try, but I don't remember seeing anything about spread.
So: I am working on a measurement system, ultimately computer vision based.
I take N captures, and process them using a library which outputs pose estimations in the form of 4x4 affine transformation matrices of translation and rotation.
There's some noise in these pose estimations. The standard deviation in Euler angles for each axis of rotation is less than 2.5 degrees, so all orientations are pretty close to each other (for a case where all Euler angles are close to 0 or 180). Standard errors of less than 0.25 degrees are important to me. But I have already run into the problems endemic to Euler angles.
I want to average all these pretty-close-together pose estimates to get a single final pose estimate. And I also want to find some measure of spread so that I can estimate accuracy.
I'm aware that "average" isn't actually well defined for rotations.
(For the record, my code is in Numpy-heavy Python.)
I also may want to weight this average, since some captures (and some axes) are known to be more accurate than others.
My impression is that I can just take the mean and standard deviation of the translation vector, and that for the rotation I can convert to quaternions, take the mean, and re-normalize with OK accuracy since these quaternions are pretty close together.
I've also heard mentions of least-squares across all the quaternions, but most of my research into how this would be implemented has been a dismal failure.
Is this workable? Is there a reasonably well-defined measure of spread in this context?
Without more info about your geometry setup is hard to answer. Anyway for rotations I would:
create 3 unit vectors
x=(1,0,0),y=(0,1,0),z=(0,0,1)
and apply the rotation on them and call the output
x(i),y(i),z(i)
it is just applying the matrix(i) with position at (0,0,0)
do this for all measurements you have
now average all vectors
X=avg(x(1),x(2),...x(n))
Y=avg(y(1),y(2),...y(n))
Z=avg(z(1),z(2),...z(n))
correct the vector values
so make each of the X,Y,Z unit vectors again and take the axis which is more closest to the rotation axis as main axis. It will stay as is and recompute the remaining two axises as cross product of main axis and the other vector to ensure orthogonality. Beware of the multiplication order (wrong order of operands will negate the output)
construct averaged transform matrix
see transform matrix anatomy as origin you can use averaged origin of the measurement matrices
Moakher wrote a paper that explains there are basically two ways to take an average of Rotation matrices. The first is a weighted average followed by a projection back to SO(3) using the SVD. The second is the Riemannian center of mass. That one is a closer notion to the geometric mean, and its more complicated to compute.

Resources