A strange warning from proc phreg (Survival Analysis) in SAS - statistics

I am been trying to fit a Cox regression on a small dataset but I have come across a strange problem. Although the model runs well, I am unable to get an ouput from it. Instead in the log one reads
WARNING: The OUTPUT data set has no observations due to the presence of time-dependent explanatory
variables.
It's true that I have a time dependent variable on the RHS but this shouldn't be a problem, I think. Many analyses use this kind of variables. Could you please help me understand why that happens and how I can get past it? There is plenty of information to be got from this statement and it would be really helpful to me. Here is my dataset and the code I have been using so far.
data surv;
input time event fin;
cards;
2 0 1
3 1 1
4 1 1
1 1 0
5 1 0
6 0 1
7 0 0
8 1 1
9 0 0
10 1 0
;
proc phreg data=surv;
model time*event(0)=fin ft;
ft=fin*log(time);
output out=b;
run;
Wasn't sure whether I should post it here or in stats stack.exchange but in any case, I would really appreciate some help. Thank you.

SAS is just telling you that you have a time dependent variable (it doesn't impede the code from running). You are violating the proportional hazards assumption for the Cox PH test, but the test is robust enough to handle it. There is really no "correct" answer here. You can perform some transformations and run the model after each transformation. Whichever model returns the lowest AIC would be your best model. Check out this presentation. Also, this lecture has some good information as well. IF however the PH assumption is not important, you should switch to a parametric model. I hope this is what (or somewhat) close to what you were looking for.

Related

Decorrelating 3 categorical variables

I have a table of 3 categorical variables (salary, face_amount, and area_code) that I was looking to decorrelate from one another. In other words, I'm trying to find how much of some output can be attributed solely to each one of these variables. So I would want to see how much of this output is due to the salary variable and not the correlated amount of salary with face_amount for example if that makes sense.
I noticed that there is Multiple Correspondence Analysis for this type of problem that will decorrelate the variables, however, the issue I'm having is that I need the original variables and not the ones that are produced from multiple correspondence analysis. I'm very confused as to how to go about analyzing this type of problem and would appreciated any help with this kind of problem.
Sample of data:
salary face_amount area_code
'1-50' 1000 67
'1-50' 500 600
'1-50' 500 600
'51-200' 2000 623
'51-200' 1000 623
'201-500' 500 700
I'm not exactly sure how to go about this kind of problem

Calculating accuracy from precision, recall, f1-score - scikit-learn

I made a huge mistake. I printed output of scikit-learn svm accuracy as:
str(metrics.classification_report(trainExpected, trainPredict, digits=6))
Now I need to calculate accuracy from following output:
precision recall f1-score support
1 0.000000 0.000000 0.000000 1259
2 0.500397 1.000000 0.667019 1261
avg / total 0.250397 0.500397 0.333774 2520
Is it possible to calculate accuracy from these values?
PS: I don't want to spend another day for getting outputs of the model. I just realized this mistake hopefully I don't need to start from the beginning.
You can compute the accuracy from precision, recall and number of true/false positives or in your case support (even if precision or recall were 0 due to a 0 numerator or denominator).
TruePositive+FalseNegative=Support_True
TrueNegative+FalsePositive=Support_False
Precision=TruePositive/(TruePositive+FalsePositive) if TruePositive+FalsePositive!=0 else 0
Recall=TruePositive/(TruePositive+FalseNegative) if TruePositive+FalseNegative!=0 else 0
Accuracy=(TruePositive+TrueNegative)/(TruePositive+TrueNegative+FalsePositive+FalseNegative)
-or-
Given TruePositive/TrueNegative counts for example then:
TPP=TruePositive/Precision=TruePositive+FalsePositive if Precision!=0 and TruePositive!=0 else TPP=0
TPR=TruePositive/Recall=TruePositive+FalseNegative if Recall!=0 and TruePositive!=0 else TPR=0
In the above, when TruePositive==0, then no computation is possible without more information about FalseNegative/FalsePositive. Hence the support is better.
Accuracy=(TruePositive+TrueNegative)/(TPP+TPR-TruePositive+TrueNegative)
But in your case given was support so we use recall:
Recall=TruePositive/Support_True if Support_True!=0 else 0
TruePositive=Recall*Support_True, likewise TrueNegative=Recall_False*Support_False in all cases
Accuracy=(Recall*Support_True+Recall_False*Support_False)/(Support_True + Support_False)
In your case (0*1259+1*1261)/(1259+1261)=0.500397 which is exactly what you would expect when only one class is predicted. The respective precision score becomes the accuracy in that case.
Like the other poster said, better to use the library. But since this sounded like potentially a mathematical question as well, this can be used.
No need to spend more time on it. The metrics module has everything you need in it and you have already computed the predicted values. It's a one line change.
print(metrics.accuracy_score(trainExpected, trainPredict))
I suggest that you spend some time to read the linked page to learn more about evaluating models in general.
I do think you have a bigger problem with at hand -- you have zero predicted values for your 1 class, despite having balanced classes. You likely have a problem in your data, modeling strategy, or code that you'll have to deal with.

Sklearn TruncatedSVD() ValueError: n_components must be < n_features

Hi I'am trying to run script for a Kaggle competition.
you can see the whole script here
But when I run this script i get an ValueError
ValueError: n_components must be < n_features; got 1 >= 1
Can somebody tell me please how to find out how many features there are at this point.
I don't think it will be usefull when I set n_components to 0.
I also read the documentation but I can't solve that issue.
Greetz
Alex
It is highly likely that the shape of your data matrix is wrong: It seems to have only one column. That needs to be fixed. Use a debugger to figure out what goes into the fit method of the TruncatedSVD, or unravel the pipeline and do the steps by hand.
As for the error message, if it is due to a matrix with one column, this makes sense: You can only have maximally as many components as features. Since you are using TruncatedSVD it additionally assumes that you don't want the full feature space, hence the strict inequality.

SVM and cross validation

The problem is as follows. When I do support vector machine training, suppose I have already performed cross validation on 10000 training points with a Gaussian kernel and have obtained the best parameter C and \sigma. Now I have another 40000 new training points and since I don't want to waste time on cross validation, I stick to the original C and \sigma that I obtained from the first 10000 points, and train the entire 50000 points on these parameters. Is there any potentially major problem with this? It seems that for C and \sigma in some range, the final test error wouldn't be that bad, and thus the above process seems okay.
There is one major pitfal of such appraoch. Both C and sigma are data dependant. In particular, it can be shown, that optimal C strongly depends on the size of the training set. So once you make your training data 5 times bigger, even if it brings no "new" knowledge - you should still find new C to get the exact same model as before. So, you can do such procedure, but keep in mind, that best parameters for smaller training set do not have to be the best for the bigger one (even though, they sometimes still are).
To better see the picture. If this procedure would be fully "ok" than why not fit C on even smaller data? 5 times? 25 times smaller? Mayone on one single point per class? 10,000 may seem "a lot", but it depends on the problem considered. In many real life domains this is just a "regular" (biology) or even "very small" (finance) dataset, so you won't be sure, if your procedure is fine for this particular problem until you test it.

Randomly select increasing subset of data to see where mean levels off

Could anyone please advise the best way to do the following?
I have three variables (X, Y & Z) and four groups (1, 2, 3 & 4). I have been using discriminant function analysis in SPSS to predict group membership of known grouped data for use with future ungrouped data.
Ideally I would like to able to randomly sample an increasing number of a subset of the data to see how many observations are required to hit a desired correct classification percentage.
However, I understand this might be difficult. Therefore, I'm looking to to do this for the means.
For example, Lets say variable X has a mean of 141 for group 1. This mean might have been calculated from 2000 observations. However, it might be the case that the mean occurred at say 700 observations. I would like to be able to calculate at what number of observations/cases the mean levels of in my data. For example, perhaps starting at 10 observations and repeating this randomly say 50 or 100 times, then increasing to 20 observations....and so on.
I understand this is a form of monte carlo testing. I have access to SPSS 15, 17 and 18 and excel. I also have access to minitab 15 & 16 and amos17 and have downloaded "R" but im not familiar with these. My experience is with SPSS and excel. I have tried some syntax in SPSS Modified from this..http://pages.infinit.net/rlevesqu/Syntax/RandomSampling/Select2CasesFromEachGroup.txt but this would still be quite time consuming on my part to enter the subset number ect etc.
Hope some one can help.
Thanks for reading.
Andy
The text you linked to is a good start (you can also use the SAMPLE command in SPSS, but IMO the Raynald script you linked to is more flexible when you think about constructing the sample that way).
In pseudo-code, the process might look like;
do n for sample size (a to b)
loop 100 times
draw sample size n
compute (& save) statistics
Here is where SPSS's macro language comes into play (I think this document is a good introduction, plus you can examine other references on the SPSS tag wiki). Basically once you figure out how to draw the sample and compute the stats you want, you just need to figure out how to write a macro so you can loop through the process (and pass it the sample size parameter). I include the loop 100 times because you want to be able to make some type of estimate about the error associated with each sample size.
If you give an example of how you compute the statistics I may be able to give examples of how to make that into a macro function and loop through the desired number of times.
#Andy W
#Oliver
Thanks for your suggestions guys. Ive managed to find a work around using the following macro from.........http://www.spsstools.net/Syntax/Bootstrap/GetRandomSampleOfVariousSizeCalcStats.txt However, for this I need to copy and paste the variable data for a given group into a new data window. Thats not to much of a problem. To take this further would anyone know how: 1/ I could get other statistics recorded eg std error, std dev ect ect. 2/Use other analysis, ideally discriminant function analysis and record in a new data window the percentage of correct classificcations rather than having lots of output tables 3/not need to copy and paste variables for each group so I can just run the macro specifying n samples for x variable on group 1, 2, 3 & 4.
Thanks again.
DEFINE !sample(myvar !TOKENS(1)
/nbsampl !TOKENS(1)
/size !CMDEND).
* myvar = the variable of interest (here we want the mean of salary)
* nbsampl = number of samples.
* size = the size of each samples.
!LET !first='1'
!DO !ss !IN (!size)
!DO !count = 1 !TO !nbsampl.
GET FILE='c:\Program Files\SPSS\employee data.sav'.
COMPUTE draw=uniform(1).
SORT CASES BY draw.
N OF CASES !ss.
COMPUTE samplenb=!count.
COMPUTE ss=!ss.
AGGREGATE
/OUTFILE=*
/BREAK=samplenb
/!myvar = MEAN(!myvar) /ss=FIRST(ss).
!IF (!first !NE '1') !THEN
ADD FILES /FILE=* /FILE='c:\temp\sample.sav'.
!IFEND
SAVE OUTFILE='c:\temp\sample.sav'.
!LET !first='0'
!DOEND.
!DOEND.
VARIABLE LABEL ss 'Sample size'.
EXAMINE
VARIABLES=salary BY ss /PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL
/MISSING=REPORT.
!ENDDEFINE.
* ----------------END OF MACRO ----------------------------------------------.
* Call macro (parameters are number of samples (here 20) and sizes of sample (here 5, 10,15,30,50).
* Thus 20 samples of size 5.
* Thus 20 samples of size 10, etc.
!sample myvar=salary nbsampl=20 size= 5 10 15 30 50.

Resources