Dummy Coding Cox regression for survival analysis in Python - survival-analysis

I have two questions:
1)I would like to ask if dummy coding is needed when we apply Cox Regression in python to a feature which has 2 categories?
2)If the feature is categorical and if it does not satisfy the proportional hazard assumption then strata maybe an option but below is an example that Do I need to type each dummy category in the strata?
cph.fit(data, duration_col = 'tenure', event_col = 'Churn', strata=['Contract_One
year','Contract_Two year'])
Thank you.

Related

Normality Assumption - how to check you have not violated it?

I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!

Random Forest Regressor using a custom objective/ loss function (Python/ Sklearn)

I want to build a Random Forest Regressor to model count data (Poisson distribution). The default 'mse' loss function is not suited to this problem. Is there a way to define a custom loss function and pass it to the random forest regressor in Python (Sklearn, etc..)?
Is there any implementation to fit count data in Python in any packages?
In sklearn this is currently not supported. See discussion in the corresponding issue here, or this for another class, where they discuss reasons for that a bit more in detail (mainly the large computational overhead for calling a Python function).
So it could be done as discussed within the issues, by forking sklearn, implementing the cost function in Cython and then adding it to the list of available 'criterion'.
If the problem is that the counts c_i arise from different exposure times t_i, then indeed one cannot fit the counts, but one can still fit the rates r_i = c_i/t_i using MSE loss function, where one should, however, use weights proportional to the exposures, w_i = t_i.
For a true Random Forest Poisson regression, I've seen that in R there is the rpart library for building a single CART tree, which has a Poisson regression option. I wish this kind of algorithm would have been imported to scikit-learn.
In R, writing a custom objective function is fairly simple.
randomForestSRC package in R has provision for writing your own custom split rule. The custom split rule, however has to be written in pure C language.
All you have to do is, write your own custom split rule, register the split rule, compile and install the package.
The custom split rule has to be defined in the file called splitCustom.c in randomForestSRC source code.
You can find more info
here.
The file in which you define the split rule is
this.

Python AUC Calculation for Unsupervised Anomaly Detection (Isolation Forest, Elliptic Envelope, ...)

I am currently working in anomaly detection algorithms. I read papers comparing unsupervised anomaly algorithms based on AUC values. For example i have anomaly scores and anomaly classes from Elliptic Envelope and Isolation Forest. How can i compare these two algorithms based on AUC values.
I am looking for a python code example.
Thanks
Problem solved. Steps i done so far;
1) Gathering class and score after anomaly function
2) Converting anomaly score to 0 - 100 scale for better compare with different algorihtms
3) Auc requires this variables to be arrays. My mistake was to use them like Data Frame column which returns "nan" all the time.
Python Script:
#outlier_class and outlier_score must be array
fpr,tpr,thresholds_sorted=metrics.roc_curve(outlier_class,outlier_score)
aucvalue_sorted=metrics.auc(fpr,tpr)
aucvalue_sorted
Regards,
Seçkin Dinç
Although you already solved your problem, my 2 cents :)
Once you've decided which algorithmic method to use to compare them (your "evaluation protocol", so to say), then you might be interested in ways to run your challengers on actual datasets.
This tutorial explains how to do it, based on an example (comparing polynomial fitting algorithms on several datasets).
(I'm the author, feel free to provide feedback on the github page!)

How to conduct multivariate regression in Excel?

I've been searching the web for possibilities to conduct multivariate regressions in Excel, where I've seen that Analysis ToolPak can make the job done.
However it seems that Analysis ToolPak can handle multivariable linear regression but not multivariate linear regression (where the latter is that one may have more than one dependent variable Y1,...,Yn = x1+x2..+xn and the former that a dependent variable can have multiple independent variables Y = x1+x2+..+xn).
Is there a way to conduct multivariate regressions in Excel or should I start looking for other programs like R?
Thanks in advance
Is this what you are looking for?
http://smallbusiness.chron.com/run-multivariate-regression-excel-42353.html

Is it possible to compare the classification ability of two sets of features by ROC?

I am learning about SVM and ROC. As I know, people can usually use a ROC(receiver operating characteristic) curve to show classification ability of a SVM (Support Vector Machine). I am wondering if I can use the same concept to compare two subsets of features.
Assume I have two subsets of features, subset A and subset B. They are chosen from the same train data by 2 different features extraction methods, A and B. If I use these two subsets of features to train the same SVM by using the LIBSVM svmtrain() function and plot the ROC curves for both of them, can I compare their classification ability by their AUC values ? So if I have a higher AUC value for subsetA than subsetB, can I conclude that method A is better than method B ? Does it make any sense ?
Thank you very much,
Yes, you are on the right track. However, you need to keep a few things in mind.
Often using the two features A and B with the appropriate scaling/normalization can give better performance than the features individually. So you might also consider the possibility of using both the features A and B together.
When training SVMs using features A and B, you should optimize for them separately, i.e. compare the best performance obtained using feature A with the best obtained using feature B. Often the features A and B might give their best performance with different kernels and parameter settings.
There are other metrics apart from AUC, such as F1-score, Mean Average Precision(MAP) that can be computed once you have evaluated the test data and depending on the application that you have in mind, they might be more suitable.

Resources