KL Divergence vs Z Test for Proportion for comparing two Proportions (Binary Categorical Variable) - statistics

I want to detect data drift on a dataset that has 54 binary categorical features?
Which method would work better if I want to compare the features separately? KL Divergence vs Z Test.
The size of dataset is very huge.
Also, if I want to consider all the features together and detect the data drift which method would be the most appropriate in this case??

Related

Unsupervised learning feature selection in Python

I need to select the most important features from my data frame before starting with nearest neighbours problem.
Which methods are the best to do this? My data frame has around 8 categorical features and 2 continuous features but no target variable.
The problem is that I have three categorical features which can only be one hot encoded and once I do that it explodes the data into 47 OHE variables.
So considering these situations, what would be the best method to do the feature selection?

Categorical variables in recursive feature elimination with random forest

I am trying to use a recursive feature elimination with random forest to find out the optimal features. However, one thing I am confused about is what should I do with categorical variables? Most of time people are doing a one-hot encoder for the categorical variables. But if I do one hot encoder, how can I know which feature is important which is not? Because after doing a one-hot encoder, 1 feature may become multiple features.
My current way is doing a label encoder for all the categorical variables, which means I labeled all the categorical variables as integers. And then using the code
rfc = RandomForestClassifier(random_state=101)
rfecv = RFECV(estimator=rfc, step=1, cv=StratifiedKFold(10), scoring='accuracy')
rfecv.fit(X, target)
One feature is the 44 different county names, I am not sure if this is the right way to do it.

What are the ways of pre-processing categorical data before applying classification algorithms?

I am new to machine learning and I am working on a classification problem with Categorical (nominal) data. I have tried applying BayesNet and a couple of Trees and Rules classification algorithms to the raw data. I am able to achieve an AUC of 0.85.
I further want to improve the AUC by pre-processing or transforming the data. However since the data is categorical I don't think that log transform, addition, multiplication etc. of different columns will work here.
Can somebody list down what are most common transformations applied on categorical data-sets? ( I tried one-hot encoding but it takes a lot of memory!!)
Categorical is in my experience best dealt with one-hot encoding (e.g converting to a binary vector) as you've mentioned. If memory is an issue, it may be worthwhile using an online classification algorithm and generate the modified vectors on the fly.
Apart from this, if the categories represent a range (for example, if the categories represent a range of values such as age, height or income) it may be possible to treat the centre (or some appropriate mean, if there's an intra-label distribution) of the category ranges as a real number.
If you were applying clustering you could also treat the categorical labels as points on an axis (1,2,3,4,5 etc), scaled appropriately to the other features.

dimension reduction makes data non-linearly separable

I am working on a project to classify hearing disorders using SVM. I have collected real time data from the site (http://archive.ics.uci.edu/ml/machine-learning-databases/audiology/) and initially decided to go for two classes to classify patients with normal ear and patients with any disorder. Varying the optimization parameter C from 0.1 to 10 I get one miss-classification between the two classes (C=10).
However I wan to plot the data with the decision boundary but the data set has around 68 features so it is not possible to plot it. I used PCA to reduce to 2D and used svm on this data to see the results. But when I use PCA, the data no longer remains linearly separable and linear decision boundary cannot separate the 2D PCA data. So I want to know if there is a way to reduce dimension but to retain the nature of the data (nature as in separability using linear decision boundary). Can anyone please help me?
Thanks

Is it possible to compare the classification ability of two sets of features by ROC?

I am learning about SVM and ROC. As I know, people can usually use a ROC(receiver operating characteristic) curve to show classification ability of a SVM (Support Vector Machine). I am wondering if I can use the same concept to compare two subsets of features.
Assume I have two subsets of features, subset A and subset B. They are chosen from the same train data by 2 different features extraction methods, A and B. If I use these two subsets of features to train the same SVM by using the LIBSVM svmtrain() function and plot the ROC curves for both of them, can I compare their classification ability by their AUC values ? So if I have a higher AUC value for subsetA than subsetB, can I conclude that method A is better than method B ? Does it make any sense ?
Thank you very much,
Yes, you are on the right track. However, you need to keep a few things in mind.
Often using the two features A and B with the appropriate scaling/normalization can give better performance than the features individually. So you might also consider the possibility of using both the features A and B together.
When training SVMs using features A and B, you should optimize for them separately, i.e. compare the best performance obtained using feature A with the best obtained using feature B. Often the features A and B might give their best performance with different kernels and parameter settings.
There are other metrics apart from AUC, such as F1-score, Mean Average Precision(MAP) that can be computed once you have evaluated the test data and depending on the application that you have in mind, they might be more suitable.

Resources