Logistics Regression & SVC: Do we need to do scaling if features are BOW, tf-idf or doc2Vec? - svm

I know Logistics Regression & SVC usually require scaling of the features. However, if the features are generated by
BOW
tf-idf
doc2Vec
do we still need to scale the features?
Thank you

You never need to do anything, but you should try it both ways, and choose the approach that scores better for your data & goals.

Related

Find optimized threshold value

I have a dataset which has a fraud_label and some other sets of feature variable. How can I find the best rule which would help me identify fraud_label correctly with the best precision and recall values. Example of features are number_of_site_visits, external_fraud_score etc. I need to be able to come up with a rule which says that if number_of_site_visits is less than X and external_fraud_score is greater than Y then we will get the best precision and recall. I have to do this in Python and any help you can provide or direction would be very helpful.
I have tried Random Forest model but that gives me feature importances and not exact threshold values.
The best way to find the best rule for identifying fraud_label correctly with the best precision and recall values is to use a supervised machine learning algorithm such as logistic regression or support vector machines. These algorithms can be used to train a model on your dataset and then use the trained model to predict the fraud_label. The model can then be evaluated using metrics such as precision and recall.
You can also use grid search or cross-validation to find the optimal parameters for your model, which will help you identify the best thresholds for each feature variable. This will allow you to create a rule that will give you the best precision and recall values.
In Python, you can use scikit-learn library for implementing these algorithms.

What should go first: automated xgboost model params tuning (Hyperopt) or features selection (boruta)

I classify clients by many little xgboost models created from different parts of dataset.
Since it is hard to support many models manually, I decided to automate hyperparameters tuning via Hyperopt and features selection via Boruta.
Would you advise me please, what should go first: hyperparameters tuning or features selection? On the other hand, it does not matter.
After features selection, the number of features decreases from 2500 to 100 (actually, I have 50 true features and 5 categorical features turned to 2 400 via OneHotEncoding).
If some code is needed, please, let me know. Thank you very much.
Feature selection (FS) can be considered as a preprocessing activity, wherein, the aim is to identify features having low bias and low variance [1].
Meanwhile, the primary aim of hyperparameter optimization (HPO) is to automate hyper-parameter tuning process and make it possible for users to apply Machine Learning (ML) models to practical problems effectively [2]. Some important reasons for applying HPO techniques to ML models are as follows [3]:
It reduces the human effort required, since many ML developers spend considerable time tuning the hyper-parameters, especially for large datasets or complex ML algorithms with a large number of hyper-parameters.
It improves the performance of ML models. Many ML hyper-parameters have different optimums to achieve best performance in different datasets or problems.
It makes the models and research more reproducible. Only when the same level of hyper-parameter tuning process is implemented can different ML algorithms be compared fairly; hence, using a same HPO method on different ML algorithms also helps to determine the most suitable ML model for a specific problem.
Given the above difference between the two, I think FS should be first applied followed by HPO for a given algorithm.
References
[1] Tsai, C.F., Eberle, W. and Chu, C.Y., 2013. Genetic algorithms in feature and instance selection. Knowledge-Based Systems, 39, pp.240-247.
[2] M. Kuhn, K. Johnson Applied Predictive Modeling Springer (2013) ISBN: 9781461468493.
[3] F. Hutter, L. Kotthoff, J. Vanschoren (Eds.), Automatic Machine Learning: Methods, Systems, Challenges, 9783030053185, Springer (2019)

How to know which features contribute significantly in prediction models?

I am novice in DS/ML stuff. I am trying to solve Titanic case study in Kaggle, however my approach is not systematic till now. I have used correlation to find relationship between variables and have used KNN and Random Forest Classification, however my models performance has not improved. I have selected features based on the result of correlation between variables.
Please guide me if there are certain sk-learn methods which can be used to identify features which can contribute significantly in forecasting.
Through Various Boosting Techniques You can Improve accuracy approx 99% I suggest you to use Gradient Boosting.

Logistic regression overfits even using cross validation in sklearn?

I am implementing a logistic regression model using sklearn, for a text classification competition on Kaggle.
When I use unigram, there are 23,617 features. The best mean_test_score Cross validation search (sklearn's GridSearchCV) gives me is similar to the score I got from Kaggle, using the best model.
There are 1,046,524 features if I use bigram. GridSearchCV gives me a better mean_test_score compared to unigram, but using this new model I got a much much lower score on Kaggle.
I guess the reason might be overfitting, since I have too many features. I have tried to set the GridSearchCV using 5-fold, or even 2-fold, but the scores are still inconsistent.
Does it really indicate my second model is overfitting, even in the validation stage? If so, how can I tune the regularization term for my logistic model using sklearn? Any suggestions are appreciated!
Assuming you are using sklearn. You could try looking into using the tuning parameters max_df, min_df, and max_features. Throwing these into a GridSearch may take a long time but you will likely get some interesting results back. I know these features are implemented in the sklearn.feature_extraction.text.TfidfVectorizer, but I am sure they use them elsewhere as well. Essentially the idea is that including too many grams can lead to overfitting, same thing with having too many grams with low or high document frequencies.

auc_score in scikit-learn 0.14

I'm training a RandomForestClassifier on a binary classification problem in scikit-learn. I want to maximize my auc score for the model. I understand this is not possible in the 0.13 stable version but is possible in the 0.14 bleeding edge version.
I tried this but I seemed to get a worse result:
ic = RandomForestClassifier(n_estimators=100, compute_importances=True, criterion='entropy', score_func = auc_score);
Does this work as a parameter for the model or only in gridsearchCV?
If I use it in gridsearchCV will it make the model fit the data better for auc_score? I also want to try it to maximize recall_score.
I am surprised the above does not raise an error. You can use the AUC only for model selection as in GridSearchCV.
If you use it there (scoring='roc_auc' iirc), this means that the model with the best auc will be selected. It does not make the individual models better with respect to this score.
It is still worth trying, though.
I have found a journal article that addresses highly imbalanced classes with random forests. Although it is aimed at running RDF on Hadoop clusters, the same techniques seem to work well on smaller problems as well:
del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced big data using Random Forest. Information Sciences, 285, 112-137.
http://sci2s.ugr.es/rf_big_imb/pdf/rio14_INS.pdf

Resources