Can I extract significane values for Logistic Regression coefficients in pyspark - apache-spark

Is there a way to get the significance level of each coefficient we receive after we fit a logistic regression model on training data?
I was trying to find out a way and could not figure out myself.
I think I may get the significance level of each feature if I run chi sq test but first of all not sure if I can run the test on all features together and secondly I have numeric data value so if it will give me right result or not that remains a question as well.
Right now I am running the modeling part using statsmodel and scikit learn but certainly, want to know, how can I get these results from PySpark ML or MLLib itself
If anyone can shed some light, it will be helpful

I use only mllib, I think that when you train a model you can use toPMML method to export your model un PMML format (xml file), then you can parse the xml file to get features weights, here an example
https://spark.apache.org/docs/2.0.2/mllib-pmml-model-export.html
Hope that will help

Related

Find optimized threshold value

I have a dataset which has a fraud_label and some other sets of feature variable. How can I find the best rule which would help me identify fraud_label correctly with the best precision and recall values. Example of features are number_of_site_visits, external_fraud_score etc. I need to be able to come up with a rule which says that if number_of_site_visits is less than X and external_fraud_score is greater than Y then we will get the best precision and recall. I have to do this in Python and any help you can provide or direction would be very helpful.
I have tried Random Forest model but that gives me feature importances and not exact threshold values.
The best way to find the best rule for identifying fraud_label correctly with the best precision and recall values is to use a supervised machine learning algorithm such as logistic regression or support vector machines. These algorithms can be used to train a model on your dataset and then use the trained model to predict the fraud_label. The model can then be evaluated using metrics such as precision and recall.
You can also use grid search or cross-validation to find the optimal parameters for your model, which will help you identify the best thresholds for each feature variable. This will allow you to create a rule that will give you the best precision and recall values.
In Python, you can use scikit-learn library for implementing these algorithms.

How to get the logits for the T5 model when using the `generate` method for inference?

I'm currently using HuggingFace's T5 implementation for text generation purposes. More specifically, I'm using the T5ForConditionalGeneration to solve a text classification problem as generation.
The model's performance is overall very satisfactory after training, but what I am wondering is how I can get the logits for generation?
I'm currently performing inference as is suggested in the documentation via model.generate(**tokenizer_outputs), but this simply outputs the IDs themselves without anything else.
The reason why I want the logits is because I want to measure the model's confidence of generation. I'm not 100% certain if my approach is correct, but I'm thinking that if I can get the logit values of each generated token and average them, I could get the overall confidence score of the generated sequence.
Would anybody know how I could do this? Thanks.
I was struggling with this because I wasn't familiar with how the Transformers library works, but after looking at the source code all you have to do is set the arguments output_scores and return_dict_in_generate to True.
For more information, take a look at the method transformers.generation.utils.GenerationMixin.generate.

what are the methods to check if my model fits the data (without using graphs)

I am working on a binary logistic regression data set in python. I want to know if there are any numerical methods to calculate how well the model fits the data.
please don't include graphical methods like plotting etc.
Thanks :)
read through 3.3.2. Classification metrics in sklearn documentation.
http://scikit-learn.org/stable/modules/model_evaluation.html
hope it helps.

Logistic regression overfits even using cross validation in sklearn?

I am implementing a logistic regression model using sklearn, for a text classification competition on Kaggle.
When I use unigram, there are 23,617 features. The best mean_test_score Cross validation search (sklearn's GridSearchCV) gives me is similar to the score I got from Kaggle, using the best model.
There are 1,046,524 features if I use bigram. GridSearchCV gives me a better mean_test_score compared to unigram, but using this new model I got a much much lower score on Kaggle.
I guess the reason might be overfitting, since I have too many features. I have tried to set the GridSearchCV using 5-fold, or even 2-fold, but the scores are still inconsistent.
Does it really indicate my second model is overfitting, even in the validation stage? If so, how can I tune the regularization term for my logistic model using sklearn? Any suggestions are appreciated!
Assuming you are using sklearn. You could try looking into using the tuning parameters max_df, min_df, and max_features. Throwing these into a GridSearch may take a long time but you will likely get some interesting results back. I know these features are implemented in the sklearn.feature_extraction.text.TfidfVectorizer, but I am sure they use them elsewhere as well. Essentially the idea is that including too many grams can lead to overfitting, same thing with having too many grams with low or high document frequencies.

Is it possible to obtain class probabilities using GradientBoostedTrees with spark mllib?

I am currently working with spark mllib.
I have created a text classifier using the Gradient Boosting algorithm with the class GradientBoostedTrees:
Gradient Boosted Trees
Currently I obtain the predictions to know the class of new elements but I would like to obtain the class probabilities (the value of the output before the hard decision).
In other mllib algorithms like logistic regression you can remove the threshold from the classifier to obtain the class probabilities but I can not find a way to do the same procedure with GradientBosstedTrees.
As far as I know, it's not currently possible but it is possible with random forest.
You can see this link...I have explained a procedure here
Predicting probabilities of classes in case of Gradient Boosting Trees in Spark using the tree output
In order to implement the predicted probabilities and thresholds one need to write program using the trees from
print(model.toDebugString)
output. I tried to understand how the tree works to predict which is fairly simple outside Spark.
It seems that in Spark MLLIB it is not possible to obtain the class probabilities.
You can only obtain the final classification decision.
That's a pity because that information would be very useful (If you classify a sample as positive with 99.99% of posibilities is not the same than 51%) and it is not difficult to obtain that information once the model has been trained.
An alternative is using a different software like xgboost: https://github.com/dmlc/xgboost

Resources