Finding the overall contribution of each original descriptor in a PLS model - scikit-learn

New to scikit-learn. I am using v 20.2. I am developing PLS regression models.I would like to know how important each of the original predictors/descriptors are in predicting the response. The diffferent matrices returned by scikit-learn for the learned PLS model(X_loadings, X_weights,etc) are giving descriptor-related values for each PLS component. But I am looking for a way to calculate/visualize the overall importance/contribution of each feature in the model. Can someone help me out here?
Also, what of the matrices shows me the coefficient assigned to each PLS component in the final linear model?
Thanks,
Yannick

The coef_ function of the model should give contribution of each descriptor to the response variable.

Related

How to get the logits for the T5 model when using the `generate` method for inference?

I'm currently using HuggingFace's T5 implementation for text generation purposes. More specifically, I'm using the T5ForConditionalGeneration to solve a text classification problem as generation.
The model's performance is overall very satisfactory after training, but what I am wondering is how I can get the logits for generation?
I'm currently performing inference as is suggested in the documentation via model.generate(**tokenizer_outputs), but this simply outputs the IDs themselves without anything else.
The reason why I want the logits is because I want to measure the model's confidence of generation. I'm not 100% certain if my approach is correct, but I'm thinking that if I can get the logit values of each generated token and average them, I could get the overall confidence score of the generated sequence.
Would anybody know how I could do this? Thanks.
I was struggling with this because I wasn't familiar with how the Transformers library works, but after looking at the source code all you have to do is set the arguments output_scores and return_dict_in_generate to True.
For more information, take a look at the method transformers.generation.utils.GenerationMixin.generate.

How to do face recognition using euclidean distance in python

I have a project where I need to include face recognition in it. I am referring to this article. This article is using open-face to get the face embeddings and its saving all the embeddings in a pickle file. Then its passing the face embeddings data to support vector machine which generates another pickle file. This file is later used to recognize and predict the face.
This has been working and is giving me more than 80% accuracy. But this article has not explained on how to calculate euclidean distance. This I needed for my own research work.
I can easily calculate euclidean distance between the face embedding of test image and face embeddings present in pickle file but I am not able to understand how to set the threshold value so that any distance more than that will be tagged as unknown.
Can anyone please point me to some article where this has been explained and I can follow up from there. I have tried searching many articles but didnt get much results on this. Please help. Thanks
You can build 2 ( normal ) distributions.
distances between same person's faces
distances between different faces
Intersection of these distributuins will be the threshold.
Small illustration:

what are the methods to check if my model fits the data (without using graphs)

I am working on a binary logistic regression data set in python. I want to know if there are any numerical methods to calculate how well the model fits the data.
please don't include graphical methods like plotting etc.
Thanks :)
read through 3.3.2. Classification metrics in sklearn documentation.
http://scikit-learn.org/stable/modules/model_evaluation.html
hope it helps.

StackingRegressor: how to define meta features?

I am having some difficulties with StackingRegressor.
Actully I trained lot of regressors and get good results.
I thougt that I would probably get better result if I use StackingRegressor but I did not. And I think that I am missing something...
Here is my code:
regressors=[rf, knn]
stregr = StackingRegressor(regressors=regressors, meta_regressor=LinearRegression())
Here is what I understand about stacking:
if randomforest is better than knn when (for exemple) the obs is a woman and knn is better than randomforest when the obs is a man, stacking will use this information and predict with random forest for women and with knn for men.
Sex is a meta feature that will be use by the Stacking model.
But, in my exemple, how can I define a meta feature list? What meta
feature will be used by StackingRegressor if I do not explicitly tell
it which one to use? All available variables? None?
I also try a stacking with all my regressors but I get very very bad results!!! Same than above, I think that I need to define some meta feature to use but I do not know how....
regressors=[rf, knn, gb, lasso, ridge, lr, svm_rbf, svm_lin, ada, xgb]
stregr = StackingRegressor(regressors=regressors, meta_regressor=LinearRegression())
If anyone understand how it work or get any link to python notebook or anything to help it would be great.
Thanks in advance!

Can I extract significane values for Logistic Regression coefficients in pyspark

Is there a way to get the significance level of each coefficient we receive after we fit a logistic regression model on training data?
I was trying to find out a way and could not figure out myself.
I think I may get the significance level of each feature if I run chi sq test but first of all not sure if I can run the test on all features together and secondly I have numeric data value so if it will give me right result or not that remains a question as well.
Right now I am running the modeling part using statsmodel and scikit learn but certainly, want to know, how can I get these results from PySpark ML or MLLib itself
If anyone can shed some light, it will be helpful
I use only mllib, I think that when you train a model you can use toPMML method to export your model un PMML format (xml file), then you can parse the xml file to get features weights, here an example
https://spark.apache.org/docs/2.0.2/mllib-pmml-model-export.html
Hope that will help

Resources