Spark ML Logistic Regression in Python: Set the model threshold to maximize F-Measure - apache-spark

I've trained a logistic regression in Spark using pipeline. It ran and I am looking at model diagnostics.
I created my model summary (lr_summary = lrModel.stages[-1].summary).
After that I pretty much copied the code from this webpage. It all works until I try to determine the best threshold based on F-measure using this example Python code:
# Set the model threshold to maximize F-Measure
fMeasure = lr_summary.fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']).select('threshold').head()['threshold']
lr.setThreshold(bestThreshold)
Unfortunately, I am getting an error in line 3 (bestThreshold = ):
TypeError: 'NoneType' object has no attribute 'getitem'
Any advice?
Thank you so much!

I cannot reproduce this problem, but it is possible that model doesn't have summary (in that case I would expect attribute error in maxFMeasure = ... line). You can check if model has one:
lrModel.stages[-1].hasSummary
Also you can make this code much simpler:
bestThreshold = fMeasure.orderBy(fMeasure['F-Measure'].desc()).first().threshold

Related

MaskRCNN should find exactly one element

I trained a maskrcnn-model with matterport with one class to detect. It worked.
Now I want to predict some unseen images. I know that the object is present on each image and that it appears only once per image. How do I use my model to do so?
An possibility that came to my mind was:
num_results = 0
while num_results = 0:
model = mrcnn.model.MaskRCNN(mode='inference', config=pred_config)
model.load_weights('weight/path')
results = model.detect([img], verbose=1)
num_results = compute_num_of(results)
# lower DETECTION_MIN_CONFIDENCE from pred_config
But I think this is very time consuming because I load that model and its weights at every step. What would be best practice here?
Thanks

Unable to Reproduce Results while using Scikit-learn RFECV

I am trying to use Recursive Feature Elimination with CV and produce reproducible results. Even though I have tried fixing the randomness by random_state = SEED as arguments of the components used as well as tried setting the random seed globally as well using np.random.seed(SEED). However, I am unable to control for the randomness and am unable to reproduce results. Attached is the code segment.
estimator = GradientBoostingClassifier(random_state=SEED, n_estimators=2*df.shape[1])
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=SEED)
selector = RFECV(estimator, n_jobs=-1,step=STEP, cv=cv)
selector = selector.fit(df, y)
df = df.loc[:, selector.support_]
print("Shape of final data AFTER FEATURE SELECTION")
print(df.shape, y.shape)
Specifically, if I run this segment of code it returns different number of features selected at each run. Any help would be appreciated

Getting an error while executing perplexity function to evaluate the LDA model

I am trying to evaluate the topic modeling(LDA). Getting a error while execting perplexity function as: Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘perplexity’ for signature ‘"LDA_Gibbs", "numeric"’ someone please help to solve this.
As you haven't provided any example of your code, it's difficult to know what your exact issue is. However, I found this question when I was facing the same error so I will provide the problem I faced and solution here in the hope that it may help someone else.
In the topicmodels package, when fitting using Gibbs the perplexity() function requires newdata to be supplied in a document-term format. If you give it something else, you get this error. Going by your error message you were probably giving it something numeric instead of a dtm.
Here is a working example, using the newsgroups data from the lda package converted to the dtm format:
library(topicmodels)
# load the required data from lda package
data("newsgroup.train.documents", "newsgroup.test.documents", "newsgroup.vocab", package="lda")
# create document-term matrix using newsgroups training data
dtm <- ldaformat2dtm(documents = newsgroup.train.documents, vocab = newsgroup.vocab)
# fit LDA model using Gibbs sampler
fit <- LDA(x = dtm, k = 20, method="Gibbs")
# create document-term matrix using newsgroups test data
testdtm <- ldaformat2dtm(documents = newsgroup.test.documents, vocab = newsgroup.vocab)
# calculate perplexity
perplexity(fit, newdata = testdtm)

How to use cross_val_predict to predict probabilities for a new dataset?

I am using sklearn's cross_val_predict for training like so:
myprobs_train = cross_val_predict(LogisticRegression(),X = x_old, y=y_old, method='predict_proba', cv=10)
I am happy with the returned probabilities, and would like now to score up a brand-new dataset. I tried:
myprobs_test = cross_val_predict(LogisticRegression(), X =x_new, y= None, method='predict_proba',cv=10)
but this did not work, it's complaining about y having zero shape. Does it mean there's no way to apply the trained and cross-validated model from cross_val_predict on new data? Or am I just using it wrong?
Thank you!
You are looking at a wrong method. Cross validation methods do not return a trained model; they return values that evaluate the performance of a model (logistic regression in your case). Your goal is to fit some data and then generate prediction for new data. The relevant methods are fit and predict of the LogisticRegression class. Here is the basic structure:
logreg = linear_model.LogisticRegression()
logreg.fit(x_old, y_old)
predictions = logreg.predict(x_new)
I have the same concern as #user3490622. If we can only use cross_val_predict on training and testing sets, why y (target) is None as the default value? (sklearn page)
To partially achieve the desired results of multiple predicted probability, one could use the fit then predict approach repeatedly to mimic the cross-validation.

Use the previously trained model for further prediction in catboost

I want to find optimal parameters for doing classification using Catboost.
I have training data and test data. I want to run the algorithm for say 500 iterations and then make predictions on test data. Next, I want to repeat this for 600 iterations and then 700 iterations and so on. I don't want to start from iteration 0 again. So, is there any way I can do this in Catboost algorithm?
Any help is highly appreciated!
You can run the algorithm for the maximum number of iterations and then use CatBoost.predict() with ntree_limit parameter or CatBoost.staged_predict() to try different number of iterations.
First i create a predictive model in R using XGB. Now i want to
build a regression model using CatBoost to improve the results
Superconductors dataset convert into training dataset & test dataset
dataset_catboost20<-read.csv("train.csv")
dataset_catboost20
rows<-nrow(dataset_catboost20)
f<-0.65
upper_bound_catboost20<- floor(f*rows)
permuted_dataset_catboost20<- dataset_catboost20[sample(rows),]
train_dataset_catboost20<-permuted_dataset_catboost20[1:upper_bound_catboost20,]
train_dataset_catboost20
There are 28 independent variable & one dependent variable. Now i
use the same formula as i use in XGB.Covert the formula into
**sparse.model.matrix both in XGB & Catboost. In XGB formula was
working but in Catboost it show error.**
Unsupported data type, expecting data.frame, got: dgCMatrix
Formula
train_dataset_catboost2020
y_traincatboost20=train_dataset_catboost20$critical_temp
catboost_trcontrol20<-trainControl(method="cv", number = 5,allowParallel = TRUE,verboseIter = FALSE,returnData = FALSE)
catboostGrid20 <- expand.grid(depth= c(2,6,8), learning_rate=0.1, iterations=100,
l2_leaf_reg=.05, rsm=.95, border_count=65)
catboost_model20 = train(
train_dataset_catboost2020,y_traincatboost20,method = catboost.caret,
logging_level="Silent",preProc=NULL,
tuneGrid = catboostGrid20,trControl=catboost_trcontrol20 )

Resources