How to get confidence score in keras_ocr - keras

I am using keras_ocr from here. How to get output confidence score in this. I go through the official documentation of keras_ocr from here and I found that for some popular datasets they set confidence score = 1. And also I searched internet and found that confidence score functionality will be implemented in future. Currently how to get confidence score.

Related

In DialogFlow-ES, how can I measure the accuracy of my agent apart from calculating the confidence score?

I have created a chatbot using DialogFlow-ES, but I cannot find a way to generate a confusion matrix from which we can measure the accuracy, precision, recall of our chatbot.
Even if we cannot generate a confusion matrix, are there any other performance metric apart from the Confidence score for a query, that DialogFlow-ES provides?
I tried searching for other metrics in google and everywhere only confidence score was mentioned. Also, I read some articles that dialogflow chatbots are different from the conventional chatbots. That is why I might have been unable to find any python libraries for this task too

Confusion matrix for LDA

I’m trying to check the performance of my LDA model using a confusion matrix but I have no clue what to do. I’m hoping someone can maybe just point my in the right direction.
So I ran an LDA model on a corpus filled with short documents. I then calculated the average vector of each document and then proceeded with calculating cosine similarities.
How would I now get a confusion matrix? Please note that I am very new to the world of NLP. If there is some other/better way of checking the performance of this model please let me know.
What is your model supposed to be doing? And how is it testable?
In your question you haven't described your testable assessment of the model the results of which would be represented in a confusion matrix.
A confusion matrix helps you represent and explore the different types of "accuracy" of a predictive system such as a classifier. It requires your system to make a choice (e.g. yes/no, or multi-label classifier) and you must use known test data to be able to score it against how the system should have chosen. Then you count these results in the matrix as one of the combination of possibilities, e.g. for binary choices there's two wrong and two correct.
For example, if your cosine similarities are trying to predict if a document is in the same "category" as another, and you do know the real answers, then you can score them all as to whether they were predicted correctly or wrongly.
The four possibilities for a binary choice are:
Positive prediction vs. positive actual = True Positive (correct)
Negative prediction vs. negative actual = True Negative (correct)
Positive prediction vs. negative actual = False Positive (wrong)
Negative prediction vs. positive actual = False Negative (wrong)
It's more complicated in a multi-label system as there are more combinations, but the correct/wrong outcome is similar.
About "accuracy".
There are many kinds of ways to measure how well the system performs, so it's worth reading up on this before choosing the way to score the system. The term "accuracy" means something specific in this field, and is sometimes confused with the general usage of the word.
How you would use a confusion matrix.
The confusion matrix sums (of total TP, FP, TN, FN) can fed into some simple equations which give you, these performance ratings (which are referred to by different names in different fields):
sensitivity, d' (dee-prime), recall, hit rate, or true positive rate (TPR)
specificity, selectivity or true negative rate (TNR)
precision or positive predictive value (PPV)
negative predictive value (NPV)
miss rate or false negative rate (FNR)
fall-out or false positive rate (FPR)
false discovery rate (FDR)
false omission rate (FOR)
Accuracy
F Score
So you can see that Accuracy is a specific thing, but it may not be what you think of when you say "accuracy"! The last two are more complex combinations of measure. The F Score is perhaps the most robust of these, as it's tuneable to represent your requirements by combining a mix of other metrics.
I found this wikipedia article most useful and helped understand why sometimes is best to choose one metric over the other for your application (e.g. whether missing trues is worse than missing falses). There are a group of linked articles on the same topic, from different perspectives e.g. this one about search.
This is a simpler reference I found myself returning to: http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html
This is about sensitivity, more from a science statistical view with links to ROC charts which are related to confusion matrices, and also useful for visualising and assessing performance: https://en.wikipedia.org/wiki/Sensitivity_index
This article is more specific to using these in machine learning, and goes into more detail: https://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf
So in summary confusion matrices are one of many tools to assess the performance of a system, but you need to define the right measure first.
Real world example
I worked through this process recently in a project I worked on where the point was to find all of few relevant documents from a large set (using cosine distances like yours). This was like a recommendation engine driven by manual labelling rather than an initial search query.
I drew up a list of goals with a stakeholder in their own terms from the project domain perspective, then tried to translate or map these goals into performance metrics and statistical terms. You can see it's not just a simple choice! The hugely imbalanced nature of our data set skewed the choice of metric as some assume balanced data or else they will give you misleading results.
Hopefully this example will help you move forward.

How to improve the confidence score of the intent in Rasa NLU?

I was working on Rasa NLU for intent classification, in link how shall I improve the confidence score for a given intent.
I have tried to give more training data but still the confidence score isn't increasing. Can anyone please let me know which parameters \ hyperparameters I can tune in order to get good confidence score.
I did tried to all possible combinations provided in this link but still there was hardly any improvement.
I did checked the suggestion provided over here, but I am looking for granular tuning of the model such that it can perform better.
Thanks.
Edit 1: Please provide a valid reason for down-voting the question.
You can use tensorflow_embedding which gives confidence score near to .9, rather than using spacy_sklearn which provides neat to .3
Depending whether spaCy provides a good language model for your language, you should either use the spaCy pipeline (as it comes with pretrained models) or the tensorflow_embedding pipeline which works with any language but requires more training examples.
I think that your problems might be caused by overlapping training examples. An example to clarify:
## intent:ask_bot_name
- Tell me your name
- What is your name
- name please
## intent:ask_location_name
- Tell me the name
- What's the name
- name please
So I would suggest to go through your training data and have a look whether different intents have the same or very very similar examples.

How to debug scikit classifier that chooses wrong class with high confidence

I am using the LogisticRegression classifier to classify documents. The results are good (macro-avg. f1 = 0.94). I apply an extra step to the prediction results (predict_proba) to check if a classification is "confident" enough (e.g. >0.5 confidence for the first class, >0.2 distance in confidence to the 2. class etc.). Otherwise, the sample is discarded as "unknown".
The score that is most significant for me is the number of samples that, despite this additional step, are assigned to the wrong class. This score is unfortunately too high (~ 0.03). In many of these cases, the classifier is very confident (0.8 - 0.9999!) that he chose the correct class.
Changing parameters (C, class_weight, min_df, tokenizer) so far only lead to a small decrease in this score, but a significant decrease in correct classifications. However, looking at several samples and the most discriminative features of the respective classes, I cannot understand where this high confidence comes from. I would assume it is possible to discard most of these samples without discarding significantly more correct samples.
Is there a way to debug/analyze such cases? What could be the reason for these high confidence values?

AIC and PSSE comparison

Akaike Information Criterion (AIC) and Predictive sum of squares error (PSSE) are information theoretic and predictive capability assessment measures to rank the models.
I was trying to rank the models based on these two criterions. However, the rankings are completely contradicting, meaning, 6 models which are ranked best based on AIC is being ranked least based on PSSE.
In this kind of situations how do I decide which measure is the best. I tried to look for some article or research papers but unfortunately I could not find much. Any information would be appreciated.
Thank you!

Resources