what are the methods to estimate probabilities of production rules? - nlp

I know that n-gram is useful for finding the probability of words,I want to know, How to estimate probabilities of production rules? How many methods or rules to calculate probabilities of production rules?
I could not find any good blog or something on this topic.Now I am studying on probabilistic context free grammar & CKY parsing algorithm.

As I understand your question, you are asking how to estimate the parameters of a PCFG model from data.
In short, it's easy to make empirical production-rule probability estimates when you have ground-truth parses in your training data. If you want to estimate the probability that S -> NP VP, this is something like Count(S -> NP VP) / Count(S -> *), where * is any possible subtree.
You can find a much more formal statement in lots of places on the web (search for "PCFG estimation" or "PCFG learning"). Here's a nice one from Michael Collins' lecture notes: http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/pcfgs.pdf#page=9

Related

Confusion matrix for LDA

I’m trying to check the performance of my LDA model using a confusion matrix but I have no clue what to do. I’m hoping someone can maybe just point my in the right direction.
So I ran an LDA model on a corpus filled with short documents. I then calculated the average vector of each document and then proceeded with calculating cosine similarities.
How would I now get a confusion matrix? Please note that I am very new to the world of NLP. If there is some other/better way of checking the performance of this model please let me know.
What is your model supposed to be doing? And how is it testable?
In your question you haven't described your testable assessment of the model the results of which would be represented in a confusion matrix.
A confusion matrix helps you represent and explore the different types of "accuracy" of a predictive system such as a classifier. It requires your system to make a choice (e.g. yes/no, or multi-label classifier) and you must use known test data to be able to score it against how the system should have chosen. Then you count these results in the matrix as one of the combination of possibilities, e.g. for binary choices there's two wrong and two correct.
For example, if your cosine similarities are trying to predict if a document is in the same "category" as another, and you do know the real answers, then you can score them all as to whether they were predicted correctly or wrongly.
The four possibilities for a binary choice are:
Positive prediction vs. positive actual = True Positive (correct)
Negative prediction vs. negative actual = True Negative (correct)
Positive prediction vs. negative actual = False Positive (wrong)
Negative prediction vs. positive actual = False Negative (wrong)
It's more complicated in a multi-label system as there are more combinations, but the correct/wrong outcome is similar.
About "accuracy".
There are many kinds of ways to measure how well the system performs, so it's worth reading up on this before choosing the way to score the system. The term "accuracy" means something specific in this field, and is sometimes confused with the general usage of the word.
How you would use a confusion matrix.
The confusion matrix sums (of total TP, FP, TN, FN) can fed into some simple equations which give you, these performance ratings (which are referred to by different names in different fields):
sensitivity, d' (dee-prime), recall, hit rate, or true positive rate (TPR)
specificity, selectivity or true negative rate (TNR)
precision or positive predictive value (PPV)
negative predictive value (NPV)
miss rate or false negative rate (FNR)
fall-out or false positive rate (FPR)
false discovery rate (FDR)
false omission rate (FOR)
Accuracy
F Score
So you can see that Accuracy is a specific thing, but it may not be what you think of when you say "accuracy"! The last two are more complex combinations of measure. The F Score is perhaps the most robust of these, as it's tuneable to represent your requirements by combining a mix of other metrics.
I found this wikipedia article most useful and helped understand why sometimes is best to choose one metric over the other for your application (e.g. whether missing trues is worse than missing falses). There are a group of linked articles on the same topic, from different perspectives e.g. this one about search.
This is a simpler reference I found myself returning to: http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html
This is about sensitivity, more from a science statistical view with links to ROC charts which are related to confusion matrices, and also useful for visualising and assessing performance: https://en.wikipedia.org/wiki/Sensitivity_index
This article is more specific to using these in machine learning, and goes into more detail: https://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf
So in summary confusion matrices are one of many tools to assess the performance of a system, but you need to define the right measure first.
Real world example
I worked through this process recently in a project I worked on where the point was to find all of few relevant documents from a large set (using cosine distances like yours). This was like a recommendation engine driven by manual labelling rather than an initial search query.
I drew up a list of goals with a stakeholder in their own terms from the project domain perspective, then tried to translate or map these goals into performance metrics and statistical terms. You can see it's not just a simple choice! The hugely imbalanced nature of our data set skewed the choice of metric as some assume balanced data or else they will give you misleading results.
Hopefully this example will help you move forward.

Python AUC Calculation for Unsupervised Anomaly Detection (Isolation Forest, Elliptic Envelope, ...)

I am currently working in anomaly detection algorithms. I read papers comparing unsupervised anomaly algorithms based on AUC values. For example i have anomaly scores and anomaly classes from Elliptic Envelope and Isolation Forest. How can i compare these two algorithms based on AUC values.
I am looking for a python code example.
Thanks
Problem solved. Steps i done so far;
1) Gathering class and score after anomaly function
2) Converting anomaly score to 0 - 100 scale for better compare with different algorihtms
3) Auc requires this variables to be arrays. My mistake was to use them like Data Frame column which returns "nan" all the time.
Python Script:
#outlier_class and outlier_score must be array
fpr,tpr,thresholds_sorted=metrics.roc_curve(outlier_class,outlier_score)
aucvalue_sorted=metrics.auc(fpr,tpr)
aucvalue_sorted
Regards,
Seçkin Dinç
Although you already solved your problem, my 2 cents :)
Once you've decided which algorithmic method to use to compare them (your "evaluation protocol", so to say), then you might be interested in ways to run your challengers on actual datasets.
This tutorial explains how to do it, based on an example (comparing polynomial fitting algorithms on several datasets).
(I'm the author, feel free to provide feedback on the github page!)

Binary semi-supervised classification with positive only and unlabeled data set

My data consist of comments (saved in files) and few of them are labelled as positive. I would like to use semi-supervised and PU classification to classify these comments into positive and negative classes. I would like to know if there is any public implementation for semi-supervised and PU implementations in python (scikit-learn)?
You could try to train a one-class SVM and see what kind of results that gives you. I haven't heard about the PU paper. I think for all practical purposes you will be much better of labelling some points and then using semi-supervised methods.
If finding negative points is hard, I would try to use heuristics to find putative negative points (which I think is similar to the techniques in the PU paper). You could either classify unlabelled vs positive and then only look at the ones that score strongly for unlabelled, or learn a one-class SVM or similar and then look for negative points in the outliers.
If you are interested in actually solving the task, I would much rather invest time in manual labelling than implementing fancy methods.

How to combine LIBSVM probability estimates from two (or three) two class SVM classifiers.

I have training data that falls into two classes, let's say Yes and No. The data represents three tasks, easy, medium and difficult. A person performs these tasks and is classified into one of the two classes as a result. Each task is classified independently and then the results are combined. I am using 3 independently trained SVM classifiers and then voting on the final result.
I am looking to provide a measure of confidence or probability associated with each classification. LIBSVM can provide a probability estimate along with the classification for each task (easy, medium and difficult, say Pe, Pm and Pd) but I am unsure of how best to combine these into an overall estimate for the final classification of the person (let's call it Pp).
My attempts so far have been along the lines of a simple average:
Pp = (Pe + Pm + Pd) / 3
An Inverse-variance weighted average (since each task is repeated a few times and sample variance (VARe, VARm and VARd) can be calculated - in which case Pe would be a simple average of all the easy samples):
Pp = (Pe/VARe + Pm/VARm + Pd/VARd) / (( 1/VARe ) + ( 1/VARm ) + ( 1/VARd ))
Or a multiplication (under the assumption that these events are independent, which I am unsure of since the underlying tasks are related):
Pp = Pe * Pm * Pd
The multiplication would provide a very low number, so it's unclear how to interpret that as an overall probability when the results of the voting are very clear.
Would any of these three options be the best or is there some other method / detail I'm overlooking?
Based on your comment, I will make the following suggestion. If you need to do this as an SVM (and because, as you say, you get better performance when you do it this way), take the output from your intermediate classifiers and feed them as features to your final classifier. Even better, switch to a multi-layer Neural Net where your inputs represent inputs to the intermediates, the (first) hidden layer represents outputs to the intermediate problem, and subsequent layer(s) represent the final decision you want. This way you get the benefit of an intermediate layer, but its output is optimised to help with the final prediction rather than for accuracy in its own right (which I assume you don't really care about).
The correct generative model for these tests likely looks something like the following:
Generate an intelligence/competence score i
For each test t: generate pass/fail according to p_t(pass | i)
This is simplified, but I think it should illustrate tht you have a latent variable i on which these tests depend (and there's also structure between them, since presumably p_easy(pass|i) > p_medium(pass|i) > p_hard(pass|i); you could potentially model this as a logistic regression with a continuous 'hardness' feature). I suspect what you're asking about is a way to do inference on some thresholding function of i, but you want to do it in a classification way rather than as a probabilistic model. That's fine, but without explicitly encoding the latent variable and the structure between the tests it's going to be hard (and no average of the probabilities will account for the missing structure).
I hope that helps---if I've made assumptions that aren't justified, please feel free to correct.

information criteria for confusion matrices

One can measure goodness of fit of a statistical model using Akaike Information Criterion (AIC), which accounts for goodness of fit and for the number of parameters that were used for model creation. AIC involves calculation of maximized value of likelihood function for that model (L).
How can one compute L, given prediction results of a classification model, represented as a confusion matrix?
It is not possible to calculate the AIC from a confusion matrix since it doesn't contain any information about the likelihood. Depending on the model you are using it may be possible to calculate the likelihood or quasi-likelihood and hence the AIC or QIC.
What is the classification problem that you are working on, and what is your model?
In a classification context often other measures are used to do GoF testing. I'd recommend reading through The Elements of Statistical Learning by Hastie, Tibshirani and Friedman to get a good overview of this kind of methodology.
Hope this helps.
Information-Based Evaluation Criterion for Classifier's Performance by Kononenko and Bratko is exactly what I was looking for:
Classification accuracy is usually used as a measure of classification performance. This measure is, however, known to have several defects. A fair evaluation criterion should exclude the influence of the class probabilities which may enable a completely uninformed classifier to trivially achieve high classification accuracy. In this paper a method for evaluating the information score of a classifier''s answers is proposed. It excludes the influence of prior probabilities, deals with various types of imperfect or probabilistic answers and can be used also for comparing the performance in different domains.

Resources