Binary semi-supervised classification with positive only and unlabeled data set

Binary semi-supervised classification with positive only and unlabeled data set - scikit-learn

My data consist of comments (saved in files) and few of them are labelled as positive. I would like to use semi-supervised and PU classification to classify these comments into positive and negative classes. I would like to know if there is any public implementation for semi-supervised and PU implementations in python (scikit-learn)?

You could try to train a one-class SVM and see what kind of results that gives you. I haven't heard about the PU paper. I think for all practical purposes you will be much better of labelling some points and then using semi-supervised methods.
If finding negative points is hard, I would try to use heuristics to find putative negative points (which I think is similar to the techniques in the PU paper). You could either classify unlabelled vs positive and then only look at the ones that score strongly for unlabelled, or learn a one-class SVM or similar and then look for negative points in the outliers.
If you are interested in actually solving the task, I would much rather invest time in manual labelling than implementing fancy methods.

Related

Bert fine-tuned for semantic similarity

I would like to apply fine-tuning Bert to calculate semantic similarity between sentences.
I search a lot websites, but I almost not found downstream about this.
I just found STS benchmark.
I wonder if I can use STS benchmark dataset to train a fine-tuning bert model, and apply it to my task.
Is it reasonable?
As I know, there are a lot method to calculate similarity including cosine similarity, pearson correlation, manhattan distance, etc.
How choose for semantic similarity?

In addition, if you're after a binary verdict (yes/no for 'semantically similar'), BERT was actually benchmarked on this task, using the MRPC (Microsoft Research Paraphrase Corpus).
The google github repo https://github.com/google-research/bert includes some example calls for this, see --task_name=MRPC in section Sentence (and sentence-pair) classification tasks.

As a general remark ahead, I want to stress that this kind of question might not be considered on-topic on Stackoverflow, see How to ask. There are, however, related sites that might be better for these kinds of questions (no code, theoretical PoV), namely AI Stackexchange, or Cross Validated.
If you look at a rather popular paper in the field by Mueller and Thyagarajan, which is concerned with learning sentence similarity on LSTMs, they use a closely related dataset (the SICK dataset), which is also hosted by the SemEval competition, and ran alongside the STS benchmark in 2014.
Either one of those should be a reasonable set to fine-tune on, but STS has run over multiple years, so the amount of available training data might be larger.
As a great primer on the topic, I can also highly recommend the Medium article by Adrien Sieg (see here, which comes with an accompanied GitHub reference.
For semantic similarity, I would estimate that you are better of with fine-tuning (or training) a neural network, as most classical similarity measures you mentioned have a more prominent focus on the token similarity (and thus, syntactic similarity, although not even that necessarily). Semantic meaning, on the other hand, can sometimes differ wildly on a single word (maybe a negation, or the swapped sentence position of two words), which is difficult to interpret or evaluate with static methods.

Confusion matrix for LDA

I’m trying to check the performance of my LDA model using a confusion matrix but I have no clue what to do. I’m hoping someone can maybe just point my in the right direction.
So I ran an LDA model on a corpus filled with short documents. I then calculated the average vector of each document and then proceeded with calculating cosine similarities.
How would I now get a confusion matrix? Please note that I am very new to the world of NLP. If there is some other/better way of checking the performance of this model please let me know.

What is your model supposed to be doing? And how is it testable?
In your question you haven't described your testable assessment of the model the results of which would be represented in a confusion matrix.
A confusion matrix helps you represent and explore the different types of "accuracy" of a predictive system such as a classifier. It requires your system to make a choice (e.g. yes/no, or multi-label classifier) and you must use known test data to be able to score it against how the system should have chosen. Then you count these results in the matrix as one of the combination of possibilities, e.g. for binary choices there's two wrong and two correct.
For example, if your cosine similarities are trying to predict if a document is in the same "category" as another, and you do know the real answers, then you can score them all as to whether they were predicted correctly or wrongly.
The four possibilities for a binary choice are:
Positive prediction vs. positive actual = True Positive (correct)
Negative prediction vs. negative actual = True Negative (correct)
Positive prediction vs. negative actual = False Positive (wrong)
Negative prediction vs. positive actual = False Negative (wrong)
It's more complicated in a multi-label system as there are more combinations, but the correct/wrong outcome is similar.
About "accuracy".
There are many kinds of ways to measure how well the system performs, so it's worth reading up on this before choosing the way to score the system. The term "accuracy" means something specific in this field, and is sometimes confused with the general usage of the word.
How you would use a confusion matrix.
The confusion matrix sums (of total TP, FP, TN, FN) can fed into some simple equations which give you, these performance ratings (which are referred to by different names in different fields):
sensitivity, d' (dee-prime), recall, hit rate, or true positive rate (TPR)
specificity, selectivity or true negative rate (TNR)
precision or positive predictive value (PPV)
negative predictive value (NPV)
miss rate or false negative rate (FNR)
fall-out or false positive rate (FPR)
false discovery rate (FDR)
false omission rate (FOR)
Accuracy
F Score
So you can see that Accuracy is a specific thing, but it may not be what you think of when you say "accuracy"! The last two are more complex combinations of measure. The F Score is perhaps the most robust of these, as it's tuneable to represent your requirements by combining a mix of other metrics.
I found this wikipedia article most useful and helped understand why sometimes is best to choose one metric over the other for your application (e.g. whether missing trues is worse than missing falses). There are a group of linked articles on the same topic, from different perspectives e.g. this one about search.
This is a simpler reference I found myself returning to: http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html
This is about sensitivity, more from a science statistical view with links to ROC charts which are related to confusion matrices, and also useful for visualising and assessing performance: https://en.wikipedia.org/wiki/Sensitivity_index
This article is more specific to using these in machine learning, and goes into more detail: https://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf
So in summary confusion matrices are one of many tools to assess the performance of a system, but you need to define the right measure first.
Real world example
I worked through this process recently in a project I worked on where the point was to find all of few relevant documents from a large set (using cosine distances like yours). This was like a recommendation engine driven by manual labelling rather than an initial search query.
I drew up a list of goals with a stakeholder in their own terms from the project domain perspective, then tried to translate or map these goals into performance metrics and statistical terms. You can see it's not just a simple choice! The hugely imbalanced nature of our data set skewed the choice of metric as some assume balanced data or else they will give you misleading results.
Hopefully this example will help you move forward.

Python AUC Calculation for Unsupervised Anomaly Detection (Isolation Forest, Elliptic Envelope, ...)

I am currently working in anomaly detection algorithms. I read papers comparing unsupervised anomaly algorithms based on AUC values. For example i have anomaly scores and anomaly classes from Elliptic Envelope and Isolation Forest. How can i compare these two algorithms based on AUC values.
I am looking for a python code example.
Thanks

Problem solved. Steps i done so far;
1) Gathering class and score after anomaly function
2) Converting anomaly score to 0 - 100 scale for better compare with different algorihtms
3) Auc requires this variables to be arrays. My mistake was to use them like Data Frame column which returns "nan" all the time.
Python Script:
#outlier_class and outlier_score must be array
fpr,tpr,thresholds_sorted=metrics.roc_curve(outlier_class,outlier_score)
aucvalue_sorted=metrics.auc(fpr,tpr)
aucvalue_sorted
Regards,
Seçkin Dinç

Although you already solved your problem, my 2 cents :)
Once you've decided which algorithmic method to use to compare them (your "evaluation protocol", so to say), then you might be interested in ways to run your challengers on actual datasets.
This tutorial explains how to do it, based on an example (comparing polynomial fitting algorithms on several datasets).
(I'm the author, feel free to provide feedback on the github page!)

Multiclass text classification with python and nltk

I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.

In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.

Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.

text classification using svm

i read this article :A hybrid classification method of k nearest neighbor, Bayesian methods
and genetic algorithm
it's proposed to use genetic algorithm in order to improve text classification
i want to replace Genetic algorithm with SVM but i don't know if it works or not
i mean i do not know if the new idea and the result will be better than this article
i read somewhere Ga is better than SVM but i dono if it's right or not?

SVM and Genetic Algorithms are in fact completely different methods. SVM is basicaly a classification tool, while genetic algorithms are meta optimisation heuristic. Unfortunately I do not have access to the cited paper, but I can hardly imagine, how putting sVM in the place of GA could work.
i read somewhere Ga is better than SVM but i dono if it's right or not?
No, it is not true. These methods are not comparable as they are completely different tools.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string