How can I implement text classification for the purpose of matching using GPT-3? - text

I have tried fine tuning a GPT-3 model for the purpose of text classification to classify whether two names match such as 'William Jonathan' and 'William J' and the label would be yes/no, yes indicating that two names are matching and no indicating that they aren't. I have created a large number of examples related to different scenarios such as names being spelled differently, abbreviations, missing token, etc. After fine-tuning the model on GPT-3 using examples that look like this with a jsonl format:
{"text": "Are the following two names the same?\nWilliam Jonathan\nWilliam J", "label": "Yes"}
It is not able to do binary classification, however it outputs a large number of labels next to each other, for instance:
Prompt: Are the following two names the same?\nWilliam Jonathan \nWilliam J
Completion: YesYesNoYesNoYesYesNoYesNoYesNoYesNoYesYes
Any idea on how I can perform binary text classification using GPT-3 on an example similar to the above?

Related

Which classification model should I use for author attribution in machine learning?

I aim to have a train set of texts written by a specific author and a larger test set of unknown texts. I want to be able to predict whether or not each text (or class) in the test set was written by the specific author of the train set of texts. What classification model should I use to achieve this and how might I implement it?
You could use a logistic regression model. Even it has "regression" in the name, it is applied to classification.
If the use of certain words is typical for your author, you could create a model that is based on frequency of words in the texts:
Before you apply the model, you need to create numerical values out of the texts. Therefore you can assign tokens to the unique words.
You create a feature vector by counting the frequency of the words
Logistic regession model for text classification contains code where these steps are executed in order to derive the judgement of a movie review.
If for example the sequence of words needs to be considered, you need a modified approach.

Cluster similar words using word2vec

I have various restaurant labels with me and i have some words that are unrelated to restaurants as well. like below:
vegan
vegetarian
pizza
burger
transportation
coffee
Bookstores
Oil and Lube
I have such mix of around 500 labels. I want to know is there a way pick the similar labels that are related to food choices and leave out words like oil and lube, transportation.
I tried using word2vec but, some of them have more than one word and could not figure out a right way.
Brute-force approach is to tag them manually. But, i want to know is there a way using NLP or Word2Vec to cluster all related labels together.
Word2Vec could help with this, but key factors to consider are:
How are your word-vectors trained? Using off-the-shelf vectors (like say the popular GoogleNews vectors trained on a large corpus of news stories) are unlikely to closely match the senses of these words in your domain, or include multi-word tokens like 'oil_and_lube'. But, if you have a good training corpus from your own domain, with multi-word tokens from a controlled vocabulary (like oil_and_lube) that are used in context, you might get quite good vectors for exactly the tokens you need.
The similarity of word-vectors isn't strictly 'synonymity' but often other forms of close-relation including oppositeness and other ways words can be interchangeable or be used in similar contexts. So whether or not the word-vector similarity-values provide a good threshold cutoff for your particular desired "related to food" test is something you'd have to try out & tinker around. (For example: whether words that are drop-in replacements for each other are closest to each other, or words that are common-in-the-same-topics are closest to each other, can be influenced by whether the window parameter is smaller or larger. So you could find tuning Word2Vec training parameters improve the resulting vectors for your specific needs.)
Making more recommendations for how to proceed would require more details on the training data you have available – where do these labels come from? what's the format they're in? how much do you have? – and your ultimate goals – why is it important to distinguish between restaurant- and non-restaurant- labels?
OK, thank you for the details.
In order to train on word2vec you should take into account the following facts :
You need a huge and variate text dataset. Review your training set and make sure it contains the useful data you need in order to obtain what you want.
Set one sentence/phrase per line.
For preprocessing, you need to delete punctuation and set all strings to lower case.
Do NOT lemmatize or stemmatize, because the text will be less complex!
Try different settings:
5.1 Algorithm: I used word2vec and I can say BagOfWords (BOW) provided better results, on different training sets, than SkipGram.
5.2 Number of layers: 200 layers provide good result
5.3 Vector size: Vector length = 300 is OK.
Now run the training algorithm. The, use the obtained model in order to perform different tasks. For example, in your case, for synonymy, you can compare two words (i.e. vectors) with cosine (or similarity). From my experience, cosine provides a satisfactory result: the distance between two words is given by a double between 0 and 1. Synonyms have high cosine values, you must find the limit between words which are synonyms and others that are not.

Stanford NER lowercase entities

I am facing problem to detect named entities which starts with lowercase letter. If I train the model with only lowercase words, then the accuracy is reasonable; however, when the model is trained with fully uppercase tokens or even mix of lowercase and uppercase, the result is very bad. I tried some features which presented by the Stanford NLP Group Class NERFeatureFactory as well as variety of sentences, but I could not get the results that I expected.
An example for the problem I am facing is as follow:
"ali studied at university of michigan and now he works for us navy."
I expected the model to recognize entities as follow:
"university" : "FACILITY",
"of michigan" : "FACILITY",
"ali" : "PERSON"
"us" : "ORGANIZATION"
"navy" : "ORGANIZATION"
If the .TSV file, which used as training data, contains ONLY lowercase letters, then I can get the above result otherwise the result is surprising.
Any help is highly appreciated a head.
If you have lowercase text or mixed case text, the accuracy can get affected as the Stanford NLP models are trained on standardly edited data, but there are a couple of useful ways to approach this problem:
One way is to correctly capitalize the text with a true case annotator, and then process the resulting text with the regular NER model.
Another way is to explore caseless models including ones that are available as part of Stanford NER.
You can read more here.

Word sense disambiguation using WEKA

I've got a training DataSet and a Test DataSet. How can we experiment and get results ?
Can WEKA be used for the same ?
The topic is Word Sense Disambiguation using Support Vector Machine Supervised learning Approach
The Document types within both the sets include following file types:
1. 2 XML files
2. README file
3. SENSEMAP format
4. TRAIN format
5. KEY format
6. WORDS format
Machine learning approaches like SVM are not popular with word sense disambiguation.
Are you aware of Wikify, mapping to wikipedia can be considered very fine word-sense disambiguation.
To answer your question, in cases like these; any machine learning technique can give you desired results. One should be more worried about the features to extract and make sure the word features are distinctive enough to resolve the disambiguations at the level you chose. For example in the sentence: Wish you a very Happy Christamas you just want to disambiguate Happy Christmas as either book or festival.

Clustering phrases around a theme

I have encountered a very unusual problem. I have a set of phrases (noun phrases) extracted from a large corpus of documents. These phrases are >=2 and <=3 words of length. There is a need to cluster these phrases because the number of phrases extracted are very large in number and showing them as a simple list might not be useful for the user.
We are thinking of nice very simple ways of clustering these. Is there a quick tool/software/method that I could use to cluster these so that all phrases inside a cluster belong to a particular theme/topic, if I keep the number of topics as a fixed initially? I don't have any training set or any other clusters that I can use as a training set.
Topic classification is not an easy problem.
The conventional methods used to classify long documents (100's of words) are usually based on frequent words, and not suitable for very short messages. I believe that your problem is somewhat similar to tweet classification.
Two very interesting papers are:
Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia
(presented at HCI International 2011)
Eddi: Interactive Topic-based Browsing of Social Status Streams (presented at UIST'10)
If you want to include knowledge about the world so that, e.g., cat and dog will be clustered together, you can use WordNet's domains hierarchy.

Resources