Search through GPT-3's training data - nlp

I'm using GPT-3 for some experiments where I prompt the language model with tests from cognitive science. The tests have the form of short text snippets. Now I'd like to check whether GPT-3 has already encountered these text snippets during training. Hence my question: Is there any way to sift through GPT-3's training text corpora? Can one find out whether a certain string is part of these text corpora?
Thanks for your help!

I don't think that's possible, unfortunately. GPT-3's training corpora is private.
But if that was possible, it would be great for detecting plagiarism. Maybe ask if it it knows where a certain line of text came from?

Related

Text Classification - what can you do vs. what are your capabilities?

Text Classification basically works on the input training sentences. Little or less number of variations of in the sentences do work. But when there is a scenario like
What can you do <<==>> What are your capabilities
This scenario does not work well with the regular classification or bot building platforms.
Are there any approaches for classification that would help me achieve this ?
What you are trying to solve is called Semantic Textual Similarity and is a known and well studied field.
There are many different ways to solve this even if your data is tagged or not.
For example, Google has published the Universal Sentence Encoder (code example) which is intended to tell if two sentences are similar like in your case.
Another example would be any solution you can find in Quora Question Pairs Kaggle competition.
There are also datasets for this problem, for example you can look for SemEval STS (STS for Semantic Textual Similarity), or the PAWS dataset

Detecting questions in text

I have a project where I need to analyze a text to extract some information if the user who post this text need help in something or not, I tried to use sentiment analysis but it didn't work as expected, my idea was to get the negative post and extract the main words in the post and suggest to him some articles about that subject, if there is another way that can help me please post it below and thanks.
for the dataset i useed, it was a dataset for sentiment analyze, but now I found that it's not working and I need a dataset use for this subject.
Please use the NLP methods before processing the sentiment analysis. Use the TFIDF, Word2Vector to create vectors on the given dataset. And them try the sentiment analysis. You may also need glove vector for the conducting analysis.
For this topic I found that this field in machine learning is called "Natural Language Questions" it's a field where machine learning models trained to detect questions in text and suggesting answer for them based on data set you are working with, check this article for more detail.

Ensure the presence of a word/token/noun in Encoder-Decoder text generation deep learning models

I am stuck with a problem where in I want to ensure that specific tokens/words are produced while decoding and generating abstractive-style sentences.
I am working with deep learning models like LSTM and transformer model for generating short sentences(100-200 characters). I want that some words like places or nouns(like brand names) be present in the generated texts.
I am not sure if there has been any research on this, I couldn't really find a paper after an extensive search on it.
TIA, any leads or suggestions are appreciated. :)
I am not sure but you can try to condition your output based on those specific words. Your trainer can be like a seq2seq decoder but instead of attending to the encoder outputs it can attend to those specific words.

Do I need to provide sentences for training Spacy NER or are paragraphs fine?

I am trying to train a new Spacy model to recognize references to law articles. I start using a blank model, and train the ner pipe according to the example given in the documentation.
The performance of the trained model is really poor, even with several thousands on input points. I am tryong to figure out why.
One possible answer is that I am giving full paragraphs to train on, instead of sentences that are in the examples. Each of these paragraphs can have multiple references to law articles. Is this a possible issue?
Turns out I was making a huge mistake in my code. There is nothing wrong with paragraphs. As long as your code actually supplies them to spacy.
Paragraphs should be fine. Could you give an example input data point?

Unguided speech to text conversion

I am trying to come up with a way to convert speech to text. I am trying to use Sphinx to attain this. What I mean by unguided speech to text is that, the speaker is not bound to speak from a definite set of sentences. Rather he might speak any sentence. So its not possible for me to have a grammar file, where each word is one of the alternative pre-written in the grammar file. I understand that I would have to train Sphinx somehow to do this.
But I am a beginner in sphinx. How to start training Sphinx to convert unguided speech? Is it possible to attain unguided conversion with Sphinx?
The task you are up to is, as of right now, is not yet possible to complete, at least not with satisfying accuracy.
As for the Sphinx-based solution: you will have to create dictionary with all the words to be recognized. There is no other way.
Once you have the dictionary, you can generate a simple n-gram model based on it, with ony unigrams - each unigram will be one word. The probability of each may be the same, or you may attempt to do some statistical analysis of the words that will be used.

Resources