Customize OpenAI model: How to make sure answers are from customized data? - nlp

I'm using customized text with 'Prompt' and 'Completion' to train new model.
Here's the tutorial I used to create customized model from my data:
beta.openai.com/docs/guides/fine-tuning/advanced-usage
However even after training the model and sending prompt text to the model, I'm still getting generic results which are not always suitable for me.
How I can make sure completion results for my prompts will be only from the text I used for the model and not from the generic OpenAI models?
Can I use some flags to eliminate results from generic models?

Wrong goal: OpenAI API should answer from the fine-tuning dataset if the prompt is similar to the one from the fine-tuning dataset
It's the completely wrong logic. Forget about fine-tuning. As stated on the official OpenAI website:
Fine-tuning lets you get more out of the models available through the
API by providing:
Higher quality results than prompt design
Ability to train on more examples than can fit in a prompt
Token savings due to shorter prompts
Lower latency requests
Fine-tuning is not about answering with a specific answer from the fine-tuning dataset.
Fine-tuning helps the model gain more knowledge, but it has nothing to do with how the model answers. Why? The answer we get from the fine-tuned model is based on all knowledge (i.e., fine-tuned model knowledge = default knowledge + fine-tuning knowledge).
Although GPT-3 models have a lot of general knowledge, sometimes we want the model to answer with a specific answer (i.e., "fact").
Correct goal: Answer with a "fact" when asked about a "fact", otherwise answer with the OpenAI API
Note: For better (visual) understanding, the following code was ran and tested in Jupyter.
STEP 1: Create a .csv file with "facts"
To keep things simple, let's add two companies (i.e., ABC and XYZ) with a content. The content in our case will be a 1-sentence description of the company.
companies.csv
Run print_dataframe.ipynb to print the dataframe.
print_dataframe.ipynb
import pandas as pd
df = pd.read_csv('companies.csv')
df
We should get the following output:
STEP 2: Calculate an embedding vector for every "fact"
An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents (source).
Let's test the Embeddings endpoint first. Run get_embedding.ipynb with an input This is a test.
Note: In the case of Embeddings endpoint, the parameter prompt is called input.
get_embedding.ipynb
import openai
openai.api_key = '<OPENAI_API_KEY>'
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = model,
input = text
)
return result['data'][0]['embedding']
print(get_embedding('text-embedding-ada-002', 'This is a test'))
We should get the following output:
What we see in the screenshot above is This is a test as an embedding vector. More precisely, we get a 1536-dimensional embedding vector (i.e., there are 1536 numbers inside). You are probably familiar with a 3-dimensional space (i.e., X, Y, Z). Well, this is a 1536-dimensional space which is very hard to imagine.
There are two things we need to understand at this point:
Why do we need to transform text into an embedding vector (i.e., numbers)? Because later on, we can compare embedding vectors and figure out how similar the two texts are. We can't compare texts as such.
Why are there exactly 1536 numbers inside the embedding vector? Because the text-embedding-ada-002 model has an output dimension of 1536. It's pre-defined.
Now we can create an embedding vector for each "fact". Run get_all_embeddings.ipynb.
get_all_embeddings.ipynb
import openai
from openai.embeddings_utils import get_embedding
import pandas as pd
openai.api_key = '<OPENAI_API_KEY>'
df = pd.read_csv('companies.csv')
df['embedding'] = df['content'].apply(lambda x: get_embedding(x, engine = 'text-embedding-ada-002'))
df.to_csv('companies_embeddings.csv')
The code above will take the first company (i.e., x), get its 'content' (i.e., "fact") and apply the function get_embedding using the text-embedding-ada-002 model. It will save the embedding vector of the first company in a new column named 'embedding'. Then it will take the second company, the third company, the fourth company, etc. At the end, the code will automatically generate a new .csv file named companies_embeddings.csv.
Saving embedding vectors locally (i.e., in a .csv file) means we don't have to call the OpenAI API every time we need them. We calculate an embedding vector for a given "fact" once and that's it.
Run print_dataframe_embeddings.ipynb to print the dataframe with the new column named 'embedding'.
print_dataframe_embeddings.ipynb
import pandas as pd
import numpy as np
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df
We should get the following output:
STEP 3: Calculate an embedding vector for the input and compare it with embedding vectors from the companies_embeddings.csv using cosine similarity
We need to calculate an embedding vector for the input so that we can compare the input with a given "fact" and see how similar these two texts are. Actually, we compare the embedding vector of the input with the embedding vector of the "fact". Then we compare the input with the second "fact", the third "fact", the fourth "fact", etc. Run get_cosine_similarity.ipynb.
get_cosine_similarity.ipynb
import openai
from openai.embeddings_utils import cosine_similarity
import pandas as pd
openai.api_key = '<OPENAI_API_KEY>'
my_model = 'text-embedding-ada-002'
my_input = '<INSERT_INPUT>'
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = my_model,
input = my_input
)
return result['data'][0]['embedding']
input_embedding_vector = get_embedding(my_model, my_input)
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, input_embedding_vector))
df
The code above will take the input and compare it with the first fact. It will save the calculated similarity of the two in a new column named 'similarity'. Then it will take the second fact, the third fact, the fourth fact, etc.
If my_input = 'Tell me something about company ABC':
If my_input = 'Tell me something about company XYZ':
If my_input = 'Tell me something about company Apple':
We can see that when we give Tell me something about company ABC as an input, it's the most similar to the first "fact". When we give Tell me something about company XYZ as an input, it's the most similar to the second "fact". Whereas, if we give Tell me something about company Apple as an input, it's the least similar to any of these two "facts".
STEP 4: Answer with the most similar "fact" if similarity is above our threshold, otherwise answer with the OpenAI API
Let's set our similarity threshold to >= 0.9. The code below should answer with the most similar "fact" if similarity is >= 0.9, otherwise answer with the OpenAI API. Run get_answer.ipynb.
get_answer.ipynb
# Imports
import openai
from openai.embeddings_utils import cosine_similarity
import pandas as pd
import numpy as np
# Insert your API key
openai.api_key = '<OPENAI_API_KEY>'
# Insert OpenAI text embedding model and input
my_model = 'text-embedding-ada-002'
my_input = '<INSERT_INPUT>'
# Calculate embedding vector for the input using OpenAI Embeddings endpoint
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = my_model,
input = my_input
)
return result['data'][0]['embedding']
# Save embedding vector of the input
input_embedding_vector = get_embedding(my_model, my_input)
# Calculate similarity between the input and "facts" from companies_embeddings.csv file which we created before
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, input_embedding_vector))
# Find the highest similarity value in the dataframe column 'similarity'
highest_similarity = df['similarity'].max()
# If the highest similarity value is equal or higher than 0.9 then print the 'content' with the highest similarity
if highest_similarity >= 0.9:
fact_with_highest_similarity = df.loc[df['similarity'] == highest_similarity, 'content']
print(fact_with_highest_similarity)
# Else pass input to the OpenAI Completions endpoint
else:
response = openai.Completion.create(
model = 'text-davinci-003',
prompt = my_input,
max_tokens = 30,
temperature = 0
)
content = response['choices'][0]['text'].replace('\n', '')
print(content)
If my_input = 'Tell me something about company ABC' and the threshold is >= 0.9 we should get the following answer from the companies_embeddings.csv:
If my_input = 'Tell me something about company XYZ' and the threshold is >= 0.9 we should get the following answer from the companies_embeddings.csv:
If my_input = 'Tell me something about company Apple' and the threshold is >= 0.9 we should get the following answer from the OpenAI API:

Related

NLP: Get opinionated terms that correspond to aspect terms

I want to extract the sentiment sentence that goes along an aspect term in a sentence. I have the following code:
import spacy
nlp = spacy.load("en_core_web_lg")
def find_sentiment(doc):
# find roots of all entities in the text
ner_heads = {ent.root.idx: ent for ent in doc.ents}
rule3_pairs = []
for token in doc:
children = token.children
A = "999999"
M = "999999"
add_neg_pfx = False
for child in children:
if(child.dep_ in ["nsubj"] and not child.is_stop): # nsubj is nominal subject
if child.idx in ner_heads:
A = ner_heads[child.idx].text
else:
A = child.text
if(child.dep_ in ["acomp", "advcl"] and not child.is_stop): # acomp is adjectival complement
M = child.text
# example - 'this could have been better' -> (this, not better)
if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
neg_prefix = "not"
add_neg_pfx = True
if(child.dep_ == "neg"): # neg is negation
neg_prefix = child.text
add_neg_pfx = True
# print(child, child.dep_)
if (add_neg_pfx and M != "999999"):
M = neg_prefix + " " + M
if(A != "999999" and M != "999999"):
rule3_pairs.append((A, M))
return rule3_pairs
print(find_sentiment(nlp('NEW DELHI Refined soya oil remained weak for the second day and prices shed 0.56 per cent to Rs 682.50 per 10 kg in futures market today as speculators reduced positions following sluggish demand in the spot market against adequate stocks position.')))
Which gets me the output: [('oil', 'weak'), ('prices', 'reduced')]
But this is too little of the content of the text
I want to know if it is possible to get an output like: [('oil', 'weak'), ('prices', 'shed 0.56 percent'), ('demand', 'sluggish')]
Is there any approach you recomend trying?
I triedthe code given above. Also a another library of stanza which only got similar results.
Unfortunately, if your task is to extract all expressive words from the text (all the words that contain sentimental significance), then it is not possible with the current state of affairs. Language is highly variable, and the same word could change its sentiment and meaning from sentence to sentence. While words like "awful" are easy to classify as negative, "demand" from your text is not as obvious, not even speaking about edge cases when seemingly positive "incredible" may reverse its sentiment if used as empowerment: "incredibly stupid" should be classified as very negative, but machines can normally only output two opposite labels for those words.
This is why for purposes of sentimental analysis, the only reliable way is building machine learning model that will classify texts entirely, which means you should adapt your software to accept the final verdict and process it in some way or another.
Naive Bayes Classifier
The simplest way to classify text by sentiment is the Naive Bayes classifier algorithm (that, among other things, not only classifies sentiment) that is implemented in NLTK:
from nltk import NaiveBayesClassifier, classify
#The training data is a two-dimensional list of words to classify.
train_data = dataset[:7000]
test_data = dataset[7000:]
#Train method returns the trained model.
classifier = NaiveBayesClassifier.train(train_data)
#To get accuracy, use classify.accuracy method:
print("Accuracy is:", classify.accuracy(classifier, test_data))
In order to make a prediction, we need to pass a list of words. It's preferable to remove any words that do not play sentimental significance such as the stop words and punctuation so that it wouldn't disturb our model:
from nltk.corpus import stopwords
from nltk.tokenise import word_tokenise
def clearLexemes(words):
return [word if word not in stopwords.word("english")
or "!?<>:;.&*%^" in word for word in words]
text = "What a terrible day!"
tokens = clearLexemes(word_tokenise(text))
print("Text sentiment is " + str(classifier.classify(dict([token, True] for token in tokens)))))
The output will be the sentiment of the text.
The important notes:
requires a minimum parameters to train and trains relatively fast;
is highly efficient for working with natural languages (is also used for gender identification and named entity recognition);
is unlikely to properly classify edge cases when words shift their sentiment in creatively-styled or rare utterances. For example, "Sweetheart, I wish ll of your fears would come true and you will be happy to live in such world!" This sentence is negative and uses irony to mask negative attribute through positive expressions, and the model may not be able to detect this.
Linear Regression
Another related method is to use linear regression algorithms from your favourite machine learning framework. In this notebook I used the Amazon food review dataset
to measure how fast model accuracy increases as you feed it with more and more data. The data you need to feed the model is the raw text and its score label (that in your case could be sentiment).
import numpy as np #For converting strings to text
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, classification_report
#Preparing the data
ys: pd.DataFrame = reviews.head(170536) #30% of the dataframe is test data
xs: pd.DataFrame = reviews[170537:] #70% of the dataframe is training data
#Training the model
lr = LogisticRegression(max_iter=1000)
cv = CountVectorizer(token_pattern=r'\b\w+\b')
train = cv.fit_transform(xs["Summary"].apply(lambda x: np.str_(x)))
test = cv.transform(ys["Summary"].apply(lambda x: np.str_(x)))
lr.fit(train, xs["Score"])
#Measuring accuracy:
predictions = lr.predict(test)
labels = ["x1", "x2", "x3", "x4", "x5"]
report = classification_report(predictions, ys["Score"],
target_names = labels, output_dict=True)
accuracy = [report[label]["precision"] for label in labels]
print(accuracy)
Conclusion
Investigating sentimental analysis is a worthwhile area of academic and industrial research that completely relies on machine learning and is bound to its limitations. It is a powerful topic that should be covered in the classical NLP suite. Unfortunately, currently understanding meaning close enough to be able to extract situational meaning is a feat close to inventing Artificial General Intelligence, however technology rapidly grows in that direction.

Using pretrained word2vector model

I am trying to use a pretrained word2vector model to create word embeddings but i am getting the following error when Im trying to create weight matrix from word2vec genism model:
Code:
import gensim
w2v_model = gensim.models.KeyedVectors.load_word2vec_format("/content/drive/My Drive/GoogleNews-vectors-negative300.bin.gz", binary=True)
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
EMBEDDING_DIM=300
# Function to create weight matrix from word2vec gensim model
def get_weight_matrix(model, vocab):
# total vocabulary size plus 0 for unknown words
vocab_size = len(vocab) + 1
# define weight matrix dimensions with all 0
weight_matrix = np.zeros((vocab_size, EMBEDDING_DIM))
# step vocab, store vectors using the Tokenizer's integer mapping
for word, i in vocab.items():
weight_matrix[i] = model[word]
return weight_matrix
embedding_vectors = get_weight_matrix(w2v_model, tokenizer.word_index)
Im getting the following error:
Error
As a note: it's better to paste a full error is as formatted text than as an image of text. (See Why not upload images of code/errors when asking a question? for a full list of the reasons why.)
But regarding your question:
If you get a KeyError: word 'didnt' not in vocabulary error, you can trust that the word you've requested is not in the set-of-word-vectors you've requested it from. (In this case, the GoogleNews vectors that Google trained & released back around 2012.)
You could check before looking it up – 'didnt' in w2v_model, which would return False, and then do something else. Or you could use a Python try: ... catch: ... formulation to let it happen, but then do something else when it happens.
But it's up to you what your code should do if the model you've provided doesn't have the word-vectors you were hoping for.
(Note: the GoogleNews vectors do include a vector for "didn't", the contraction with its internal apostrophe. So in this one case, the issue may be that your tokenization is stripping such internal-punctuation-marks from contractions, but Google chose not to when making those vectors. But your code should be ready for handling missing words in any case, unless you're sure through other steps that can never happen.)

how to use tokens with sklearn in LDA

i have a list of tokenized documents,containing both unigrams, bi-grams and i would like to perform sklearn lda on it.i have tried the following code:
my_data =[['low-rank matrix','detection method','problem finding'],['probabilistic inference','problem finding','statistical learning','solution' ],['detection method','probabilistic inference','population','language']...]
tf_vectorizer = CountVectorizer(min_df=2, max_features=n_features,
stop_words='english')
tf = tf_vectorizer.fit_transform(mydata)
lda = LatentDirichletAllocation(n_topics=3, max_iter=5,random_state=10)
but when i print the output i get something like this:
topic 0:
detection,finding, solution ,method,problem
topic 1:
language, statistical , problem, learning,finding
and so on..
bigrams are broken and are separated from one another.i have 10,000 documents and already tokenize them, also the method for finding the bigram is not nltk based so i already did this.
is there any method to improve this without changing the input?
i am very new in using sklearn so apologies in advance if i am making some obvious mistake.
CountVectorizer has a ngram_range param which will be used for deciding if the vocabulary will contain uniqrams, or bigrams or trigrams etc:-
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the
range of n-values for different n-grams to be extracted. All values of
n such that min_n <= n <= max_n will be used.
For example:
ngram_range=(1,1) => Will include only unigrams
ngram_range=(1,2) => Will include unigrams and bigrams
ngram_range=(2,2) => Will include only bigrams
and so on...
You have not defined that, so default ngram_range=(1,1) and hence only unigrams are used here.
tf_vectorizer = CountVectorizer(min_df=2,
max_features=n_features,
stop_words='english',
ngram_range = (2,2)) # You need this
tf = tf_vectorizer.fit_transform(my_data)
Secondly, you say that you have already tokenize the data and show the lists of list (my_data) in your code. That doesnt work with CountVectorizer. For that, you need to pass a simple list of strings and CountVectorizer will automatically apply tokenizing on them. So you will need to pass on your own preprocessing steps to that. See other params 'preprocessor', 'tokenizer' and 'analyzer' in the linked documentation.

Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:
inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims]
print(rank)
Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.
Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?
Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks #gojomo).
def seeded_vector(self, seed_string):
"""Create one 'random' vector (but deterministic by seed_string)"""
# Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
return (once.rand(self.vector_size) - 0.5) / self.vector_size
Set negative=0 to avoid randomization:
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20, window=5, min_count=1, negative=0, workers=6, epochs=10)
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
v1 = model.infer_vector(s)
for i in range(100):
v2 = model.infer_vector(s)
assert np.all(v1 == v2), "Failed on %s" % (''.join(s))

scikit-learn: Is there a way to provide an object as an input to predict function of a classifier?

I am planning to use an SGDClassifier in production. The idea is to train the classifier on some training data, use cPickle to dump it to a .pkl file and reuse it later in a script. However, there are certain high cardinality fields which are categorical in nature and translated to one hot matrix representation which creates around 5000 features. Now the input that I get for the predict will only have one of these features and rest all will be zeroes. It will also include ofcourse the other numerical features apart from this. From the docs, it appears that the predict function expects an array of array as input. Is there any way I can transform my input to the format expected by the predict function without having to store the fields everytime I train the model ?
Update
So, let us say my input contains 3 fields:
{
rate: 10, // numeric
flagged: 0, //binary
host: 'somehost.com' // keeping this categorical
}
host can have around 5000 different values. Now I loaded the file to a pandas dataframe, used the get_dummies function to transform the host field to around 5000 new fields which are binary fields.
Then I trained by model and stored it using cPickle.
Now, when I need to use the predict function, for the input, I only have 3 fields (shown above). However, as per my understanding the predict endpoint will expect an array of vectors and each vector is supposed to have those 5000 fields.
For the entry that I need to predict, I know only one field for that entry which will be the value of host itself.
For example, if my input is
{
rate: 5,
flagged: 1
host: 'new_host.com'
}
I know that the fields expected by the predict should be:
{
rate: 5,
flagged: 1
new_host: 1
}
But if I translate it to vector format, I don't know which index to place the new_host field. Also, I don't know in advance what other hosts are (unless I store it somewhere during the training phase)
I hope I am making some sense. Let me know if I am doing it the wrong way.
I don't know which index to place the new_host field
A good approach that has worked for me is to build a pipeline which you then use for training and prediction. This way you do not have to concern yourself with the column index of whatever output is produced by your transformation:
# in training
pipl = Pipeline(steps=[('binarizer', LabelBinarizer(),
('clf', SGDClassifier())])
model = pipl.train(X, Y)
pickle.dump(mf, model)
# in production
model = pickle.load(mf)
y = model.predict(X)
As X, Y inputs you need to pass an array-like object. Make sure the input is the same structure for both training and test, e.g.
X = [[data.get('rate'), data.get('flagged'), data.get('host')]]
Y = [[y-cols]] # your example doesn't specify what is Y in your data
More flexible: Pandas DataFrame + Pipeline
What also works nicely is to use a Pandas DataFrame in combination with sklearn-pandas as it allows you to use different transformations on different column names. E.g.
df = pd.DataFrame.from_dict(data)
mapper = DataFrameMapper([
('host', sklearn.preprocessing.LabelBinarizer()),
('rate', sklearn.preprocessing.StandardScaler())
])
pipl = Pipeline(steps=[('mapper', mapper),
('clf', SGDClassifier())])
X = df[x-cols]
y = df[y-col(s)]
pipl.fit()
Note that x-cols and y-col(s) are the list of the feature and target columns respectively.
You should use a scikit-learn transformer instead of get_dummies. In this case, LabelBinarizer makes sense. Seeing as LabelBinarizer doesn't work in a pipeline, this is one way to do what you want:
binarizer = LabelBinarizer()
# fitting LabelBinarizer means it remembers all the columns it's seen
one_hot_data = binarizer.fit_transform(X_train[:, categorical_col])
# replace string column with one-hot representation
X_train = np.concatenate([np.delete(X_train, categorical_col, axis=1),
one_hot_data], axis=1)
model = SGDClassifier()
clf.fit(X_train, y)
pickle.dump(f, {'clf': clf, 'binarizer': binarizer})
then at prediction time:
estimators = pickle.load(f)
clf = estimators['clf']
binarizer = estimators['binarizer']
one_hot_data = binarizer.transform(X_test[:, categorical_col])
X_test = np.concatenate([np.delete(X_test, categorical_col, axis=1),
one_hot_data], axis=1)
clf.predict(X_test)

Resources