word frequency with TfidfVectorizer - python-3.x

I'm trying to calculate the word frequency for a messaging dataframe using TF-IDF. So far I have this
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
new_group['tokenized_sents'] = new_group.apply(lambda row: nltk.word_tokenize(row['message']),axis=1).astype(str).lower()
vectoriser=TfidfVectorizer()
new_group['tokenized_vector'] = list(vectoriser.fit_transform(new_group['tokenized_sents']).toarray())
However with the code above I get a bunch of zeros instead of the words frequency. How can I fix this to get the correct number frenquncy for the messages. This is my dataframe
user_id date message tokenized_sents tokenized_vector
X35WQ0U8S 2019-02-17 Need help ['need','help'] [0.0,0.0]
X36WDMT2J 2019-03-22 Thank you! ['thank','you','!'] [0.0,0.0,0.0]

First of all for the counts, you don't want to use TfidfVectorizer as it is normalized. You want to use CountVectorizer. Second, you dont need to tokenize the words as sklearn has a build in tokenizer with both TfidfVectorizer and CountVectorizer.
#add whatever settings you want
countVec =CountVectorizer()
#fit transform
cv = countVec.fit_transform(df['message'].str.lower())
#feature names
cv_feature_names = countVec.get_feature_names()
#feature counts
feature_count = cv.toarray().sum(axis = 0)
#feature name to count
dict(zip(cv_feature_names, feature_count))

Related

Customize OpenAI model: How to make sure answers are from customized data?

I'm using customized text with 'Prompt' and 'Completion' to train new model.
Here's the tutorial I used to create customized model from my data:
beta.openai.com/docs/guides/fine-tuning/advanced-usage
However even after training the model and sending prompt text to the model, I'm still getting generic results which are not always suitable for me.
How I can make sure completion results for my prompts will be only from the text I used for the model and not from the generic OpenAI models?
Can I use some flags to eliminate results from generic models?
Wrong goal: OpenAI API should answer from the fine-tuning dataset if the prompt is similar to the one from the fine-tuning dataset
It's the completely wrong logic. Forget about fine-tuning. As stated on the official OpenAI website:
Fine-tuning lets you get more out of the models available through the
API by providing:
Higher quality results than prompt design
Ability to train on more examples than can fit in a prompt
Token savings due to shorter prompts
Lower latency requests
Fine-tuning is not about answering with a specific answer from the fine-tuning dataset.
Fine-tuning helps the model gain more knowledge, but it has nothing to do with how the model answers. Why? The answer we get from the fine-tuned model is based on all knowledge (i.e., fine-tuned model knowledge = default knowledge + fine-tuning knowledge).
Although GPT-3 models have a lot of general knowledge, sometimes we want the model to answer with a specific answer (i.e., "fact").
Correct goal: Answer with a "fact" when asked about a "fact", otherwise answer with the OpenAI API
Note: For better (visual) understanding, the following code was ran and tested in Jupyter.
STEP 1: Create a .csv file with "facts"
To keep things simple, let's add two companies (i.e., ABC and XYZ) with a content. The content in our case will be a 1-sentence description of the company.
companies.csv
Run print_dataframe.ipynb to print the dataframe.
print_dataframe.ipynb
import pandas as pd
df = pd.read_csv('companies.csv')
df
We should get the following output:
STEP 2: Calculate an embedding vector for every "fact"
An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents (source).
Let's test the Embeddings endpoint first. Run get_embedding.ipynb with an input This is a test.
Note: In the case of Embeddings endpoint, the parameter prompt is called input.
get_embedding.ipynb
import openai
openai.api_key = '<OPENAI_API_KEY>'
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = model,
input = text
)
return result['data'][0]['embedding']
print(get_embedding('text-embedding-ada-002', 'This is a test'))
We should get the following output:
What we see in the screenshot above is This is a test as an embedding vector. More precisely, we get a 1536-dimensional embedding vector (i.e., there are 1536 numbers inside). You are probably familiar with a 3-dimensional space (i.e., X, Y, Z). Well, this is a 1536-dimensional space which is very hard to imagine.
There are two things we need to understand at this point:
Why do we need to transform text into an embedding vector (i.e., numbers)? Because later on, we can compare embedding vectors and figure out how similar the two texts are. We can't compare texts as such.
Why are there exactly 1536 numbers inside the embedding vector? Because the text-embedding-ada-002 model has an output dimension of 1536. It's pre-defined.
Now we can create an embedding vector for each "fact". Run get_all_embeddings.ipynb.
get_all_embeddings.ipynb
import openai
from openai.embeddings_utils import get_embedding
import pandas as pd
openai.api_key = '<OPENAI_API_KEY>'
df = pd.read_csv('companies.csv')
df['embedding'] = df['content'].apply(lambda x: get_embedding(x, engine = 'text-embedding-ada-002'))
df.to_csv('companies_embeddings.csv')
The code above will take the first company (i.e., x), get its 'content' (i.e., "fact") and apply the function get_embedding using the text-embedding-ada-002 model. It will save the embedding vector of the first company in a new column named 'embedding'. Then it will take the second company, the third company, the fourth company, etc. At the end, the code will automatically generate a new .csv file named companies_embeddings.csv.
Saving embedding vectors locally (i.e., in a .csv file) means we don't have to call the OpenAI API every time we need them. We calculate an embedding vector for a given "fact" once and that's it.
Run print_dataframe_embeddings.ipynb to print the dataframe with the new column named 'embedding'.
print_dataframe_embeddings.ipynb
import pandas as pd
import numpy as np
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df
We should get the following output:
STEP 3: Calculate an embedding vector for the input and compare it with embedding vectors from the companies_embeddings.csv using cosine similarity
We need to calculate an embedding vector for the input so that we can compare the input with a given "fact" and see how similar these two texts are. Actually, we compare the embedding vector of the input with the embedding vector of the "fact". Then we compare the input with the second "fact", the third "fact", the fourth "fact", etc. Run get_cosine_similarity.ipynb.
get_cosine_similarity.ipynb
import openai
from openai.embeddings_utils import cosine_similarity
import pandas as pd
openai.api_key = '<OPENAI_API_KEY>'
my_model = 'text-embedding-ada-002'
my_input = '<INSERT_INPUT>'
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = my_model,
input = my_input
)
return result['data'][0]['embedding']
input_embedding_vector = get_embedding(my_model, my_input)
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, input_embedding_vector))
df
The code above will take the input and compare it with the first fact. It will save the calculated similarity of the two in a new column named 'similarity'. Then it will take the second fact, the third fact, the fourth fact, etc.
If my_input = 'Tell me something about company ABC':
If my_input = 'Tell me something about company XYZ':
If my_input = 'Tell me something about company Apple':
We can see that when we give Tell me something about company ABC as an input, it's the most similar to the first "fact". When we give Tell me something about company XYZ as an input, it's the most similar to the second "fact". Whereas, if we give Tell me something about company Apple as an input, it's the least similar to any of these two "facts".
STEP 4: Answer with the most similar "fact" if similarity is above our threshold, otherwise answer with the OpenAI API
Let's set our similarity threshold to >= 0.9. The code below should answer with the most similar "fact" if similarity is >= 0.9, otherwise answer with the OpenAI API. Run get_answer.ipynb.
get_answer.ipynb
# Imports
import openai
from openai.embeddings_utils import cosine_similarity
import pandas as pd
import numpy as np
# Insert your API key
openai.api_key = '<OPENAI_API_KEY>'
# Insert OpenAI text embedding model and input
my_model = 'text-embedding-ada-002'
my_input = '<INSERT_INPUT>'
# Calculate embedding vector for the input using OpenAI Embeddings endpoint
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = my_model,
input = my_input
)
return result['data'][0]['embedding']
# Save embedding vector of the input
input_embedding_vector = get_embedding(my_model, my_input)
# Calculate similarity between the input and "facts" from companies_embeddings.csv file which we created before
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, input_embedding_vector))
# Find the highest similarity value in the dataframe column 'similarity'
highest_similarity = df['similarity'].max()
# If the highest similarity value is equal or higher than 0.9 then print the 'content' with the highest similarity
if highest_similarity >= 0.9:
fact_with_highest_similarity = df.loc[df['similarity'] == highest_similarity, 'content']
print(fact_with_highest_similarity)
# Else pass input to the OpenAI Completions endpoint
else:
response = openai.Completion.create(
model = 'text-davinci-003',
prompt = my_input,
max_tokens = 30,
temperature = 0
)
content = response['choices'][0]['text'].replace('\n', '')
print(content)
If my_input = 'Tell me something about company ABC' and the threshold is >= 0.9 we should get the following answer from the companies_embeddings.csv:
If my_input = 'Tell me something about company XYZ' and the threshold is >= 0.9 we should get the following answer from the companies_embeddings.csv:
If my_input = 'Tell me something about company Apple' and the threshold is >= 0.9 we should get the following answer from the OpenAI API:

How to calculate TF-IDF values of noun documents excluding spaCy stop words?

I have a data frame, df with text, cleaned_text, and nouns as column names. text and cleaned_text contains string document, nouns is a list of nouns extracted from cleaned_text column. df.shape = (1927, 3).
I am trying to calculate TF-IDF values for all documents within df only for nouns, excluding spaCy stopwords.
What I have tried?
import spacy
from spacy.lang.en import English
nlp = spacy.load('en_core_web_sm')
# subclass to modify stop word lists recommended from spaCy version 3.0 onwards
excluded_stop_words = {'down'}
included_stop_words = {'dear', 'regards'}
class CustomEnglishDefaults(English.Defaults):
stop_words = English.Defaults.stop_words.copy()
stop_words -= excluded_stop_words
stop_words |= included_stop_words
class CustomEnglish(English):
Defaults = CustomEnglishDefaults
# function to extract nouns from cleaned_text column, excluding spaCy stowords.
nlp = CustomEnglish()
def nouns(text):
doc = nlp(text)
return [t for t in doc if t.pos_ in ['NOUN'] and not t.is_stop and not t.is_punct]
# calculate TF-IDF values for nouns, excluding spaCy stopwords.
from sklearn.feature_extraction.text import TfidfVectorizer
documents = df.cleaned_text
tfidf = TfidfVectorizer(stop_words=CustomEnglish)
X = tfidf.fit_transform(documents)
What I am expecting?
I am expecting to have an output as a list of tuples ranked in descending order;
nouns = [('noun_1', tf-idf_1), ('noun_2', tf-idf_2), ...]. All nouns in nouns should match those of df.nouns (this is to check whether I am on the right way).
What is my issue?
I got confused about how to apply TfidfVectorizer such that to calculate only TF-IDF values for Nouns extracted from cleaned_text. I am also not sure whether SkLearn TfidfVectorizer can calculate TF-IDF as I am expecting.
Not sure if you're still looking for a solution. Here is an option that you might wanna go ahead with.
First of all, by default TF_IDF takes into account the entire set of words, not just nouns. Hence, you would need to implement a custom TF_IDF function to apply results only on nouns. Following is a good reference on how TF_IDF works internally: https://www.askpython.com/python/examples/tf-idf-model-from-scratch
Instead of running the tf_idf function(as applied in the above url) for all words of a sentence/document, you can just run it on the list of nouns you've extracted,i.e., just change the code from:
def tf_idf(sentence):
tf_idf_vec = np.zeros((len(word_set),))
for word in sentence:
tf = termfreq(sentence,word)
idf = inverse_doc_freq(word)
value = tf*idf
tf_idf_vec[index_dict[word]] = value
return tf_idf_vec
to:
def tf_idf(sentence, nouns):
values = []
for word in nouns:
tf = termfreq(sentence,word)
idf = inverse_doc_freq(word)
value = tf*idf
values.append(value)
return tf_idf_vec, values
You now have a "values" list corresponding to the list of "nouns" for each sentence. Hope this makes sense.

Cannot reproduce pre-trained word vectors from its vector_ngrams

Just curiosity, but I was debugging gensim's FastText code for replicating the implementation of Out-of-Vocabulary (OOV) words, and I'm not being able to accomplish it.
So, the process i'm following is training a tiny model with a toy corpus, and then comparing the resulting vectors of a word in the vocabulary. That means if the whole process is OK, the output arrays should be the same.
Here is the code I've used for the test:
from gensim.models import FastText
import numpy as np
# Default gensim's function for hashing ngrams
from gensim.models._utils_any2vec import ft_hash_bytes
# Toy corpus
sentences = [['hello', 'test', 'hello', 'greeting'],
['hey', 'hello', 'another', 'test']]
# Instatiate FastText gensim's class
ft = FastText(sg=1, size=5, min_count=1, \
window=2, hs=0, negative=20, \
seed=0, workers=1, bucket=100, \
min_n=3, max_n=4)
# Build vocab
ft.build_vocab(sentences)
# Fit model weights (vectors_ngram)
ft.train(sentences=sentences, total_examples=ft.corpus_count, epochs=5)
# Save model
ft.save('./ft.model')
del ft
# Load model
ft = FastText.load('./ft.model')
# Generate ngrams for test-word given min_n=3 and max_n=4
encoded_ngrams = [b"<he", b"<hel", b"hel", b"hell", b"ell", b"ello", b"llo", b"llo>", b"lo>"]
# Hash ngrams to its corresponding index, just as Gensim does
ngram_hashes = [ft_hash_bytes(n) % 100 for n in encoded_ngrams]
word_vec = np.zeros(5, dtype=np.float32)
for nh in ngram_hashes:
word_vec += ft.wv.vectors_ngrams[nh]
# Compare both arrays
print(np.isclose(ft.wv['hello'], word_vec))
The output of this script is False for every dimension of the compared arrays.
It would be nice if someone could point me out if i'm missing something or doing something wrong. Thanks in advance!
The calculation of a full word's FastText word-vector is not just the sum of its character n-gram vectors, but also a raw full-word vector that's also trained for in-vocabulary words.
The full-word vectors you get back from ft.wv[word] for known-words have already had this combination pre-calculated. See the adjust_vectors() method for an example of this full calculation:
https://github.com/RaRe-Technologies/gensim/blob/68ec5b8ed7f18e75e0b13689f4da53405ef3ed96/gensim/models/keyedvectors.py#L2282
The raw full-word vectors are in a .vectors_vocab array on the model.wv object.
(If this isn't enough to reconcile matters: ensure you're using the latest gensim, as there have been many recent FT fixes. And, ensure your list of ngram-hashes matches the output of the ft_ngram_hashes() method of the library – if not, your manual ngram-list-creation and subsequent hashing may be doing something different.)

how to use tokens with sklearn in LDA

i have a list of tokenized documents,containing both unigrams, bi-grams and i would like to perform sklearn lda on it.i have tried the following code:
my_data =[['low-rank matrix','detection method','problem finding'],['probabilistic inference','problem finding','statistical learning','solution' ],['detection method','probabilistic inference','population','language']...]
tf_vectorizer = CountVectorizer(min_df=2, max_features=n_features,
stop_words='english')
tf = tf_vectorizer.fit_transform(mydata)
lda = LatentDirichletAllocation(n_topics=3, max_iter=5,random_state=10)
but when i print the output i get something like this:
topic 0:
detection,finding, solution ,method,problem
topic 1:
language, statistical , problem, learning,finding
and so on..
bigrams are broken and are separated from one another.i have 10,000 documents and already tokenize them, also the method for finding the bigram is not nltk based so i already did this.
is there any method to improve this without changing the input?
i am very new in using sklearn so apologies in advance if i am making some obvious mistake.
CountVectorizer has a ngram_range param which will be used for deciding if the vocabulary will contain uniqrams, or bigrams or trigrams etc:-
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the
range of n-values for different n-grams to be extracted. All values of
n such that min_n <= n <= max_n will be used.
For example:
ngram_range=(1,1) => Will include only unigrams
ngram_range=(1,2) => Will include unigrams and bigrams
ngram_range=(2,2) => Will include only bigrams
and so on...
You have not defined that, so default ngram_range=(1,1) and hence only unigrams are used here.
tf_vectorizer = CountVectorizer(min_df=2,
max_features=n_features,
stop_words='english',
ngram_range = (2,2)) # You need this
tf = tf_vectorizer.fit_transform(my_data)
Secondly, you say that you have already tokenize the data and show the lists of list (my_data) in your code. That doesnt work with CountVectorizer. For that, you need to pass a simple list of strings and CountVectorizer will automatically apply tokenizing on them. So you will need to pass on your own preprocessing steps to that. See other params 'preprocessor', 'tokenizer' and 'analyzer' in the linked documentation.

What is the use of 'max_features' in TfidfVectorizer

What I have understood from it is, If max_feature = n; It means that it is selecting the top n Feature on the basis of Tf-Idf value. I went through the Documentation of TfidfVectorizer on scikit-learn but didn't understand it properly.
If you want row-wise words which have the highest tfidf values, then you need to access the transformed tf-idf matrix from Vectorizer, access it row by row (doc by doc) and then sort the values to get those.
Something like this:
# TfidfVectorizer will by default output a sparse matrix
tfidf_data = tfidf_vectorizer.fit_transform(text_data).tocsr()
vocab = np.array(tfidf_vectorizer.get_feature_names())
# Replace this with the number of top words you want to get in each row
top_n_words = 5
# Loop all the docs present
for i in range(tfidf_data.shape[0]):
doc = tfidf_data.getrow(i).toarray().ravel()
sorted_index = np.argsort(doc)[::-1][:top_n_words]
print(sorted_index)
for word, tfidf in zip(vocab[sorted_index], doc[sorted_index]):
print("%s - %f" %(word, tfidf))
If you can use pandas, then the logic becomes simpler:
for i in range(tfidf_data.shape[0]):
doc_data = pd.DataFrame({'Tfidf':tfidf_data.getrow(i).toarray().ravel(),
'Word': vocab})
doc_data.sort_values(by='Tfidf', ascending=False, inplace=True)
print(doc_data.iloc[:top_n_words])

Resources