How to iterate TfidfVectorizer() on pandas dataframe - python-3.x

I have a large pandas dataframe with 10 million records of news articles. So, this is how I have applied TfidfVectorizer.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(df['articles'])
It took alot of time to process all documents. All I wants to iterate each article in dataframe one at a time or is it possible that I can pass documents in chunks and it keep updating existing vocabulary without overwriting old dictionary of vocabulary?
I have gone through this SO post but not exactly getting how to applied it on pandas. I have also heard about Python generators but not exactly whether its useful here.

You can iterate in chunks as below. The solution has been adapted from here
def ChunkIterator():
for chunk in pd.read_csv(csvfilename, chunksize=1000):
for doc in chunk['articles'].values:
yield doc
corpus = ChunkIterator()
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(corpus)

Related

Map BERTopic topic IDs back to the training dataframe

I have trained a BERTopic model on a dataframe of length of 400k. I want to map the topics of each document in a new column inside the dataframe. I could do that by running a for loop on all the documents and do topic_model.transform(doc) on them. The only problem is, it takes more than a second to transform each document into its topic and it would take days for the whole dataset.
Is there a way to achieve this faster since I want to map the topics on the training data.
I tried:
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)
topics = []
for text in df.texts:
tops = topic_model.transform(text)
topics.append(tops)
df['topics'] = topics
There is no need to recalculate the topics as you already retrieved them when using .fit_transform. There, the topics that you retrieve are in the exact same order as the input documents. Therefore, you can perform the following:
# The `topics` that you get here are in the exact same order as `docs`
# `topics[0]` belongs to `docs[0]`, `topics[1]` to `docs[1]`, etc.
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)
# When you used `.fit_transform`:
df = pd.DataFrame({"Document": docs, "Topic": topic})
For those using .fit instead of .fit_transform, you can also access the topics and their documents as follows:
# When you used `.fit`:
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})
From the source code, the transform() function of the BERTopic class is able accept a list of documents -- so you don't need to loop over your dataframe calling transform() multiple times for each document.
Secondly, it seems that if you don't pass your pre-trained document embeddings to the transform() function, embeddings will be set to None and you'll be calling _extract_embeddings() every single time which is likely what is causing the poor performance. The solution is to pass the embeddings to your transform() call. In the dummy example shown below, this improves speed of classification of 1,000 documents by approx. 1,555x (68.43 vs 0.044 seconds).
Example
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.datasets import fetch_20newsgroups
import random
import pandas as pd
# Create dummy data
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
random.seed(756)
training_docs = random.sample(docs, 1000)
testing_docs = random.sample(docs, 1000)
# Instantiate and fit topic model to training docs
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(training_docs, show_progress_bar=True)
topic_model = BERTopic().fit(training_docs, embeddings)
topic_model.reduce_topics(training_docs, nr_topics=5) # Reduce num of topics, default = 20
# Determine topics on testing docs
topics, probs = topic_model.transform(testing_docs, embeddings)
# topics, probs = topic_model.transform(testing_docs) # ~1,555x slower
df = pd.DataFrame({"docs": testing_docs, "topics": topics})
print(df)
print(topic_model.get_topic_info())

In Pytorch, how can i shuffle a DataLoader?

I have a dataset with 10000 samples, where the classes are present in an ordered manner. First I loaded the data into an ImageFolder, then into a DataLoader, and I want to split this dataset into a train-val-test set. I know the DataLoader class has a shuffle parameter, but thats not good for me, because it only shuffles the data when enumeration happens on it. I know about the RandomSampler function, but with it, i can only take n amount of data randomly from the dataset, and i have no control of what is being taken out, so one sample might be present in the train,test and val set at the same time.
Is there a way to shuffle the data in a DataLoader? The only thing i need is the shuffle, after that i can subset the data.
The Subset dataset class takes indices (https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset). You can probably exploit that to get this functionality as below. Essentially, you can get away by shuffling the indices and then picking the subset of the dataset.
# suppose dataset is the variable pointing to whole datasets
N = len(dataset)
# generate & shuffle indices
indices = numpy.arange(N)
indices = numpy.random.permutation(indices)
# there are many ways to do the above two operation. (Example, using np.random.choice can be used here too
# select train/test/val, for demo I am using 70,15,15
train_indices = indices [:int(0.7*N)]
val_indices = indices[int(0.7*N):int(0.85*N)]
test_indices = indices[int(0.85*N):]
train_dataset = Subset(dataset, train_indices)
val_dataset = Subset(dataset, val_indices)
test_dataset = Subset(dataset, test_indices)

Incremental OneHotEncoding and Target Encoding

I am working with a large tabular dataset that consists of many categorical columns. I want to train a regression model (XGBoost) in this data while using as many regressors as possible.
Because of the size of data, I am using incremental training - where following sklearn API - .fit(X, y) I am not able to fit the entire matrix X into memory and therefore I am training the model in a couple of rows at the time. The problem is that in every batch, the model is expecting the same number of columns in X.
This is where it gets tricky because some variables are categorical it may be that one-hot encoding on a batch of data will same some shape (e.g. 20 columns). However, the next batch will have (26 columns) simply because in the previous batch not every unique level of the categorical feature was present. Sklearn allows for accounting for this and costume function can also be used: To keep some number of columns in matrix X.
import seaborn as sns
import numpy as np
from sklearn.preprocessing import OneHotEncoder
def one_hot_known(dataf, list_levels, col):
"""Creates a dummy coded matrix with as many columns as unique levels"""
return np.array(
[np.eye(len(list_levels))[list_levels.index(i)] for i in dataf[col]])
# Load Some Dataset with categorical variable
df_orig = sns.load_dataset('tips')
# List of unique levels - known apriori
day_level = list(df_orig['day'].unique())
# Image, we have a batch of data (subset of original data) and one categorical level (DAY) is not present here
df = df_orig.loc[lambda d: d['day'] != 'Sun']
# Missing category is filled with 0 and in next batch, if present its columns will have 1.
OneHotEncoder(categories = [day_level], sparse=False).fit_transform(np.array(df['day']).reshape(-1, 1))
#Costum function, can be used in incremental(data batches chunk fashion)
one_hot_known(df, day_level, 'day')
What I would like to do not is to utilize the TargerEncoding approach, so that we do not have matrix X with a huge number of columns. However, it still needs to be done in an Incremental fashion, just like the OneHot Encoding above.
I am writing this as a post because I know this is very useful to many people and would like to know how to utilize the same strategy for TargetEncoding.
I am aware that Deep Learning allows for Embedding layers, which represent categorical features in continuous space but I would like to apply TargetEncoding.

Stratified Sampling in python scikit-learn

I want to divide my dataset into train and test sets using stratified sampling(scikitlearn).my approach is as follows :
1) I'am reading a CSV file and loading it using pandas readCSV.so ultimately i'am storing the loaded csv in a dataframe names "dataset"
dataset = pd.readCSV('CSV_NAME)
2) Now i'am applying stratified sampling as :
train,test = train_test_split(dataset,test_size=0.20,stratify=True)
But it throwing the following error :
TypeError: Singleton array array(True, dtype=bool) cannot be considered a valid collection.
So please suggest me the correct way of doing to it.
'train_test_split' needs to know what the target variable is. Therefore, you should change your call to something like:
train,test = train_test_split(dataset[needed columns], dataset.target,test_size=0.20,stratify=True)
Btw, there is a missing single quote in your first line of code.
You could convert the pandas dataframe to a numpy array by the following
import numpy
dataset = pd.readCSV('CSV_NAME')
dataset = array(dataset)
like suggested in the second answer here: https://www.quora.com/How-does-python-pandas-go-along-with-scikit-learn-library-Has-anyone-doing-data-analysis-using-pandas-and-then-then-fit-models-using-scikit-learn
Or you could read the dataset into a numpy array directly.

K-Means clustering is biased to one center

I have a corpus of wiki pages (baseball, hockey, music, football) which I'm running through tfidf and then through kmeans. After a couple issues to start (you can see my previous questions), I'm finally getting a KMeansModel...but when I try to predict, I keep getting the same center. Is this because of the small dataset, or because I'm comparing a multi-word document against a smaller amount of words(1-20) query? Or is there something else I'm doing wrong? See the below code:
//Preprocessing of data includes splitting into words
//and removing words with only 1 or 2 characters
val corpus: RDD[Seq[String]]
val hashingTF = new HashingTF(100000)
val tf = hashingTF.transform(corpus)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf).cache
val kMeansModel = KMeans.train(tfidf, 3, 10)
val queryTf = hashingTF.transform(List("music"))
val queryTfidf = idf.transform(queryTf)
kMeansModel.predict(queryTfidf) //Always the same, no matter the term supplied
This question seems somewhat related to this one
More a checklist than an answer:
A single word query or a very short sentence is probably not a good choice especially when combined with a large feature vector. I would start with significant fragments of the documents from the corpus
Manually check similarity between query an each cluster. Is it even remotely similar to each cluster?
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}
import breeze.linalg.functions.cosineDistance
import org.apache.spark.mllib.linalg.{Vector, SparseVector, DenseVector}
def toBreeze(v: Vector): BV[Double] = v match {
case DenseVector(values) => new BDV[Double](values)
case SparseVector(size, indices, values) => {
new BSV[Double](indices, values, size)
}
}
val centers = kMeansModel.clusterCenters.map(toBreeze(_))
val query = toBreeze(queryTfidf)
centers.map(c => cosineDistance(query, c))
Does K-Means converge? Depending on a dataset and initial centroids ten or twenty iterations can be not enough. Try to increase this number to one thousand or so and see if the problem persist.
Is your corpus diverse enough to form meaningful clusters? Try to find centroids for each document in you corpus. Do you get a relatively uniform distribution or almost all documents are assigned to a single cluster.
Perform visual inspection. Take your tfidf RDD convert to a matrix, apply PCA, plot, color by cluster and see if you get a meaningful results.
Plot centroids as well and check if these cover possible cluster. If not check convergence once again.
You can also check similarities between centroids:
(0 until centers.size)
.toList
.flatMap(i => ((i + 1) until centers.size)
.map(j => (i, j, 1 - cosineDistance(centers(i), centers(j)))))
Is your pre-processing thorough enough? Simple removal of the short words most likely won't suffice. I would at lest extend it using with stopwords removal. Some stemming wouldn't hurt too.
K-Means results depend on the initial centroids. Try running an algorithm multiple times an see if problem persists.
Try more sophisticated algorithm like LDA

Resources