How to cluster "text document" with "spherical k-means" using Python? - python-3.x

I have finished implementing the traditional k-means text clustering. However, right now, I need to revise my program to "spherical k-means text clustering" but have not succeeded yet.
I've searched for solutions on sites but still cannot revise my program successfully.
The followings are the resources that should be helpful with my project but I still cannot figure out a way yet.
https://github.com/jasonlaska/spherecluster
https://github.com/khyatith/Clustering-newsgroup-dataset
Spherical k-means implementation in Python
This is my traditional K-means program:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.externals import joblib #store model
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(tag_document) //tag_document is a list that contains many strings
true_k = 3 //assume that i want to have 3 clusters
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
#store
joblib.dump(model,'save/cluster.pkl')
#restore
clu2 = joblib.load('save/cluster.pkl')
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
I expect to cluster text documents with "spherical k-means clustering".

First you need to check if your texts are similar when the cosine distance between two similar texts is small. After that, you can just normalize vectors and cluster with kmeans.
I did something like this:
k = 20
kmeans = KMeans(n_clusters=k,init='random', random_state=0)
normalizer = Normalizer(copy=False)
sphere_kmeans = make_pipeline(normalizer, kmeans)
sphere_kmeans = sphere_kmeans.fit_transform(word2vec-tfidf-vectors)

Related

'Subset' object is not an iterator for updating torch' legacy IMDB dataset

I'm updating a pytorch network from legacy code to the current code. Following documentation such as that here.
I used to have:
import torch
from torchtext import data
from torchtext import datasets
# setting the seed so our random output is actually deterministic
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
# defining our input fields (text) and labels.
# We use the Spacy function because it provides strong support for tokenization in languages other than English
TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)
from torchtext import datasets
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
import random
train_data, valid_data = train_data.split(random_state = random.seed(SEED))
example = next(iter(test_data))
example.text
MAX_VOCAB_SIZE = 25_000
TEXT.build_vocab(train_data,
max_size = MAX_VOCAB_SIZE,
vectors = "glove.6B.100d",
unk_init = torch.Tensor.normal_) # how to initialize unseen words not in glove
LABEL.build_vocab(train_data)
Now in the new code I am struggling to add the validation set. All goes well until here:
from torchtext.datasets import IMDB
train_data, test_data = IMDB(split=('train', 'test'))
I can print the outputs, while they look different (problems later on?), they have all the info. I can print test_data fine with next(train_data.
Then after I do:
test_size = int(len(train_dataset)/2)
train_data, valid_data = torch.utils.data.random_split(train_dataset, [test_size,test_size])
It tells me:
next(train_data)
TypeError: 'Subset' object is not an iterator
This makes me think I am not correct in applying random_split. How to correctly create the validation set for this dataset? Without causing issues.
Try next(iter(train_data)). It seems one have to create iterator over dataset explicitly. And use Dataloader when effectiveness is required.

How to evaluate NMF Topic Modeling by using Confusion Matrix?

I am doing topic modeling using NMF model. I want to evaluate its performance by confusion matrix or if there are other better methods to evaluate NMF, I am ok with that also. I tried to find tutorials or other resources on internet but couldn't find anything that help me solve my problem. Below is the complete code which I am using for NMF topic modeling.
import pandas as pd
import numpy as np
dataset = pd.read_csv(r'Preprocess_Data.csv')
dataset = reviews_datasets.head(20000)
dataset.dropna()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
tfidf_vect = TfidfVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = tfidf_vect.fit_transform(dataset['Text'].values.astype('U'))
from sklearn.decomposition import NMF
nmf = NMF(n_components=5, random_state=42)
nmf.fit(doc_term_matrix)
import random
for i in range(10):
random_id = random.randint(0,len(tfidf_vect.get_feature_names()))
print(tfidf_vect.get_feature_names()[random_id])
first_topic = nmf.components_[0]
top_topic_words = first_topic.argsort()[-10:]
for i in top_topic_words:
print(tfidf_vect.get_feature_names()[I])
for i,topic in enumerate(nmf.components_):
print(f'Top 10 words for topic #{i}:')
print([tfidf_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
print('\n')
Thanks in advance for the suggestions and advices.
If you have labels associated with documents, then you can train a classifier using the topic-document representations as document features and test on the topic-document representations of the testing set.
Otherwise, you need to stick to unsupervised metrics, e.g. the most well-known is topic coherence which measures how related the top-N words of the topics are.
You can find all these measures and many others here: https://github.com/mind-Lab/octis#available-metrics

Features for Support Vector Machine (SVM)

I have to classify some texts with support vector machine. In my train file I have 5 different categories. I have to do classify at first with "Bag of Words" feature, after with SVD feature by keeping 90% of the total variance.
I 'm using python and sklearn but I don't know how to create the above SVD feature.
My train set is separated with tab (\t), my texts are in 'Content' column and the categories are in 'Category' column.
The high level steps for a tf-idf/PCA/SVM workflow are as follows:
Load data (will be different in your case):
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
train_text = newsgroups_train.data
y = newsgroups_train.target
Preprocess features and train classifier:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.svm import SVC
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(train_text)
pca = PCA(.8)
X = pca.fit_transform(X_tfidf.todense())
clf = SVC(kernel="linear")
clf.fit(X,y)
Finally, do the same preprocessing steps for test dataset and make predictions.
PS
If you wish, you may combine preprocessing steps into Pipeline:
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
preproc = Pipeline([('tfidf',TfidfVectorizer())
,('todense', FunctionTransformer(lambda x: x.todense(), validate=False))
,('pca', PCA(.9))])
X = preproc.fit_transform(train_text)
and use it later for dealing with test data as well.

Why am I getting a score of 0.0 when finding the score of test data using Gaussian NB classifier?

I have two different data sets. One for training my classifier and the other one is for testing. Both the datasets are text files with two columns separated by a ",". FIrst column (numbers) is for the independent variable (group) and the second column is for the dependent variable.
Training data set
(just few lines for example. there are no empty lines between each row):
EMI3776438,1
EMI3776438,1
EMI3669492,1
EMI3752004,1
Testing data setup
(as you can see, i have picked data from the training data to be sure that the score surely can't be zero)
EMI3776438,1
Code in Python 3.6:
# #all the import statements have been ignored to keep the code short
# #loading the training data set
training_file_path=r'C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\modified_columns.txt'
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
training_file_data = pandas.read_table(training_file_path,
header=None,
names=['numbers','group'],
sep=',')
training_file_data = training_file_data.apply(le.fit_transform)
features = ['numbers']
x = training_file_data[features]
y = training_file_data["group"]
from sklearn.model_selection import train_test_split
training_x,testing_x, training_y, testing_y = train_test_split(x, y,
random_state=0,
test_size=0.1)
from sklearn.naive_bayes import GaussianNB
gnb= GaussianNB()
gnb.fit(training_x, training_y)
# #loading the testing data
testing_final_path=r"C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\testing_final.txt"
testing_sample_data=pandas.read_table(testing_final_path,
sep=',',
header=None,
names=['numbers','group'])
testing_sample_data = testing_sample_data.apply(le.fit_transform)
category = ["numbers"]
testing_sample_data_x = testing_sample_data[category]
# #finding the score of the test data
print(gnb.score(testing_sample_data_x, testing_sample_data["group"]))
First, the above data samples dont show how many classes are there in it. You need to describe more about it.
Secondly, you are calling le.fit_transform again on test data which will forget all the training samples mappings from strings to numbers. The LabelEncoder le will start encoding the test data again from scratch, which will not be equal to how it mapped training data. So the input to GaussianNB is now incorrect and hence incorrect results.
Change that to:
testing_sample_data = testing_sample_data.apply(le.transform)
UPDATE:
I'm sorry I overlooked the fact that you had two columns in your data. LabelEncoder only works on a single column of data. For making it work on multiple pandas columns at once, look at the answers of following question:
Label encoding across multiple columns in scikit-learn
If you are using the latest version of scikit (0.20) or can update to it, then you would not need any such hacks and directly use the OrdinalEncoder:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
training_file_data = enc.fit_transform(training_file_data)
And during testing:
training_file_data = enc.transform(training_file_data)

Sklearn.mixture.dpgmm not functioning correctly

I'm having trouble with sklearn.mixture.dpgmm. The main issue is that it is not returning correct covariances for synthetic data (2 separated 2D gaussians), where it really should have no issue. In particular, when I do dpgmm._get_covars(), the covariance matrices have diagonal elements that are always exactly 1.0 too large, regardless of the input data distributions. This seems like a bug, as gmm works perfectly (when limiting to known exact number of groups)
Another issue is that dpgmm.weights_ makes no sense, they sum to one but the values appear meaningless.
Does anyone have a solution to this or see something clearly wrong with my example?
Here is the exact script I'm running:
import itertools
import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
import pdb
from sklearn import mixture
# Generate 2D random sample, two gaussians each with 10000 points
rsamp1 = np.random.multivariate_normal(np.array([5.0,5.0]),np.array([[1.0,-0.2],[-0.2,1.0]]),10000)
rsamp2 = np.random.multivariate_normal(np.array([0.0,0.0]),np.array([[0.2,-0.0],[-0.0,3.0]]),10000)
X = np.concatenate((rsamp1,rsamp2),axis=0)
# Fit a mixture of Gaussians with EM using 2
gmm = mixture.GMM(n_components=2, covariance_type='full',n_iter=10000)
gmm.fit(X)
# Fit a Dirichlet process mixture of Gaussians using 10 components
dpgmm = mixture.DPGMM(n_components=10, covariance_type='full',min_covar=0.5,tol=0.00001,n_iter = 1000000)
dpgmm.fit(X)
print("Groups With data in them")
print(np.unique(dpgmm.predict(X)))
##print the input and output covars as example, should be very similar
correct_c0 = np.array([[1.0,-0.2],[-0.2,1.0]])
print "Input covar"
print correct_c0
covars = dpgmm._get_covars()
c0 = np.round(covars[0],decimals=1)
print "Output Covar"
print c0
print("Output Variances Too Big by 1.0")
According to the dpgmm docs this Class is Deprecated in version 0.18 and will be removed in version 0.20
You should use BayesianGaussianMixture Class instead, with parameter weight_concentration_prior_type set with option "dirichlet_process"
Hope it helps
instead of writing
from sklearn.mixture import GMM
gmm = GMM(2, covariance_type='full', random_state=0)
you should write:
from sklearn.mixture import BayesianGaussianMixture
gmm = BayesianGaussianMixture(2, covariance_type='full', random_state=0)

Resources