I am doing topic modeling using NMF model. I want to evaluate its performance by confusion matrix or if there are other better methods to evaluate NMF, I am ok with that also. I tried to find tutorials or other resources on internet but couldn't find anything that help me solve my problem. Below is the complete code which I am using for NMF topic modeling.
import pandas as pd
import numpy as np
dataset = pd.read_csv(r'Preprocess_Data.csv')
dataset = reviews_datasets.head(20000)
dataset.dropna()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
tfidf_vect = TfidfVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = tfidf_vect.fit_transform(dataset['Text'].values.astype('U'))
from sklearn.decomposition import NMF
nmf = NMF(n_components=5, random_state=42)
nmf.fit(doc_term_matrix)
import random
for i in range(10):
random_id = random.randint(0,len(tfidf_vect.get_feature_names()))
print(tfidf_vect.get_feature_names()[random_id])
first_topic = nmf.components_[0]
top_topic_words = first_topic.argsort()[-10:]
for i in top_topic_words:
print(tfidf_vect.get_feature_names()[I])
for i,topic in enumerate(nmf.components_):
print(f'Top 10 words for topic #{i}:')
print([tfidf_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
print('\n')
Thanks in advance for the suggestions and advices.
If you have labels associated with documents, then you can train a classifier using the topic-document representations as document features and test on the topic-document representations of the testing set.
Otherwise, you need to stick to unsupervised metrics, e.g. the most well-known is topic coherence which measures how related the top-N words of the topics are.
You can find all these measures and many others here: https://github.com/mind-Lab/octis#available-metrics
Related
Im learning how to convert text into numbers for NLP problems and following a course Im learning about word vectors provided by Spacy package. the code works all fine from learning and evaluation but I have some problems regarding:
making prediction for new sentences, I cannot seems to make it work and most examples just fit the model then use X_test set for evaluation. ( Code below)
The person explaining stated that its bad( won't give good results) if I used
""
doc.vector over doc.vector.values
""
when trying both I don't see a difference, what is the difference between the two?
the example is to classify news title between fake and real
import spacy
import pandas as pd
df= pd.read_csv('Fake_Real_Data.csv')
print(df.head())
print(f"shape is: {df.shape}")
print("checking the impalance: \n ", df.label.value_counts())
df['label_No'] = df['label'].map({'Fake': 0, 'Real': 1})
print(df.head())
nlp= spacy.load('en_core_web_lg') # only large and medium model have word vectors
df['Text_vector'] = df['Text'].apply(lambda x: nlp(x).vector) #apply the function to EACH element in the column
print(df.head(5))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test= train_test_split(df.Text_vector.values, df.label_No, test_size=0.2, random_state=2022)
x_train_2D= np.stack(X_train)
x_test_2D= np.stack(X_test)
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB()
from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler()
scaled_train_2d= scaler.fit_transform(x_train_2D)
scaled_test_2d= scaler.transform(x_test_2D)
clf.fit(scaled_train_2d, y_train)
from sklearn.metrics import classification_report
y_pred=clf.predict(scaled_test_2d)
print(classification_report(y_test, y_pred))
I am applying the breast cancer dataset to a decision tree as simple as possible:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
import graphviz
cancer = load_breast_cancer()
#print(cancer.feature_names)
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
tree = DecisionTreeClassifier(random_state=0, max_depth=2)
tree.fit(X_train, y_train)
print(f"\nscore train: {tree.score(X_train, y_train)}")
print(f"score test : {tree.score(X_test, y_test)}")
>>>
score train: 0.9413145539906104
score test : 0.9370629370629371
export_graphviz(tree, out_file=f"./src/dot/testing/breast_cancer.dot", class_names=['malignant', 'benign'], feature_names=cancer.feature_names, impurity=False, filled=True)
with open(f"./src/dot/testing/breast_cancer.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
Which lead to this graph:
Playing with feature selection, I want to get only the most important feature. In my understanding it should be the feature in the root-leaf, no? Unfortunately it's not, it's "worst concave points". Here is what I did to get the most important feature:
select = SelectKBest(k=1)
select.fit(X_train, y_train)
X_train_selected = select.transform(X_train)
print("X_train.shape : {}".format(X_train.shape))
print("X_train_selected.shape: {}\n".format(X_train_selected.shape))
>>>
X_train.shape : (426, 30)
X_train_selected.shape: (426, 1)
mask = select.get_support()
# plt.matshow(mask.reshape(1, -1), cmap='gray_r')
# plt.xlabel("Sample index")
print("most important features:")
for mask, feature in zip(mask, cancer.feature_names):
if mask: print(feature)
>>>
most important features:
worst concave points
I guess I am getting something wrong here. Could somebody clarify this? Any hint? Thanks
The most important feature does not necessarily mean that it will be the one used to make the first split. In fact, sklearn.tree.DecisionTreeClassifier uses entropy to decide which feature to use when making a split, so unless SelectKBest does this too, there is no need for both methods to reach the same conclusions in the same order. Even the same feature will reduce entropy differently in different stages of a tree classifier.
As a side note, trees do not always consider all features when making nodes. Take a look at max_features here. This means that, depending on your random-state and max_features hyper parameters, your tree may or may not have considered worst_concave_points when making the first split.
I have finished implementing the traditional k-means text clustering. However, right now, I need to revise my program to "spherical k-means text clustering" but have not succeeded yet.
I've searched for solutions on sites but still cannot revise my program successfully.
The followings are the resources that should be helpful with my project but I still cannot figure out a way yet.
https://github.com/jasonlaska/spherecluster
https://github.com/khyatith/Clustering-newsgroup-dataset
Spherical k-means implementation in Python
This is my traditional K-means program:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.externals import joblib #store model
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(tag_document) //tag_document is a list that contains many strings
true_k = 3 //assume that i want to have 3 clusters
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
#store
joblib.dump(model,'save/cluster.pkl')
#restore
clu2 = joblib.load('save/cluster.pkl')
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
I expect to cluster text documents with "spherical k-means clustering".
First you need to check if your texts are similar when the cosine distance between two similar texts is small. After that, you can just normalize vectors and cluster with kmeans.
I did something like this:
k = 20
kmeans = KMeans(n_clusters=k,init='random', random_state=0)
normalizer = Normalizer(copy=False)
sphere_kmeans = make_pipeline(normalizer, kmeans)
sphere_kmeans = sphere_kmeans.fit_transform(word2vec-tfidf-vectors)
I have to classify some texts with support vector machine. In my train file I have 5 different categories. I have to do classify at first with "Bag of Words" feature, after with SVD feature by keeping 90% of the total variance.
I 'm using python and sklearn but I don't know how to create the above SVD feature.
My train set is separated with tab (\t), my texts are in 'Content' column and the categories are in 'Category' column.
The high level steps for a tf-idf/PCA/SVM workflow are as follows:
Load data (will be different in your case):
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
train_text = newsgroups_train.data
y = newsgroups_train.target
Preprocess features and train classifier:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.svm import SVC
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(train_text)
pca = PCA(.8)
X = pca.fit_transform(X_tfidf.todense())
clf = SVC(kernel="linear")
clf.fit(X,y)
Finally, do the same preprocessing steps for test dataset and make predictions.
PS
If you wish, you may combine preprocessing steps into Pipeline:
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
preproc = Pipeline([('tfidf',TfidfVectorizer())
,('todense', FunctionTransformer(lambda x: x.todense(), validate=False))
,('pca', PCA(.9))])
X = preproc.fit_transform(train_text)
and use it later for dealing with test data as well.
Is it possible to use Keras model objects with CalibratedClassifierCV from sklearn.calibration? Or is there another way to performa isotonic regression in sklearn/other python packages without having to pass it a model object.
I tried using the sklearn wrapper for Keras, but it didn't work. Here is the doc for the CalibratedClassifierCV class.
You can train an isotonic regression a posteriori, after prediction. Let 'file1' be a csv containing your predictions pred and real observed events obs on a subset of data. Ideally, this subset has never been used before (not even in Keras training). Let file2 contain the predictions you want to calibrate (Keras predictions for the test set).
import pandas as pd
from sklearn.isotonic import IsotonicRegression
never_seen=pd.read_csv('file1')
uncalibrated=pd.read_csv('file2')
ir = IsotonicRegression( out_of_bounds = 'clip' )
ir.fit( never_seen.pred,never_seen.obs )
p_calibrated = ir.transform( uncalibrated.pred )