the right way to make prediction using Spacy word vectors - nlp

Im learning how to convert text into numbers for NLP problems and following a course Im learning about word vectors provided by Spacy package. the code works all fine from learning and evaluation but I have some problems regarding:
making prediction for new sentences, I cannot seems to make it work and most examples just fit the model then use X_test set for evaluation. ( Code below)
The person explaining stated that its bad( won't give good results) if I used
""
doc.vector over doc.vector.values
""
when trying both I don't see a difference, what is the difference between the two?
the example is to classify news title between fake and real
import spacy
import pandas as pd
df= pd.read_csv('Fake_Real_Data.csv')
print(df.head())
print(f"shape is: {df.shape}")
print("checking the impalance: \n ", df.label.value_counts())
df['label_No'] = df['label'].map({'Fake': 0, 'Real': 1})
print(df.head())
nlp= spacy.load('en_core_web_lg') # only large and medium model have word vectors
df['Text_vector'] = df['Text'].apply(lambda x: nlp(x).vector) #apply the function to EACH element in the column
print(df.head(5))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test= train_test_split(df.Text_vector.values, df.label_No, test_size=0.2, random_state=2022)
x_train_2D= np.stack(X_train)
x_test_2D= np.stack(X_test)
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB()
from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler()
scaled_train_2d= scaler.fit_transform(x_train_2D)
scaled_test_2d= scaler.transform(x_test_2D)
clf.fit(scaled_train_2d, y_train)
from sklearn.metrics import classification_report
y_pred=clf.predict(scaled_test_2d)
print(classification_report(y_test, y_pred))

Related

Why is sklearn RandomForestClassifier root node different from the most important feature?

How is feature importance calculated in RandomForestClassifier in scikit-learn?
Here's a reproducible code. I run the classifier once with criterion set to gini and once to entropy. For each of them, I print the feature importance and plot the tree.
In neither of the instances, the root tree is the same as the most important feature. Why is that?
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
from IPython.display import Image, display
from subprocess import call
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_wine
from sklearn.datasets import load_iris
wines = load_wine()
iris = load_iris()
def create_and_fit(clf,model_name):
print(clf)
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=5, random_state=seed)
# X,y = iris.data; iris.target
# X,y = wines.data, wines.target
# fit the mode
clf.fit(X, y)
# get importance
importance = clf.feature_importances_
indices = np.argsort(importance)[::-1]
for f in range(X.shape[1]):
print("feature {}: ({})".format(indices[f], importance[indices[f]]))
filename = model_name+model.criterion
if model_name == 'forest_':
print('forest')
export_graphviz(clf.estimators_[0], out_file=filename+'.dot')
else:
export_graphviz(clf, out_file=filename+'.dot')
f = 'tree_'+model.criterion+'.png'
call(['dot', '-Tpng', filename+'.dot', '-o', filename+'.png', '-Gdpi=600'])
seed=0
models = [
RandomForestClassifier(criterion='gini',max_depth=5, random_state=seed),
RandomForestClassifier(criterion='entropy',max_depth=5, random_state=seed),
]
names =['forest_', 'forest_']
for name, model in zip(names, models):
create_and_fit(model,name)
Here's the snippet to load the image:
Image(filename = 'forest_gini'+'.png')
and for the entropy
Image(filename = 'forest_entropy'+'.png')
This behaviour seems to only happen with ensembles not trees (I'm generalizing as I only tried on Random forest and Decision Tree).
Here's the snippet for decision trees
models = [
DecisionTreeClassifier(criterion='gini',max_depth=5, random_state=seed),
DecisionTreeClassifier(criterion='entropy',max_depth=5, random_state=seed)
]
names =['tree_', 'tree_']
for name, model in zip(names, models):
create_and_fit(model,name)
Here's the snippet to load the image:
Image(filename = 'tree_gini'+'.png')
and for the entropy
Image(filename = 'tree_entropy'+'.png')
I think I found the answer, which is related to max_features parameter in RandomForestClassifier. Here's scikit-learn documentation:
max_features{“sqrt”, “log2”, None}, int or float,
default=”sqrt”
The number of features to consider when looking for
the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and round(max_features *
n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

Do I have to run tsne.fit_transform for each set of embeddings that I want to visualize?

I'm trying to use sklearn.manifold.TSNE to visualize data that I sample from a generative model and compare the distribution of generated data vs training data (to measure 'extrapolation').
Here's how I'm doing it:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import joblib
import numpy as np
import pandas as pd
tsne = TSNE(n_components=2, random_state=0)
x_train = tsne.fit_transform(embds_train)
x_generated = tsne.fit_transform(embds_generated)
My question is, is it necessary to call tsne.fit_transform() on both embeddings for training and generated samples? Or I could fit only once and then add other embeddings to already fitted space?

scikit-learn: most important feature due to SelectKBest() is not the same feature of top node in DecisionTreeClassifier() with unedited data?

I am applying the breast cancer dataset to a decision tree as simple as possible:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
import graphviz
cancer = load_breast_cancer()
#print(cancer.feature_names)
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
tree = DecisionTreeClassifier(random_state=0, max_depth=2)
tree.fit(X_train, y_train)
print(f"\nscore train: {tree.score(X_train, y_train)}")
print(f"score test : {tree.score(X_test, y_test)}")
>>>
score train: 0.9413145539906104
score test : 0.9370629370629371
export_graphviz(tree, out_file=f"./src/dot/testing/breast_cancer.dot", class_names=['malignant', 'benign'], feature_names=cancer.feature_names, impurity=False, filled=True)
with open(f"./src/dot/testing/breast_cancer.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
Which lead to this graph:
Playing with feature selection, I want to get only the most important feature. In my understanding it should be the feature in the root-leaf, no? Unfortunately it's not, it's "worst concave points". Here is what I did to get the most important feature:
select = SelectKBest(k=1)
select.fit(X_train, y_train)
X_train_selected = select.transform(X_train)
print("X_train.shape : {}".format(X_train.shape))
print("X_train_selected.shape: {}\n".format(X_train_selected.shape))
>>>
X_train.shape : (426, 30)
X_train_selected.shape: (426, 1)
mask = select.get_support()
# plt.matshow(mask.reshape(1, -1), cmap='gray_r')
# plt.xlabel("Sample index")
print("most important features:")
for mask, feature in zip(mask, cancer.feature_names):
if mask: print(feature)
>>>
most important features:
worst concave points
I guess I am getting something wrong here. Could somebody clarify this? Any hint? Thanks
The most important feature does not necessarily mean that it will be the one used to make the first split. In fact, sklearn.tree.DecisionTreeClassifier uses entropy to decide which feature to use when making a split, so unless SelectKBest does this too, there is no need for both methods to reach the same conclusions in the same order. Even the same feature will reduce entropy differently in different stages of a tree classifier.
As a side note, trees do not always consider all features when making nodes. Take a look at max_features here. This means that, depending on your random-state and max_features hyper parameters, your tree may or may not have considered worst_concave_points when making the first split.

train test data split using stratify on two columns in scikit-learn

I have a dataset that I want to split into train and test so that I have data in the test set from each data source (specified in column "source") and from each class (specified in column "class"). I read about using the parameter stratifiy with scikitlearn's train_test_split function, but how can I use it on two columns?
Stratifying on multiple columns is easily done with sklearn's train_test_split since v.19.0
Proof
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_multilabel_classification
X, Y = make_multilabel_classification(1000000, 10, n_classes=2, n_labels=1)
train_X, test_X, train_Y, test_Y =train_test_split(X,Y,stratify=Y, train_size=.8, random_state=42)
Y.shape
(1000000, 2)
Then you can compare simple column means of resulting stratifications:
train_Y[:,0].mean(), test_Y[:,0].mean()
(0.45422, 0.45422)
train_Y[:,1].mean(), test_Y[:,1].mean()
(0.23472375, 0.234725)
Run statistical t-test on the equality of means:
from scipy.stats import ttest_ind
ttest_ind(train_Y[:,0],test_Y[:,0])
Ttest_indResult(statistic=0.0, pvalue=1.0)
And finally do the same for conditional means to prove that you indeed achieved what you wanted:
train_Y[train_Y[:,0].astype("bool"),1].mean(), test_Y[test_Y[:,0].astype("bool"),1].mean()
(0.43959149751221877, 0.43958874554180793)

Features for Support Vector Machine (SVM)

I have to classify some texts with support vector machine. In my train file I have 5 different categories. I have to do classify at first with "Bag of Words" feature, after with SVD feature by keeping 90% of the total variance.
I 'm using python and sklearn but I don't know how to create the above SVD feature.
My train set is separated with tab (\t), my texts are in 'Content' column and the categories are in 'Category' column.
The high level steps for a tf-idf/PCA/SVM workflow are as follows:
Load data (will be different in your case):
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
train_text = newsgroups_train.data
y = newsgroups_train.target
Preprocess features and train classifier:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.svm import SVC
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(train_text)
pca = PCA(.8)
X = pca.fit_transform(X_tfidf.todense())
clf = SVC(kernel="linear")
clf.fit(X,y)
Finally, do the same preprocessing steps for test dataset and make predictions.
PS
If you wish, you may combine preprocessing steps into Pipeline:
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
preproc = Pipeline([('tfidf',TfidfVectorizer())
,('todense', FunctionTransformer(lambda x: x.todense(), validate=False))
,('pca', PCA(.9))])
X = preproc.fit_transform(train_text)
and use it later for dealing with test data as well.

Resources