Fewer than expected purity scores in PCA analysis - python-3.x

I'm trying to plot the line graph of purity scores against the captured variances in PCA. The goal is to plot the line graph of purity scores against the captured variances of 89% and 99% only. In my code when the components/dimensions are 2 it captures 89% of variance and and when components/dimensions are 4 it captures 99% of variance.
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("clustering.csv")
X10_df = df.drop("Class",axis = 1) #feature matrix
Y10_df = df["Class"] #Target vector
X10_df = np.array(X10_df)
Y10_df = np.array(Y10_df)
scaler = StandardScaler() # Standardizing the data
df_std = scaler.fit_transform(X10_df)
pca = PCA()
pca.fit(df_std)
purity = []
n_comp = range(2,5)
for k in n_comp :
pca = PCA(n_components = k)
pca.fit(df_std)
pca.transform(df_std)
scores_pca = pca.transform(df_std)
kmeans_pca = KMeans(n_clusters=3, init ='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y12 = kmeans_pca.fit_predict(scores_pca)
purity13 = purity_score(Y10_df, pred_y12)
purity.append(purity13)
Below function calculates the purity score :
def purity_score(y_true, y_pred):
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
However, while I have four variance scores, I only have three purity scores. I expected to have four purity scores so that I could create a plot of the variance vs purity.
Why there are only three purity scores?
Here is the link to my dataset file : https://gofile.io/d/3CgFTi

This is simply because when you using for loop with a range, the last number in the range is ignored. So in a range(2,5), it will go 2, 3, 4 and then quite the loop. Please read on for loop in Python.

Related

sklearn.metrics r2_score negative

I can't understand r2_score in sklearn.metrics, which seems to return meaningless values. I followed all the "similar questions" proposed by stackoverflow (some of which elude to wrong argument sequence, which is why I include both orders below), but I'm still lost:
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import r2_score
data = [[0.70940504,0.81604095],
[0.69506565,0.78922145],
[0.66527803,0.72174502],
[0.75251691,0.74893098],
[0.72517034,0.73999503],
[0.68269306,0.72230534],
[0.75251691,0.77163700],
[0.78954422,0.81163350],
[0.83077994,0.94561242],
[0.74107290,0.75122162]]
df = pd.DataFrame(data)
x = df[0].to_numpy().reshape(-1,1)
y = df[1].to_numpy()
print("r2 = ", r2_score(y, x))
print("r2 (wrong order) = ", r2_score(x, y))
lreg = linear_model.LinearRegression()
lreg.fit(x, y)
y_pred = lreg.predict(x)
print("predicted values: ", y_pred)
print("slope = ", lreg.coef_)
print("intercept = ", lreg.intercept_)
print("score = ", lreg.score(x, y))
returns
r2 = 0.01488309898850404 # surprise!!
r2 (wrong order) = -0.7313385423077101 # even more of a surprise!!
predicted values: [0.75664194 0.74219177 0.71217403 0.80008687 0.77252903 0.7297236 0.80008687 0.83740023 0.87895451 0.78855445]
slope = [1.00772544]
intercept = 0.04175643677503682
score = 0.5778168671193278
Plotting data and predicted values in Excel show that the linear_model return values make sense (orange dots fall on Excel trend line), but r2_score return values do not (in both argument sequences):
Your model explains nearly 60% of the target variance, which is much better than the average predictor (which would explain 0).
Why does your single feature explain less? Mainly because of the intercept in this case: r2_score(y, x + 0.042) would work nearly as good.
In a simplified way, you may think of R2 as 1 - (mean_squared_error(y, y_pred) / y.var()). Not being centered around the target mean inflates the sum of squared residuals inevitably, resulting in a poor R2.

Sklearn's TfidfTransformer(use_idf=False, norm=None) returns the same output as CountVectorizer()

I am trying to understand the code behind TfidfTransformer(). From sklearn's documentation, I can get the term frequencies by setting use_idf=False. But when I check the code on Github, I noticed that the TfidfTransformer() will return the same value as CountVectorizer() when not using normalization, which is just the count of each term.
The code that is supposed to calculate term frequencies.
def transform(self, x, copy=True):
"""Transform a count matrix to a tf or tf-idf representation.
Parameters
----------
X : sparse matrix of (n_samples, n_features)
A matrix of term/token counts.
copy : bool, default=True
Whether to copy X and operate on the copy or perform in-place
operations.
Returns
-------
vectors : sparse matrix of shape (n_samples, n_features)
Tf-idf-weighted document-term matrix.
"""
X = self._validate_data(
X, accept_sparse="csr", dtype=FLOAT_DTYPES, copy-copy, reset=False
)
if not sp.issparse(X):
X = sp.csr_matrix(X, dtype=np.float64)
if self.sublinear_tf:
np.log(X.data, X.data)
X.data += 1
if self.use_idf:
# idf being a property, the automatic attributes detection
# does not work as usual and we need to specify the attribute not fitted")
# name:
check_is_fitted (self, attributes=["idf_"], msg="idf vector is not fitted")
# *= doesn't work
X = X * self._idf_diag
if self.norm is not None:
X = normalize(X, norm=self.norm, copy=False)
return X
image of code above
To investigate more, I ran both classes and compared the output of both CountVectorizer and TfidfTransformer using the following code and the output is equal.
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=(
'headers', 'footers', 'quotes'), subset='train', categories=['sci.electronics', 'rec.autos', 'rec.sport.hockey'])
train_documents = dataset.data
vectorizer = CountVectorizer()
train_documents_mat = vectorizer.fit_transform(train_documents)
tf_vectorizer = TfidfTransformer(use_idf=False, norm=None)
train_documents_mat_2 = tf_vectorizer.fit_transform(train_documents_mat)
equal = np.array_equal(
train_documents_mat.toarray(),
train_documents_mat_2.toarray()
)
print(equal)
I am trying to get the term frequencies for my documents rather than just the count. Any ideas why sklearn implement TF-IDF in this way?

Discounted Cumulative Gain dcg_score sklearn

from sklearn.metrics import ndcg_score, dcg_score
import numpy as np
actual= [3,2,0,0,1]
ideal= sorted(actual, reverse=True)
#list to np asarray
actualarr=np.asarray([actual])
idealarr= np.asarray([ideal])
print ("actual score as array", actualarr)
print("ideal score as array", idealarr)
#Discounted Cumulative Gain
dcg= dcg_score(idealarr, actualarr)
print("DCG: ", dcg)
I don't understand why dcg_score takes y_score as a parameter. When I work out DCG longhand (sum relevance/log2(i+1)) I can get the same answer ~4.6, but i can achieve this just with the true scores [3,2,0,0,1], so why does it also require the ideal score [3,2,1,0,0] in the function?
I understood that sklearn.metrics.ndcg computes its sum by taking values from y_true as if it was reordered according to y_score.
As explained inside the code: "Sum the true scores ranked in the order induced by the predicted scores"
This means the metric is computed on the induced ranking, using true relevance values.
A small example:
import numpy as np
from sklearn.metrics import dcg_score
def naive_dcg(y_score):
score = 0
for i,n in enumerate(y_score[0]):
num = 2**n -1
den = np.log2(i+1+1)
score += num/den
return score
y_true = [[1,0]]
y_score = [[0,1]]
print(f'sklearn: {dcg_score(y_true,y_score):.2}, naive: {naive_dcg(y_score):.2}')
y_score = [[0.1,0.2]]
print(f'sklearn: {dcg_score(y_true,y_score):.2}, naive: {naive_dcg(y_score):.2}')
outputs:
sklearn: 0.63, naive: 0.63
sklearn: 0.63, naive: 0.17
which shows naive produces a different metric for the same ranking order.

Model selection & Selecting the number of active components in Bayesian Gaussian Mixture Models

I have generated 2 groups of 1-D data points which are visually clearly separable and I want to use a Bayesian Gaussian Mixture Model (BGMM) to ideally recover 2 clusters.
Since BGMMs maximize a lower bound on the model evidence (ELBO) and given that the ELBO is supposed to combine notions of accuracy and complexity, I would expect more complex models to be penalized.
However, when running Grid Search over the number of clusters, I often get a solution with more than 2 clusters. More specifically, I often get the maximal number of clusters on my grid search. In the example below, I would expect the best model to define 2 clusters. Instead, the best models defines 4 but assigns minimal weights to 2 out of 4 clusters.
I am really surprised, since 2 out of 4 clusters are therefore adding little information and this more complex model still gets selected as the best model.
Why is the BGMM then picking 4 clusters for the best model?
If this is indeed the behavior a BGMM should show, how can I then assess how many active components I actually have in my model? Visually? By defining an arbitrary threshold on the weights?
I have added the code to reproduce my example below.
# Import statements
import itertools
import multiprocessing
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from joblib import Parallel, delayed
from sklearn.mixture import BayesianGaussianMixture
from sklearn.utils import shuffle
def fitmodel(x, params):
'''
Instantiates and fits Bayesian GMM
Used in the parallel for loop
'''
# Gaussian mixture model
clf = BayesianGaussianMixture(**params)
# Fit
clf = clf.fit(x, y=None)
return clf
def plot_results(X, means, covariances, title):
plt.plot(X, np.random.uniform(low=0, high=1, size=len(X)),'o', alpha=0.1, color='cornflowerblue', label='data points')
for i, (mean, covar) in enumerate(zip(
means, covariances)):
# Get normal PDF
n_sd = 2.5
x = np.linspace(mean - n_sd*covar, mean + n_sd*covar, 300)
x = x.ravel()
y = stats.norm.pdf(x, mean, covar).ravel()
if i == 0:
label = 'Component PDF'
else:
label = None
plt.plot(x, y, color='darkorange', label=label)
plt.yticks(())
plt.title(title)
# Generate data
g1 = np.random.uniform(low=-1.5, high=-1, size=(1,100))
g2 = np.random.uniform(low=1.5, high=1, size=(1,100))
X = np.append(g1, g2)
# Shuffle data
X = shuffle(X)
X = X.reshape(-1, 1)
# Define parameters for grid search
parameters = {
'n_components': [1, 2, 3, 4],
'weight_concentration_prior_type':['dirichlet_distribution']
}
# Create permutations of parameter settings
keys, values = zip(*parameters.items())
param_grid = [dict(zip(keys, v)) for v in itertools.product(*values)]
# Run GridSearch using parallel for loop
list_clf = [None] * len(param_grid)
num_cores = multiprocessing.cpu_count()
list_clf = Parallel(n_jobs=num_cores)(delayed(fitmodel)(X, params) for params in param_grid)
# Print best model (based on lower bound on model evidence)
lower_bounds = [x.lower_bound_ for x in list_clf] # Extract lower bounds on model evidence
idx = int(np.where(lower_bounds == np.max(lower_bounds))[0]) # Find best model
best_estimator = list_clf[idx]
print(f'Parameter setting of best model: {param_grid[idx]}')
print(f'Components weights: {best_estimator.weights_}')
# Plot data points and gaussian components
plt.figure(figsize=(8,6))
ax = plt.subplot(2, 1, 1)
if best_estimator.weight_concentration_prior_type == 'dirichlet_process':
prior_label = 'Dirichlet process'
elif best_estimator.weight_concentration_prior_type == 'dirichlet_distribution':
prior_label = 'Dirichlet distribution'
plot_results(X, best_estimator.means_, best_estimator.covariances_,
f'Best Bayesian GMM | {prior_label} prior')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.legend(fontsize='small')
# Plot histogram of weights
ax = plt.subplot(2, 1, 2)
for k, w in enumerate(best_estimator.weights_):
plt.bar(k, w,
width=0.9,
color='#56B4E9',
zorder=3,
align='center',
edgecolor='black'
)
plt.text(k, w + 0.01, "%.1f%%" % (w * 100.),
horizontalalignment='center')
ax.get_xaxis().set_tick_params(direction='out')
ax.yaxis.grid(True, alpha=0.7)
plt.xticks(range(len(best_estimator.weights_)))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.4)
plt.ylabel('Component weight')
plt.ylim(0, np.max(best_estimator.weights_)+0.25*np.max(best_estimator.weights_))
plt.yticks(())
plt.savefig('bgmm_clustering.png')
plt.show()
plt.close()

scikit learn: how to check coefficients significance

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!
Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm models in R, so you could find it familiar.
If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta, its variance-covariance matrix can be estimated as inverse(H), where H is the Hessian matrix of log-likelihood at theta. This is exactly what the function below does:
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression
def logit_pvalue(model, x):
""" Calculate z-scores for scikit-learn LogisticRegression.
parameters:
model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
x: matrix on which the model was fit
This function uses asymtptics for maximum likelihood estimates.
"""
p = model.predict_proba(x)
n = len(p)
m = len(model.coef_[0]) + 1
coefs = np.concatenate([model.intercept_, model.coef_[0]])
x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
ans = np.zeros((m, m))
for i in range(n):
ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
vcov = np.linalg.inv(np.matrix(ans))
se = np.sqrt(np.diag(vcov))
t = coefs/se
p = (1 - norm.cdf(abs(t))) * 2
return p
# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))
# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sm_model.pvalues)
sm_model.summary()
The outputs of print() are identical, and they happen to be coefficient p-values.
[ 0.11413093 0.08779978]
[ 0.11413093 0.08779979]
sm_model.summary() also prints a nicely formatted HTML summary.

Resources