How to get KMeans's inertia_ value after using Pipline - scikit-learn

I want to combine StandardScaler() and KMeans() by using Pipeline and also check the kmeans's inertia_ because I want to check which number of cluster is best.
The code is as following:
ks = range(3, 5)
inertias = []
inertias_temp = 9999.0
for k in ks:
scaler = StandardScaler()
kmeans = KMeans(n_clusters=k, random_state=rng)
pipeline = make_pipeline(scaler, kmeans)
pipeline.fit(X_pca)
labels = pipeline.predict(X_pca)
np.round(kmeans.cluster_centers_, decimals=3)
inertias.append(kmeans.inertia_)
if (kmeans.inertia_ < inertias_temp):
n_clusters_min = k
kmeans_min = kmeans
inertias_temp = kmeans.inertia_
However, I think that maybe the value of kmeans.inertia_ is not correct because it should be got after pipeline.predict(). But I have no way to get this value after pipeline.predict(). Can anyone help me on this?

It is possible to observe the inertia distance of the cluster run from the make_pipeline instance. However, it is not necessary to perform .predict() to observe the distance of the number of centroids. To access the inertia value in your case, you may type as below:
pipeline.named_steps['kmeans'].inertia_
And then process it as you like!
Moreover, I had some free time, so I rewrote the code for you a little bit to make it more interesting:
scaler = StandardScaler()
cluster = KMeans(random_state=1337)
pipe = make_pipeline(scaler, cluster)
centroids = []
inertias = []
min_ks = []
inertia_temp = 9999.0
for k in range(3, 5):
pipe.set_params(cluster__n_clusters=k)
pipe.fit(X_pca)
centroid = pipe.named_steps['cluster'].cluster_centers_
inertia = pipe.named_steps['cluster'].inertia_
centroids.append(centroid)
inertias.append(inertia)
if inertia < inertia_temp:
min_ks.append(k)
Thank you for the question!

Related

Multiply every element of matrix with a vector to obtain a matrix whose elements are vectors themselves

I need help in speeding up the following block of code:
import numpy as np
x = 100
pp = np.zeros((x, x))
M = np.ones((x,x))
arrayA = np.random.uniform(0,5,2000)
arrayB = np.random.uniform(0,5,2000)
for i in range(x):
for j in range(x):
y = np.multiply(arrayA, np.exp(-1j*(M[j,i])*arrayB))
p = np.trapz(y, arrayB) # Numerical evaluation/integration y
pp[j,i] = abs(p**2)
Is there a function in numpy or another method to rewrite this piece of code with so that the nested for-loops can be omitted? My idea would be a function that multiplies every element of M with the vector arrayB so we get a 100 x 100 matrix in which each element is a vector itself. And then further each vector gets multiplied by arrayA with the np.multiply() function to then again obtain a 100 x 100 matrix in which each element is a vector itself. Then at the end perform numerical integration for each of those vectors with np.trapz() to obtain a 100 x 100 matrix of which each element is a scalar.
My problem though is that I lack knowledge of such functions which would perform this.
Thanks in advance for your help!
Edit:
Using broadcasting with
M = np.asarray(M)[..., None]
y = 1000*arrayA*np.exp(-1j*M*arrayB)
return np.trapz(y,B)
works and I can ommit the for-loops. However, this is not faster, but instead a little bit slower in my case. This might be a memory issue.
y = np.multiply(arrayA, np.exp(-1j*(M[j,i])*arrayB))
can be written as
y = arrayA * np.exp(-1j*M[:,:,None]*arrayB
producing a (x,x,2000) array.
But the next step may need adjustment. I'm not familiar with np.trapz.
np.trapz(y, arrayB)

Get feature importances for dictionary of dataframes

I'm currently working on a use case using RandomForestRegressor. To get training and test data separately based on one column, let's say Home, the dataframe was split into dictionary. Almost done with the modelling, but stuck in getting the feature importance for each of the key in dictionary (number of keys = 21). Please have a look at the codes below:
hp = pd.get_dummies(hp)
hp = {i: g for i, g in hp.set_index(["Home"]).groupby(level = [0])}
feature = {}; feature_train = {}; feature_test = {}
target = {}; target_train = {}; target_test = {}; target_pred = {}
importances = {}
for k, v in hp.items():
target[k] = np.array(v["HP"])
feature[k] = v.drop(["HP", "Corr"], axis = 1)
feature_list = list(feature[1].columns)
for k, v in zip(feature, target):
feature[k] = np.array(feature[v])
for k, v in zip(feature_train, target_train):
feature_train[k], feature_test[k], target_train[k], target_test[k] = train_test_split(
feature[v], target[v], test_size = 0.25, random_state = 42)
What I've tried after a help from Random Forest Feature Importance Chart using Python
for name, importance in zip(feature_list, list(rf.feature_importances_)):
print(name, "=", importance)
but this prints importance for only one of the dictionary (and I don't know which). What I want is to get it printed for all the keys in dictionary "importances". Thanks in advance!
If I understand you correctly, you want feature's importance for both train and test data.
That's not how it works, first it creates RandomForest from your training data, and after that operation it can calculate importance of each feature based on how many times it was used to split the space (and how 'good' were the splits, e.g. how low was, for example, the gini impurity, for many trees of course).
So you obtain feature's importance for training data, for test data the learned tree architecture is used in order to predict values.

Making a randomly generated 2d map in python is taking too long to process all of the map generation

import random
l = "lava"
d = "dessert"
f = "forest"
v = "village"
s = "sect"
w = "water"
c = "city"
m = "mountains"
p = "plains"
t = "swamp"
map_list = [l,d,f,v,s,w,c,m,p,t]
map = []
for i in range(50):
map.append([])
def rdm_map(x):
for i in range(50):
map[x].append(random.choice(map_list))
def map_create():
x = 0
while x <= len(map):
rdm_map(x)
x + 1
map_create()
print(map[2][1])
I'm not getting anything for output not even an error code.I'm trying to create a randomly generated game map of descent size but when i went to run it nothing happened i'm thinking since my computer isn't that great its just taking way to long to process but i just wanted to post it on here to double check. If that is the issue is there a way to lessen the load without lessening the map size?
You have the following bugs:
Inside the map_create you must change x + 1 to x += 1. For this reason your script runs for ever.
After that you should change the while x <= len(map): to while x < len(map):. If you keep the previous, you will get a Index Error.
In any case, your code can be further improved. Please try to read some pages of the tutorial first.

Incorporating uncertainty into a pymc3 model

I have a set of data for which I have the mean, standard deviation and number of observations for each point (i.e., I have knowledge regarding the accuracy of the measure). In a traditional pymc3 model where I look only at the means, I may do something along the lines of:
x = data['mean']
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=1, sd=1)
y = a + b*x
eps= pm.HalfNormal('eps', sd=1)
likelihood = pm.Normal('likelihood', mu=y, sd=eps, observed=x)
What is the best way to incorporate the information regarding the variance of the observations into the model? Obviously the result should weight low-variance observations more heavily than high-variance (less certain) observations.
One approach a statistician suggested was to do the following:
x = data['mean'] # mean of observation
x_sd = data['sd'] # sd of observation
x_n = data['n'] # of measures for observation
x_sem = x_sd/np.sqrt(x_n)
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=1, sd=1)
y = a + b*x
eps = pm.HalfNormal('eps', sd=1)
obs = mc.Normal('obs', mu=x, sd=x_sem, shape=len(x))
likelihood = pm.Normal('likelihood', mu=y, eps=eps, observed=obs)
However, when I run this I get:
TypeError: observed needs to be data but got: <class 'pymc3.model.FreeRV'>
I am running the master branch of pymc3 (3.0 has some performance issues resulting in very slow sample times).
You are close, you just need to make some small changes. The main reason is that for PyMC3 data is always constant. Check the following code:
with pm.Model() as m:
a = pm.Normal('a', mu=0, sd=1)
b = pm.Normal('b', mu=1, sd=1)
mu = a + b*x
mu_est = pm.Normal('mu_est', mu, x_sem, shape=len(x))
likelihood = pm.Normal('likelihood', mu=mu_est, sd=x_sd, observed=x)
Notice than I keep the data fixed and I introduce the observed uncertainty at two points: for the estimation of mu_est and for the likelihood. Of course you are free to do not use x_sem or x_sd and instead estimate them, like you did in your code with the variable eps.
On a historical note, code with "random data" used to work on PyMC3 (at least for some models), but given that it was not really designed to work that way, developers decided to prevent the user from using random data, and that explains the message you got.

How to add a confusion matrix to Theano examples?

I want to make use of Theano's logistic regression classifier, but I would like to make an apples-to-apples comparison with previous studies I've done to see how deep learning stacks up. I recognize this is probably a fairly simple task if I was more proficient in Theano, but this is what I have so far. From the tutorials on the website, I have the following code:
def errors(self, y):
# check if y has same dimension of y_pred
if y.ndim != self.y_pred.ndim:
raise TypeError(
'y should have the same shape as self.y_pred',
('y', y.type, 'y_pred', self.y_pred.type)
)
# check if y is of the correct datatype
if y.dtype.startswith('int'):
# the T.neq operator returns a vector of 0s and 1s, where 1
# represents a mistake in prediction
return T.mean(T.neq(self.y_pred, y))
I'm pretty sure this is where I need to add the functionality, but I'm not certain how to go about it. What I need is either access to y_pred and y for each and every run (to update my confusion matrix in python) or to have the C++ code handle the confusion matrix and return it at some point along the way. I don't think I can do the former, and I'm unsure how to do the latter. I've done some messing around with an update function along the lines of:
def confuMat(self, y):
x=T.vector('x')
classes = T.scalar('n_classes')
onehot = T.eq(x.dimshuffle(0,'x'),T.arange(classes).dimshuffle('x',0))
oneHot = theano.function([x,classes],onehot)
yMat = T.matrix('y')
yPredMat = T.matrix('y_pred')
confMat = T.dot(yMat.T,yPredMat)
confusionMatrix = theano.function(inputs=[yMat,yPredMat],outputs=confMat)
def confusion_matrix(x,y,n_class):
return confusionMatrix(oneHot(x,n_class),oneHot(y,n_class))
t = np.asarray(confusion_matrix(y,self.y_pred,self.n_out))
print (t)
But I'm not completely clear on how to get this to interface with the function in question and give me a numpy array I can work with.
I'm quite new to Theano, so hopefully this is an easy fix for one of you. I'd like to use this classifer as my output layer in a number of configurations, so I could use the confusion matrix with other architectures.
I suggest using a brute force sort of a way. You need an output for a prediction first. Create a function for it.
prediction = theano.function(
inputs = [index],
outputs = MLPlayers.predicts,
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size]})
In your test loop, gather the predictions...
labels = labels + test_set_y.eval().tolist()
for mini_batch in xrange(n_test_batches):
wrong = wrong + int(test_model(mini_batch))
predictions = predictions + prediction(mini_batch).tolist()
Now create confusion matrix this way:
correct = 0
confusion = numpy.zeros((outs,outs), dtype = int)
for index in xrange(len(predictions)):
if labels[index] is predictions[index]:
correct = correct + 1
confusion[int(predictions[index]),int(labels[index])] = confusion[int(predictions[index]),int(labels[index])] + 1
You can find this kind of an implementation in this repository.

Resources