Why does numpy vectorization not improve the speed of my code - python-3.x

Here is the original code that does not use vectorize
import tensorflow as tf
import time
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
path='mnist.npz'
)
x_train = x_train.reshape(60000,-1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(x_train)
pca = PCA()
pca = PCA(n_components = 16) # or 12 -> 3, 4 filter_size=3
X_pca = pca.fit_transform(X_scaled).reshape(60000, 4, 4, 1)
start = time.time()
X_pca_zero = X_pca[0]
for i in range(1,60000):
X_pca_expanded = X_pca[i]
print(tf.image.ssim(X_pca_zero, X_pca_expanded, 255, filter_size=4))
print(time.time()-start)
It is essentially comparing the similarity between a reference image and a set of images. I feel it can be sped up by vectorization (so as to avoid the time wasted by the for loop). Therefore, I used the numpy vectorize function -
import numpy as np
import tensorflow as tf
import time
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
path='mnist.npz'
)
def my_func(x_train):
x_train = x_train.reshape(60000,-1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(x_train)
pca = PCA()
pca = PCA(n_components = 16) # or 12 -> 3, 4 filter_size=3
X_pca = pca.fit_transform(X_scaled).reshape(60000, 4, 4, 1)
start = time.time()
X_pca_zero = X_pca[0]
for i in range(1,6000):
X_pca_expanded = X_pca[i]
print(tf.image.ssim(X_pca_zero, X_pca_expanded, 255, filter_size=4))
print(time.time()-start)
return 0
np.vectorize(my_func(x_train))
However, there doesn't seem to be any speed improvements.

In Python function calls, arguments are evaluated before they are passed to the function. np.vectorize is Python function that expects a function as argument, and returns another function.
np.vectorize(my_func(x_train))
runs my_func(x_train) before passing the result to vectorize. That argument evaluation does all the prints and timing, and returns 0. I doubted if it would work, but:
In [194]: np.vectorize(0)
Out[194]: <numpy.vectorize at 0x7ff091c36310>
So it does run without error, but does nothing. All the timing is done before anything is passed to np.vectorize.
I suspect you read about the magic of "vectorization", and tried to use a like name function without reading its docs. Not only did you miss the performance disclaimer, but also didn't learn how use it (it does have its uses). It is not some sort of "compiler" or "vectorizing magic".
Most of your code is sklearn, while the only thing you time is
X_pca_zero = X_pca[0]
for i in range(1,60000):
X_pca_expanded = X_pca[i]
print(tf.image.ssim(X_pca_zero, X_pca_expanded, 255, filter_size=4))
I'm not familiar with what tf.image.ssim does. But it is a tensorflow function, and may be complex. Running it 6000 times is bound to take noticeable time. If it doesn't let you provide all X_pca values at once (as opposed to one by one), there's nothing you can do speed it up.
I don't know what it returns, but usually we don't include a print inside a timing block. Each print call adds to the computation time.

Related

How to get the threshold from a specific precision and recall

I'm trying to get the threshold for a specific precision and recall. Let's say I want to get the threshold at the precision of 60% and recall of 40%. Are there any straightforward way to do it using the sklearn package?
precision, recall, threshold = precision_recall_curve(y_val, y_e)
df_pr = pd.DataFrame()
df_pr['precision'] = precision
df_pr['recall'] = recall
df_pr['threshold'] = list(threshold) + [1]
precision recall threshold
0 0.247543 1.000000 0.059483
1 0.247486 0.999692 0.059489
2 0.247504 0.999692 0.059512
3 0.247523 0.999692 0.059542
Provided that I've properly understood your question, imo, the point to highlight is that precision and recall are not necessarily coupled as you seem to imply. Here's a toy example:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
X, y = make_classification(n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=7)
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_scores = lr.predict_proba(X_test)
precision, recall, threshold = precision_recall_curve(y_test, y_scores[:, 1])
plt.plot(threshold, precision[:-1], 'b--', label='Precision')
plt.plot(threshold, recall[:-1], 'r--', label='Recall')
plt.xlabel('Threshold')
plt.legend(loc='lower left')
plt.ylim([0,1])
This said, the problem becomes something you can easily solve either with numpy or pandas, depending on your "setting". For instance, here's a toy function returning precision, recall and threshold at the index where the condition is attained.
def prt(arr, value):
array = np.asarray(arr)
idx = np.where(array[:-1] == value)[0][0]
return precision[idx], recall[idx], threshold[idx]
prt(precision, 0.6) # I checked ex-ante that precision=0.6 is attained. Differently you'll have to go with something custom.
(0.6, 0.9622641509433962, 0.052229434776723364)
Otherwise, to resemble your setting with a pandas DataFrame:
df = pd.DataFrame()
df['precision'] = precision[:-1]
df['recall'] = recall[:-1]
df['threshold'] = threshold
df[df.loc[:, 'precision'] == 0.6]
I would suggest you sklearn precision_recall_curve and threshold that tries to explain how .precision_recall_curve() works under the hood and Why does precision_recall_curve() return different values than confusion matrix? which might be somehow related.

Inverse Feature Scaling not working while predicting results

# Importing required libraries
import numpy as np
import pandas as pd
# Importing dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1: -1].values
y = dataset.iloc[:, -1].values
y = y.reshape(len(y), 1)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
scy = StandardScaler()
scX = StandardScaler()
X = scX.fit_transform(X)
y = scy.fit_transform(y)
# Training SVR model
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Predicting results from SCR model
# this line is generating error
scy.inverse_transform(regressor.predict(scX.transform([[6.5]])))
I am trying to execute this code to predict values from the model but after running it I am getting errors like this:
ValueError: Expected 2D array, got 1D array instead:
array=[-0.27861589].
Reshape your data either using an array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Complete Stack trace of error:
Even my instructor is using the same code but his one is working mine one not I am new to machine learning can anybody tell me what I am doing wrong here.
Thanks for your help.
This is the data for reference
It is because of the shape of your predictions, the scy is expecting an output with (-1, 1) shape. Change your last line to this:
scy.inverse_transform([regressor.predict(scX.transform([[6.5]]))])
You can also use this line to predict:
pred = regressor.predict(scX.transform([[6.5]]))
pred = pred.reshape(-1, 1)
scy.inverse_transform(pred)

probability difference between categorical target and one-hot encoding target using OneVsRestClassifier

A bit confused with the probability between categorical target and one-hot encoding target from OneVsRestClassifier of sklean. Using iris data with simple logistic regression as an example. When I use original iris class[0,1,2], the calculated OneVsRestClassifier() probability for each observation will always add up to 1. However, if I converted the target to dummies, this is not the case. I understand that OneVsRestClassifier() compares one vs rest (class 0 vs non class 0, class 1 vs non class 1, etc). It makes more sense that the sum of these probabilities has no relation with 1. Then why I see the difference and how so?
import numpy as np
import pandas as pd
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
np.set_printoptions(suppress=True)
iris = datasets.load_iris()
rng = np.random.RandomState(0)
perm = rng.permutation(iris.target.size)
X = iris.data[perm]
y = iris.target[perm]
# categorical target with no conversion
X_train, y_train1 = X[:80], y[:80]
X_test, y_test1 = X[80:], y[80:]
m3 = LogisticRegression(random_state=0)
clf1 = OneVsRestClassifier(m3).fit(X_train, y_train1)
y_pred1 = clf1.predict(X_test)
print(np.sum(y_pred1 == y_test))
y_prob1 = clf1.predict_proba(X_test)
y_prob1[:5]
#output
array([[0.00014508, 0.17238549, 0.82746943],
[0.03850173, 0.79646817, 0.1650301 ],
[0.73981106, 0.26018067, 0.00000827],
[0.00016332, 0.32231163, 0.67752505],
[0.00029197, 0.2495404 , 0.75016763]])
# one hot encoding for categorical target
y2 = pd.get_dummies(y)
y_train2 = y2[:80]
y_test2 = y2[80:]
clf2 = OneVsRestClassifier(m3).fit(X_train, y_train2)
y_pred2 = clf2.predict(X_test)
y_prob2 = clf2.predict_proba(X_test)
y_prob2[:5]
#output
array([[0.00017194, 0.20430011, 0.98066319],
[0.02152246, 0.44522562, 0.09225181],
[0.96277892, 0.3385952 , 0.00001076],
[0.00023024, 0.45436925, 0.95512082],
[0.00036849, 0.31493725, 0.94676348]])
When you encode the targets, sklearn interprets your problem as a multilabel one instead of just multiclass; that is, that it is possible for a point to have more than one true label. And in that case, it is perfectly acceptable for the total sum of probabilities to be greater (or less) than 1. That's generally true for sklearn, but OneVsRestClassifier calls it out specifically in the docstring:
OneVsRestClassifier can also be used for multilabel classification. To use this feature, provide an indicator matrix for the target y when calling .fit.
As for the first approach, there are indeed three independent models, but the predictions are normalized; see the source code. Indeed, that's the only difference:
(y_prob2 / y_prob2.sum(axis=1)[:, None] == y_prob1).all()
# output
True
It's probably worth pointing out that LogisticRegression also natively supports multiclass. In that case, the weights for each class are independent, so it's similar to three separate models, but the resulting probabilities are the result of a softmax application, and the loss function minimizes the loss for each class simultaneously, so that the resulting coefficients and hence predictions can be different from those obtained from OneVsRestClassifier:
m3.fit(X_train, y_train1)
y_prob0 = m3.predict_proba(X_test)
y_prob0[:5]
# output:
array([[0.00000494, 0.01381671, 0.98617835],
[0.02569699, 0.88835451, 0.0859485 ],
[0.95239985, 0.04759984, 0.00000031],
[0.00001338, 0.04195642, 0.9580302 ],
[0.00002815, 0.04230022, 0.95767163]])

Fewer than expected purity scores in PCA analysis

I'm trying to plot the line graph of purity scores against the captured variances in PCA. The goal is to plot the line graph of purity scores against the captured variances of 89% and 99% only. In my code when the components/dimensions are 2 it captures 89% of variance and and when components/dimensions are 4 it captures 99% of variance.
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("clustering.csv")
X10_df = df.drop("Class",axis = 1) #feature matrix
Y10_df = df["Class"] #Target vector
X10_df = np.array(X10_df)
Y10_df = np.array(Y10_df)
scaler = StandardScaler() # Standardizing the data
df_std = scaler.fit_transform(X10_df)
pca = PCA()
pca.fit(df_std)
purity = []
n_comp = range(2,5)
for k in n_comp :
pca = PCA(n_components = k)
pca.fit(df_std)
pca.transform(df_std)
scores_pca = pca.transform(df_std)
kmeans_pca = KMeans(n_clusters=3, init ='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y12 = kmeans_pca.fit_predict(scores_pca)
purity13 = purity_score(Y10_df, pred_y12)
purity.append(purity13)
Below function calculates the purity score :
def purity_score(y_true, y_pred):
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
However, while I have four variance scores, I only have three purity scores. I expected to have four purity scores so that I could create a plot of the variance vs purity.
Why there are only three purity scores?
Here is the link to my dataset file : https://gofile.io/d/3CgFTi
This is simply because when you using for loop with a range, the last number in the range is ignored. So in a range(2,5), it will go 2, 3, 4 and then quite the loop. Please read on for loop in Python.

Padding sequences in tensorflow with tf.pad

I am trying to load imdb dataset in python. I want to pad the sequences so that each sequence is of same length. I am currently doing it with numpy. What is a good way to do it in tensorflow with tf.pad. I saw the given here but I dont know how to apply it with a 2 d matrix.
Here is my current code
import tensorflow as tf
from keras.datasets import imdb
max_features = 5000
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
def padSequence(dataset,max_length):
dataset_p = []
for x in dataset:
if(len(x) <=max_length):
dataset_p.append(np.pad(x,pad_width=(0,max_length-len(x)),mode='constant',constant_values=0))
else:
dataset_p.append(x[0:max_length])
return np.array(x_train_p)
max_length = max(len(x) for x in x_train)
x_train_p = padSequence(x_train,max_length)
x_test_p = padSequence(x_test,max_length)
print("input x shape: " ,x_train_p.shape)
Can someone please help ?
I am using tensorflow 1.0
In Response to the comment:
The padding dimensions are given by
# 'paddings' is [[1, 1,], [2, 2]].
I have a 2 d matrix where every row is of different length. I want to be able to pad to to make them of equal length. In my padSequence(dataset,max_length) function, I get the length of every row with len(x) function. Should I just do the same with tf ? Or is there a way to do it like Keras Function
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
If you want to use tf.pad, according to me you have to iterate for each row.
Code will be something like this:
max_length = 250
number_of_samples = 5
padded_data = np.ndarray(shape=[number_of_samples, max_length],dtype=np.int32)
sess = tf.InteractiveSession()
for i in range(number_of_samples):
reviewToBePadded = dataSet[i] #dataSet numpy array
paddings = [[0,0], [0, maxLength-len(reviewToBePadded)]]
data_tf = tf.convert_to_tensor(reviewToBePadded,tf.int32)
data_tf = tf.reshape(data_tf,[1,len(reviewToBePadded)])
data_tf = tf.pad(data_tf, paddings, 'CONSTANT')
padded_data[i] = data_tf.eval()
print(padded_data)
sess.close()
New to Python, possibly not the best code. But I just want to explain the concept.

Resources