machine learning from sklearn - python-3.x

I am learning sklearn module and how to split data.
I followed the instruction code
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics',
'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
remove=('headers', 'footers',
'quotes'),
categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test',
remove=('headers', 'footers',
'quotes'),
categories=categories)
num_test = len(newsgroups_test.target)
test_data, test_labels = int(newsgroups_test.data[num_test/2:]),
int(newsgroups_test.target[num_test/2:])
dev_data, dev_labels = int(newsgroups_test.data[:num_test/2]),
int(newsgroups_test.target[:num_test/2])
train_data, train_labels = int(newsgroups_train.data),
int(newsgroups_train.target)
print('training label shape:', train_labels.shape)
print( 'test label shape:', test_labels.shape)
print( 'dev label shape:', dev_labels.shape)
print('labels names:', newsgroups_train.target_names)
But I got error like this
TypeError Traceback (most recent call last)
in ()
8
9 num_test = len(newsgroups_test.target)
---> 10 test_data, test_labels = int(newsgroups_test.data[num_test/2:]), int(newsgroups_test.target[num_test/2:])
11 dev_data, dev_labels = int(newsgroups_test.data[:num_test/2]), int(newsgroups_test.target[:num_test/2])
12 train_data, train_labels = int(newsgroups_train.data), int(newsgroups_train.target)
TypeError: slice indices must be integers or None or have an index method
Not sure what's wrong.
Thanks guys

Although I'm not very familiar with scikits dataloaders, your error may be unrelated if you are using python3. You should do integer division, because the [] operator expects an integer value. Try using the division operator //, which ensures the value returned are an integer IF both args are integers, which is basically math.floor(a/b). In python3, the division operator / returns a float not an integer, regardless if the 2 arguments are both integers.
Try to change
num_test/2
to
num_test//2
Example:
newsgroups_test.target[num_test//2:]
The operator // is also available in some python2 versions.

Related

TypeError: only integer scalar arrays can be converted to a scalar index (object detection)

I am struggling with this one part. Not sure how to fix it! Would be great if someone could tell me what I need to fix in the code. Down below is the code & error message that I'm receiving.
This it the code:
categoriesList=["airplane","automobile","bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]
import matplotlib.pyplot as plt
import random
def plotImages(x_test, images_arr, labels_arr, n_images=8):
fig, axes = plt.subplots(n_images, n_images, figsize=(9,9))
axes = axes.flatten()
for i in range(100):
rand = random.randint(0, x_test.shape[0] -1)
img = images_arr[rand]
ax = axes[i]
ax.imshow( img, cmap="Greys_r")
ax.set_xticks(())
ax.set_yticks(())
sample = x_test[rand].reshape((1,32,32,3))
predict_x = model2000.predict(sample)
label=categoriesList[predict_x[0]]
if labels_arr[rand][predictions[0]] == 0:
ax.set_title(label, fontsize=18 - n_images, color="red")
else:
ax.set_title(label, fontsize=18 - n_images)
plot = plt.tight_layout()
return plot
display (plotImages(x_test, data_test_picture, y_test, n_images=10))
This is the error message:
TypeError: only integer scalar arrays can be converted to a scalar index
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-2104322840429397> in <module>
28 return plot
29
---> 30 display (plotImages(x_test, data_test_picture, y_test, n_images=10))
<command-2104322840429397> in plotImages(x_test, images_arr, labels_arr, n_images)
18 sample = x_test[rand].reshape((1,32,32,3))
19 predict_x = model2000.predict(sample)
---> 20 label=categoriesList[predict_x[0]]
21
22 if labels_arr[rand][predictions[0]] == 0:
TypeError: only integer scalar arrays can be converted to a scalar index
The output i'm getting:
To fix the integer scalar arrays can be converted to a scalar index error
Concatenate array by list
Here we have 2 array we have to convert into list using the numpy.concatenate() like numpy.concatenate([ar1, ar2])
import numpy
# Create 2 different arrays
ar1 = numpy.array(['Apple', 'Orange', 'Banana', 'Pineapple', 'Grapes'])
ar2 = numpy.array(['Onion', 'Potato'])
# Concatenate array ar1 & ar2 using numpy.concatenate()
ar3 = numpy.concatenate([ar1, ar2]) print(ar3)
# Output
['Apple' 'Orange' 'Banana' 'Pineapple' 'Grapes' 'Onion' 'Potato']
Concatenate array by Tuple
Convert array 1 and array 2 to tuple using the numpy.concatenate() like numpy.concatenate((ar1, ar2))
import numpy
# Create 2 different arrays
ar1 = numpy.array(['Apple', 'Orange', 'Banana', 'Pineapple', 'Grapes'])
ar2 = numpy.array(['Onion', 'Potato'])
# Concatenate array ar1 & ar2 using numpy.concatenate()
ar3 = numpy.concatenate((ar1, ar2)) print(ar3)
# Output
['Apple' 'Orange' 'Banana' 'Pineapple' 'Grapes' 'Onion' 'Potato']
If you use the plain array and perform some indexing operation it will show the same error. To overcome this you can convert the ordinary array into a NumPy array and then perform the required operation.
categoriesList=numpy.array(["airplane","automobile","bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"])
Refer here for more information

How to predict unseen data?

Hi I am practicing ML models and facing issue while trying to predict the unseen data.
The error is coming while doing the onehotencoding for categorical data.
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_x_1 = LabelEncoder() #will encode country
X[:,1] = labelencoder_x_1.fit_transform(X[:,1])
labelencoder_x_2 = LabelEncoder() #will encode Gender
X[:,2] = labelencoder_x_2.fit_transform(X[:,2])
onehotencoder_x = OneHotEncoder(categorical_features=[1])
X= onehotencoder_x.fit_transform(X).toarray()
X = X[:,1:]
My X has 11 columns and column 2 and 3 are categorical type(Country and Gender).
Model running fine but while trying to test the model against a random input its failing at onehotencoding.
input = [[619], ['France'], ['Male'], [42], [2], [0.0], [1], [1], [1],[101348.88]]
input[1] = labelencoder_x_1.fit_transform(input[1])
input[2] = labelencoder_x_2.fit_transform(input[2])
input= onehotencoder_x.fit_transform(input).toarray()
Error:
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:451:
DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20
and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):
File "<ipython-input-44-44a43edf17aa>", line 1, in <module>
input= onehotencoder_x.fit_transform(input).toarray()
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 624, in
fit_transform
self._handle_deprecations(X)
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 453, in
_handle_deprecations
n_features = X.shape[1]
AttributeError: 'list' object has no attribute 'shape'
I believe this is because you have nested lists.
You should flatten your input list and use that for the prediction.
input[1] = labelencoder_x_1.fit_transform(input[1])
input[2] = labelencoder_x_2.fit_transform(input[2])
intput = [item for sublist in input for item in sublist]
input= onehotencoder_x.fit_transform(input).toarray()
If you have a nested list, then each element in the list will be considered an item that needs to go through the fit_transform function, but since it's a single element, it does not match the shape that fit_transform looks for, which is [1, 10] (1 row, 10 columns).

not able to use tf.metrics.recall

I am very new to tensorflow.
I am just trying to understand how to use tf.metrics.recall
I am doing the following
true = tf.zeros([64, 1])
pred = tf.random_uniform([64,1], -1.0,1.0)
with tf.Session() as sess:
t,p = sess.run([true,pred])
# print(t)
# print(p)
rec, rec_op = tf.metrics.recall(labels=t, predictions=p)
sess.run(rec_op,feed_dict={t: t,p: p})
print(recall)
And that is giving me the following error:
TypeError Traceback (most recent call last)
<ipython-input-43-7245c92d724d> in <module>
25 # print(p)
26 rec, rec_op = tf.metrics.recall(labels=t, predictions=p)
---> 27 sess.run(rec_op,feed_dict={t: t,p: p})
28 print(recall)
TypeError: unhashable type: 'numpy.ndarray'
Please help me to understand this better.
Thank you in advance
labels and predictions in your code return tensor outputs, which are numpy array. You can you numpy or your own implementation to calculate recall over them, if you wish. The benefit of using metrics is that you can run everything uniformly in one go with just tensorflow.
with tf.Session() as sess:
rec, rec_op = tf.metrics.recall(labels=true, predictions=pred)
batch_recall, _ = sess.run([rec, rec_op],feed_dict={t: t,p: p})
print(recall)
Note, that you use tensors constructing tf.metrics.recall.

numpy code works in REPL, script says type error

Copy and pasting this code into the python3 REPL works, but when I run it as a script, I get a type error.
"""Softmax."""
scores = [3.0, 1.0, 0.2]
import numpy as np
from math import e
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
results = []
x = np.transpose(x)
for j in range(len(x)):
exps = [np.exp(s) for s in x[j]]
_sum = np.sum(np.exp(x[j]))
softmax = [i / _sum for i in exps]
results.append(softmax)
final = np.vstack(results)
return np.transpose(final)
# pass # TODO: Compute and return softmax(x)
print(softmax(scores))
# Plot softmax curves
import matplotlib.pyplot as plt
x = np.arange(-2.0, 6.0, 0.1)
scores = np.vstack([x, np.ones_like(x), 0.2 * np.ones_like(x)])
plt.plot(x, softmax(scores).T, linewidth=2)
plt.show()
The error I get running the script via CLI is the following:
bash$ python3 softmax.py
Traceback (most recent call last):
File "softmax.py", line 22, in <module>
print(softmax(scores))
File "softmax.py", line 13, in softmax
exps = [np.exp(s) for s in x[j]]
TypeError: 'numpy.float64' object is not iterable
This kind of crap makes me so nervous about running interpreted code in production with libraries like these, seriously unreliable and undefined behaviour is totally unacceptable IMO.
At the top of your script, you define
scores = [3.0, 1.0, 0.2]
This is the argument in your first call of softmax(scores). When converted to a numpy array, scores is 1-d array with shape (3,).
You pass scores into the function, and then it is converted to a numpy array by the call
x = np.transpose(x)
However, it is still 1-d, with shape (3,). The transpose function swaps dimensions, but it does not add a dimension to a 1-d array. In effect, transpose is a "no-op" when applied to a 1-d array.
Then, in the loop that follows, x[j] is a scalar of type numpy.float64, so it does not make sense to write [np.exp(s) for s in x[j]]. x[j] is a scalar, not a sequence, so you can't iterate over it.
In the bottom part of your script, you redefine scores as
x = np.arange(-2.0, 6.0, 0.1)
scores = np.vstack([x, np.ones_like(x), 0.2 * np.ones_like(x)])
Now scores is 2-d array (scores.shape is (3, 80)), so you don't get an error when you call softmax(scores).

Tensorflow Adagrad optimizer isn't working

When I run the following script, I notice the following couple of errors:
import tensorflow as tf
import numpy as np
import seaborn as sns
import random
#set random seed:
random.seed(42)
def potential(N):
points = np.random.rand(N,2)*10
values = np.array([np.exp((points[i][0]-5.0)**2 + (points[i][1]-5.0)**2) for i in range(N)])
return points, values
def init_weights(shape,var_name):
"""
Xavier initialisation of neural networks
"""
init = tf.contrib.layers.xavier_initializer()
return tf.get_variable(initializer=init,name = var_name,shape=shape)
def neural_net(X):
with tf.variable_scope("model",reuse=tf.AUTO_REUSE):
w_h = init_weights([2,10],"w_h")
w_h2 = init_weights([10,10],"w_h2")
w_o = init_weights([10,1],"w_o")
### bias terms:
bias_1 = init_weights([10],"bias_1")
bias_2 = init_weights([10],"bias_2")
bias_3 = init_weights([1],"bias_3")
h = tf.nn.relu(tf.add(tf.matmul(X, w_h),bias_1))
h2 = tf.nn.relu(tf.add(tf.matmul(h, w_h2),bias_2))
return tf.nn.relu(tf.add(tf.matmul(h2, w_o),bias_3))
X = tf.placeholder(tf.float32, [None, 2])
with tf.Session() as sess:
model = neural_net(X)
## define optimizer:
opt = tf.train.AdagradOptimizer(0.0001)
values =tf.placeholder(tf.float32, [None, 1])
squared_loss = tf.reduce_mean(tf.square(model-values))
## define model variables:
model_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,"model")
train_model = opt.minimize(squared_loss,var_list=model_vars)
sess.run(tf.global_variables_initializer())
for i in range(10):
points, val = potential(100)
train_feed = {X : points,values: val.reshape((100,1))}
sess.run(train_model,feed_dict = train_feed)
print(sess.run(model,feed_dict = {X:points}))
### plot the approximating model:
res = 0.1
xy = np.mgrid[0:10:res, 0:10:res].reshape(2,-1).T
values = sess.run(model, feed_dict={X: xy})
sns.heatmap(values.reshape((int(10/res),int(10/res))),xticklabels=False,yticklabels=False)
On the first run I get:
[nan] [nan] [nan] [nan] [nan] [nan] [nan]] Traceback (most
recent call last):
...
File
"/Users/aidanrockea/anaconda/lib/python3.6/site-packages/seaborn/matrix.py",
line 485, in heatmap
yticklabels, mask)
File
"/Users/aidanrockea/anaconda/lib/python3.6/site-packages/seaborn/matrix.py",
line 167, in init
cmap, center, robust)
File
"/Users/aidanrockea/anaconda/lib/python3.6/site-packages/seaborn/matrix.py",
line 206, in _determine_cmap_params
vmin = np.percentile(calc_data, 2) if robust else calc_data.min()
File
"/Users/aidanrockea/anaconda/lib/python3.6/site-packages/numpy/core/_methods.py",
line 29, in _amin
return umr_minimum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation minimum which has
no identity
On the second run I have:
ValueError: Variable model/w_h/Adagrad/ already exists, disallowed.
Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope?
It's not clear to me why I get either of these errors. Furthermore, when I use:
for i in range(10):
points, val = potential(10)
train_feed = {X : points,values: val.reshape((10,1))}
sess.run(train_model,feed_dict = train_feed)
print(sess.run(model,feed_dict = {X:points}))
I find that on the first run, I sometimes get a network that has collapsed to the constant function with output 0. Right now my hunch is that this might simply be a numerics problem but I might be wrong.
If so, it's a serious problem as the model I have used here is very simple.
Right now my hunch is that this might simply be a numerics problem
indeed, when running potential(100) I sometimes get values as large as 1E21. The largest points will dominate your loss function and will drive the network parameters.
Even when normalizing your target values e.g. to unit variance, the problem of the largest values dominating the loss would still remain (look e.g. at plt.hist(np.log(potential(100)[1]), bins = 100)).
If you can, try learning the log of val instead of val itself. Note however that then you are changing the assumption of the loss function from 'predictions follow a normal distribution around the target values' to 'log predictions follow a normal distribution around log of the target values'.

Resources