Is this a bug in xgboost's XGBClassifier? - scikit-learn

import numpy as np
from xgboost import XGBClassifier
model = XGBClassifier(
use_label_encoder=False,
label_lower_bound=0, label_upper_bound=1
# setting the bounds doesn't seem to help
)
x = np.array([ [1,2,3], [4,5,6] ], 'ushort' )
y = [ 1, 1 ]
try :
model.fit(x,y)
# this fails with ValueError:
# "The label must consist of integer labels
# of form 0, 1, 2, ..., [num_class - 1]."
except Exception as e :
print(e)
y = [ 0, 0 ]
# this works
model.fit(x,y)
model = XGBClassifier()
y = [ 1, 1 ]
# this works, but with UserWarning:
# "The use of label encoder in XGBClassifier is deprecated, etc."
model.fit(x,y)
Seems to me like label encoder is deprecated but we are FORCED to use it, if our classifications don't happen to contain a zero.

I had the same problem. I solved using use_label_encoder=False as parameter and the warning message disappear.
I think in your case the problem is that you have only 1 in your y, but XGBoost wants the target starting from 0. If you change y = [ 1, 1 ] with y = [ 0, 0 ] the UserWarning should disappear.

Related

Is my Implementation of K-means with Manually Set Centroids Correct?

I am trying to solve this clustering problem that involves the K-means algorithm.
Question:
Considering the data inside the file - link below - execute the K-means algorithm where the initial centroids are positioned at:
[1,1,1,1],[-1,-1,-1,-1] and [1,-1,1,-1]. What is the position of each centroid after 10 iterations?
My solution that I am not sure about:
Basic Code:
kmeans = KMeans(n_clusters = 3 , max_iter= 10, init = np.array([[1, 1, 1, 1],[-1, -1, -1, -1],[1, -1, 1, -1]], np.float64) , random_state = 42)
...
kmeans.cluster_centers_
Answer:
array([[ 1.02575735, -0.00207592, -0.02395886, 0.63623732],
[ 0.10361404, 0.00370027, 0.00669603, -0.03432606],
[ 0.99690983, 0.48052607, 0.94034839, -0.00726928]])
Data: https://drive.google.com/file/d/1DXlFR3Jc5cFiblMxD6Bl7f4p7u_qsX2S/view?usp=sharing
Google Collaborator Full Code: https://colab.research.google.com/drive/1somvP3p7KES0NtBwnLYT6vpqSr3WQfgU?usp=sharing
I used my own code to check your answer and it was right.
import pandas as pd
import numpy as np
df = pd.read_csv('agrupamento_Q1.csv')
data = df.to_numpy()
centeroids = np.array([[1.0,1.0,1.0,1.0],[-1.0,-1.0,-1.0,-1.0],[1.0,-1.0,1.0,-1.0]])
iterations = 10
for itr in range(iterations):
assign = np.zeros([data.shape[0],],dtype=int)
for i in range(data.shape[0]):
for c in range(1,3):
if np.linalg.norm(data[i]-centeroids[c]) < np.linalg.norm(data[i]-centeroids[assign[i]]):
assign[i]=c
new_cent = np.zeros_like(centeroids)
cent_pop = np.zeros([centeroids.shape[0],])
for i in range(data.shape[0]):
new_cent[assign[i]]+=data[i]
cent_pop[assign[i]]+=1
for i in range(centeroids.shape[0]):
centeroids[i] = new_cent[i]/cent_pop[i]
print(centeroids)
# [[ 1.02575735 -0.00207592 -0.02395886 0.63623732]
# [ 0.10361404 0.00370027 0.00669603 -0.03432606]
# [ 0.99690983 0.48052607 0.94034839 -0.00726928]]

scikit MLPClassifier couldn't fit xor problem

My teacher asked us to use the scikit mlpclassifier to solve the xor problem, and asked to use 2 hidden layers the first with 4 units and the second with 2 units, and the identity for activation and lbfgs as solver
x = [ [ 0 , 0 ] , [ 0 , 1 ] , [ 1 , 0 ] , [ 1 , 1 ] ]
y = [ 0 , 1 , 1 , 0 ]
clf = MLPClassifier(hidden_layer_sizes = (4,2), activation = 'identity', solver = 'lbfgs')
clf.fit(x,y)
clf.predict(x)
# output:array([1, 0, 0, 0])
I don't why it fails to predict correctly, I thought any combination of linear functions could solve a non linear problem

Understanding pytorch autograd

I am trying to understand how pytorch autograd works. If I have functions y = 2x and z = y**2, if I do normal differentiation, I get dz/dx at x = 1 as 8 (dz/dx = dz/dy * dy/dx = 2y*2 = 2(2x)*2 = 8x). Or, z = (2x)**2 = 4x^2 and dz/dx = 8x, so at x = 1, it is 8.
If I do the same with pytorch autograd, I get 4
x = torch.ones(1,requires_grad=True)
y = 2*x
z = y**2
x.backward(z)
print(x.grad)
which prints
tensor([4.])
where am I going wrong?
You're using Tensor.backward wrong. To get the result you asked for you should use
x = torch.ones(1,requires_grad=True)
y = 2*x
z = y**2
z.backward() # <-- fixed
print(x.grad)
The call to z.backward() invokes the back-propagation algorithm, starting at z and working back to each leaf node in the computation graph. In this case x is the only leaf node. After calling z.backward() the computation graph is reset and the .grad member of each leaf node is updated with the gradient of z with respect to the leaf node (in this case dz/dx).
What's actually happening in your original code? Well, what you've done is apply back-propagation starting at x. With no arguments x.backward() would simply result in x.grad being set to 1 since dx/dx = 1. The additional argument (gradient) is effectively a scale to apply to the resulting gradient. In this case z=4 so you get x.grad = z * dx/dx = 4 * 1 = 4. If interested, you can check out this for more information on what the gradient argument does.
If you still have some confusion on autograd in pytorch, Please refer this:
This will be basic xor gate representation
import numpy as np
import torch.nn.functional as F
inputs = torch.tensor(
[
[0, 0],
[0, 1],
[1, 0],
[1, 1]
]
)
outputs = torch.tensor(
[
0,
1,
1,
0
],
)
weights = torch.randn(1, 2)
weights.requires_grad = True #set it as true for gradient computation
bias = torch.randn(1, requires_grad=True) #set it as true for gradient computation
preds = F.linear(inputs, weights, bias) #create a basic linear model
loss = (outputs - preds).mean()
loss.backward()
print(weights.grad) # this will print your weights

Issue when trying to plot after applying PCA on a dataset

I am trying to plot the results of PCA of the dataset pima-indians-diabetes.csv. My code shows a problem only in the plotting piece:
import numpy
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
# Dataset Description:
# 1. Number of times pregnant
# 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
# 3. Diastolic blood pressure (mm Hg)
# 4. Triceps skin fold thickness (mm)
# 5. 2-Hour serum insulin (mu U/ml)
# 6. Body mass index (weight in kg/(height in m)^2)
# 7. Diabetes pedigree function
# 8. Age (years)
# 9. Class variable (0 or 1)
path = 'pima-indians-diabetes.data.csv'
dataset = numpy.loadtxt(path, delimiter=",")
X = dataset[:,0:8]
Y = dataset[:,8]
features = ['1','2','3','4','5','6','7','8','9']
df = pd.read_csv(path, names=features)
x = df.loc[:, features].values # Separating out the values
y = df.loc[:,['9']].values # Separating out the target
x = StandardScaler().fit_transform(x) # Standardizing the features
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
# principalDf = pd.DataFrame(data=principalComponents, columns=['pca1', 'pca2'])
# finalDf = pd.concat([principalDf, df[['9']]], axis = 1)
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
for color, i, target_name in zip(colors, [0, 1, 2], ['Negative', 'Positive']):
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of pima-indians-diabetes Dataset')
The error is located at the following line:
Traceback (most recent call last):
File "test.py", line 53, in <module>
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
IndexError: too many indices for array
Kindly, how to fix this?
As the error indicates some kind of shape/dimension mismatch, a good starting point is to check the shapes of the arrays involved in the operation:
principalComponents.shape
yields
(768, 2)
while
(y==i).shape
(768, 1)
Which leads to a shape mismatch when trying to run
principalComponents[y==i, 0]
as the first array is already multidimensional, therefore the error is indicating that you used too many indices for the array.
You can fix this by forcing the shape of y==i to a 1D array ((768,)), e.g. by changing your call to scatter to
plt.scatter(principalComponents[(y == i).reshape(-1), 0],
principalComponents[(y == i).reshape(-1), 1],
color=color, alpha=.8, lw=lw, label=target_name)
which then creates the plot for me
For more information on the difference between arrays of the shape (R, 1)and (R,) this question on StackOverflow provides a nice starting point.

pytorch how to select channels by mask?

I want to know how do I select channels by the mask in Pytorch.
[channel1 channel2 channel3 channel4] x [1,0,0,1] --> [channel1,channel4]
I tried torch.masked_select() and it did't work.
if the input has a shape like [B,C,H,W] the output's shape should be [B,masked_C,H,W],
import torch
from torch import nn
input = torch.randn((1,5,3,3))
pool = nn.AdaptiveAvgPool2d(1)
w = torch.sigmoid(pool(input)).view(1,-1)
mask = torch.gt(w,0.5)
print(input)
print(w)
print(mask)
the output is as following:
tensor([[[[ 0.9129, -0.9763, 1.4460],
[ 0.3608, 0.5561, -1.4612],
[ 1.4953, -1.2474, 0.4069]],
[[-0.9121, 0.1261, 0.4661],
[-1.1624, -1.0266, -1.5419],
[ 1.0644, 1.0039, -0.4022]],
[[-1.8454, -0.2150, 2.3703],
[ 0.5224, 0.3366, 1.7545],
[-0.4624, 1.2639, 1.8032]],
[[-1.1558, -1.9985, -1.1336],
[-0.4400, -0.2092, 0.0677],
[-0.4172, -0.3614, -1.3193]],
[[-0.9441, -0.2944, 0.3381],
[ 1.6562, -0.5623, 0.0599],
[ 0.7229, 0.0472, -0.5122]]]])
tensor([[0.5414, 0.4341, 0.6489, 0.3156, 0.5142]])
tensor([[1, 0, 1, 0, 1]], dtype=torch.uint8)
the result I want is like this:
tensor([[[[ 0.9129, -0.9763, 1.4460],
[ 0.3608, 0.5561, -1.4612],
[ 1.4953, -1.2474, 0.4069]],
[[-1.8454, -0.2150, 2.3703],
[ 0.5224, 0.3366, 1.7545],
[-0.4624, 1.2639, 1.8032]],
[[-0.9441, -0.2944, 0.3381],
[ 1.6562, -0.5623, 0.0599],
[ 0.7229, 0.0472, -0.5122]]]])
I believe you can simply do:
input[mask]
Btw. you don't need to call sigmoid and then .gt(0.5). You can directly do .gt(0.0) without calling the sigmoid.

Resources