dimension related problem in training LightGBM for Multiclass Multilable Classification? - scikit-learn

I would like to classify by LightGBM algorithm for Multiclass Multilable Classification but I encounter a problem during training because of not being a list the input. DATA
is The length of real rows is 10000
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,np.r_[0:6, 7:27]].values
y = dataset.iloc[:,np.r_[6]].values
x_train, x_test, y_train, y_test = train_test_split(X, y,test_size = 0.25, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
import lightgbm as lgb
d_train = lgb.Dataset(x_train, label=y_train)
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'binary_logloss'
params['sub_feature'] = 0.5
params['num_leaves'] = 10
params['min_data'] = 50
params['max_depth'] = 10
clf = lgb.train(params, d_train, 100)
y_pred=clf.predict(x_test)
for i in range(0,99):
if y_pred[i]>=.5:
y_pred[i]=1
else:
y_pred[i]=0
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
I encounter this problem:
clf = lgb.train(params, d_train, 100)
File "..\lightgbm\engine.py", line 228, in train
...
File "..\lightgbm\basic.py", line 1336, in set_label
label = list_to_1d_numpy(_label_from_pandas(label), name='label')
File "..\lightgbm\basic.py", line 86, in list_to_1d_numpy
"It should be list, numpy 1-D array or pandas Series".format(type(data).__name__, name))
This error is found in basic.py in a function: """Convert data to numpy 1-D array.""" While when I have changed my data to 1D by
y_train = np.reshape(y_train, [1,trainsize])
x_train = np.reshape(x_train, [1,trainsize*26])
The problem is not solved!
Then I use ravel to make 1D for x_train, y_train
x_train = np.ravel(x_train)
y_train = np.ravel(y_train)
but new error is shown:
\lib\site-packages\lightgbm\basic.py", line 872, in __init_from_np2d
raise ValueError('Input numpy.ndarray must be 2 dimensional')
ValueError: Input numpy.ndarray must be 2 dimensional
What is wrong? How I can solve this?

Related

saving polynomial model , doesn't save polynomial degree

How can I deal with polynomial degree when I want to save a polynomial model, sicne this info is not being saved!
import pandas as pd
import numpy as np
import joblib
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.DataFrame({
"a": np.random.uniform(0.0, 1.0, 1000),
"b": np.random.uniform(10.0, 14.0, 1000),
"c": np.random.uniform(100.0, 1000.0, 1000)})
def data():
X_train, X_val, y_train, y_val = train_test_split(df.iloc[:, :2].values,
df.iloc[:, 2].values,
test_size=0.2,
random_state=1340)
return X_train, X_val, y_train, y_val
X_train, X_val, y_train, y_val = data()
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X_train)
poly_reg_model = LinearRegression().fit(X_poly, y_train)
poly_model = joblib.dump(poly_reg_model, 'themodel')
y_pred = poly_reg_model.predict(poly_reg.fit_transform(X_val))
themodel = joblib.load('themodel')
Now, if I try to predict:
themodel.predict(X_val), I am receiving:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 6 is different from 2)
I have to do:
pol_feat = PolynomialFeatures(degree=2)
themodel.predict(pol_feat.fit_transform(X_val))
in order to work.
So, how can i store this info in order to be able to use the model for prediction?
You have to pickle trained PolynomialFeatures also:
# train and pickle
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X_train)
poly_reg_model = LinearRegression().fit(X_poly, y_train)
joblib.dump(poly_reg_model, 'themodel')
joblib.dump(poly_reg, 'poilynomia_features_model')
# load and predict
poilynomia_features_model = joblib.load('poilynomia_features_model')
themodel = joblib.load('themodel')
X_val_prep = poilynomia_features_model.transform(X_val)
predictions = themodel.predict(X_val_prep)
But better will wrap all the steps in the single pipeline:
pipeline = Pipeline(steps=[('poilynomia', PolynomialFeatures()),
('lr', LinearRegression())])
pipeline.fit(X_train, y_train)
pipeline.predict(X_val)

Unable to calculate Model performance for Decision Tree Regressor

Although my code run fine on repl and did giving me results but it miserably fails on the Katacoda testing environment.
I am attaching the repl file here for your review as well, which also contains the question which is commented just above the code I have written.
Kindly review and let me know what mistakes I am making here.
Repl Link
https://repl.it/repls/WarmRobustOolanguage
Also sharing code below
Commented is Question Instructions
#Import two modules sklearn.datasets, and #sklearn.model_selection.
#Import numpy and set random seed to 100.
#Load popular Boston dataset from sklearn.datasets module #and assign it to variable boston.
#Split boston.data into two sets names X_train and X_test. #Also, split boston.target into two sets Y_train and Y_test.
#Hint: Use train_test_split method from #sklearn.model_selection; set random_state to 30.
#Print the shape of X_train dataset.
#Print the shape of X_test dataset.
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
import numpy as np
np.random.seed(100)
max_depth = range(2, 6)
boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
#Import required module from sklearn.tree.
#Build a Decision tree Regressor model from X_train set and #Y_train labels, with default parameters. Name the model as #dt_reg.
#Evaluate the model accuracy on training data set and print #it's score.
#Evaluate the model accuracy on testing data set and print it's score.
#Predict the housing price for first two samples of X_test #set and print them.(Hint : Use predict() function)
dt_reg = DecisionTreeRegressor(random_state=1)
dt_reg = dt_reg.fit(X_train, Y_train)
print('Accuracy of Train Data :', cross_val_score(dt_reg, X_train,Y_train, cv=10 ))
print('Accuracy of Test Data :', cross_val_score(dt_reg, X_test,Y_test, cv=10 ))
predicted = dt_reg.predict(X_test[:2])
print(predicted)
#Fit multiple Decision tree regressors on X_train data and #Y_train labels with max_depth parameter value changing from #2 to 5.
#Evaluate each model accuracy on testing data set.
#Hint: Make use of for loop
#Print the max_depth value of the model with highest accuracy.
dt_reg = DecisionTreeRegressor()
random_grid = {'max_depth': max_depth}
dt_random = RandomizedSearchCV(estimator = dt_reg, param_distributions = random_grid,
n_iter = 90, cv = 3, verbose=2, random_state=42, n_jobs = -1)
dt_random.fit(X_train, Y_train)
dt_random.best_params_
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors / test_labels)
accuracy = 100 - mape
print('Model Performance')
print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
print('Accuracy = {:0.2f}%.'.format(accuracy))
return accuracy
best_random = dt_random.best_estimator_
random_accuracy = evaluate(best_random, X_test,Y_test)
print("Accuracy Scores of the Model ",random_accuracy)
best_parameters = (dt_random.best_params_['max_depth']);
print(best_parameters)
The question is asking for default values. Try to remove random_state=1
Current Line:
dt_reg = DecisionTreeRegressor(random_state=1)
Update Line:
dt_reg = DecisionTreeRegressor()
I think it should Work!!!
# ================================================================================
# Machine Learning Using Scikit-Learn | 3 | Decision Trees ================================================================================
import sklearn.datasets as datasets
import sklearn.model_selection as model_selection
import numpy as np
from sklearn.tree import DecisionTreeRegressor
np.random.seed(100)
# Load popular Boston dataset from sklearn.datasets module and assign it to variable boston.
boston = datasets.load_boston()
# print(boston)
# Split boston.data into two sets names X_train and X_test. Also, split boston.target into two sets Y_train and Y_test
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(boston.data, boston.target, random_state=30)
# Print the shape of X_train dataset
print(X_train.shape)
# Print the shape of X_test dataset.
print(X_test.shape)
# Build a Decision tree Regressor model from X_train set and Y_train labels, with default parameters. Name the model as dt_reg
dt_Regressor = DecisionTreeRegressor()
dt_reg = dt_Regressor.fit(X_train, Y_train)
print(dt_reg.score(X_train,Y_train))
print(dt_reg.score(X_test,Y_test))
predicted = dt_reg.predict(X_test[:2])
print(predicted)
# Get the max depth
maxdepth = 2
maxscore = 0
for x in range(2, 6):
dt_Regressor = DecisionTreeRegressor(max_depth=x)
dt_reg = dt_Regressor.fit(X_train, Y_train)
score = dt_reg.score(X_test, Y_test)
if(maxscore < score):
maxdepth = x
maxscore = score
print(maxdepth)

How to preprocess and feed data to keras model?

I have a dataset with two columns, path and class. I'd like to fine tune VGGface with it.
dataset.head(5):
path class
0 /f3_224x224.jpg red
1 /bc_224x224.jpg orange
2 /1c_224x224.jpg brown
3 /4b_224x224.jpg red
4 /0c_224x224.jpg yellow
I'd like to use these paths to preprocess images and feed to keras. My preprocessing functions is below:
from keras.preprocessing.image import img_to_array, load_img
def prep_image(photo):
img = image.load_img(path + photo, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = utils.preprocess_input(x, version=1)
return x
I prepare my datasets with the following code:
from sklearn.model_selection import train_test_split
path = list(dataset.columns.values)
path.remove('class')
X = dataset[path]
y = dataset['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
I train my model with the following code:
nb_class = 4
hidden_dim = 512
vgg_model = VGGFace(include_top=False, input_shape=(224, 224, 3))
last_layer = vgg_model.get_layer('pool5').output
x = Flatten(name='flatten')(last_layer)
x = Dense(hidden_dim, activation='relu', name='fc6')(x)
x = Dense(hidden_dim, activation='relu', name='fc7')(x)
out = Dense(nb_class, activation='softmax', name='fc8')(x)
custom_vgg_model = Model(vgg_model.input, out)
custom_vgg_model.compile(
optimizer="adam",
loss="categorical_crossentropy"
)
custom_vgg_model.fit(X_train, y_train, epochs=50, batch_size=16)
test_loss, test_acc = model.evaluate(X_test, y_test)
However i get value error because i can't figure out how to preprocess images and feed the arrays. How can i transform the paths from X_train/test dataframes and replace them with the output of prep_image function?
ValueError: Error when checking input: expected input_2 to have 4 dimensions, but got array with shape (50297, 1)
So the shape should be (50297, 224, 224, 3).
X_train, X_test are basically just path names it seems. In your data preparation step you just need to modify your code like that adding those last two lines.
from sklearn.model_selection import train_test_split
path = list(dataset.columns.values)
path.remove('class')
X = dataset[path]
y = dataset['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_train = np.array([prep_image(path)[0] for path in X_train])
X_test = np.array([prep_image(path)[0] for path in X_test])

What's the purpose of np_utils.to_categorical() in keras?

Hello I'm training my self in sentiment recognition using audio file and a code from a repository of git.
Code sample:
newdf1 = np.random.rand(len(rnewdf)) < 0.8
train = rnewdf[newdf1]
test = rnewdf[~newdf1]
trainfeatures = train.iloc[:, :-1]
trainlabel = train.iloc[:, -1:]
testfeatures = test.iloc[:, :-1]
testlabel = test.iloc[:, -1:]
from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder
X_train = np.array(trainfeatures)
y_train = np.array(trainlabel)
X_test = np.array(testfeatures)
y_test = np.array(testlabel)
lb = LabelEncoder()
y_train = np_utils.to_categorical(lb.fit_transform(y_train))
y_test = np_utils.to_categorical(lb.fit_transform(y_test))
I'd like to understand what's this code do.
y_train = np_utils.to_categorical(lb.fit_transform(y_train))
y_test = np_utils.to_categorical(lb.fit_transform(y_test))
I ask this question because in training phase of the CNN, I've got an error in model.fit
Error when checking target: expected activation_26 to have shape (1,)...
understand this may help me to overcome the problem.
thanks

Python different results for manual and cross_val_score prediction

I have one question, I'm trying to implement KFold and cross_val_score.
My goal is to calculate mean_squared_errorand for this purpose I used the following code:
from sklearn import linear_model
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
x = np.random.random((10000,20))
y = np.random.random((10000,1))
x_train = x[7000:]
y_train = y[7000:]
x_test = x[:7000]
y_test = y[:7000]
Model = linear_model.LinearRegression()
Model.fit(x_train,y_train)
y_predicted = Model.predict(x_test)
MSE = mean_squared_error(y_test,y_predicted)
print(MSE)
kfold = KFold(n_splits = 100, random_state = None, shuffle = False)
results = cross_val_score(Model,x,y,cv=kfold, scoring='neg_mean_squared_error')
print(results.mean())
I think it's all right here, I got the following results:
Results: 0.0828856459279 and -0.083069435946
But when I try to do this on some other example (datas from Kaggle House Prices), it does not work properly, at least I think so..
train = pd.read_csv('train.csv')
Insert missing values...
...
train = pd.get_dummies(train)
y = train['SalePrice']
train = train.drop(['SalePrice'], axis = 1)
x_train = train[:1000].values.reshape(-1,339)
y_train = y[:1000].values.reshape(-1,1)
y_train_normal = np.log(y_train)
x_test = train[1000:].values.reshape(-1,339)
y_test = y[1000:].values.reshape(-1,1)
Model = linear_model.LinearRegression()
Model.fit(x_train,y_train_normal)
y_predicted = Model.predict(x_test)
y_predicted_transform = np.exp(y_predicted)
MSE = mean_squared_error(y_test, y_predicted_transform)
print(MSE)
kfold = KFold(n_splits = 10, random_state = None, shuffle = False)
results = cross_val_score(Model,train,y, cv = kfold, scoring = "neg_mean_squared_error")
print(results.mean())
Here I get the following results: 0.912874946869 and -6.16986926564e+16
Apparently, the mean_squared_error calculated 'manually' is not the same as the mean_squared_error calculated by the help of KFold.
I'm interested in where I made a mistake?
The discrepancy is because, in contrast to your first approach (training/test set), in your CV approach you use the unnormalized y data for fitting the regression, hence your huge MSE. To get comparable results, you should do the following:
y_normal = np.log(y)
y_test_normal = np.log(y_test)
MSE = mean_squared_error(y_test_normal, y_predicted) # NOT y_predicted_transform
results = cross_val_score(Model, train, y_normal, cv = kfold, scoring = "neg_mean_squared_error")

Resources