What's the purpose of np_utils.to_categorical() in keras? - keras

Hello I'm training my self in sentiment recognition using audio file and a code from a repository of git.
Code sample:
newdf1 = np.random.rand(len(rnewdf)) < 0.8
train = rnewdf[newdf1]
test = rnewdf[~newdf1]
trainfeatures = train.iloc[:, :-1]
trainlabel = train.iloc[:, -1:]
testfeatures = test.iloc[:, :-1]
testlabel = test.iloc[:, -1:]
from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder
X_train = np.array(trainfeatures)
y_train = np.array(trainlabel)
X_test = np.array(testfeatures)
y_test = np.array(testlabel)
lb = LabelEncoder()
y_train = np_utils.to_categorical(lb.fit_transform(y_train))
y_test = np_utils.to_categorical(lb.fit_transform(y_test))
I'd like to understand what's this code do.
y_train = np_utils.to_categorical(lb.fit_transform(y_train))
y_test = np_utils.to_categorical(lb.fit_transform(y_test))
I ask this question because in training phase of the CNN, I've got an error in model.fit
Error when checking target: expected activation_26 to have shape (1,)...
understand this may help me to overcome the problem.
thanks

Related

dimension related problem in training LightGBM for Multiclass Multilable Classification?

I would like to classify by LightGBM algorithm for Multiclass Multilable Classification but I encounter a problem during training because of not being a list the input. DATA
is The length of real rows is 10000
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,np.r_[0:6, 7:27]].values
y = dataset.iloc[:,np.r_[6]].values
x_train, x_test, y_train, y_test = train_test_split(X, y,test_size = 0.25, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
import lightgbm as lgb
d_train = lgb.Dataset(x_train, label=y_train)
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'binary_logloss'
params['sub_feature'] = 0.5
params['num_leaves'] = 10
params['min_data'] = 50
params['max_depth'] = 10
clf = lgb.train(params, d_train, 100)
y_pred=clf.predict(x_test)
for i in range(0,99):
if y_pred[i]>=.5:
y_pred[i]=1
else:
y_pred[i]=0
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
I encounter this problem:
clf = lgb.train(params, d_train, 100)
File "..\lightgbm\engine.py", line 228, in train
...
File "..\lightgbm\basic.py", line 1336, in set_label
label = list_to_1d_numpy(_label_from_pandas(label), name='label')
File "..\lightgbm\basic.py", line 86, in list_to_1d_numpy
"It should be list, numpy 1-D array or pandas Series".format(type(data).__name__, name))
This error is found in basic.py in a function: """Convert data to numpy 1-D array.""" While when I have changed my data to 1D by
y_train = np.reshape(y_train, [1,trainsize])
x_train = np.reshape(x_train, [1,trainsize*26])
The problem is not solved!
Then I use ravel to make 1D for x_train, y_train
x_train = np.ravel(x_train)
y_train = np.ravel(y_train)
but new error is shown:
\lib\site-packages\lightgbm\basic.py", line 872, in __init_from_np2d
raise ValueError('Input numpy.ndarray must be 2 dimensional')
ValueError: Input numpy.ndarray must be 2 dimensional
What is wrong? How I can solve this?

Errors in RFE and KFOLD method using Logistic Regression

I am getting errors in RFE and K-Fold method in python using logistic regression. How to achieve error-free code?
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
accuracies = []
feature_set = []
max_accuracy_so_far = 0
for i in range(1,len(X[0])+1):
selector = RFE(LogisticRegression(), i,verbose=1)
selector = selector.fit(X, y)
current_accuracy = selector.score(X,y)
accuracies.append(current_accuracy)
feature_set.append(selector.support_)
if max_accuracy_so_far < current_accuracy:
max_accuracy_so_far = current_accuracy
selected_features = selector.support_
print('End of iteration no. {}'.format(i))
X_sub = X[:,selected_features]
#KFOLD model score
scores = []
max_score = 0
from sklearn.model_selection import KFold
kf = KFold(n_splits=4,random_state=0,shuffle=True)
for train_index, test_index in kf.split(X_sub):
X_train, X_test = X_sub[train_index], X_sub[test_index]
y_train, y_test = y[train_index], y[test_index]
current_model = LogisticRegression()
#train the model
current_model.fit(X_train,y_train)
#see performance score
current_score = model.score(X_test,y_test)
scores.append(current_score)
if max_score < current_score:
max_score = current_score
best_model = current_model
best_model.intercept_
best_model.coef_
What is the correct output?
I expect the output to be correct

Error encountered: Classification metrics can't handle a mix of multiclass-multioutput and binary targets

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
file = './BBC.csv'
df = read_csv(file)
array = df.values
X = array[:, 0:11]
Y = array[:, 11]
test_size = 0.30
seed = 45
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = RandomForestClassifier()
model.fit(X_train, Y_train)
result = model.score(X_test, X_test)
print("Accuracy: %.3f%%") % (result*100.0)
dataset: https://www.dropbox.com/s/ar1c9yuv5x774cv/BBC.csv?dl=0
I have encountered this error:
Classification metrics can't handle a mix of multiclass-multioutput and binary targets
If i'm not wrong RandomForest should be able to handle both classes (classification) and means (regression). Am i wrong?
Edit:
Checked your dataset. So for classification task, your problem lies in your code.
result = model.score(X_test, X_test)
Note that the parameter here should be X_test and Y_test
-----kind of off-topic-----
If you want to use RandomForest for regression, you probably should call RandomForestRegressor

Error in reshaping input tokenized text predicting the sentiments in a lstm rnn

I am new to neural network and have been learning it's application in the field of text analytics, so i have used a lstm rnn for the application in python.
After training the model on a dataset of dimension 20,000*1 (2000-being the text and ,1-being the sentiment of the text) i got a good accuracy of 99%, after which i validated the model which was working fine(using the model.predict()function).
Now just to test my model i have been trying to give random text inputs either from a dataframe or variables containing some text but i always landup with the error of reshaping the array , where it is required that the input to the rnn model be of the dimension (1,30).
But when i re-input the training data into the model for prediction , the model works absolutely fine , why is this happening?
link for the screenshot of error
link for image of model summary
training data
I am just stuck here and any kind of suggestion will help me learning more about rnn, i am attaching the error and the rnn model code with this request.
Thank You
Regards
Tushar Upadhyay
import numpy as np
import pandas as pd
import keras
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re
data=pd.read_csv('..../twitter_tushar_data.csv')
max_fatures = 4000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['tweetText'].values)
X = tokenizer.texts_to_sequences(data['tweetText'].values)
X = pad_sequences(X)
embed_dim = 128
lstm_out = 196
model = Sequential()
keras.layers.core.SpatialDropout1D(0.2) #used to avoid overfitting
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(LSTM(196, recurrent_dropout=0.2, dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics
= ['accuracy'])
print(model.summary())
#splitting data in training and testing parts
Y = pd.get_dummies(data['SA']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size =
0.30, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
batch_size = 128
model.fit(X_train, Y_train, epochs = 7, batch_size=batch_size, verbose =
2)
validation_size = 3500
X_validate = X_test[-validation_size:]
Y_validate = Y_test[-validation_size:]
X_test = X_test[:-validation_size]
Y_test = Y_test[:-validation_size]
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = 128)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))
pos_cnt, neg_cnt, pos_correct, neg_correct = 0, 0, 0, 0
for x in range(len(X_validate)):
result =
model.predict(X_validate[x].reshape(1,X_test.shape[1]),batch_size=1,verbose
= 2)[0]
if np.argmax(result) == np.argmax(Y_validate[x]):
if np.argmax(Y_validate[x]) == 0:
neg_correct += 1
else:
pos_correct += 1
if np.argmax(Y_validate[x]) == 0:
neg_cnt += 1
else:
pos_cnt += 1
print("pos_acc", pos_correct/pos_cnt*100, "%")
print("neg_acc", neg_correct/neg_cnt*100, "%")
I got the solution to my question, it was just a matter of tokenizing the input properly, Thanks !! The code is below for prediction of different user inputs..
text=np.array(['you are a pathetic awful movie'])
print(text.shape)
tk=Tokenizer(num_words=4000,lower=True,split=" ")
tk.fit_on_texts(text)
prediction=model.predict(sequence.pad_sequences(tk.texts_to_sequences(text),
maxlen=max_review_length))
print(prediction)
print(np.argmax(prediction))

Python different results for manual and cross_val_score prediction

I have one question, I'm trying to implement KFold and cross_val_score.
My goal is to calculate mean_squared_errorand for this purpose I used the following code:
from sklearn import linear_model
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
x = np.random.random((10000,20))
y = np.random.random((10000,1))
x_train = x[7000:]
y_train = y[7000:]
x_test = x[:7000]
y_test = y[:7000]
Model = linear_model.LinearRegression()
Model.fit(x_train,y_train)
y_predicted = Model.predict(x_test)
MSE = mean_squared_error(y_test,y_predicted)
print(MSE)
kfold = KFold(n_splits = 100, random_state = None, shuffle = False)
results = cross_val_score(Model,x,y,cv=kfold, scoring='neg_mean_squared_error')
print(results.mean())
I think it's all right here, I got the following results:
Results: 0.0828856459279 and -0.083069435946
But when I try to do this on some other example (datas from Kaggle House Prices), it does not work properly, at least I think so..
train = pd.read_csv('train.csv')
Insert missing values...
...
train = pd.get_dummies(train)
y = train['SalePrice']
train = train.drop(['SalePrice'], axis = 1)
x_train = train[:1000].values.reshape(-1,339)
y_train = y[:1000].values.reshape(-1,1)
y_train_normal = np.log(y_train)
x_test = train[1000:].values.reshape(-1,339)
y_test = y[1000:].values.reshape(-1,1)
Model = linear_model.LinearRegression()
Model.fit(x_train,y_train_normal)
y_predicted = Model.predict(x_test)
y_predicted_transform = np.exp(y_predicted)
MSE = mean_squared_error(y_test, y_predicted_transform)
print(MSE)
kfold = KFold(n_splits = 10, random_state = None, shuffle = False)
results = cross_val_score(Model,train,y, cv = kfold, scoring = "neg_mean_squared_error")
print(results.mean())
Here I get the following results: 0.912874946869 and -6.16986926564e+16
Apparently, the mean_squared_error calculated 'manually' is not the same as the mean_squared_error calculated by the help of KFold.
I'm interested in where I made a mistake?
The discrepancy is because, in contrast to your first approach (training/test set), in your CV approach you use the unnormalized y data for fitting the regression, hence your huge MSE. To get comparable results, you should do the following:
y_normal = np.log(y)
y_test_normal = np.log(y_test)
MSE = mean_squared_error(y_test_normal, y_predicted) # NOT y_predicted_transform
results = cross_val_score(Model, train, y_normal, cv = kfold, scoring = "neg_mean_squared_error")

Resources