Choosing a multi-label classifier with high number of labels - keras

First of all, I'm very new to ML and I have this task:
I need to build a ML model to give clients a list of 10 professions for each of them which best fit with their data: bachelor's degree type, favourite subjects, favourite professional sectors, etc...
My team and I have alredy extracted info from an SQL database and created a dataframe with all the relevant information and One Hot Encoded it: this is the result:
df_all_dumm.shape
(773, 1029)
So I have 773 clients and 1029 columns (a lot of them, but we thought it is necessary because all the columns were numeric categorical). Most of the columns are OHE professions columns (from 99 to 998), where there is "1" if the profession has been suggested to the client or "0" if not.
I'm a bit lost about if this dataset approach is fine for multi-label classification, about what method to use (NN, RandomTrees classif, scikit multi-learn models...). I have alredy tried with some multi-learn models like MLkNN or BRkNNaClassifier but the results are very poor (F1 Score = 0.1 - 0.2).
This is the dataset (it doesn't contain any private data, so I think there is no problem uploading it. Also, I don't know if this is the propper way to paste a link, sorry again)
https://drive.google.com/file/d/1nID4q7EfpoiNKdWz6N4FRUgIQEniwFRV/view?usp=sharing
EDIT:
I have created a Sequential Keras model:
# Slicing target columns from the rest of the df
data = pd.read_csv('df_all_dumm.csv')
data_c = data.copy()
data_in = data_c.copy()
data_c.iloc[:,99:999]
data_out = data_c.iloc[:,99:999]
data_in = data_in.drop(data_out.columns,1)
data_in = data_in.drop(['id'],1)
X_train, X_test, y_train, y_test = train_test_split(data_in,
data_out,
test_size = 0.3,
random_state = 42)
print("{0:2.2f}% of data in train set".format(len(X_train)/len(data.index)*100))
print("{0:2.2f}% of data in test set".format(len(X_test)/len(data.index)*100))
# Dataframes to tensors
X_train_tf = tf.convert_to_tensor(X_train.values)
X_test_tf = tf.convert_to_tensor(X_test.values)
y_train_tf = tf.convert_to_tensor(y_train.values)
y_test_tf = tf.convert_to_tensor(y_test.values)
from numpy import asarray
from sklearn.datasets import make_multilabel_classification
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
# get the model
def get_model(n_inputs, n_outputs):
model = Sequential()
model.add(Dense(512, input_dim=n_inputs, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(600, input_dim=n_inputs, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(700, input_dim=n_inputs, activation='relu'))
model.add(Dense(900, input_dim=n_inputs, activation='relu'))
model.add(Dense(n_outputs, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
# load dataset
X, y = X_train_tf, y_train_tf
n_inputs, n_outputs = X.shape[1], y.shape[1]
# get model
model = get_model(n_inputs, n_outputs)
# fit the model on all data
model.fit(X, y, validation_split=0.33, epochs=100, batch_size=10)
prec = model.evaluate(X_test_tf, y_test_tf)[1]
print("La precisiĆ³n de la red es: {} %".format(round(prec*100,2)))
La precisiĆ³n de la red es: 4.74 %
So, this is the model we created. I think the main problem here is that we have 900 different output labels and our data input size is 500...
We also thought about applying some clustering algorithm first to cluster the professions in, let's say 5 groups, train a NN for every group and then predict.

Related

what does [1] mean in model.evaluate(X, Y)[1]

The following codes are from a textbook called 'Deeplearning for everybody' and it is to predict diabetes based on the data from Pima indians. I wonder what the [1] at the end of the codes mean.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import numpy
import tensorflow as tf
np.random.seed(3)
tf.random.set_seed(3)
dataset = np.loadtxt('.\dataset\pima-indians-diabetes.csv', delimiter=',')
X = dataset[:, 0:8]
Y = dataset[:, 8]
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X, Y, epochs=200, batch_size=10)
print('\n Accuracy: %.4f' % (model.evaluate(X, Y)[1])) # <---------
In Keras model.evaluate() returns a list of aggregated metric values. Say you want to measure the loss, the accuracy, F1 score on your test data, then you would compile your model something like this: model.compile(optimizer, loss, metrics=['accuracy', custom_f1_function], .. ) . Then these will be calculated for every sample (or batch) in the dataset and then reduced usually by taking the average. In the end you will get a list that has three elements: aggregated loss, aggregated accuracy, aggregated F1 score. In your code you are accessing the second element of this list, namely the accuracy.
(The order in `metrics=[..] determines the order in the output list!)

Model trained using LSTM is predicting only same value for all

I have a dataset with 4000 rows and two columns. The first column contains some sentences and the second column contains some numbers for it.
There are some 4000 sentences and they are categorized by some 100 different numbers. For example:
Sentences Codes
Google headquarters is in California 87390
Steve Jobs was a great man 70214
Steve Jobs has done great technology innovations 70214
Google pixel is a very nice phone 87390
Microsoft is another great giant in technology 67012
Bill Gates founded Microsoft 67012
Similarly, there are a total of 4000 rows containing these sentences and these rows are classified with 100 such codes
I have tried the below code but when I am predicting, it is predicting one same value for all. IN othr words y_pred is giving an array of same values.
May I know where is the code going wrong
import pandas as pd
import numpy as np
xl = pd.ExcelFile("dataSet.xlsx")
df = xl.parse('Sheet1')
#df = df.sample(frac=1).reset_index(drop=True)# shuffling the dataframe
df = df.sample(frac=1).reset_index(drop=True)# shuffling the dataframe
X = df.iloc[:, 0].values
Y = df.iloc[:, 1].values
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import pickle
count_vect = CountVectorizer()
X = count_vect.fit_transform(X)
tfidf_transformer = TfidfTransformer()
X = tfidf_transformer.fit_transform(X)
X = X.toarray()
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
y = Y.reshape(-1, 1) # Because Y has only one column
onehotencoder = OneHotEncoder(categories='auto')
Y = onehotencoder.fit_transform(y).toarray()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
inputDataLength = len(X_test[0])
outputDataLength = len(Y[0])
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.layers import Dropout
# fitting the model
embedding_vector_length = 100
model = Sequential()
model.add(Embedding(outputDataLength,embedding_vector_length, input_length=inputDataLength))
model.add(Dropout(0.2))
model.add(LSTM(outputDataLength))
model.add(Dense(outputDataLength, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=20)
y_pred = model.predict(X_test)
invorg = model.inverse_transform(y_test)
y_test = labelencoder_Y.inverse_transform(invorg)
inv = onehotencoder.inverse_transform(y_pred)
y_pred = labelencoder_Y.inverse_transform(inv)
You are using binary_crossentropy eventhough you have 100 classes. Which is not the right thing to do. You have to use categorical_crossentropy for this task.
Compile your model like this,
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Also, you are predicting with the model and converting to class labels like this,
y_pred = model.predict(X_test)
inv = onehotencoder.inverse_transform(y_pred)
y_pred = labelencoder_Y.inverse_transform(inv)
Since your model is activated with softmax inorder to get the class label, you have to find the argmax of the predictions.
For example, if the prediction was [0.2, 0.3, 0.0005, 0.99] you have to take argmax, which will give you output 3. The class that have high probability.
So you have to modify the prediction code like this,
y_pred = model.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)
y_pred = labelencoder_Y.inverse_transform(y_pred)
invorg = np.argmax(y_test, axis=1)
invorg = labelencoder_Y.inverse_transform(invorg)
Now you will have the actual class labels in invorg and predicted class labels at y_pred

Neural network only predicts one class from binary class

My task is to learn defected items in a factory. It means, I try to detect defected goods or fine goods. This led a problem where one class dominates the others (one class is 99.7% of the data) as the defected items were very rare. Training accuracy is 0.9971 and validation accuracy is 0.9970. It sounds amazing.
But the problem is, the model only predicts everything is 0 class which is fine goods. That means, it fails to classify any defected goods.
How can I solve this problem? I have checked other questions and tried out, but I still have the situation. the total data points are 122400 rows and 5 x features.
In the end, my confusion matrix of the test set is like this
array([[30508, 0],
[ 92, 0]], dtype=int64)
which does a terrible job.
My code is as below:
le = LabelEncoder()
y = le.fit_transform(y)
ohe = OneHotEncoder(sparse=False)
y = y.reshape(-1,1)
y = ohe.fit_transform(y)
scaler = StandardScaler()
x = scaler.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.25, random_state = 777)
#DNN Modelling
epochs = 15
batch_size =128
Learning_rate_optimizer = 0.001
model = Sequential()
model.add(Dense(5,
kernel_initializer='glorot_uniform',
activation='relu',
input_shape=(5,)))
model.add(Dense(5,
kernel_initializer='glorot_uniform',
activation='relu'))
model.add(Dense(8,
kernel_initializer='glorot_uniform',
activation='relu'))
model.add(Dense(2,
kernel_initializer='glorot_uniform',
activation='softmax'))
model.compile(loss='binary_crossentropy',
optimizer=Adam(lr = Learning_rate_optimizer),
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
y_pred = model.predict(x_test)
confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
Thank you
it sounds like you have highly imbalanced dataset, the model is learning only how to classify fine goods.
you can try one of the approaches listed here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
The best attempt would be to firstly take almost equal portions of data of both classes, split them into train-test-val, train the classifier and do thorough testing on your complete dataset. You can also try and use data augmentation techniques to your other set to get more data from the same set. Keep on iterating and maybe even try and change your loss function to suit your condition.

How do I implement multilabel classification neural network with keras

I am attempting to implement a neural network using Keras with a problem that involves multilabel classification. I understand that one way to tackle the problem is to transform it to several binary classification problems. I have implemented one of these, but am not sure how to proceed with the others, mostly how do I go about combining them? My data set has 5 input variables and 5 labels. Generally a single sample of data would have 1-2 labels. It is rare to have more than two labels.
Here is my code (thanks to machinelearningmastery.com):
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataframe = pandas.read_csv("Realdata.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:5].astype(float)
Y = dataset[:,5]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
scores = model.evaluate(X, encoded_Y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
#Make predictions....change the model.predict to whatever you want instead of X
predictions = model.predict(X)
# round predictions
rounded = [round(x[0]) for x in predictions]
print(rounded)
return model
# evaluate model with standardized dataset
estimator = KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Results: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
The approach you are referring to is the one-versus-all or the one-versus-one strategy for multi-label classification. However, when using a neural network, the easiest solution for a multi-label classification problem with 5 labels is to use a single model with 5 output nodes. With keras:
model = Sequential()
model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(5, kernel_initializer='normal', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='sgd')
You can provide the training labels as binary-encoded vectors of length 5. For instance, an example that corresponds to classes 2 and 3 would have the label [0 1 1 0 0].

Categorical classification in Keras Python

I am doing multi-class classification of 5 classes. I am using Tensorflow with Keras. My code is like this:
# load dataset
dataframe = pandas.read_csv("Data5Class.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:47].astype(float)
Y = dataset[:,47]
print("Load Data.....")
encoder= to_categorical(Y)
def create_larger():
model = Sequential()
print("Create Dense Ip & HL 1 Model ......")
model.add(Dense(47, input_dim=47, kernel_initializer='normal', activation='relu'))
print("Add Dense HL 2 Model ......")
model.add(Dense(40, kernel_initializer='normal', activation='relu'))
print("Add Dense output Model ......")
model.add(Dense(5, kernel_initializer='normal', activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
estimators = []
estimators.append(('rnn', KerasClassifier(build_fn=create_larger, epochs=60, batch_size=10, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoder, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
The CSV file I have taken as an input contains the data with labels. The labels are like this 0, 1, 2, 3, 4 which represent 5 different classes.
Then, as the labels are already in integer form, do I need to use
the LabelEncoder() function in my code?
Also, I have used to_categorical(Y) function. Should I use it or I should just pass the Y variable containing these labels to the classifier for training?
I got the error like this:
Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.
This error occurred when I used encoder variable in the code
results = cross_val_score(pipeline, X, encoder, cv=kfold) where encoder variable represents the to_categorical(Y) data. How to solve this error?
As mentioned on the Keras documentation here:
Note: when using the categorical_crossentropy loss, your targets
should be in categorical format (e.g. if you have 10 classes, the
target for each sample should be a 10-dimensional vector that is
all-zeros except for a 1 at the index corresponding to the class of
the sample). In order to convert integer targets into categorical
targets, you can use the Keras utility to_categorical:
from keras.utils.np_utils import to_categorical
categorical_labels = to_categorical(int_labels, num_classes=None)
So this means that you need to use the to_categorical() method on your y before training. But no need to use LabelEncoder if y is already in integer type.

Resources