How to make prediction with single sample in sklearn model.predict? - scikit-learn

I trained a logistic regression model with some data.
I applied standard scalar to train and test data, trained model.
But if I want to make prediction with the model with the data outside the train and test data, I have to apply standard scalar to new data but what if I have single data than i cannot apply standard scalar to that new single sample that i want to give as input.
What should be the procedure to predict results with new data especially single sample at a time?

The predict() method always expects a 2D array of shape [n_samples, n_features]. This means that if you want to predict even for a single data point, you will have to convert it into a 2D array.
Converting data into a 2D array using reshape
# Sample data
print(arr)
[1, 2, 3, 4]
# Reshaping into 2D
arr.reshape(1, -1)
# Result
array([[1, 2, 3, 4]])
This array can now be transformed using standard scalar using transform() method before being used to generate a prediction from the model.

Related

LSTM input shape through json file

I am working on the LSTM and after the pre-processing of data I get the data X in form of a list which contains the 3 lists of features and each list contains the sequence of 50 points in form of a list.
X = [list:100 [list:3 [list:50]]]
Y = [list:100]
since its a multivariate LSTM, I am not sure how to give all 3 sequences as an input to Keras-Lstm. Do I need to convert it in Pandas data frame?
model = models.Sequential()
model.add(layers.Bidirectional(layers.LSTM(units=32,
input_shape=(?,?,?)))
You can do do the following to convert the lists into NumPy arrays:
X = np.array(X)
Y = np.array(Y)
Calling the following after this conversion:
print(X.shape)
print(Y.shape)
should output: (100, 3, 50) and (100,), respectively. Finally, the input_shape of the LSTM layer can be (None, 50).
LSTM Call arguments Doc:
inputs: A 3D tensor with shape [batch, timesteps, feature].
You would have to transform that list into a numpy array to work with Keras.
As per the shape of X you have provided, it should work in theory. However you do have to figure out what the 3 dimensions of your array actually contain.
The 1st dimension should be your batch_size i.e. how many batches of data you have.
The 2nd dimension is your timestep data.
Ex: words in a sentence, "cat sat on dog" -> 'cat' is timestep 1, 'sat' is timestep 2 and 'on' is timestep 3 and so on.
The 3rd dimension represent the features of your data of each timestep.. For our sentence earlier, we can vectorize each word

Imbalance Learn "balanced_batch_generator" for > 2 dimensional data

I am using imbalanced learn's "balanced_batch_generator" to try to perform undersampling on image array which has 4 dimensions. I ran the code below:
training_generator, steps_per_epoch = balanced_batch_generator(x_train, y_train, sampler=NearMiss(), batch_size=10)
and got the following error:
ValueError: Found array with dim 4. Estimator expected <= 2.
I am aware that this function does not accept > 2 dimensional data however I am wondering if there is an work around to this. I would perform under/over sampling myself by just manually splitting the data, however i want to make use of keras's nicely implemented functions such as NearMiss to intelligently sample my data.
You need to reshape the array for your x_train:
x.train.shape #So you know the values for all 4 dims (1st dim, 2nd dim, 3rd dm, 4th dm)
x_train = x_train.reshape(x_train.shape[0],-1)
Then use the balanced_batch_generator. Afterwards:
x_train = x_train.reshape(x_train.shape[0], initial Value for the 2nd dim, initial Value for the 3rd, initial Value for the 4th)
for x_train, y_train in training_generator:
break
And your x_train contains balanced batches with its initial dimensions. You don't need to reshape y_train because it only consists of the labels. (dim <=2)
If you do not want to type in the values of the dims all the time, I would use variables that save the values.

Embedding vs inserting word vectors directly to input layer

I used gensim to build a word2vec embedding of my corpus.
Currently I'm converting my (padded) input sentences to the word vectors using the gensim model.
This vectors are used as input for the model.
model = Sequential()
model.add(Masking(mask_value=0.0, input_shape=(MAX_SEQUENCE_LENGTH, dim)))
model.add(Bidirectional(
LSTM(num_lstm, dropout=0.5, recurrent_dropout=0.4, return_sequences=True))
)
...
model.fit(training_sentences_vectors, training_labels, validation_data=validation_data)
Are there any drawbacks using the word vectors directly without a keras embedding layer?
I'm also currently adding additional (one-hot encoded) tags to the input tokens by concatenating them to each word vector, does this approach make sense?
In your current setup, the drawback will be that you will not be able to set your word vectors to be trainable. You will not be able to fine tune your model for your task.
What I mean by this is that Gensim has only learned the "Language Model". It understands your corpus and its contents. However, it does not know how to optimize for whatever downstream task you are using keras for. Your model's weights will help to fine tune your model, however you will likely experience an increase in performance if you extract the embeddings from gensim, use them to initialize a keras embedding layer, and then pass in indexes instead of word vectors for your input layer.
There's an elegant way to do what you need.
Problem with your solution is that:
the size of the input is large: (batch_size, MAX_SEQUENCE_LENGTH, dim) and may not fit in memory.
You won't be able to train and update the word vectors as per your task
You can instead get away with just: (batch_size, MAX_SEQUENCE_LENGTH). The keras embedding layer allows you to pass in a word index and get a vector. So, 42 -> Embedding Layer -> [3, 5.2, ..., 33].
Conveniently, gensim's w2v model has a function get_keras_embedding which creates the needed embedding layer for you with the trained weights.
gensim_model = # train it or load it
embedding_layer = gensim_model.wv.get_keras_embedding(train_embeddings=True)
embedding_layer.mask_zero = True # No need for a masking layer
model = Sequential()
model.add(embedding_layer) # your embedding layer
model.add(Bidirectional(
LSTM(num_lstm, dropout=0.5, recurrent_dropout=0.4, return_sequences=True))
)
But, you have to make sure the index for a word in the data is the same as the index for the word2vec model.
word2index = {}
for index, word in enumerate(model.wv.index2word):
word2index[word] = index
Use the above word2index dictionary to convert your input data to have the same index as the gensim model.
For example, your data might be:
X_train = [["hello", "there"], ["General", "Kenobi"]]
new_X_train = []
for sent in X_train:
temp_sent = []
for word in sent:
temp_sent.append(word2index[word])
# Add the padding for each sentence. Here I am padding with 0
temp_sent += [0] * (MAX_SEQUENCE_LENGTH - len(temp_sent))
new_X_train.append(temp_sent)
X_train = numpy.as_array(new_X_train)
Now you can use X_train and it will be like: [[23, 34, 0, 0], [21, 63, 0, 0]]
The Embedding Layer will map the index to that vector automatically and train it if needed.
I think this is the best way of doing it but I'll dig into how gensim wants it to be done and update this post if needed.

Keras SimpleRNN - Shape MFCC vectors

I'm currently trying to implement a Recurrent Neural Network in Keras. The data consists of a collection of 45.000 whereby each entry is a collection (of variable length) of MFCC vectors with each 13 coefficients:
spoken = numpy.load('spoken.npy')
print(spoken[0]) # Gives:
example_row = [
[
5.67170000e-01 -1.79430000e-01 -7.27360000e+00 -9.59300000e-02
-9.30140000e-02 -1.62960000e-01 4.11620000e-01 3.00590000e-01
6.86360000e-02 1.07130000e+00 1.07090000e-01 5.00890000e-01
7.51750000e-01],
[.....]
]
print(spoken.shape) # Gives: (45000,0)
print(spoken[0].shape) # Gives (N, 13) --> N amount of MFCC vectors
I'm struggling to understand how I need to reshape this Numpy array in order to feed it to the SimpleRNN of Keras:
model = Sequential()
model_spoken.add(SimpleRNN(units=10, activation='relu', input_shape=?))
.....
Therefore, my question is how do I need to reshape a collection of variable length MFCC vectors so that I can feed it to the SimpleRNN object of Keras?
It was actually quite simple since Keras has built in function for reformatting the array and padding zeros to get a static length:
spoken_train = pad_sequences(spoken_train, maxlen=100)
See github issue

LSTM's for Binary classification in Keras?

Suppose I have the following data-set X with 2 features and Y labels .
X = [[0.3, 0.1], [0.2, 0.9], [0.4, 0.0]]
Y = [0, 1, 0]
# split into input (X) and output (Y) variables
X = dataset[:, 0:2] #X features are from the first column and the 50th column
Y = dataset[:, 2]
model = Sequential()
model.add(Embedding(2, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, Y)
It works, but I wanted o know more about parameter_1, parameter_2, parameter_3 that go in
Embedding(parameter_1, parameter_2, input_length=parameter_2)
P.S, I just put in random stuff and don't know what I am doing.
What would be the proper parameters to fill in Embedding() given the data set I described above?
Alright, following more precise questions in the comments, here is the explaination.
An embedding layer is usually used to embed words so I will use a "red line example" with words, but you can think of them as categorical features.
The embedding layer is useful indeed to represent words (categorical features) as vectors in a continuous vector space.
When you have a text, you will tokenize your words and assign them a number. They become then categorical features labelled with an index. You will have for example the sentence " I embed stuff" becoming the list of categorical objects [2, 1, 3] where a dictionnary maps the index to each words : {1: "embed", 2: "I", 3: "stuff", 4: "some_other_words", 0:"<pad>"}
When you use a neural network or a continuous mathematical framework, those discrete objects (=categories) are unordered, there is no sense in 2 > 1 when you talk about your words, those are not "numerical values", they are categories. So you want to make them become numbers, to embed them in a vector space.
This is precisely what the Embedding() layer does, it maps every indexes to a word. So to do that, there are three main parameters to define :
How many indices you want to use in total. This is the number of words you have in your vocabulary, or the number of categories that the categorical feature you want to encode has. This is the input_dim feature. In our little example, we have 5 words in the vocabulary (indices from 0 to 4), so we will have input_dim = 5. The reason why it is called a "dimension" is because under the hood, keras is transforming the index number into a one-hot vector of dimension = the number of different elements. For example, the word "stuff" which is index 3 will be transformed into the 5 dimesions vector : [0 0 0 1 0] before being embedded. This is why your inputs should be integer, they are indices representing where the 1 is in the one-hot vector.
How big do you want your output vectors. This is the size of the vector space where your features will live. The parameter is output_dim. if you don't have a lot of words in your vocabulary (different categories for your features), this number should be low, in our case we will set it to output_dim = 2. Our 5 words will be living in a 2D space.
As embedding layers are often the firsts in a Neural Network, you need to specify what is the number of words that you have in the samples. This will be the input_length. Our sample was a 3 words phrase so input_length=3.
The reason why you usually have the embedding layer as first layer is because it takes integers inputs, layers in neural networks return real values, so it wouldn't work.
So to summarize, what comes in the the layer is a sequence of indices : [2, 1, 3] in our example. And what comes out is the embedded vector corresponding to each index. This might be something like [[0.2, 0.4], [-1.2, 0.3], [-0.5, -0.8]].
And to come back to your example, the input should be a list of samples, samples being lists of indices. There is no use to embed features that are already real values, values which have a mathematical sense already, the model can understand it, as opposed to categorical values.
Is it clearer now? :)

Resources