Train autoencoder in script mode on AWS sagemaker - keras

I want to train an autoencoder using keras where X_train is mxn matrix and y_train is also mxn matrix.
for Examaple
X_train = np.array(([1, 2],
[3, 4]))
y_train = np.array(([5, 6],
[7, 8]))
I concatenate two matrix in train_set and save into one file training.npy
train_set = np.concatenate([X_train, y_train], axis=1)
print(train_set)
array([[1, 2, 5, 6],
[3, 4, 7, 8]])
Later I save it to S3
training_path_input = sess.upload_data('/tmp/training.npy', key_prefix=prefix+'/training')
Now when I fit the model
model.fit({'train': training_path_input })
I wonder how estimator will find index for X_train and y_train since y_train is not a vector unlike other cases. Is there any way to specify this in fit() method.
Or is there any alternative way to do it?

The fit method does 2 things: (1) copy your data from training_path_input (on S3) to /opt/ml/input/data/<channel> in the SageMaker training instance (/opt/ml/input/data/train in your case) and (2) launching the code with any hyperparameter you specified. You need to make sure that your training code knows how to read the type of files you're copying to the machine. Your training code must include code that will read locally the copied files.

Related

how do I solve ValueError in Tensorflow?

I am running a cnn in Google colab and i am using tensorflow or Keras. However I received this feedback
Negative dimension size caused by subtracting 3 from 2 for '{{node conv2d_11/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], explicit_paddings=[], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true](Placeholder, conv2d_11/Conv2D/ReadVariableOp)' with input shapes: [?,2,2,394], [3,3,394,394].
Call arguments received:
• inputs=tf.Tensor(shape=(None, 2, 2, 394), dtype=float32)
does this have to do with my input data or my parameters? Thanks
Try inserting a number instead of using None when specifying the shape. As said in the documentation here, you run across not-fully-specified shapes when using None.
in this case you have defined model which consists of MaxPool layers or AvgPool layers alot, so by the images pass through model layers, images size will be decreased; i think it would be helpful if you set padding parameter in convolution layer to same and for more details you could read about conv layers parameters strides and padding.

PyTorch: How to do inference in batches (inference in parallel)

How to do inference in batches in PyTorch? How to do inference in parallel to speed up that part of the code.
I've started with the standard way of doing inference:
with torch.no_grad():
for inputs, labels in dataloader['predict']:
inputs = inputs.to(device)
output = model(inputs)
output = output.to(device)
And I've researched and the only mention of doing inference in parallel (in the same machine) seems to be with the library Dask: https://examples.dask.org/machine-learning/torch-prediction.html
Currently attempting to understand that library and create a working example. In the meanwhile do you know of a better way?
In pytorch, the input tensors always have the batch dimension in the first dimension. Thus doing inference by batch is the default behavior, you just need to increase the batch dimension to larger than 1.
For example, if your single input is [1, 1], its input tensor is [[1, 1], ] with shape (1, 2). If you have two inputs [1, 1] and [2, 2], generate the input tensor as [[1, 1], [2, 2], ] with shape (2, 2). This is usually done in the batch generator function such as your dataloader.

Multilabel classification of a sequence, how to do it?

I am quite new to the deep learning field especially Keras. Here I have a simple problem of classification and I don't know how to solve it. What I don't understand is how the general process of the classification, like converting the input data into tensors, the labels, etc.
Let's say we have three classes, 1, 2, 3.
There is a sequence of classes that need to be classified as one of those classes. The dataset is for example
Sequence 1, 1, 1, 2 is labeled 2
Sequence 2, 1, 3, 3 is labeled 1
Sequence 3, 1, 2, 1 is labeled 3
and so on.
This means the input dataset will be
[[1, 1, 1, 2],
[2, 1, 3, 3],
[3, 1, 2, 1]]
and the label will be
[[2],
[1],
[3]]
Now one thing that I do understand is to one-hot encode the class. Because we have three classes, every 1 will be converted into [1, 0, 0], 2 will be [0, 1, 0] and 3 will be [0, 0, 1]. Converting the example above will give a dataset of 3 x 4 x 3, and a label of 3 x 1 x 3.
Another thing that I understand is that the last layer should be a softmax layer. This way if a test data like (e.g. [1, 2, 3, 4]) comes out, it will be softmaxed and the probabilities of this sequence belonging to class 1 or 2 or 3 will be calculated.
Am I right? If so, can you give me an explanation/example of the process of classifying these sequences?
Thank you in advance.
Here are a few clarifications that you seem to be asking about.
This point was confusing so I deleted it.
If your input data has the shape (4), then your input tensor will have the shape (batch_size, 4).
Softmax is the correct activation for your prediction (last) layer
given your desired output, because you have a classification problem
with multiple classes. This will yield output of shape (batch_size,
3). These will be the probabilities of each potential classification, summing to one across all classes. For example, if the classification is class 0, then a single prediction might look something like [0.9714,0.01127,0.01733].
Batch size isn't hard-coded to the network, hence it is represented in model.summary() as None. E.g. the network's last-layer output shape can be written (None, 3).
Unless you have an applicable alternative, a softmax prediction layer requires a categorical_crossentropy loss function.
The architecture of a network remains up to you, but you'll at least need a way in and a way out. In Keras (as you've tagged), there are a few ways to do this. Here are some examples:
Example with Keras Sequential
model = Sequential()
model.add(InputLayer(input_shape=(4,))) # sequence of length four
model.add(Dense(3, activation='softmax')) # three possible classes
Example with Keras Functional
input_tensor = Input(shape=(4,))
x = Dense(3, activation='softmax')(input_tensor)
model = Model(input_tensor, x)
Example including input tensor shape in first functional layer (either Sequential or Functional):
model = Sequential()
model.add(Dense(666, activation='relu', input_shape=(4,)))
model.add(Dense(3, activation='softmax'))
Hope that helps!

Keras model the list of Numpy arrays that you are passing to your model is not the size the model expected

I have the problem with keras model.
I get this error:
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 array(s), but instead got the following list of 165757 arrays: [array([[0],
[1]]), array([[2],
[3]]), array([[4],
[5]]), array([[6],
[7]]), array([[8],
[9]]), array([[10],
[11]]), array([[12],
[13]]), array([[14],
...
That accure in the training part :
model.fit (X_train, y_train,
batch_size=64,
epochs=7,
validation_data=(X_dev, y_dev),
verbose=1)
scores = model.evaluate(X_test, y_test, batch_size=64)
print("Accuracy is: %.2f%%" %(scores[1] * 100))
The problem is in X_train. As data I have pairs of words, that could be related to each other or not. The words I represented as ids:
[[0, 1],
[2, 3],
[4, 5],
[4, 6],
[7, 8]]
The model according to the error want the data to be one list. The problem is, that I need to pass pairs. Does anyone know what to do in this case?

scikit learn: train_test_split, can I ensure same splits on different datasets

I understand that the train_test_split method splits a dataset into random train and test subsets. And using random_state=int can ensure we have the same splits on this dataset for each time the method is called.
My problem is slightly different.
I have two datasets, A and B, they contain identical sets of examples and the order of these examples appear in each dataset is also identical. But they key difference is that exmaples in each dataset uses a different sets of features.
I would like to test to see if the features used in A leads to better performance than features used in B. So I would like to ensure that when I call train_test_split on A and B, I can get the same splits on both datasets so that the comparison is meaningful.
Is this possible? Do I simply need to ensure the random_state in both method calls for both datasets are the same?
Thanks
Yes, random state is enough.
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X2 = np.hstack((X,X))
>>> X_train, X_test, _, _ = train_test_split(X,y, test_size=0.33, random_state=42)
>>> X_train2, X_test2, _, _ = train_test_split(X2,y, test_size=0.33, random_state=42)
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> X_train2
array([[4, 5, 4, 5],
[0, 1, 0, 1],
[6, 7, 6, 7]])
>>> X_test
array([[2, 3],
[8, 9]])
>>> X_test2
array([[2, 3, 2, 3],
[8, 9, 8, 9]])
Looking at the code for the train_test_split function, it sets the random seed inside the function at every call. So it will result in the same split every time. We can check that this works pretty simply
X1 = np.random.random((200, 5))
X2 = np.random.random((200, 5))
y = np.arange(200)
X1_train, X1_test, y1_train, y1_test = model_selection.train_test_split(X1, y,
test_size=0.1,
random_state=42)
X2_train, X2_test, y2_train, y2_test = model_selection.train_test_split(X1, y,
test_size=0.1,
random_state=42)
print np.all(y1_train == y2_train)
print np.all(y1_test == y2_test)
Which outputs:
True
True
Which is good! Another way of doing this problem is to create one training and test split on all your features and then split your features up before training. However if you're in a weird situation where you need to do both at once (sometimes with similarity matrices you don't want test features in your training set), then you can use the StratifiedShuffleSplit function to return the indices of the data that belongs to each set. For example:
n_splits = 1
sss = model_selection.StratifiedShuffleSplit(n_splits=n_splits,
test_size=0.1,
random_state=42)
train_idx, test_idx = list(sss.split(X, y))[0]
Since sklearn.model_selection.train_test_split(*arrays, **options) accepts a variable number of arguments, you can just do like this:
A_train, A_test, B_train, B_test, _, _ = train_test_split(A, B, y,
test_size=0.33,
random_state=42)
As mentioned above you can use Random state parameter.
But if you want to globally generate the same results means setting the random state for all future calls u can use.
np.random.seed('Any random number ')

Resources