keras input not matching output confusion - keras

I am trying to implement a silly learn to rank example. Essentially, I have 2 descriptions of a location, size and number of bathrooms. I want to "combine" them to create a score. Then I wish to compare the scores for the "best". I will always be comparing 3 locations at a time.
The neuralnetwork I expect to do this:
# 3 locations with 2 descriptions.
rinputs = Input(shape=(3, 2), name ='inputlayer')
# take my 3 expected inputs, split them
split = Lambda( lambda x: tf.split(x,num_or_size_splits=3,axis=1))(rinputs)
input_one_tensor = split[0]
input_two_tensor = split[1]
input_three_tensor = split[2]
# combine each set of location elements into 1 "score"
layer2 = Dense(1, name = 'Layer2', use_bias = True, activation = 'sigmoid') # 60 was better than 100
layer2a = layer2(input_one_tensor)
layer2b = layer2(input_two_tensor)
layer2c = layer2(input_three_tensor)
concatLayer = Concatenate(name = 'ConcatLayer2')([layer2a,layer2b, layer2c])
# softmax my score to get "best selection"
softmaxLayer = Dense(3, activation='softmax', name = 'softmax', use_bias = False)
softmaxLayer = softmaxLayer(concatLayer)
model = Model(inputs=rinputs, outputs=softmaxLayer)
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adam(),metrics=['accuracy'])
I now create my test data:
loc1 = [1, 5]
loc2 = [4, 1]
loc3 = [6, 7]
# create two entries for my trial run
inputs = np.asarray([[loc1, loc2, loc3], [loc3,loc3,loc1]]).reshape(2,3,2)
ytrue = np.asarray([[1, 0, 0], [0, 0, 1]]).reshape(2,3)
model.fit(inputs, ytrue,verbose=True,)
But then I get the following error about my outputs. That I am not understanding.
File "/.virtualenvs/python310/lib/python3.10/site-packages/keras/losses.py", line 1990, in categorical_crossentropy
return backend.categorical_crossentropy(
File "/.virtualenvs/python310/lib/python3.10/site-packages/keras/backend.py", line 5529, in categorical_crossentropy
target.shape.assert_is_compatible_with(output.shape)
ValueError: Shapes (None, 3) and (None, 1, 3) are incompatible
I'm not entirely understanding why the shapes don't match. I expect my softmax layer to output 3 numbers that sum to 1 and can be compared to my ytrue.
any insights appreciated

Just from the model architecture itself, it seems like you just need a two-dimensional data to be fed into Layer2:
One may use a Reshape/Flatten layer to fix it.
By reshaping the output of Lambda layer from (None, 1, 2) to (None, 2), the final output's shape should become compatible too (None, 3).
Additional notes:
As an example borrowed (with some modifications) from the TensorFlow website, let's assume we want to split an input tensor of the shape of (3, 2) into 3 smaller tensors along the axis=1:
x = tf.Variable(tf.random.uniform([3, 2], -1, 1))
s0, s1, s2 = tf.split(x, num_or_size_splits=3, axis=1)
Output:
Here are the smaller tensor splits:
Now, we can see the shape is (1, 2), i.e. a 2D tensor consistent with the tensor it is derived from, and not a vector of the shape of (2,). In the context of your problem, for a batch, that would be (None, 1, 2).

Related

How does calculation in a GRU layer take place

So I want to understand exactly how the outputs and hidden state of a GRU cell are calculated.
I obtained the pre-trained model from here and the GRU layer has been defined as nn.GRU(96, 96, bias=True).
I looked at the the PyTorch Documentation and confirmed the dimensions of the weights and bias as:
weight_ih_l0: (288, 96)
weight_hh_l0: (288, 96)
bias_ih_l0: (288)
bias_hh_l0: (288)
My input size and output size are (1000, 8, 96). I understand that there are 1000 tensors, each of size (8, 96). The hidden state is (1, 8, 96), which is one tensor of size (8, 96).
I have also printed the variable batch_first and found it to be False. This means that:
Sequence length: L=1000
Batch size: B=8
Input size: Hin=96
Now going by the equations from the documentation, for the reset gate, I need to multiply the weight by the input x. But my weights are 2-dimensions and my input has three dimensions.
Here is what I've tried, I took the first (8, 96) matrix from my input and multiplied it with the transpose of my weight matrix:
Input (8, 96) x Weight (96, 288) = (8, 288)
Then I add the bias by replicating the (288) eight times to give (8, 288). This would give the size of r(t) as (8, 288). Similarly, z(t) would also be (8, 288).
This r(t) is used in n(t), since Hadamard product is used, both the matrices being multiplied have to be the same size that is (8, 288). This implies that n(t) is also (8, 288).
Finally, h(t) is the Hadamard produce and matrix addition, which would give the size of h(t) as (8, 288) which is wrong.
Where am I going wrong in this process?
TLDR; This confusion comes from the fact that the weights of the layer are the concatenation of input_hidden and hidden-hidden respectively.
- nn.GRU layer weight/bias layout
You can take a closer look at what's inside the GRU layer implementation torch.nn.GRU by peaking through the weights and biases.
>>> gru = nn.GRU(input_size=96, hidden_size=96, num_layers=1)
First the parameters of the GRU layer:
>>> gru._all_weights
[['weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0']]
You can look at gru.state_dict() to get the dictionary of weights of the layer.
We have two weights and two biases, _ih stands for 'input-hidden' and _hh stands for 'hidden-hidden'.
For more efficient computation the parameters have been concatenated together, as the documentation page clearly explains (| means concatenation). In this particular example num_layers=1 and k=0:
~GRU.weight_ih_l[k] – the learnable input-hidden weights of the layer (W_ir | W_iz | W_in), of shape (3*hidden_size, input_size).
~GRU.weight_hh_l[k] – the learnable hidden-hidden weights of the layer (W_hr | W_hz | W_hn), of shape (3*hidden_size, hidden_size).
~GRU.bias_ih_l[k] – the learnable input-hidden bias of the layer (b_ir | b_iz | b_in), of shape (3*hidden_size).
~GRU.bias_hh_l[k] – the learnable hidden-hidden bias of the (b_hr | b_hz | b_hn).
For further inspection we can get those split up with the following code:
>>> W_ih, W_hh, b_ih, b_hh = gru._flat_weights
>>> W_ir, W_iz, W_in = W_ih.split(H_in)
>>> W_hr, W_hz, W_hn = W_hh.split(H_in)
>>> b_ir, b_iz, b_in = b_ih.split(H_in)
>>> b_hr, b_hz, b_hn = b_hh.split(H_in)
Now we have the 12 tensor parameters sorted out.
- Expressions
The four expressions for a GRU layer: r_t, z_t, n_t, and h_t, are computed at each timestep.
The first operation is r_t = σ(W_ir#x_t + b_ir + W_hr#h + b_hr). I used the # sign to designate the matrix multiplication operator (__matmul__). Remember W_ir is shaped (H_in=input_size, hidden_size) while x_t contains the element at step t from the x sequence. Tensor x_t = x[t] is shaped as (N=batch_size, H_in=input_size). At this point, it's simply a matrix multiplication between the input x[t] and the weight matrix. The resulting tensor r is shaped (N, hidden_size=H_in):
>>> (x[t]#W_ir.T).shape
(8, 96)
The same is true for all other weight multiplication operations performed. As a result, you end up with an output tensor shaped (N, H_out=hidden_size).
In the following expressions h is the tensor containing the hidden state of the previous step for each element in the batch, i.e. shaped (N, hidden_size=H_out), since num_layers=1, i.e. there's a single hidden layer.
>>> r_t = torch.sigmoid(x[t]#W_ir.T + b_ir + h#W_hr.T + b_hr)
>>> r_t.shape
(8, 96)
>>> z_t = torch.sigmoid(x[t]#W_iz.T + b_iz + h#W_hz.T + b_hz)
>>> z_t.shape
(8, 96)
The output of the layer is the concatenation of the computed h tensors at
consecutive timesteps t (between 0 and L-1).
- Demonstration
Here is a minimal example of an nn.GRU inference manually computed:
Parameters
Description
Values
H_in
feature size
3
H_out
hidden size
2
L
sequence length
3
N
batch size
1
k
number of layers
1
Setup:
gru = nn.GRU(input_size=H_in, hidden_size=H_out, num_layers=k)
W_ih, W_hh, b_ih, b_hh = gru._flat_weights
W_ir, W_iz, W_in = W_ih.split(H_out)
W_hr, W_hz, W_hn = W_hh.split(H_out)
b_ir, b_iz, b_in = b_ih.split(H_out)
b_hr, b_hz, b_hn = b_hh.split(H_out)
Random input:
x = torch.rand(L, N, H_in)
Inference loop:
output = []
h = torch.zeros(1, N, H_out)
for t in range(L):
r = torch.sigmoid(x[t]#W_ir.T + b_ir + h#W_hr.T + b_hr)
z = torch.sigmoid(x[t]#W_iz.T + b_iz + h#W_hz.T + b_hz)
n = torch.tanh(x[t]#W_in.T + b_in + r*(h#W_hn.T + b_hn))
h = (1-z)*n + z*h
output.append(h)
The final output is given by the stacking the tensors h at consecutive timesteps:
>>> torch.vstack(output)
tensor([[[0.1086, 0.0362]],
[[0.2150, 0.0108]],
[[0.3020, 0.0352]]], grad_fn=<CatBackward>)
In this case the output shape is (L, N, H_out), i.e. (3, 1, 2).
Which you can compare with output, _ = gru(x).

Input problem with siamese network with customize datagenerator

Hello everyone and thank you in advance for the help.
I'm trying to implement a siamese network (for the first time) for my image recognition project, but i'm not able to overcome this error:
"Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays"
I think my datagenerator is the problem, but I don't know how to fix it.
Here are some information:
##################### MODEL
input_dim = (200, 200, 1)
img_a = Input(shape = input_dim)
img_b = Input(shape = input_dim)
base_net = build_base_network(input_dim)
features_a = base_net(img_a)
features_b = base_net(img_b)
distance = Lambda(euclidean_distance, output_shape = eucl_dist_output_shape)([features_a, features_b])
model = Model(inputs=[img_a, img_b], outputs=distance)
##################### NETWORK
def build_base_network(input_shape):
seq = Sequential()
#Layer_1
seq.add(Conv2D(96, (11, 11), subsample=(4, 4), input_shape=(input_shape), init='glorot_uniform', dim_ordering='tf'))
seq.add(Activation('relu'))
seq.add(BatchNormalization())
seq.add(MaxPooling2D((3,3), strides=(2, 2)))
seq.add(Dropout(0.4))
.
.
.
.
#Flatten
seq.add(Flatten())
seq.add(Dense(1024, activation='relu'))
seq.add(Dropout(0.5))
seq.add(Dense(1024, activation='relu'))
seq.add(Dropout(0.5))
return seq
##################### LAST PART OF DATAGENERATOR
.
.
.
if len(Pair_equal) > len(Pair_diff):
Pair_equal = Pair_equal[0:len(Pair_diff)]
y_equal = y_equal[0:len(y_diff)]
elif len(Pair_equal) < len(Pair_diff):
Pair_diff = Pair_diff[0:len(Pair_equal)]
y_diff = y_diff[0:len(y_equal)]
Pair_equal = np.array(Pair_equal) #contains pairs of the same image
Pair_diff = np.array(Pair_diff) #contains pairs of different images
y_equal = np.array(y_equal)
y_diff = np.array(y_diff)
X = np.concatenate([Pair_equal, Pair_diff], axis=0)
y = np.concatenate([y_equal, y_diff], axis=0)
return X, y
##################### SHAPES
(16, 2, 200, 200, 1) --> Pair_equal
(16, 2, 200, 200, 1) --> Pair_diff
(16,) --> y_equal
(16,) --> y_diff
If you need anything else ask and I will provide it.
you can solve changing what your generator return in
return [X[:,0,...], X[:,1,...]], y
your it's a siamese network that expects 2 inputs and produces 1 output but in general this is valid for all the keras models that expect multiple inputs (or also outputs).
to feed this kind of model you have to pass an array for each input. in your case, an array of image_a and another for image_b. each array (in case of multiple input/output) must be put inside a list

Numpy and tensorflow RNN shape representation mismatch

I'm building my first RNN in tensorflow. After understanding all the concepts regarding the 3D input shape, I came across with this issue.
In my numpy version (1.15.4), the shape representation of 3D arrays is the following: (panel, row, column). I will make each dimension different so that it is clearer:
In [1]: import numpy as np
In [2]: arr = np.arange(30).reshape((2,3,5))
In [3]: arr
Out[3]:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]],
[[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]]])
In [4]: arr.shape
Out[4]: (2, 3, 5)
In [5]: np.__version__
Out[5]: '1.15.4'
Here my understanding is: I have two timesteps with each timestep having 3 observations with 5 features in each observation.
However, in tensorflow "theory" (which I believe it is strongly based in numpy) RNN cells expect tensors (i.e. just n-dimensional matrices) of shape [batch_size, timesteps, features], which could be translated to: (row, panel, column) in the numpy "jargon".
As can be seen, the representation doesn't match, leading to errors when feeding numpy data into a placeholder, which in most of the examples and theory is defined like:
x = tf.placeholder(tf.float32, shape=[None, N_TIMESTEPS_X, N_FEATURES], name='XPlaceholder')
np.reshape() doesn't solve the issue because it just rearranges the dimensions, but messes up with the data.
I'm using for the first time the Dataset API, but I encounter the problems once into the session, not in the Dataset API ops.
I'm using the static_rnn method, and everything works well until I have to feed the data into the placeholder, which obviously results in a shape error.
I have tried to change the placeholder shape to shape=[N_TIMESTEPS_X, None, N_FEATURES]. HOWEVER, I'm using the dataset API, and I get errors when making the initializer if I change the Xplaceholder to the shape=[N_TIMESTEPS_X, None, N_FEATURES].
So, to summarize:
First problem: Shape errors with different shape representations.
Second problem: Dataset error when equating the shape representations (I think that either static_rnn or dynamic_rnn would function if this is resolved).
My question is:
¿Is there anything I'm missing in regard to this different representation logic which makes the practice confusing?
¿Could the solution be attained to switching to dynamic_rnn? (although the problems about the shape I encounter are related to the dataset API initializer being fed with shape [N_TIMESTEPS_X, None, N_FEATURES], not with the RNN cell itself.
Thank you very much for your time.
Full code:
'''The idea is to create xt, yt, xval and yval. My numpy arrays to
be fed are of the following shapes:
The 3D xt array has a shape of: (11, 69579, 74)
The 3D xval array has a shape of: (11, 7732, 74)
The yt array has a shape of: (69579, 3)
The yval array has a shape of: (7732, 3)
'''
N_TIMESTEPS_X = xt.shape[0] ## The stack number
BATCH_SIZE = 256
#N_OBSERVATIONS = xt.shape[1]
N_FEATURES = xt.shape[2]
N_OUTPUTS = yt.shape[1]
N_NEURONS_LSTM = 128 ## Number of units in the LSTMCell
N_NEURONS_DENSE = 64 ## Number of units in the Dense layer
N_EPOCHS = 600
LEARNING_RATE = 0.1
### Define the placeholders anda gather the data.
train_data = (xt, yt)
validation_data = (xval, yval)
## We define the placeholders as a trick so that we do not break into memory problems, associated with feeding the data directly.
'''As an alternative, you can define the Dataset in terms of tf.placeholder() tensors, and feed the NumPy arrays when you initialize an Iterator over the dataset.'''
batch_size = tf.placeholder(tf.int64)
x = tf.placeholder(tf.float32, shape=[None, N_TIMESTEPS_X, N_FEATURES], name='XPlaceholder')
y = tf.placeholder(tf.float32, shape=[None, N_OUTPUTS], name='YPlaceholder')
# Creating the two different dataset objects.
train_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE).repeat()
val_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE)
# Creating the Iterator type that permits to switch between datasets.
itr = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
train_init_op = itr.make_initializer(train_dataset)
validation_init_op = itr.make_initializer(val_dataset)
next_features, next_labels = itr.get_next()
### Create the graph
cellType = tf.nn.rnn_cell.LSTMCell(num_units=N_NEURONS_LSTM, name='LSTMCell')
inputs = tf.unstack(next_features, N_TIMESTEPS_X, axis=0)
'''inputs: A length T list of inputs, each a Tensor of shape [batch_size, input_size]'''
RNNOutputs, _ = tf.nn.static_rnn(cell=cellType, inputs=inputs, dtype=tf.float32)
predictionsLayer = tf.layers.dense(inputs=tf.layers.batch_normalization(RNNOutputs[-1]), units=N_NEURONS_DENSE, activation=None, name='Dense_Layer')
### Define the cost function, that will be optimized by the optimizer.
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=predictionsLayer, labels=next_labels, name='Softmax_plus_Cross_Entropy'))
optimizer_type = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, name='AdamOptimizer')
optimizer = optimizer_type.minimize(cost)
### Model evaluation
correctPrediction = tf.equal(tf.argmax(predictionsLayer,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correctPrediction,tf.float32))
#confusionMatrix = tf.confusion_matrix(next_labels, predictionsLayer, num_classes=3, name='ConfMatrix')
N_BATCHES = train_data[0].shape[0] // BATCH_SIZE
## Saving variables so that we can restore them afterwards.
saver = tf.train.Saver()
save_dir = '/home/zmlaptop/Desktop/tfModels/{}_{}'.format(cellType.__class__.__name__, datetime.now().strftime("%Y%m%d%H%M%S"))
os.mkdir(save_dir)
varDict = {'nTimeSteps':N_TIMESTEPS_X, 'BatchSize': BATCH_SIZE, 'nFeatures':N_FEATURES,
'nNeuronsLSTM':N_NEURONS_LSTM, 'nNeuronsDense':N_NEURONS_DENSE, 'nEpochs':N_EPOCHS,
'learningRate':LEARNING_RATE, 'optimizerType': optimizer_type.__class__.__name__}
varDicSavingTxt = save_dir + '/varDict.txt'
modelFilesDir = save_dir + '/modelFiles'
os.mkdir(modelFilesDir)
logDir = save_dir + '/TBoardLogs'
os.mkdir(logDir)
acc_summary = tf.summary.scalar('Accuracy', accuracy)
loss_summary = tf.summary.scalar('Cost_CrossEntropy', cost)
summary_merged = tf.summary.merge_all()
with open(varDicSavingTxt, 'w') as outfile:
outfile.write(repr(varDict))
with tf.Session() as sess:
tf.set_random_seed(2)
sess.run(tf.global_variables_initializer())
train_writer = tf.summary.FileWriter(logDir + '/train', sess.graph)
validation_writer = tf.summary.FileWriter(logDir + '/validation')
# initialise iterator with train data
sess.run(train_init_op, feed_dict = {x : train_data[0], y: train_data[1], batch_size: BATCH_SIZE})
print('¡Training starts!')
for epoch in range(N_EPOCHS):
batchAccList = []
tot_loss = 0
for batch in range(N_BATCHES):
optimizer_output, loss_value, summary = sess.run([optimizer, cost, summary_merged])
accBatch = sess.run(accuracy)
tot_loss += loss_value
batchAccList.append(accBatch)
if batch % 10 == 0:
train_writer.add_summary(summary, batch)
epochAcc = tf.reduce_mean(batchAccList)
if epoch%10 == 0:
print("Epoch: {}, Loss: {:.4f}, Accuracy: {}".format(epoch, tot_loss / N_BATCHES, epochAcc))
#confM = sess.run(confusionMatrix)
#confDic = {'confMatrix': confM}
#confTxt = save_dir + '/confMDict.txt'
#with open(confTxt, 'w') as outfile:
# outfile.write(repr(confDic))
#print(confM)
# initialise iterator with validation data
sess.run(validation_init_op, feed_dict = {x : validation_data[0], y: validation_data[1], batch_size:len(validation_data[0])})
print('Validation Loss: {:4f}, Validation Accuracy: {}'.format(sess.run(cost), sess.run(accuracy)))
summary_val = sess.run(summary_merged)
validation_writer.add_summary(summary_val)
saver.save(sess, modelFilesDir)
Is there anything I'm missing in regard to this different
representation logic which makes the practice confusing?
In fact, you made a mistake about the input shapes of static_rnn and dynamic_rnn. The input shape of static_rnn is [timesteps,batch_size, features](link),which is a list of 2D tensors of shape [batch_size, features]. But The input shape of dynamic_rnn is either [timesteps,batch_size, features] or [batch_size,timesteps, features] depending on time_major is True or False(link).
Could the solution be attained to switching to dynamic_rnn?
The key is not that you use static_rnn or dynamic_rnn, but that your data shape matches the required shape. The general format of placeholder is like your code is [None, N_TIMESTEPS_X, N_FEATURES]. It's also convenient for you to use dataset API.
You can use transpose()(link) instead of reshape().transpose() will permute the dimensions of an array and won't messes up with the data.
So your code needs to be modified.
# permute the dimensions
xt = xt.transpose([1,0,2])
xval = xval.transpose([1,0,2])
# adjust shape,axis=1 represents timesteps
inputs = tf.unstack(next_features, axis=1)
Other errors should have nothing to do with rnn shape.

Keras "Shapes must be of equal rank" error when trying to slice tensor with another tensor

Building off some of the other questions I've asked, I'm trying to define a custom loss function that allows me to slice the contents of an input tensor using the contents of another tensor:
def innerLoss(z):
y_pred = z[0]
patch_x = z[1][0]
patch_y = z[1][1]
patch_true = y_pred[patch_y:patch_y+10, patch_x:patch_x+10, 0]
return 0
originalInputs = Input(shape=(128, 128, 1))
featureInputs = Input(shape=(2,), dtype="int64")
originalOutputs = Input(shape=(128, 128, 1))
loss = Lambda(innerLoss)([originalOutputs, featureInputs])
outerModel = Model(inputs=[originalInputs, featureInputs], outputs=loss)
I'm getting the following error:
ValueError: Shapes must be equal rank, but are 1 and 0
From merging shape 1 with other shapes. for 'lambda_3/strided_slice_2/stack_1' (op: 'Pack') with input shapes: [2], [2], [].
Here, featureInputs will consist of a pair of coordinates telling us where to begin slicing the image in originalInputs.

multi-level feature fusion in tensorflow

I want to know how can I combine two layers with different spatial space in Tensorflow.
for example::
batch_size = 3
input1 = tf.ones([batch_size, 32, 32, 3], tf.float32)
input2 = tf.ones([batch_size, 16, 16, 3], tf.float32)
filt1 = tf.constant(0.1, shape = [3,3,3,64])
filt1_1 = tf.constant(0.1, shape = [1,1,64,64])
filt2 = tf.constant(0.1, shape = [3,3,3,128])
filt2_2 = tf.constant(0.1, shape = [1,1,128,128])
#first layer
conv1 = tf.nn.conv2d(input1, filt1, [1,2,2,1], "SAME")
pool1 = tf.nn.max_pool(conv1, [1,2,2,1],[1,2,2,1], "SAME")
conv1_1 = tf.nn.conv2d(pool1, filt1_1, [1,2,2,1], "SAME")
deconv1 = tf.nn.conv2d_transpose(conv1_1, filt1_1, pool1.get_shape().as_list(), [1,2,2,1], "SAME")
#seconda Layer
conv2 = tf.nn.conv2d(input2, filt2, [1,2,2,1], "SAME")
pool2 = tf.nn.max_pool(conv2, [1,2,2,1],[1,2,2,1], "SAME")
conv2_2 = tf.nn.conv2d(pool2, filt2_2, [1,2,2,1], "SAME")
deconv2 = tf.nn.conv2d_transpose(conv2_2, filt2_2, pool2.get_shape().as_list(), [1,2,2,1], "SAME")
The deconv1 shape is [3, 8, 8, 64] and the deconv2 shape is [3, 4, 4, 128]. Here I cannot use the tf.concat to combine the deconv1 and deconv2. So how can I do this???
Edit
This is image for the architecture that I tried to implement:: it is releated to this paper::
vii. He, W., Zhang, X. Y., Yin, F., & Liu, C. L. (2017). Deep Direct
Regression for Multi-Oriented Scene Text Detection. arXiv preprint
arXiv:1703.08289
I checked the paper you point and there is it, consider the input image to this network has size H x W (height and width), I write the size of the output image on the side of each layer. Now look at the most bottom layer which I circle the input arrows to that layer, let's check it. This layer has two input, the first from the previous layer which has shape H/2 x W/2 and the second from the first pooling layer which also has size H/2 x W/2. These two inputs are merged together (not concatenation, but added together based on paper) and goes into the last Upsample layer, which output image of size H x W.
The other Upsample layers also have the same inputs. As you can see all merging operations have the match shapes. Also, the filter number for all merging layers is 128 which has consistency with others.
You can also use concat instead of merging, but it results in a larger filter number, be careful about that. i.e. merging two matrices with shapes H/2 x W/2 x 128 results in the same shape H/2 x W/2 x 128, but concat two matrices on the last axis, with shapes H/2 x W/2 x 128 results in H/2 x W/2 x 256.
I tried to guide you as much as possible, hope that was useful.

Resources