Related
I am developing a code to use the pre-trained GPT2 model for a machine translation task. The length of my data's word-to-id is 91, and I developed the following code for my model:
import torch
from torch.utils.data import DataLoader
from transformers.models.gpt2.modeling_gpt2 import GPT2Model
# data preparation code
def batch_sequences(x, y, env):
"""
Take as input a list of n sequences (torch.LongTensor vectors) and return
a tensor of size (slen, n) where slen is the length of the longest
sentence, and a vector lengths containing the length of each sentence.
"""
lengths_x = torch.LongTensor([len(s) + 2 for s in x])
lengths_y = torch.LongTensor([len(s) + 2 for s in y])
max_length = max(lengths_x.max().item(), lengths_y.max().item())
sent_x = torch.LongTensor(
max_length, lengths_x.size(0)).fill_(env.pad_index)
sent_y = torch.LongTensor(
max_length, lengths_y.size(0)).fill_(env.pad_index)
assert lengths_x.min().item() > 2
assert lengths_y.min().item() > 2
sent_x[0] = env.eos_index
for i, s in enumerate(x):
sent_x[1:lengths_x[i] - 1, i].copy_(s)
sent_x[lengths_x[i] - 1, i] = env.eos_index
sent_y[0] = env.eos_index
for i, s in enumerate(y):
sent_y[1:lengths_y[i] - 1, i].copy_(s)
sent_y[lengths_y[i] - 1, i] = env.eos_index
return sent_x, sent_y, max_length
def collate_fn(elements):
"""
Collate samples into a batch.
"""
x, y = zip(*elements)
x = [torch.LongTensor([env.word2id[w]
for w in seq if w in env.word2id]) for seq in x]
y = [torch.LongTensor([env.word2id[w]
for w in seq if w in env.word2id]) for seq in y]
x, y, length = batch_sequences(x, y, env)
return (x, length), (y, length), torch.LongTensor(nb_ops)
loader = DataLoader(data, batch_size=1, shuffle=False, collate_fn=collate_fn)
gpt2 = GPT2Model.from_pretrained('gpt2')
in_layer = nn.Embedding(len(env.word2id), 768)
out_layer = nn.Linear(768, len(env.word2id))
parameters = list(gpt2.parameters()) + list(in_layer.parameters()) + list(out_layer.parameters())
optimizer = torch.optim.Adam(parameters)
loss_fn = nn.CrossEntropyLoss()
for layer in (gpt2, in_layer, out_layer):
layer.train()
accuracies = list()
n_epochs = 5
for i in range(n_epochs):
for (x, x_len), (y, y_len) in loader:
x = x.to(device=device)
y = y.to(device=device)
embeddings = in_layer(x.reshape(1, -1))
hidden_state = gpt2(inputs_embeds=embeddings).last_hidden_state[:, :]
logits = out_layer(hidden_state)[0]
loss = loss_fn(logits, y.reshape(-1))
accuracies.append(
(logits.argmax(dim=-1) == y.reshape(-1)).float().mean().item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
if len(accuracies) % 500 == 0:
accuracy = sum(accuracies[-50:]) / len(accuracies[-50:])
print(f'Samples: {len(accuracies)}, Accuracy: {accuracy}')
This code works pretty well when the batch size is 1. But it is so slow. I wanted to increase the batch size from 1 to 32, but I get some dimension compatibility problems. How can I increase the batch size without errors?
My data consists of pair of sentences, the first one is a sentence in the first language and the second one is its translation in the second language.
For example, assume that x.shape is (batch_size, 12) (meaning we have 'batch_size' sentences of length 12 as input and y.shape is also (batch_size, 12) (the translations). And also we have a word-to-id dictionary of length 90 that matches each word in a sentence with its index)
This problem can be solved using padding. We need two special symbols:
code 0 in inputs (x) will denote "blank" tokens that should not be translated.
code -100 in outputs (y) will denote "blank" tokens that should not participate in the calculation of loss. nn.CrossEntropyLoss() is programmed to ignore this value (by the argument ignore_index).
The batch of size 3 could look like this:
x:
[[1, 2, 3, 0, 0],
[ 4, 5, 6, 7, 8],
[ 9, 8, 0, 0, 0]]
y:
[[1, 2, 3, -100, -100],
[ 4, 5, 6, 7, 8],
[ 9, 8, -100, -100, -100]]
You could generate it with code such as:
def pad_sequences(batch, pad_value=0):
n = max(len(v) for v in batch)
return torch.tensor([v + [pad_value] * (n - len(v)) for v in batch])
However, I feel there is an issue with your problem statement. If you perform machine translation, then your inputs and outputs can have different lengths, but your architecture only allows x and y to have the same lengths. If you want to support x and y of different lengths, I would suggest to use a seq2seq architecture such as T5 instead.
Another issue is that GPT is autoregressive, so if y is completely aligned with x, then we cannot use the suffix of x while generating the left part of y. So if you wish your x and y to be perfectly aligned, but still would like to use the full information about x when generating y, I would recommend using a bidirectional encoder such as BERT.
I have 3 parallel MLPs and want to obtain the following in Keras:
Out = W1 * Out_MLP1 + W2 * Out_MLP2 + W3 * Out_MLP3
where Out_MLPs are output layer of each MLP and have dimension of (10,) and W1, W2 and W3 are three trainable weights (floats) where they satisfy the following condition:
W1 + W2 + W3 = 1
What is the best way to implement this with Keras functional API? What if we had N parallel layers?
what you need is to apply a softmax on a set of learnable weights, in order to grant that they sum up to 1.
We initialize our learnable weights in a custom layer. this layer receives the output of our MLPs and combines them following our logic W1 * Out_MLP1 + W2 * Out_MLP2 + W3 * Out_MLP3. the output will be a tensor of shape (10,).
class W_ADD(Layer):
def __init__(self, n_output):
super(W_ADD, self).__init__()
self.W = tf.Variable(initial_value=tf.random.uniform(shape=[1,1,n_output], minval=0, maxval=1),
trainable=True) # (1,1,n_inputs)
def call(self, inputs):
# inputs is a list of tensor of shape [(n_batch, n_feat), ..., (n_batch, n_feat)]
# expand last dim of each input passed [(n_batch, n_feat, 1), ..., (n_batch, n_feat, 1)]
inputs = [tf.expand_dims(i, -1) for i in inputs]
inputs = Concatenate(axis=-1)(inputs) # (n_batch, n_feat, n_inputs)
weights = tf.nn.softmax(self.W, axis=-1) # (1,1,n_inputs)
# weights sum up to one on last dim
return tf.reduce_sum(weights*inputs, axis=-1) # (n_batch, n_feat)
in this dummy example, I create a network that has 3 parallel MLPs
inp1 = Input((100))
inp2 = Input((100))
inp3 = Input((100))
x1 = Dense(32, activation='relu')(inp1)
x2 = Dense(32, activation='relu')(inp2)
x3 = Dense(32, activation='relu')(inp3)
x1 = Dense(10, activation='linear')(x1)
x2 = Dense(10, activation='linear')(x2)
x3 = Dense(10, activation='linear')(x3)
mlp_outputs = [x1,x2,x3]
out = W_ADD(n_output=len(mlp_outputs))(mlp_outputs)
m = Model([inp1,inp2,inp3], out)
m.compile('adam','mse')
X1 = np.random.uniform(0,1, (1000,100))
X2 = np.random.uniform(0,1, (1000,100))
X3 = np.random.uniform(0,1, (1000,100))
y = np.random.uniform(0,1, (1000,10))
m.fit([X1,X2,X3], y, epochs=10)
as you can see this is easily generalizable in case of N parallel layers
This question is about TensorFlow (and TensorBoard) version 2.2rc3, but I have experienced the same issue with 2.1.
Consider the following weird code:
from datetime import datetime
import tensorflow as tf
from tensorflow import keras
inputs = keras.layers.Input(shape=(784, ))
x1 = keras.layers.Dense(32, activation='relu', name='Model/Block1/relu')(inputs)
x1 = keras.layers.Dropout(0.2, name='Model/Block1/dropout')(x1)
x1 = keras.layers.Dense(10, activation='softmax', name='Model/Block1/softmax')(x1)
x2 = keras.layers.Dense(32, activation='relu', name='Model/Block2/relu')(inputs)
x2 = keras.layers.Dropout(0.2, name='Model/Block2/dropout')(x2)
x2 = keras.layers.Dense(10, activation='softmax', name='Model/Block2/softmax')(x2)
x3 = keras.layers.Dense(32, activation='relu', name='Model/Block3/relu')(inputs)
x3 = keras.layers.Dropout(0.2, name='Model/Block3/dropout')(x3)
x3 = keras.layers.Dense(10, activation='softmax', name='Model/Block3/softmax')(x3)
x4 = keras.layers.Dense(32, activation='relu', name='Model/Block4/relu')(inputs)
x4 = keras.layers.Dropout(0.2, name='Model/Block4/dropout')(x4)
x4 = keras.layers.Dense(10, activation='softmax', name='Model/Block4/softmax')(x4)
outputs = x1 + x2 + x3 + x4
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.summary()
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = x_train.reshape(60000, 784).astype('float32') / 255
x_test = x_test.reshape(10000, 784).astype('float32') / 255
model.compile(loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=keras.optimizers.RMSprop(),
metrics=['accuracy'])
logdir = "logs/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)
model.fit(x_train, y_train,
batch_size=64,
epochs=5,
validation_split=0.2,
callbacks=[tensorboard_callback])
When running it and looking at the graph created in TensorBoard
you will see the following.
As can be seen, the addition operations are really ugly.
When replacing the line
outputs = x1 + x2 + x3 + x4
With the lines:
outputs = keras.layers.add([x1, x2], name='Model/add/add1')
outputs = keras.layers.add([outputs, x3], name='Model/add/add2')
outputs = keras.layers.add([outputs, x4], name='Model/add/add3')
a much nicer graph is created by TensorBoard (in this second screenshot, the Model as well as one of the inner blocks are shown in details).
The difference between the two representations of the model is that in the second one, we could name the addition operations and group them.
I could not find any way to name these operations, unless by using the keras.layers.add(). In this model the problem does not look that critical as the model is simple, and it is easy to replace + with keras.layers.add(). However, in more complex models, it can become a real pain. For example, operations such as t[:, start:end] should be translated to complex calls to tf.strided_slice(). So my models representations are quite messy with plenty of cryptic gather, stride and concat operations.
I wonder if there is a way to wrap / group such operations to allow nicer graphs in TensorBoard.
outputs = keras.layers.Add()([x1, x2, x3, x4])
Following the hint from Marco Cerliani, Lambda layer is indeed very useful here. So the following code will group nicely the +:
outputs = keras.layers.Lambda(lambda x: x[0] + x[1], name='Model/add/add1')([x1, x2])
outputs = keras.layers.Lambda(lambda x: x[0] + x[1], name='Model/add/add2')([outputs, x2])
outputs = keras.layers.Lambda(lambda x: x[0] + x[1], name='Model/add/add3')([outputs, x2])
Or if needed to wrap strides, the following code will group nicely the t[]:
x1 = keras.layers.Lambda(lambda x: x[:, 0:5], name='Model/stride_concat/stride1')(x1) # instead of x1 = x1[:, 0:5]
x2 = keras.layers.Lambda(lambda x: x[:, 5:10], name='Model/stride_concat/stride2')(x2) # instead of x2 = x2[:, 5:10]
outputs = keras.layers.concatenate([x1, x2], name='Model/stride_concat/concat')
This answers the question asked. But actually, there is still an open issue that is described in another question: 'TensorFlowOpLayer messes up the TensorBoard graphs'
It's going to be a long post, sorry in advance...
I'm working on a denoising algorithm and my goal is to:
Use PyTorch to design / train the model
Convert the PyTorch model into a CoreML model
The denoising algorithm consists in the following 3 parts:
A "down-sampling" + noise level map
A regular convnet
An "up-sampling"
The first part is quite simple in its idea, but not so easy to explain. Given for instance an input color image and a input value "sigma" that represents the standard deviation of the image noise.
The "down-sampling" part is in fact a space-to-depth. In short, for a given channel and for a subset of 2x2 pixels, the space-to-depth creates a single pixel composed of 4 channels. The number of channels is multiplied by 4 while the height and width are divided by 2. The data is simply reorganized.
The noise level map consists in creating 3 channels containing the standard deviation value so that the convnet knows how to properly denoise the input image.
This will be maybe more clear with some code:
def downsample_and_noise_map(input, sigma):
# Input tensor size (batch, channels, height, width)
in_n, in_c, in_h, in_w = input.size()
# Output tensor size
out_h = in_h // 2
out_w = in_w // 2
sigma_c = in_c # nb of channels of the standard deviation tensor
image_c = in_c * 4 # nb of channels of the image tensor
# Standard deviation tensor
output_sigma = sigma.view(1, 1, 1, 1).repeat(in_n, sigma_c, out_h, out_w)
# Image tensor
output_image = torch.zeros((in_n, image_c, out_h, out_w))
output_image[:, 0::4, :, :] = input[:, :, 0::2, 0::2]
output_image[:, 1::4, :, :] = input[:, :, 0::2, 1::2]
output_image[:, 2::4, :, :] = input[:, :, 1::2, 0::2]
output_image[:, 3::4, :, :] = input[:, :, 1::2, 1::2]
# Concatenate standard deviation and image tensors
return torch.cat((output_sigma, output_image), dim=1)
This function is then called as the first step in the model's forward function:
def forward(self, x, sigma):
x = downsample_and_noise_map(x, sigma)
x = self.convnet(x)
x = upsample(x)
return x
Let's consider an input tensor of size 1x3x100x100 (PyTorch standard: batch, channels, height, width) and a sigma value of 0.1. The output tensor has the following properties:
Tensor's shape is 1x15x50x50
Tensor's values for channels 0, 1 and 2 are all equal to sigma = 0.1
Tensor's values for channels 3, 4, 5, 6 are composed of the input image values of channel 0
Tensor's values for channels 7, 8, 9, 10 are composed of the input image values of channel 1
Tensor's values for channels 11, 12, 13, 14 are composed of the input image values of channel 2
If this code is not clear enough, I can post an even more naive version.
The up-sampling part is the reciprocal function of the downsampling one.
I was able to use this function for training and testing in PyTorch.
Then, I tried to convert the model to CoreML with ONNX as an intermediate step.
The conversion to ONNX generated "TracerWarning". Conversion from ONNX to CoreML failed (TypeError: 1.0 has type numpy.float64, but expected one of: int, long). The problem came from the down-sampling + noise level map (and from up-sampling too).
When I removed the down-sampling + noise level map and up-sampling layers, I was able to convert to ONNX and to CoreML very easily since only a simple convnet remained. This means I have a solution to my problem: implement these 2 layers using 2 shaders on the mobile side. But I'm not satisfied with this solution as I want my model to contain all layers ^^
Before considering writing a post here, I crawled Internet to find an answer and I was able to write a better version of the previous function using reshape and permute. This version removed all ONNX warning, but the CoreML conversion still failed...
def downsample_and_noise_map(input, sigma):
# Input image size
in_n, in_c, in_h, in_w = input.size()
# Output tensor size
out_n = in_n
out_h = in_h // 2
out_w = in_w // 2
# Create standard deviation tensor
output_sigma = sigma.view(out_n, 1, 1, 1).repeat(out_n, in_c, out_h, out_w)
# Split RGB channels
channels_rgb = torch.split(input, 1, dim=1)
# Reshape (space-to-depth) each image channel
channels_reshaped = []
for channel in channels_rgb:
channel = channel.reshape(1, out_h, 2, out_w, 2)
channel = channel.permute(2, 4, 0, 1, 3)
channel = channel.reshape(1, 4, out_h, out_w)
channels_reshaped.append(channel)
# Concatenate all reshaped image channels together
output_image = torch.cat(channels_reshaped, dim=1)
# Concatenate standard deviation and image tensors
output = torch.cat([output_sigma, output_image], dim=1)
return output
So here are (some of) my questions:
What is the preferred PyTorch way to implement a function such as downsample_and_noise_map function within a model?
Same question but when the conversion to ONNX and then to CoreML is part of the equation?
Is the PyTorch -> ONNX -> CoreML still best path to deploy the model for iOS production?
Thanks for your help (and your patience) ^^
Disclaimer I'm not familiar with CoreML or deploying to iOS but I do have experience deploying PyTorch models in TensorRT and OpenVINO via ONNX.
The main issues I've faced when deploying to other frameworks is that operations like slicing and repeating tensors tend to have limited support in other frameworks. Often we can construct equivalent conv or transpose-conv operations which achieve the desired behavior.
In order to ensure we don't export the logic used to construct the conv weights I've separated the weight initialization from the application of the weights. This makes the ONNX export much more straightforward since all it sees is some constant tensors being applied.
class DownsampleAndNoiseMap():
def __init__(self):
self.initialized = False
self.weight = None
self.zeros = None
def init_weights(self, input):
with torch.no_grad():
in_n, in_c, in_h, in_w = input.size()
out_h = int(in_h // 2)
out_w = int(in_w // 2)
sigma_c = in_c
image_c = in_c * 4
# conv weights used for downsampling
self.weight = torch.zeros(image_c, in_c, 2, 2).to(input)
for c in range(in_c):
self.weight[4 * c, c, 0, 0] = 1
self.weight[4 * c + 1, c, 0, 1] = 1
self.weight[4 * c + 2, c, 1, 0] = 1
self.weight[4 * c + 3, c, 1, 1] = 1
# zeros used to replace repeat
self.zeros = torch.zeros(in_n, sigma_c, out_h, out_w).to(input)
self.initialized = True
def __call__(self, input, sigma):
assert self.initialized
output_sigma = self.zeros + sigma
output_image = torch.nn.functional.conv2d(input, self.weight, stride=2)
return torch.cat((output_sigma, output_image), dim=1)
class Upsample():
def __init__(self):
self.initialized = False
self.weight = None
def init_weights(self, input):
with torch.no_grad():
in_n, in_c, in_h, in_w = input.size()
image_c = in_c * 4
self.weight = torch.zeros(in_c + image_c, in_c, 2, 2).to(input)
for c in range(in_c):
self.weight[in_c + 4 * c, c, 0, 0] = 1
self.weight[in_c + 4 * c + 1, c, 0, 1] = 1
self.weight[in_c + 4 * c + 2, c, 1, 0] = 1
self.weight[in_c + 4 * c + 3, c, 1, 1] = 1
self.initialized = True
def __call__(self, input):
assert self.initialized
return torch.nn.functional.conv_transpose2d(input, self.weight, stride=2)
I made the assumption that upsample was the reciprocal of downsample in the sense that x == upsample(downsample_and_noise_map(x, sigma)) (correct me if I'm wrong in this assumption). I also verified that my version of downsample agrees with yours.
# consistency checking code
x = torch.randn(1, 3, 100, 100)
sigma = torch.randn(1)
# OP downsampling
y1 = downsample_and_noise_map(x, sigma)
ds = DownsampleAndNoiseMap()
ds.init_weights(x)
y2 = ds(x, sigma)
print('downsample diff:', torch.sum(torch.abs(y1 - y2)).item())
us = Upsample()
us.init_weights(x)
x_recov = us(ds(x, sigma))
print('recovery error:', torch.sum(torch.abs(x - x_recov)).item())
which results in
downsample diff: 0.0
recovery error: 0.0
Exporting to ONNX
When exporting we need to invoke init_weights for the new classes before using torch.onnx.export. For example
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.downsample = DownsampleAndNoiseMap()
self.upsample = Upsample()
self.convnet = lambda x: x # placeholder
def init_weights(self, x):
self.downsample.init_weights(x)
self.upsample.init_weights(x)
def forward(self, x, sigma):
x = self.downsample(x, sigma)
x = self.convnet(x)
x = self.upsample(x)
return x
x = torch.randn(1, 3, 100, 100)
sigma = torch.randn(1)
model = Model()
# ... load state dict here
model.init_weights(x)
torch.onnx.export(model, (x, sigma), 'deploy.onnx', verbose=True, input_names=["input", "sigma"], output_names=["output"])
which gives the ONNX graph
graph(%input : Float(1, 3, 100, 100)
%sigma : Float(1)) {
%2 : Float(1, 3, 50, 50) = onnx::Constant[value=<Tensor>](), scope: Model
%3 : Float(1, 3, 50, 50) = onnx::Add(%2, %sigma), scope: Model
%4 : Float(12, 3, 2, 2) = onnx::Constant[value=<Tensor>](), scope: Model
%5 : Float(1, 12, 50, 50) = onnx::Conv[dilations=[1, 1], group=1, kernel_shape=[2, 2], pads=[0, 0, 0, 0], strides=[2, 2]](%input, %4), scope: Model
%6 : Float(1, 15, 50, 50) = onnx::Concat[axis=1](%3, %5), scope: Model
%7 : Float(15, 3, 2, 2) = onnx::Constant[value=<Tensor>](), scope: Model
%output : Float(1, 3, 100, 100) = onnx::ConvTranspose[dilations=[1, 1], group=1, kernel_shape=[2, 2], pads=[0, 0, 0, 0], strides=[2, 2]](%6, %7), scope: Model
return (%output);
}
As for the last question about the recommended way to deploy on iOS I can't answer that since I don't have experience in that area.
I am trying to write a neural network that recognizes the xor function from scratch. The full code is here (in python 3).
I am currently getting the error :
ValueError: No gradients provided for any variable, check your graph for ops that do not support gradients
I am new to tensorflow and I don't understand why this is. Can anyone help me out in correcting my code? Thanks in advance.
P.S. If more details are required in the question, do let me know before downvoting. Thanks again!
Edit: relevant part of code:
def initialize_parameters():
# Create Weights and Biases for Hidden Layer and Output Layer
W1 = tf.get_variable("W1", [2, 2], initializer = tf.contrib.layers.xavier_initializer())
b1 = tf.get_variable("b1", [2, 1], initializer = tf.zeros_initializer())
W2 = tf.get_variable("W2", [1, 2], initializer = tf.contrib.layers.xavier_initializer())
b2 = tf.get_variable("b2", [1, 1], initializer = tf.zeros_initializer())
parameters = {
"W1" : W1,
"b1" : b1,
"W2" : W2,
"b2" : b2
}
return parameters
def forward_propogation(X, parameters):
threshold = tf.constant(0.5, name = "threshold")
W1, b1 = parameters["W1"], parameters["b1"]
W2, b2 = parameters["W2"], parameters["b2"]
Z1 = tf.add(tf.matmul(W1, X), b1)
A1 = tf.nn.relu(Z1)
tf.squeeze(A1)
Z2 = tf.add(tf.matmul(W2, A1), b2)
A2 = tf.round(tf.sigmoid(Z2))
print(A2.shape)
tf.squeeze(A2)
A2 = tf.reshape(A2, [1, 1])
print(A2.shape)
return A2
def compute_cost(A, Y):
logits = tf.transpose(A)
labels = tf.transpose(Y)
cost = tf.nn.sigmoid_cross_entropy_with_logits(logits = logits, labels = labels)
return cost
def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001, num_epochs = 1500):
ops.reset_default_graph()
(n_x, m) = X_train.shape
n_y = Y_train.shape[0]
costs = []
X, Y = create_placeholders(n_x, n_y)
parameters = initialize_parameters()
A2 = forward_propogation(X, parameters)
cost = compute_cost(A2, Y)
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
for epoch in range(num_epochs):
epoch_cost = 0
_, epoch_cost = session.run([optimizer, cost], feed_dict = {X : X_train, Y : Y_train})
parameters = session.run(parameters)
correct_prediction = tf.equal(tf.argmax(A2), tf.argmax(Y))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print("Training Accuracy is {0} %...".format(accuracy.eval({X : X_train, Y : Y_train})))
print("Test Accuracy is {0} %...".format(accuracy.eval({X : X_test, Y : Y_test})))
return parameters
The error is caused by the use of tf.round when you define A2 (known issue, by the way).
In this particular task, the solution is simply not to use tf.round at all. Remember that, the output of tf.sigmoid is the value between 0 and 1, which can be interpreted as probability of result 1. Cross-entropy loss function is measuring the distance to the target, 0 or 1, and computes the needed update to the weights based on this distance. Calling tf.round before the cross-entropy will squeeze the probability to either 0 or 1 - that's will make cross-entropy pretty meaningless.
By the way, tf.losses.softmax_cross_entropy should work better, because you've applied the sigmoid yourself in the second layer.