Log_probabilities returned by tf.nn.ctc_beam_search_decoder - python-3.x

I am training a LSTM-CTC speech recognition system with using beam search decoding in the following configuration:
decoded, log_prob =
tf.nn.ctc_beam_search_decoder(
inputs,
sequence_length,
beam_width=100,
top_paths=3,
merge_repeated=True
)
The output of log_probabilities for a batch by the above decoder are like:
[[ 14.73168373, 14.45586109, 14.35735512],
[ 20.45407486, 20.44991684, 20.41798401],
[ 14.9961853 , 14.925807 , 14.88066769],
...,
[ 18.89863396, 18.85992241, 18.85712433],
[ 3.93567419, 3.92791557, 3.89198923],
[ 14.56258488, 14.55923843, 14.51092243]],
So how do these scores represent log probabilities and if I want to compare confidence for top paths among examples then what will be the normalisation factor?

Related

Change trialId in Google AI Platform hyperparameter tuning

I'm trying to follow this tutorial on hyperparameter tuning on AI Platform: https://cloud.google.com/blog/products/gcp/hyperparameter-tuning-on-google-cloud-platform-is-now-faster-and-smarter.
My configuration yaml file looks like this:
trainingInput:
hyperparameters:
goal: MINIMIZE
hyperparameterMetricTag: loss
maxTrials: 4
maxParallelTrials: 2
params:
- parameterName: learning_rate
type: DISCRETE
discreteValues:
- 0.0005
- 0.001
- 0.0015
- 0.002
The expected output:
"completedTrialCount": "4",
"trials": [
{
"trialId": "3",
"hyperparameters": {
"learning_rate": "2e-03"
},
"finalMetric": {
"trainingStep": "123456",
"objectiveValue": 0.123456
},
},
Is there any way to customize the trialId instead the defaults numeric values (e.g. 1,2,3,4...)?
It is not possible to customize the trialId as it is dependent on the parameter maxTrials in your hyperparameter tuning config.
maxTrials only accepts integers, so the assigned value to trialId will be a range from 1 to your defined maxTrials.
Also as mentioned in the example in your post where maxTrials: 40 is set and it yields a json that shows trialId: 35 which is within the range of maxTrials.
This indicates that 40 trials have been completed, and the best so far
is trial 35, which achieved an objective of 1.079 with the
hyperparameter values of nembeds=18 and nnsize=32.
Example output:

Gradients vanishing despite using Kaiming initialization

I was implementing a conv block in pytorch with activation function(prelu). I used Kaiming initilization to initialize all my weights and set all the bias to zero. However as I tested these blocks (by stacking 100 such conv and activation blocks on top of each other), I noticed that the output I am getting values of the order of 10^(-10). Is this normal, considering I am stacking upto 100 layers. Adding a small bias to each layer fixes the problem. But in Kaiming initialization the biases are supposed to be zero.
Here is the conv block code
from collections import Iterable
def convBlock(
input_channels, output_channels, kernel_size=3, padding=None, activation="prelu"
):
"""
Initializes a conv block using Kaiming Initialization
"""
padding_par = 0
if padding == "same":
padding_par = same_padding(kernel_size)
conv = nn.Conv2d(input_channels, output_channels, kernel_size, padding=padding_par)
relu_negative_slope = 0.25
act = None
if activation == "prelu" or activation == "leaky_relu":
nn.init.kaiming_normal_(conv.weight, a=relu_negative_slope, mode="fan_in")
if activation == "prelu":
act = nn.PReLU(init=relu_negative_slope)
else:
act = nn.LeakyReLU(negative_slope=relu_negative_slope)
if activation == "relu":
nn.init.kaiming_normal_(conv.weight, nonlinearity="relu")
act = nn.ReLU()
nn.init.constant_(conv.bias.data, 0)
block = nn.Sequential(conv, act)
return block
def flatten(lis):
for item in lis:
if isinstance(item, Iterable) and not isinstance(item, str):
for x in flatten(item):
yield x
else:
yield item
def Sequential(args):
flattened_args = list(flatten(args))
return nn.Sequential(*flattened_args)
This is the test Code
ls=[]
for i in range(100):
ls.append(convBlock(3,3,3,"same"))
model=Sequential(ls)
test=np.ones((1,3,5,5))
model(torch.Tensor(test))
And the output I am getting is
tensor([[[[-1.7771e-10, -3.5088e-10, 5.9369e-09, 4.2668e-09, 9.8803e-10],
[ 1.8657e-09, -4.0271e-10, 3.1189e-09, 1.5117e-09, 6.6546e-09],
[ 2.4237e-09, -6.2249e-10, -5.7327e-10, 4.2867e-09, 6.0034e-09],
[-1.8757e-10, 5.5446e-09, 1.7641e-09, 5.7018e-09, 6.4347e-09],
[ 1.2352e-09, -3.4732e-10, 4.1553e-10, -1.2996e-09, 3.8971e-09]],
[[ 2.6607e-09, 1.7756e-09, -1.0923e-09, -1.4272e-09, -1.1840e-09],
[ 2.0668e-10, -1.8130e-09, -2.3864e-09, -1.7061e-09, -1.7147e-10],
[-6.7161e-10, -1.3440e-09, -6.3196e-10, -8.7677e-10, -1.4851e-09],
[ 3.1475e-09, -1.6574e-09, -3.4180e-09, -3.5224e-09, -2.6642e-09],
[-1.9703e-09, -3.2277e-09, -2.4733e-09, -2.3707e-09, -8.7598e-10]],
[[ 3.5573e-09, 7.8113e-09, 6.8232e-09, 1.2285e-09, -9.3973e-10],
[ 6.6368e-09, 8.2877e-09, 9.2108e-10, 9.7531e-10, 7.0011e-10],
[ 6.6954e-09, 9.1019e-09, 1.5128e-08, 3.3151e-09, 2.1899e-10],
[ 1.2152e-08, 7.7002e-09, 1.6406e-08, 1.4948e-08, -6.0882e-10],
[ 6.9930e-09, 7.3222e-09, -7.4308e-10, 5.2505e-09, 3.4365e-09]]]],
grad_fn=<PreluBackward>)
Amazing question (and welcome to StackOverflow)! Research paper for quick reference.
TLDR
Try wider networks (64 channels)
Add Batch Normalization after activation (or even before, shouldn't make much difference)
Add residual connections (shouldn't improve much over batch norm, last resort)
Please check this out in this order and give a comment what (and if) any of that worked in your case (as I'm also curious).
Things you do differently
Your neural network is very deep, yet very narrow (81 parameters per layer only!)
Due to above, one cannot reliably create those weights from normal distribution as the sample is just too small.
Try wider networks, 64 channels or more
You are trying much deeper network than they did
Section: Comparison Experiments
We conducted comparisons on a deep but efficient model with 14 weight
layers (actually 22 was also tested in comparison with Xavier)
That was due to date of release of this paper (2015) and hardware limitations "back in the days" (let's say)
Is this normal?
Approach itself is quite strange with layers of this depth, at least currently;
each conv block is usually followed by activation like ReLU and Batch Normalization (which normalizes signal and helps with exploding/vanishing signals)
usually networks of this depth (even of depth half of what you've got) use also residual connections (though this is not directly linked to vanishing/small signal, more connected to degradation problem of even deep networks, like 1000 layers)

Integrate tflite model with tensorflow object-detection example code

I have a tflite model which I procured by converting my TFmodel( MobileNet Single Shot Detector (v2) ).
I have successfully converted my model into tflite format using the code below.
!tflite_convert \
--input_shape=1,300,300,3 \
--input_arrays=normalized_input_image_tensor \
--output_arrays=TFLite_Detection_PostProcess,TFLite_Detection_PostProcess:1,TFLite_Detection_PostProcess:2,TFLite_Detection_PostProcess:3 \
--allow_custom_ops \
--graph_def_file=/content/models/research/fine_tuned_model/tflite/tflite_graph.pb \
--output_file="/content/models/research/fine_tuned_model/final_model.tflite"
And have tried to integrate it into the object-detection code which is provided by the tensorflow team.But the output is not visible.
The Steps taken from my end for integrating were as follows:
1.Commenting the below line from build.gradle(app)
apply from:'download_model.gradle'
I added my tflite model in the assets folder and modified the label.txt with my own labels.
In the Detector Activity,
private static final boolean TF_OD_API_IS_QUANTIZED = true;
I have set the above boolean to false
and reduced the probability to 0.2
private static final float MINIMUM_CONFIDENCE_TF_OD_API = 0.5f;
But it didn't worked.
The github link to the object-detection code :-
https://github.com/tensorflow/examples/blob/master/lite/examples/object_detection/android
Also ,please also let know how to test the working of the tflite model using the test images.
These are the values after debugging the model
[[[ 0.15021165 0.45557776 0.99523586 1.009417 ]
[ 0.4825344 0.18693507 0.9941584 0.83610606]
[ 0.36018616 0.612343 1.0781565 1.1020089 ]
[ 0.47380492 0.03632754 0.99250865 0.5964786 ]
[ 0.15898478 0.12117874 0.94728076 0.8854655 ]
[ 0.44774154 0.41910237 0.9966481 0.9704595 ]
[ 0.06241751 -0.02005028 0.93670964 0.3915068 ]
[ 0.1917564 0.00806974 1.0165613 0.5287838 ]
[ 0.20279509 0.738887 0.95690674 1.0022873 ]
[ 0.7434618 0.07342905 0.9969055 0.6412263 ]]]
First train your model . Better use a trained model . After training the model get the tflite graphs and convert them to tflite model using Bazel(Better use Ubuntu). Then get the Tensorflow/examples and open the object detection android folder on your Android Studio .
Remove the metadata code from the tflite interpreter class , since it demands it and there is no official way declared by the officials to make it happen . And then you can make it work .

hell going on with stochastic gradient descent

I am working with multivariate linear regression and using stochastic gradient descent to optimize.
Working on this dataSet
http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/
for every run all hyperParameters and all remaining things are same, epochs=200 and alpha=0.1
when I first run then I got final_cost=0.0591, when I run the program again keeping everything same I got final_cost=1.0056
, running again keeping everything same I got final_cost=0.8214
, running again final_cost=15.9591, running again final_cost=2.3162 and so on and on...
As you can see that keeping everything same and running, again and again, each time the final cost changes by large amount sometimes so large like from 0.8 to direct 15.9 , 0.05 to direct 1.00 and not only this the graph of final cost after every epoch within the same run is every zigzag unlike in batch GD in which the cost graph decreases smoothly.
I can't understand that why SGD is behaving so weirdly, different results in the different run.
I tried the same with batch GD and everything is absolutely fine and smooth as per expectations. In case of batch GD no matter how many times I run the same code the result is exactly the same every time.
But in the case of SGD, I literally cried,
class Abalone :
def __init__(self,df,epochs=200,miniBatchSize=250,alpha=0.1) :
self.df = df.dropna()
self.epochs = epochs
self.miniBatchSize = miniBatchSize
self.alpha = alpha
print("abalone created")
self.modelTheData()
def modelTheData(self) :
self.TOTAL_ATTR = len(self.df.columns) - 1
self.TOTAL_DATA_LENGTH = len(self.df.index)
self.df_trainingData =
df.drop(df.index[int(self.TOTAL_DATA_LENGTH * 0.6):])
self.TRAINING_DATA_SIZE = len(self.df_trainingData)
self.df_testingData =
df.drop(df.index[:int(self.TOTAL_DATA_LENGTH * 0.6)])
self.TESTING_DATA_SIZE = len(self.df_testingData)
self.miniBatchSize = int(self.TRAINING_DATA_SIZE / 10)
self.thetaVect = np.zeros((self.TOTAL_ATTR+1,1),dtype=float)
self.stochasticGradientDescent()
def stochasticGradientDescent(self) :
self.finalCostArr = np.array([])
startTime = time.time()
for i in range(self.epochs) :
self.df_trainingData =
self.df_trainingData.sample(frac=1).reset_index(drop=True)
miniBatches=[self.df_trainingData.loc[x:x+self.miniBatchSize-
((x+self.miniBatchSize)/(self.TRAINING_DATA_SIZE-1)),:]
for x in range(0,self.TRAINING_DATA_SIZE,self.miniBatchSize)]
self.epochCostArr = np.array([])
for j in miniBatches :
tempMat = j.values
self.actualValVect = tempMat[ : , self.TOTAL_ATTR:]
tempMat = tempMat[ : , :self.TOTAL_ATTR]
self.desMat = np.append(
np.ones((len(j.index),1),dtype=float) , tempMat , 1 )
del tempMat
self.trainData()
currCost = self.costEvaluation()
self.epochCostArr = np.append(self.epochCostArr,currCost)
self.finalCostArr = np.append(self.finalCostArr,
self.epochCostArr[len(miniBatches)-1])
endTime = time.time()
print(f"execution time : {endTime-startTime}")
self.graphEvaluation()
print(f"final cost :
{self.finalCostArr[len(self.finalCostArr)-1]}")
print(self.thetaVect)
def trainData(self) :
self.predictedValVect = self.predictResult()
diffVect = self.predictedValVect - self.actualValVect
partialDerivativeVect = np.matmul(self.desMat.T , diffVect)
self.thetaVect -=
(self.alpha/len(self.desMat))*partialDerivativeVect
def predictResult(self) :
return np.matmul(self.desMat,self.thetaVect)
def costEvaluation(self) :
cost = sum((self.predictedValVect - self.actualValVect)**2)
return cost / (2*len(self.actualValVect))
def graphEvaluation(self) :
plt.title("cost at end of all epochs")
x = range(len(self.epochCostArr))
y = self.epochCostArr
plt.plot(x,y)
plt.xlabel("iterations")
plt.ylabel("cost")
plt.show()
I kept epochs=200 and alpha=0.1 for all runs but I got a totally different result in each run.
The vector mentioned below is the theta vector where the first entry is the bias and remaining are weights
RUN 1 =>>
[[ 5.26020144]
[ -0.48787333]
[ 4.36479114]
[ 4.56848299]
[ 2.90299436]
[ 3.85349625]
[-10.61906207]
[ -0.93178027]
[ 8.79943389]]
final cost : 0.05917831328836957
RUN 2 =>>
[[ 5.18355814]
[ -0.56072668]
[ 4.32621647]
[ 4.58803884]
[ 2.89157598]
[ 3.7465471 ]
[-10.75751065]
[ -1.03302031]
[ 8.87559247]]
final cost: 1.0056239103948563
RUN 3 =>>
[[ 5.12836056]
[ -0.43672936]
[ 4.25664898]
[ 4.53397465]
[ 2.87847224]
[ 3.74693215]
[-10.73960775]
[ -1.00461585]
[ 8.85225402]]
final cost : 0.8214901206702101
RUN 4 =>>
[[ 5.38794798]
[ 0.23695412]
[ 4.43522951]
[ 4.66093372]
[ 2.9460605 ]
[ 4.13390252]
[-10.60071883]
[ -0.9230675 ]
[ 8.87229324]]
final cost: 15.959132174895712
RUN 5 =>>
[[ 5.19643132]
[ -0.76882106]
[ 4.35445135]
[ 4.58782119]
[ 2.8908931 ]
[ 3.63693031]
[-10.83291949]
[ -1.05709616]
[ 8.865904 ]]
final cost: 2.3162151072779804
I am unable to figure out what is going Wrong. Does SGD behave like this or I did some stupidity while transforming my code from batch GD to SGD. And if SGD behaves like this then how I get to know that how many times I have to rerun because I am not so lucky that every time in the first run I got such a small cost like 0.05 sometimes the first run gives cost around 10.5 sometimes 0.6 and maybe rerunning it a lot of times I got cost even smaller than 0.05.
when I approached the exact same problem with exact same code and hyperParameters just replacing the SGD function with normal batch GD I get the expected result i.e, after each iteration over the same data my cost is decreasing smoothly i.e., a monotonic decreasing function and no matter how many times I rerun the same program I got exactly the same result as this is very obvious.
"keeping everything same but using batch GD for epochs=20000 and alpha=0.1
I got final_cost=2.7474"
def BatchGradientDescent(self) :
self.costArr = np.array([])
startTime = time.time()
for i in range(self.epochs) :
tempMat = self.df_trainingData.values
self.actualValVect = tempMat[ : , self.TOTAL_ATTR:]
tempMat = tempMat[ : , :self.TOTAL_ATTR]
self.desMat = np.append( np.ones((self.TRAINING_DATA_SIZE,1),dtype=float) , tempMat , 1 )
del tempMat
self.trainData()
if i%100 == 0 :
currCost = self.costEvaluation()
self.costArr = np.append(self.costArr,currCost)
endTime = time.time()
print(f"execution time : {endTime - startTime} seconds")
self.graphEvaluation()
print(self.thetaVect)
print(f"final cost : {self.costArr[len(self.costArr)-1]}")
SomeBody help me figure out What actually is going on. Every opinion/solution is big revenue for me in this new field :)
You missed the most important and only difference between GD ("Gradient Descent") and SGD ("Stochastic Gradient Descent").
Stochasticity - Literally means "the quality of lacking any predictable order or plan". Meaning randomness.
Which means that while in the GD algorithm, the order of the samples in each epoch remains constant, in SGD the order is randomly shuffled at the beginning of every epochs.
So every run of GD with the same initialization and hyperparameters will produce the exact same results, while SGD will most defiantly not (as you have experienced).
The reason for using stochasticity is to prevent the model from memorizing the training samples (which will results in overfitting, where accuracy on the training set will be high but accuracy on unseen samples will be bad).
Now regarding to the big differences in final cost values between runs at your case, my guess is that your learning rate is too high. You can use a lower constant value, or better yet, use a decaying learning rate (which gets lower as epochs get higher).

Hierarchical Clustering with cosine similarity metric in fcluster package

I use scipy.cluster.hierarchy to do a hierarchical clustering on a set of points using "cosine" similarity metric. As an example, I have:
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt
Points =
np.array([[ 0. , 0.23508573],
[ 0.00754775 , 0.26717266],
[ 0.00595464 , 0.27775905],
[ 0.01220563 , 0.23622067],
[ 0.00542628 , 0.14185873],
[ 0.03078922 , 0.11273108],
[ 0.06707743 ,-0.1061131 ],
[ 0.04411757 ,-0.10775407],
[ 0.01349434 , 0.00112159],
[ 0.04066034 , 0.11639591],
[ 0. , 0.29046682],
[ 0.07338036 , 0.00609912],
[ 0.01864988 , 0.0316196 ],
[ 0. , 0.07270636],
[ 0. , 0. ]])
z = hac.linkage(Points, metric='cosine', method='complete')
labels = hac.fcluster(z, 0.1, criterion="distance")
plt.scatter(Points[:, 0], Points[:, 1], c=labels.astype(np.float))
plt.show()
Since I use cosine metric, in some cases the dot product of two vectors can be negative or norm of some vectors can be zero. It means z output will have some negative or infinite elements which is not valid for fcluster (as below):
z =
[[ 0.00000000e+00 1.00000000e+01 0.00000000e+00 2.00000000e+00]
[ 1.30000000e+01 1.50000000e+01 0.00000000e+00 3.00000000e+00]
[ 8.00000000e+00 1.10000000e+01 4.26658708e-13 2.00000000e+00]
[ 1.00000000e+00 2.00000000e+00 2.31748880e-05 2.00000000e+00]
[ 3.00000000e+00 4.00000000e+00 8.96700489e-05 2.00000000e+00]
[ 1.60000000e+01 1.80000000e+01 3.98805492e-04 5.00000000e+00]
[ 1.90000000e+01 2.00000000e+01 1.33225099e-03 7.00000000e+00]
[ 5.00000000e+00 9.00000000e+00 2.41120340e-03 2.00000000e+00]
[ 6.00000000e+00 7.00000000e+00 1.52914684e-02 2.00000000e+00]
[ 1.20000000e+01 2.20000000e+01 3.52441432e-02 3.00000000e+00]
[ 2.10000000e+01 2.40000000e+01 1.38662986e-01 1.00000000e+01]
[ 1.70000000e+01 2.30000000e+01 6.99056531e-01 4.00000000e+00]
[ 2.50000000e+01 2.60000000e+01 1.92543748e+00 1.40000000e+01]
[ -1.00000000e+00 2.70000000e+01 inf 1.50000000e+01]]
To solve this problem, I checked linkage() function and inside it I needed to check _hierarchy.linkage() method. I use pycharm text editor and when I asked for "linkage" source code, it opened up a python file namely "_hierarchy.py" inside the directory like the following:
.PyCharm40/system/python_stubs/-1247972723/scipy/cluster/_hierarchy.py
This python file doesn't have any definition for all included functions.
I am wondering what is the correct source of this function to revise it or is there another way to solve this problem.
I would be appreciated for your helps and hints.
You have a zero vector 0 0 in your data set. For such data, cosine distance is undefined, so you are using an inappropriate distance function!
This is a definition gap that cannot be trivially closed. inf is as incorrect as 0. The distance to 0 0 with cosine cannot be defined without contraditions. You must not use cosine on such data.
Back to your actual question: _hierarchy is a Cython module. It is not pure python, but it is compiled to native code. You can easily see the source code on Github:
https://github.com/scipy/scipy/blob/master/scipy/cluster/_hierarchy.pyx

Resources