Computing gradient twice for two different losses in Pytorch - pytorch

I want to compute the gradients twice for two different losses in the same iteration.
Code:
batch_output0,batch_output1 = get_output_from_model(model=model,
data=batch[0])
train_loss0 = loss_fun0(batch_output0, batch_labels0.float().view(-1, 1))
train_loss0.backward()
grad0_conv_w = model.conv1.conv1.weight.grad
batch_output0,batch_output1 = get_output_from_model(model=model,
data=batch[0])
train_loss1 = loss_fun1(batch_output1, batch_labels1.float().view(-1, 1))
train_loss1.backward()
grad1_conv_w = model.conv1.conv1.weight.grad
Outputs:
train_loss0: tensor(0.6950, grad_fn=<BinaryCrossEntropyBackward>)
train_loss1: tensor(25.5431, grad_fn=<MseLossBackward>)
Grad0: tensor([-2.4883e-05, 3.7842e-05, 1.2635e-04, ..., -1.6413e-04,
-1.8419e-04, -1.7884e-04])
Grad1: tensor([-2.4883e-05, 3.7842e-05, 1.2635e-04, ..., -1.6413e-04,
-1.8419e-04, -1.7884e-04])
You may note that even though the two losses are quite different, the gradients for the corresponding losses are exactly the same.
Please help me to diagnose the problem.
Thank you.

Related

Numbers of hidden layers and units in AutoKeras dense block

I am training a model with Autokeras. So far my best model is this:
structured_data_block_1/normalize:false
structured_data_block_1/dense_block_1/use_batchnorm:true
structured_data_block_1/dense_block_1/num_layers:2
structured_data_block_1/dense_block_1/units_0:32
structured_data_block_1/dense_block_1/dropout:0
structured_data_block_1/dense_block_1/units_1:32
dense_block_2/use_batchnorm:true
dense_block_2/num_layers:2
dense_block_2/units_0:128
dense_block_2/dropout:0
dense_block_2/units_1:16
dense_block_3/use_batchnorm:false
dense_block_3/num_layers:1
dense_block_3/units_0:32
dense_block_3/dropout:0
dense_block_3/units_1:32
regression_head_1/dropout:0
optimizer:"adam"
learning_rate:0.1
dense_block_2/units_2:32
structured_data_block_1/dense_block_1/units_2:256
dense_block_3/units_2:128
My first dense_block_1 has 2 layers (num_layers:2), how can I have three units / neurons then? It say units_0: 32, units_1: 32 and units_2: 256, this implies to me that I have three layers, so why is num_layers:2?
If I would want to recreate the above model in this code, how would I do it properly?
input_node = ak.StructuredDataInput()
output_node = ak.StructuredDataBlock(categorical_encoding=False, normalize=False)(input_node)
output_node = ak.DenseBlock()(output_node)
output_node = ak.DenseBlock()(output_node)
output_node = ak.RegressionHead()(output_node)
Thx for any input

Gradients vanishing despite using Kaiming initialization

I was implementing a conv block in pytorch with activation function(prelu). I used Kaiming initilization to initialize all my weights and set all the bias to zero. However as I tested these blocks (by stacking 100 such conv and activation blocks on top of each other), I noticed that the output I am getting values of the order of 10^(-10). Is this normal, considering I am stacking upto 100 layers. Adding a small bias to each layer fixes the problem. But in Kaiming initialization the biases are supposed to be zero.
Here is the conv block code
from collections import Iterable
def convBlock(
input_channels, output_channels, kernel_size=3, padding=None, activation="prelu"
):
"""
Initializes a conv block using Kaiming Initialization
"""
padding_par = 0
if padding == "same":
padding_par = same_padding(kernel_size)
conv = nn.Conv2d(input_channels, output_channels, kernel_size, padding=padding_par)
relu_negative_slope = 0.25
act = None
if activation == "prelu" or activation == "leaky_relu":
nn.init.kaiming_normal_(conv.weight, a=relu_negative_slope, mode="fan_in")
if activation == "prelu":
act = nn.PReLU(init=relu_negative_slope)
else:
act = nn.LeakyReLU(negative_slope=relu_negative_slope)
if activation == "relu":
nn.init.kaiming_normal_(conv.weight, nonlinearity="relu")
act = nn.ReLU()
nn.init.constant_(conv.bias.data, 0)
block = nn.Sequential(conv, act)
return block
def flatten(lis):
for item in lis:
if isinstance(item, Iterable) and not isinstance(item, str):
for x in flatten(item):
yield x
else:
yield item
def Sequential(args):
flattened_args = list(flatten(args))
return nn.Sequential(*flattened_args)
This is the test Code
ls=[]
for i in range(100):
ls.append(convBlock(3,3,3,"same"))
model=Sequential(ls)
test=np.ones((1,3,5,5))
model(torch.Tensor(test))
And the output I am getting is
tensor([[[[-1.7771e-10, -3.5088e-10, 5.9369e-09, 4.2668e-09, 9.8803e-10],
[ 1.8657e-09, -4.0271e-10, 3.1189e-09, 1.5117e-09, 6.6546e-09],
[ 2.4237e-09, -6.2249e-10, -5.7327e-10, 4.2867e-09, 6.0034e-09],
[-1.8757e-10, 5.5446e-09, 1.7641e-09, 5.7018e-09, 6.4347e-09],
[ 1.2352e-09, -3.4732e-10, 4.1553e-10, -1.2996e-09, 3.8971e-09]],
[[ 2.6607e-09, 1.7756e-09, -1.0923e-09, -1.4272e-09, -1.1840e-09],
[ 2.0668e-10, -1.8130e-09, -2.3864e-09, -1.7061e-09, -1.7147e-10],
[-6.7161e-10, -1.3440e-09, -6.3196e-10, -8.7677e-10, -1.4851e-09],
[ 3.1475e-09, -1.6574e-09, -3.4180e-09, -3.5224e-09, -2.6642e-09],
[-1.9703e-09, -3.2277e-09, -2.4733e-09, -2.3707e-09, -8.7598e-10]],
[[ 3.5573e-09, 7.8113e-09, 6.8232e-09, 1.2285e-09, -9.3973e-10],
[ 6.6368e-09, 8.2877e-09, 9.2108e-10, 9.7531e-10, 7.0011e-10],
[ 6.6954e-09, 9.1019e-09, 1.5128e-08, 3.3151e-09, 2.1899e-10],
[ 1.2152e-08, 7.7002e-09, 1.6406e-08, 1.4948e-08, -6.0882e-10],
[ 6.9930e-09, 7.3222e-09, -7.4308e-10, 5.2505e-09, 3.4365e-09]]]],
grad_fn=<PreluBackward>)
Amazing question (and welcome to StackOverflow)! Research paper for quick reference.
TLDR
Try wider networks (64 channels)
Add Batch Normalization after activation (or even before, shouldn't make much difference)
Add residual connections (shouldn't improve much over batch norm, last resort)
Please check this out in this order and give a comment what (and if) any of that worked in your case (as I'm also curious).
Things you do differently
Your neural network is very deep, yet very narrow (81 parameters per layer only!)
Due to above, one cannot reliably create those weights from normal distribution as the sample is just too small.
Try wider networks, 64 channels or more
You are trying much deeper network than they did
Section: Comparison Experiments
We conducted comparisons on a deep but efficient model with 14 weight
layers (actually 22 was also tested in comparison with Xavier)
That was due to date of release of this paper (2015) and hardware limitations "back in the days" (let's say)
Is this normal?
Approach itself is quite strange with layers of this depth, at least currently;
each conv block is usually followed by activation like ReLU and Batch Normalization (which normalizes signal and helps with exploding/vanishing signals)
usually networks of this depth (even of depth half of what you've got) use also residual connections (though this is not directly linked to vanishing/small signal, more connected to degradation problem of even deep networks, like 1000 layers)

How does sklearn.linear_model.LinearRegression work with insufficient data?

To solve a 5 parameter model, I need at least 5 data points to get a unique solution. For x and y data below:
import numpy as np
x = np.array([[-0.24155831, 0.37083184, -1.69002708, 1.4578805 , 0.91790011,
0.31648635, -0.15957368],
[-0.37541846, -0.14572825, -2.19695883, 1.01136142, 0.57288752,
0.32080956, -0.82986857],
[ 0.33815532, 3.1123936 , -0.29317028, 3.01493602, 1.64978158,
0.56301755, 1.3958912 ],
[ 0.84486735, 4.74567324, 0.7982888 , 3.56604097, 1.47633894,
1.38743513, 3.0679506 ],
[-0.2752026 , 2.9110031 , 0.19218081, 2.0691105 , 0.49240373,
1.63213241, 2.4235483 ],
[ 0.89942508, 5.09052174, 1.26048572, 3.73477373, 1.4302902 ,
1.91907482, 3.70126468]])
y = np.array([-0.81388378, -1.59719762, -0.08256274, 0.61297275, 0.99359647,
1.11315445])
I used only 6 data to fit a 8 parameter model (7 slopes and 1 intercept).
lr = LinearRegression().fit(x, y)
print(lr.coef_)
array([-0.83916772, -0.57249998, 0.73025938, -0.02065629, 0.47637768,
-0.36962192, 0.99128474])
print(lr.intercept_)
0.2978781587718828
Clearly, it's using some kind of assignment to reduce the degrees of freedom. I tried to look into the source code but couldn't found anything about that. What method do they use to find the parameter of under specified model?
You don't need to reduce the degrees of freedom, it simply finds a solution to the least squares problem min sum_i (dot(beta,x_i)+beta_0-y_i)**2. For example, in the non-sparse case it uses the linalg.lstsq module from scipy. The default solver for this optimization problem is the gelsd LAPACK driver. If
A= np.concatenate((ones_v, X), axis=1)
is the augmented array with ones as its first column, then your solution is given by
x=numpy.linalg.pinv(A.T*A)*A.T*y
Where we use the pseudoinverse precisely because the matrix may not be of full rank. Of course, the solver doesn't actually use this formula but uses singular value Decomposition of A to reduce this formula.

Top 4 Prediction Using Keras Model

I made my own Keras CNN and used the code below to predict. The prediction give all the 143 prediction while I only want the four major classes with the highest percentage.
Code:
preds = model.predict(imgs)
for cls in train_generator.class_indices:
x = preds[0][train_generator.class_indices[cls]]
x_pred = "{:.1%}".format(x)
value = (cls+":"+ x_pred)
print (value)
Prediction:
Acacia_abyssinica:0.0%
Acacia_kirkii:0.0%
Acacia_mearnsii:0.0%
Acacia_melanoxylon:0.0%
Acacia_nilotica:0.0%
Acacia_polyacantha:0.0%
Acacia_senegal:0.0%
Acacia_seyal:0.0%
Acacia_xanthophloea:0.0%
Afrocarpus_falcatus:0.0%
Afzelia_quanzensis:0.0%
Albizia_gummifera:0.0%
Albizia_lebbeck:0.0%
Allanblackia_floribunda:0.0%
Artocarpus_heterophyllus:0.0%
Azadirachta_indica:0.0%
Balanites_aegyptiaca:0.0%
Bersama_abyssinica:0.0%
Bischofia_javanica:0.0%
Brachylaena_huillensis:0.0%
Bridelia_micrantha:0.0%
Calodendron_capensis:0.0%
Calodendrum_capense:0.0%
Casimiroa_edulis:0.0%
Cassipourea_malosana:0.0%
Casuarina_cunninghamiana:0.0%
Casuarina_equisetifolia:4.8%
Catha_edulis:0.0%
Cathium_Keniensis:0.0%
Ceiba_pentandra:39.1%
Celtis_africana:0.0%
Chionanthus_battiscombei:0.0%
Clausena_anisat:0.0%
Clerodendrum_johnstonii:0.0%
Combretum_molle:0.0%
Cordia_africana:0.0%
Cordia_africana_Cordia:0.0%
Cotoneaster_Pannos:0.0%
Croton_macrostachyus:0.0%
Croton_megalocarpus:0.0%
Cupressus_lusitanica:0.0%
Cussonia_Spicata:0.2%
Cussonia_holstii:0.0%
Diospyros_abyssinica:0.0%
Dodonaea_angustifolia:0.0%
Dodonaea_viscosa:0.0%
Dombeya_goetzenii:0.0%
Dombeya_rotundifolia:0.0%
Dombeya_torrida:0.0%
Dovyalis_abyssinica:0.0%
Dovyalis_macrocalyx:0.0%
Drypetes_gerrardii:0.0%
Ehretia_cymosa:0.0%
Ekeber_Capensis:0.0%
Erica_arborea:0.0%
Eriobotrya_japonica:0.0%
Erythrina_abyssinica:0.0%
Eucalyptus_camaldulensis:0.0%
Eucalyptus_globulus:55.9%
Eucalyptus_grandis:0.0%
Eucalyptus_grandis_saligna:0.0%
Eucalyptus_hybrids:0.0%
Eucalyptus_saligna:0.0%
Euclea_divinorum:0.0%
Ficus_indica:0.0%
Ficus_natalensi:0.0%
Ficus_sur:0.0%
Ficus_sycomorus:0.0%
Ficus_thonningii:0.0%
Flacourtia_indica:0.0%
Flacourtiaceae:0.0%
Fraxinus_pennsylvanica:0.0%
Grevillea_robusta:0.0%
Hagenia_abyssinica:0.0%
Jacaranda_mimosifolia:0.0%
Juniperus_procera:0.0%
Kigelia_africana:0.0%
Macaranga_capensis:0.0%
Mangifera_indica:0.0%
Manilkara_Discolor:0.0%
Markhamia_lutea:0.0%
Maytenus_senegalensis:0.0%
Melia_volkensii:0.0%
Meyna_tetraphylla:0.0%
Milicia_excelsa:0.0%
Moringa_Oleifera:0.0%
Murukku_Trichilia_emetica:0.0%
Myrianthus_holstii:0.0%
Newtonia_buchananii:0.0%
Nuxia_congesta:0.0%
Ochna_holstii:0.0%
Ochna_ovata:0.0%
Ocotea_usambarensis:0.0%
Olea_Europaea:0.0%
Olea_africana:0.0%
Olea_capensis:0.0%
Olea_hochstetteri:0.0%
Olea_welwitschii:0.0%
Osyris_lanceolata:0.0%
Persea_americana:0.0%
Pinus_radiata:0.0%
Podocarpus _falcatus:0.0%
Podocarpus_latifolius:0.0%
Polyscias_fulva:0.0%
Polyscias_kikuyuensis:0.0%
Pouteria_adolfi_friedericii:0.0%
Prunus_africana:0.0%
Psidium_guajava:0.0%
Rauvolfia_Vomitoria:0.0%
Rhus_natalensis:0.0%
Rhus_vulgaris:0.0%
Schinus_molle:0.0%
Schrebera_alata:0.0%
Sclerocarya_birrea:0.0%
Scolopia_zeyheri:0.0%
Senna_siamea:0.0%
Sinarundinaria_alpina:0.0%
Solanum_mauritianum:0.0%
Spathodea_campanulata:0.0%
Strychnos_usambare:0.0%
Syzygium_afromontana:0.0%
Syzygium_cordatum:0.0%
Syzygium_cuminii:0.0%
Syzygium_guineense:0.0%
Tamarindus_indica:0.0%
Tarchonanthus_camphoratus:0.0%
Teclea_Nobilis:0.0%
Teclea_simplicifolia:0.0%
Terminalia_brownii:0.0%
Terminalia_mantaly:0.0%
Toddalia_asiatica:0.0%
Trema_Orientalis:0.0%
Trichilia_emetica:0.0%
Trichocladus_ellipticus:0.0%
Trimeria_grandifolia:0.0%
Vangueria_madagascariensis:0.0%
Vepris_nobilis:0.0%
Vepris_simplicifolia:0.0%
Vernonia_auriculifera:0.0%
Vitex_keniensis:0.0%
Warburgia_ugandensis:0.0%
Zanthoxylum_gilletii:0.0%
Mahogany_tree:0.0%
You can just get all your predictions, sort them and take top four
preds = model.predict(imgs)
sorted_preds = []
for cls in train_generator.class_indices:
x = preds[0][train_generator.class_indices[cls]]
x_pred = "{:.1%}".format(x)
sorted_preds.append([x, x_pred, cls])
top_4 = sorted(sorted_preds, reverse=True)[:4]

scikit-learn roc_curve: why does it return a threshold value = 2 some time?

Correct me if I'm wrong: the "thresholds" returned by scikit-learn's roc_curve should be an array of numbers that are in [0,1]. However, it sometimes gives me an array with the first number close to "2". Is it a bug or I did sth wrong? Thanks.
In [1]: import numpy as np
In [2]: from sklearn.metrics import roc_curve
In [3]: np.random.seed(11)
In [4]: aa = np.random.choice([True, False],100)
In [5]: bb = np.random.uniform(0,1,100)
In [6]: fpr,tpr,thresholds = roc_curve(aa,bb)
In [7]: thresholds
Out[7]:
array([ 1.97396826, 0.97396826, 0.9711752 , 0.95996265, 0.95744405,
0.94983331, 0.93290463, 0.93241372, 0.93214862, 0.93076592,
0.92960511, 0.92245024, 0.91179548, 0.91112166, 0.87529458,
0.84493853, 0.84068543, 0.83303741, 0.82565223, 0.81096657,
0.80656679, 0.79387241, 0.77054807, 0.76763223, 0.7644911 ,
0.75964947, 0.73995152, 0.73825262, 0.73466772, 0.73421299,
0.73282534, 0.72391126, 0.71296292, 0.70930102, 0.70116428,
0.69606617, 0.65869235, 0.65670881, 0.65261474, 0.6487222 ,
0.64805644, 0.64221486, 0.62699782, 0.62522484, 0.62283401,
0.61601839, 0.611632 , 0.59548669, 0.57555854, 0.56828967,
0.55652111, 0.55063947, 0.53885029, 0.53369398, 0.52157349,
0.51900774, 0.50547317, 0.49749635, 0.493913 , 0.46154029,
0.45275916, 0.44777116, 0.43822067, 0.43795921, 0.43624093,
0.42039077, 0.41866343, 0.41550367, 0.40032843, 0.36761763,
0.36642721, 0.36567017, 0.36148354, 0.35843793, 0.34371331,
0.33436415, 0.33408289, 0.33387442, 0.31887024, 0.31818719,
0.31367915, 0.30216469, 0.30097917, 0.29995201, 0.28604467,
0.26930354, 0.2383461 , 0.22803687, 0.21800338, 0.19301808,
0.16902881, 0.1688173 , 0.14491946, 0.13648451, 0.12704826,
0.09141459, 0.08569481, 0.07500199, 0.06288762, 0.02073298,
0.01934336])
Most of the time these thresholds are not used, for example in calculating the area under the curve, or plotting the False Positive Rate against the True Positive Rate.
Yet to plot what looks like a reasonable curve, one needs to have a threshold that incorporates 0 data points. Since Scikit-Learn's ROC curve function need not have normalised probabilities for thresholds (any score is fine), setting this point's threshold to 1 isn't sufficient; setting it to inf is sensible but coders often expect finite data (and it's possible the implementation also works for integer thresholds). Instead the implementation uses max(score) + epsilon where epsilon = 1. This may be cosmetically deficient, but you haven't given any reason why it's a problem!
From the documentation:
thresholds : array, shape = [n_thresholds]
Decreasing thresholds on the decision function used to compute
fpr and tpr. thresholds[0] represents no instances being predicted
and is arbitrarily set to max(y_score) + 1.
So the first element of thresholds is close to 2 because it is max(y_score) + 1, in your case thresholds[1] + 1.
this seems like a bug to me - in roc_curve(aa,bb), 1 is added to the first threshold. You should create an issue here https://github.com/scikit-learn/scikit-learn/issues

Resources