ambiguity of understanding SMAPE - keras

I have implemented several regression forecast approaches and now I want to compare them. I picked the MAE, RMSE and SMAPE ratings. My result looks like follows:
Approach 1: MAE= 0,6 , RMSE= 0,9 and SMAPE 531
Approach 2: MAE= 3,0 ,RMSE= 6,1 and SMAPE 510
Approach 3: MAE= 10,1 , RMSE= 17,00 and SMAPE 420
When I plot my predictions and compare them with my test set, I can see that Approach 1 > Approach 2 > Approach 3. This is also evident from the values of MAE and RMSE. But I thought that the lower the resulting SMAPE, the better the prediction.
Did I misunderstand SMAPE?
Since there is no predefined method in phyton, my SMAPE calculation looks like this:
def smape(A, F):
return 100/len(A) * np.sum(2 * np.abs(F - A) / (np.abs(A) + np.abs(F)))
Or is the calculation wrong?
thanks in advance

Okay maybe the method was wrong.. instead i used this one from Kaggle:
from numba import jit
import math
#jit
def smape_fast(y_true, y_pred):
out = 0
for i in range(y_true.shape[0]):
a = y_true[i]
b = y_pred[i]
c = a+b
if c == 0:
continue
out += math.fabs(a - b) / c
out *= (200.0 / y_true.shape[0])
return out
URL
now my results look more plausible from SMAPE compared to MAE and RMSE

Related

A batch-normalization error in the epsilon of tensorflow

We found that the implementation of tf.keras.layers.BatchNormalization does not conform to its mathematical model. The cause of the problem may come from its epsilon or variance parameter. The occurrence of error is specifically divided into four steps:
(1)Initialize a BN operator (i.e. source_model), randomly input an input (i.e. data), and get an output (i.e. source_result);
(2)Randomly generate a perturbation (i.e. delta). Add the variance of source_model to delta, and subtract delta from epsilon to get a new BN operator (i.e. follow_model);
(3)Input data to follow_model and get a follow_result;
(4)Calculate the distance between source_result and follow_result. Theoretically, it should be small or even 0, in practice it can get a result greater than 1
# from tensorflow.keras.layers import BatchNormalization, Input
# from tensorflow.keras.models import Model, clone_model
from tensorflow._api.v1.keras.layers import BatchNormalization, Input
from tensorflow._api.v1.keras.models import Model, clone_model
import os
import re
import numpy as np
def SourceModel(shape):
x = Input(shape=shape[1:])
y = BatchNormalization(axis=-1)(x)
return Model(x, y)
def FollowModel_1(source_model):
follow_model = clone_model(source_model)
# read weights
weights = source_model.get_weights()
weights_names = [weight.name for layer in source_model.layers for weight in layer.weights]
variance_idx = FindWeightsIdx("variance", weights_names)
# mutation operator
# delta = np.random.uniform(-1e-3, 1e-3, 1)[0]
follow_model.layers[1].epsilon += delta # mutation epsilon
weights[variance_idx] -= delta
follow_model.set_weights(weights)
return follow_model
def FindWeightsIdx(name, weights_names):
# find layer index by name
for idx, names in enumerate(weights_names):
if re.search(name, names):
return idx
return -1
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
shape = (10, 32, 32, 3)
data = np.random.uniform(-1, 1, shape)
delta = -1
source_model = SourceModel(shape)
follow_model = FollowModel_1(source_model)
source_result = source_model.predict(data)
follow_result = follow_model.predict(data)
dis = np.sum(abs(source_result-follow_result))
print("delta:", delta, "; dis:", dis)
No matter how big delta is, dis should be smaller, but it is not. This shows that there may be a bug in the batch-norm operator of tensorflow. This problem occurs in both tf1.x and tf2.x
delta: -1 ; dis: 4497.482
I wouldn't call this is a bug so much as undocumented behavior.
I noticed that, with your code, I did not get any difference for delta > 0, or in fact any delta > -0.001 -- the default epsilon is 0.001, so using a larger delta means we still have epsilon > 0. Any larger negative delta (in particular -1 as in your example) will cause epsilon < 0.
Is epsilon < 0 a problem? Yes, because you need to prevent dividing by 0 when dividing by the variance. Variance is always > 0, so subtracting something here might cause a divide by 0. The more pertinent issue is that epsilon is added to the variance, and then the square root is taken to get the standard deviation, which would crash in case variance + epsilon is < 0, which can happen for small variance and epsilon < 0.
My hunch was that somewhere in the code, they do something like taking abs(epsilon) to prevent such issues. However, I couldn't find anything in the layer implementation and also not in the op that is used by the layer.
However, BN by default uses a "fused" implementation which is faster. This is here. And there we see these lines:
# Set a minimum epsilon to 1.001e-5, which is a requirement by CUDNN to
# prevent exception (see cudnn.h).
min_epsilon = 1.001e-5
epsilon = epsilon if epsilon > min_epsilon else min_epsilon
So indeed, epsilon is always positive. The only thing I don't understand is, you can pass fused=False to the BN constructor, and this should use the more basic implementation, where I couldn't find anything that modifies epsilon. But when I tested it, the issue still remained. Not sure what the problem is...
tl;dr: You have epsilon < 0. Don't do that, it's bad.

Optuna pruning for validation loss

I introduced the following lines in my deep learning project in order to early stop when the validation loss has not improved for 10 epochs:
if best_valid_loss is None or valid_loss < best_valid_loss:
best_valid_loss = valid_loss
counter = 0
else:
counter += 1
if counter == 10:
break
Now I want to use Optuna to tune some hyperparameters, but I don't really understand how pruning works in Optuna. Is it possible for Optuna pruners to act the same way as in the code above? I assume I have to use the following:
optuna.pruners.PatientPruner(???, patience=10)
But I don't know which pruner I could use inside PatientPruner. Btw in Optuna I'm minimizing the validation loss.
Short answer: Yes.
Hi, I'm one of the authors of PatientPruner in Optuna. If we perform vanilla early-stopping, wrapped_pruner=None works as we expected. For example,
import optuna
def objective(t):
for step in range(30):
if step == 5:
t.report(0., step=step)
else:
t.report(step * 0.1, step=step)
if t.should_prune():
print("pruned at {}".format(step))
raise optuna.exceptions.TrialPruned()
return 1.
study = optuna.create_study(pruner=optuna.pruners.PatientPruner(None, patience=9), direction="minimize")
study.optimize(objective, n_trials=1)
The output will be pruned at 15.

Logistic regression cost function returning nan

I learnt logistic regression recently, and I wanted to practice it. I am currently using this dataset from kaggle. I tried to define a cost function in this manner (I made all necessary imports):
# Defining the hypothesis
sigmoid = lambda x: 1 / (1 + np.exp(-x))
predict = lambda trainset, parameters: sigmoid(trainset # parameters)
# Defining the cost
def cost(theta):
#print(X.shape, y.shape, theta.shape)
preds = predict(X, theta.T)
errors = (-y * np.log(preds)) - ((1-y)*np.log(1-preds))
return np.mean(errors)
theta = []
for i in range(13):
theta.append(1)
theta = np.array([theta])
cost(theta)
and when I run this cell I get:
/opt/venv/lib/python3.7/site-packages/ipykernel_launcher.py:9: RuntimeWarning: divide by zero encountered in log
if __name__ == '__main__':
/opt/venv/lib/python3.7/site-packages/ipykernel_launcher.py:9: RuntimeWarning: invalid value encountered in multiply
if __name__ == '__main__':
nan
When I searched online, I got the advice to normalise the data and then try it. So this is how I did it:
df = pd.read_csv("/home/jovyan/work/heart.csv")
df.head()
# The dataset is 303x14 in size (using df.shape)
length = df.shape[0]
# Output vector
y = df['target'].values
y = np.array([y]).T
# We name trainingset as X for convenience
trainingset = df.drop(['target'], axis = 1)
#trainingset = df.insert(0, 'bias', 1)
minmax_normal_trainset = (trainingset - trainingset.min())/(trainingset.max() - trainingset.min())
X = trainingset.values
I really don't know where the division by zero error is occurring and how to fix it. If I made any mistakes in this implementation please correct me. I am sorry if this has been asked before, but all I could find was the tip to normalise the data. Thanks in advance!
np.log(0) raises a divide by zero error. So it's this part that's causing the problems:
errors = (-y * np.log(preds)) - ((1 - y) * np.log(1 - preds))
############## #################
preds can be 0 or 1 when the absolute value of x is greater than 709 (because of floating point math, at least on my machine), which is why normalizing x to be between 0 and 1 solves the problem.
EDIT:
You may want to normalize to a larger range than (0, 1) - your sigmoid function as currently set is pretty much linear in that range. Maybe use:
minmax_normal_trainset = c * (trainingset - trainingset.mean())/(trainingset.stdev())
And tune c for better convergence.

Binomial Distribution using scipy.stats package

In each of 4 different competitions, Jin has 60% chance of winning. Assuming that the competitions are independent of each other, what is the probability that: Jin will win at least 1 race.
Binomial Distribution Parameters:
n=4
p=0.60
Display the probability in decimal.
Hint:
P(x>=1)=1-P(x=0)
Use the binom.pmf() function of scipy.stats package to calculate the probability.
#n=4
#p=0.60
#k=1
from scipy import stats
probability=stats.binom.pmf(1,4,0.60)
print(probability)
#0.15360000000000007
What should be the value of K here. My output is not correct.
I will first explain the solution in Mathematical Terms:
The Probability that Jin will win atleast 1 race = 1 - Jin will win NO race
In each of the 4 races Jin has 60 percent chance of winning. That means he has 40 percent chance of losing.
If the probability of success on an individual trial is p, then the binomial probability of n repeated trials with x successes is nCx⋅p^x⋅(1−p)^n−x
Hence,
the Probability that Jin will win No race out of the 4 races = 4C0 X 0.6^0 X 0.4^4 = 0.0256
Hence, the Probability that Jin will win atleast 1 race = 1 - 0.0256 = 0.9744‬
The Code:
from scipy import stats
def binomial():
ans = 1 - round(stats.binom.pmf(0,4,0.6),2)
return ans
if __name__=='__main__':
print(binomial())
#n=4
#p=0.60
#k=1
from scipy import stats
//P(x>=1)=1-P(x=0) this means 1.first find probability with k=0
probability=stats.binom.pmf(0,4,0.60)
//then do 1- probability
actual_probability=1-probability
print(actual_probability)
from scipy import stats
from scipy.stats import binom
def binomial():
n=4
p=0.6
k=0
prob =binom.pmf(k,n,p)
ans =round(1-prob,2)
#Round off to 2 decimal places
return ans
def binomial():
li=[1,2,3,4]
lis=[stats.binom.pmf(k,4,0.6) for k in li]
an=sum(lis)
ans=round(an,2)
return ans
if __name__=='__main__':
print(binomial())

gradient descendent coust increass by each iteraction in linear regression with one feature

Hi I am learning some machine learning algorithms and for the sake of understanding I was trying to implement a linear regression algorithm with one feature using as cost function the Residual sum of squares for gradient descent method as bellow:
My pseudocode:
while not converge
w <- w - step*gradient
python code
Linear.py
import math
import numpy as num
def get_regression_predictions(input_feature, intercept, slope):
predicted_output = [intercept + xi*slope for xi in input_feature]
return(predicted_output)
def rss(input_feature, output, intercept,slope):
return sum( [ ( output.iloc[i] - (intercept + slope*input_feature.iloc[i]) )**2 for i in range(len(output))])
def train(input_feature,output,intercept,slope):
file = open("train.csv","w")
file.write("ID,intercept,slope,RSS\n")
i =0
while True:
print("RSS:",rss(input_feature, output, intercept,slope))
file.write(str(i)+","+str(intercept)+","+str(slope)+","+str(rss(input_feature, output, intercept,slope))+"\n")
i+=1
gradient = [derivative(input_feature, output, intercept,slope,n) for n in range(0,2) ]
step = 0.05
intercept -= step*gradient[0]
slope-= step*gradient[1]
return intercept,slope
def derivative(input_feature, output, intercept,slope,n):
if n==0:
return sum( [ -2*(output.iloc[i] - (intercept + slope*input_feature.iloc[i])) for i in range(0,len(output))] )
return sum( [ -2*(output.iloc[i] - (intercept + slope*input_feature.iloc[i]))*input_feature.iloc[i] for i in range(0,len(output))] )
With the main program:
import Linear as lin
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
df = pd.read_csv("test2.csv")
train = df
lin.train(train["X"],train["Y"], 0, 0)
The test2.csv:
X,Y
0,1
1,3
2,7
3,13
4,21
I resisted the value of rss on a file and noticed that the value of rss became worst at each iteration as follows:
ID,intercept,slope,RSS
0,0,0,669
1,4.5,14.0,3585.25
2,-7.25,-18.5,19714.3125
3,19.375,58.25,108855.953125
Mathematically I think it doesn't make any sense I review my own code many times I think it is correct, I am doing something else wrong?
If your cost isn't decreasing, that's usually a sign you're overshooting with your gradient descent approach, meaning too large of a step size.
A smaller step size can help. You can also look into methods for variable step sizes, which can change each iteration to get you nice convergence properties and speed; usually, these methods change the step size with some proportionality to the gradient. Of course, the specifics depend on each problem.

Resources