Nan loss from Keras Sequential Model Training - python-3.x

I have a Tensorflow Sequential Network which is returning a loss value of Nan consistently during training.
I am using pandas and keras.
An example of the data is :
Actual_GP1 Budgeted_GP_Value_Cleanup Budgeted_GP_Value_New \
0 2.0 2.0 95.00
1 2.0 2.0 63684.55
3 2.0 2.0 26022.57
4 2.0 2.0 440759.17
6 2.0 2.0 95.00
7 2.0 2.0 3519120.00
9 2.0 2.0 4.00
12 2.0 2.0 4.00
13 2.0 2.0 355960.00
14 2.0 2.0 62745.00
Costing_Date Created_Time Date_Time_16 Delivery_Date Engineering_Date \
0 4 1.579523 4.0 4.0 4
1 4 1.575390 4.0 4.0 4
3 4 1.575471 4.0 4.0 4
4 4 1.575020 4.0 4.0 4
6 4 1.579508 4.0 4.0 4
7 4 1.578304 4.0 4.0 4
9 4 1.574600 4.0 4.0 4
12 4 1.570805 4.0 4.0 4
13 4 1.573831 4.0 4.0 4
14 4 1.576153 4.0 4.0 4
Exchange_Rate GP ... Last_Activity_Time Modified_Time \
0 2.0 100.0 ... 4.000000 1.579523
1 2.0 30.0 ... 1.579519 1.579519
3 2.0 44.0 ... 1.579516 1.579516
4 2.0 37.0 ... 1.579516 1.579516
6 2.0 100.0 ... 4.000000 1.579508
7 2.0 44.0 ... 1.579507 1.579507
9 2.0 100.0 ... 1.579506 1.579506
12 2.0 32.0 ... 1.579506 1.579506
13 2.0 44.0 ... 1.579506 1.579506
14 2.0 44.5 ... 1.579506 1.579506
Next_step_actioned_by PO_Date PO_Week Production_End_Date \
0 4.0 1.580429 4.000000 4
1 4.0 1.579824 1.579478 4
3 4.0 1.575850 1.575850 4
4 4.0 1.575418 1.575245 4
6 4.0 1.580429 4.000000 4
7 4.0 1.583798 1.583798 4
9 4.0 1.579219 1.578874 4
12 4.0 1.580429 1.580083 4
13 4.0 1.585613 1.585526 4
14 4.0 1.580429 1.580083 4
Production_Start_Date Project_Value Prototype_Date \
0 4 95.00 4
1 4 212281.82 4
3 4 3.00 4
4 4 4.00 4
6 4 95.00 4
7 4 7998000.00 4
9 4 4.00 4
12 4 4.00 4
13 4 809000.00 4
14 4 141000.00 4
Revenue_Forecast_Probability_Weighting
0 1.0
1 2.0
3 3.0
4 4.0
6 1.0
7 5.0
9 4.0
12 4.0
13 7.0
14 8.0
I understand some of the dates in this sample are categorically labelled, but that is due to missing values.
The target value for this model is a probability of success, which is based on historical data, and i have left that out of this question. It's a value [0,100].
and the network configuration is :
dataset=tf.data.Dataset.from_tensor_slices((df.values, target.values))
train_dataset=dataset.shuffle(len(df)).batch(1)
print(df.shape)
def get_compiled_model():
model = tf.keras.Sequential([
tf.keras.layers.Dense(24, activation='relu', input_shape=(df.shape[-1],)),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dense(8, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='mse',metrics=['accuracy'])
return model
model=get_compiled_model()
model.fit(train_dataset, epochs=20)
model.save("keras_saved_model.h5")
with an output of
(574, 24)
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 574 steps
Epoch 1/20
574/574 [==============================] - 2s 3ms/step - loss: nan - acc: 0.3275
Epoch 2/20
574/574 [==============================] - 1s 1ms/step - loss: nan - acc: 0.6655
Epoch 3/20
574/574 [==============================] - 1s 1ms/step - loss: nan - acc: 0.6655
Epoch 4/20
574/574 [==============================] - 1s 1ms/step - loss: nan - acc: 0.6655
Epoch 5/20
574/574 [==============================] - 1s 1ms/step - loss: nan - acc: 0.6655
Epoch 7/20
574/574 [==============================] - 1s 1ms/step - loss: nan - acc: 0.6655
and so on.
Could someone please point me in the right direction regarding this consistent accuracy and these null loss values.
EDIT:
The solution was to divide the target value by 100 so it would fit in the range [0,1], since the final activation layer is a sigmoid function.
Thanks to Matias Valdenegro for pointing this out

Providing answer here for the community even if the answer is provided in the comment section.
Since the target value ranges from [0,100] the user has normalized the value by diving it by 100, and used the sigmoid activation function, which resolved the issue.
You can apply the normalize function for a feature using the below code.
To get min and max value of a numerical column:
def _z_score_params(column):
mean = traindf[column].min()
std = traindf[column].max()
return {'min': min, 'max': max}
def zscore(col):
min_value = _z_score_params(col)[min]
max_value = _z_score_params(col)[max]
return (col - min_value)/max_value
feature_name = ‘column_name_to_normalize’
normalized_feature = tf.feature_column.numeric_column(
feature_name,
normalizer_fn=zscore)

Related

Finding the Max of Excel Matrix Data based on Criteria from Maxtrix

I have a data on Matrix and I also have the criteria data in Matrix as well See below
Data from the Matrix
Period
0.0
30
45
60
75
90
105
120
135
150
180
6.0
0.356
0.443
0.469
0.505
0.579
0.525
0.516
0.475
0.342
0.271
0.171
7.0
0.439
0.541
0.558
0.678
0.802
0.642
0.747
0.499
0.436
0.336
0.232
8.0
0.505
0.544
0.591
0.694
0.759
0.747
0.736
0.584
0.560
0.467
0.269
9.0
0.489
0.614
0.618
0.630
0.791
0.687
0.631
0.577
0.507
0.562
0.340
10.0
0.538
0.603
0.572
0.580
0.703
0.643
0.619
0.556
0.489
0.459
0.399
11.0
0.503
0.491
0.513
0.578
0.585
0.630
0.587
0.542
0.439
0.459
0.345
12.0
0.517
0.446
0.539
0.588
0.546
0.564
0.552
0.497
0.411
0.412
0.355
13.0
0.470
0.439
0.545
0.534
0.530
0.482
0.510
0.470
0.422
0.404
0.329
14.0
0.399
0.427
0.469
0.442
0.462
0.434
0.409
0.425
0.382
0.395
0.340
15.0
0.370
0.390
0.388
0.397
0.421
0.393
0.355
0.387
0.355
0.341
0.331
Criteria for the matrix
Period
0.0
30
45
60
75
90
105
120
135
150
180
6.0
3
5
5
6
7
6
6
5
3
2
0
7.0
5
6
7
9
10
8
10
6
5
3
1
8.0
6
6
7
9
10
10
9
7
7
5
2
9.0
6
8
8
8
10
9
8
7
6
7
3
10.0
6
7
7
7
9
8
8
7
6
5
4
11.0
6
6
6
7
7
8
7
6
5
5
3
12.0
6
5
6
7
6
7
7
6
4
4
3
13.0
5
5
6
6
6
5
6
5
4
4
3
14.0
4
5
5
5
5
5
4
5
4
4
3
15.0
4
4
4
4
4
4
3
4
3
3
3
Is there any way to find the maximum of no 3 or 10 from the criteria data on the criteria Matrix, and the max values should be taken the matrix data based on the location from the matrix criteria ?
So from the above No 10 should be the maximum from Matrix ( [7,75] or [7,105] or [8,75] or [8,90] or [9,75] )?
I am expecting Excel function or VBA to find the max data of those numbers?
Thanks alot for your help and taught about it
Excel Function or Excel VBA
Assume tables start (with header row and column) in cell A1 of two sheets named Criteria and Data:
=MAX(SUMPRODUCT( (Criteria!B2:L11=10) * (Data!B2:L11) ) )
Max in Matrix Using Criteria Matrix
If you have Microsoft 365 and if the criteria are in the range N2:N12, in cell O2 of sheet Criteria you could use:
=MAX(TOCOL(($B$2:$L$11=N2)*Data!$B$2:$L$11))
or (more of the same i.e. getting familiar with the LET function)
=LET(tCriteria,$B$2:$L$11,tData,Data!$B$2:$L$11,Criteria,N2,
MAX(TOCOL((tCriteria=Criteria)*tData)))
used in cell P2 of the screenshot, and copy down.

Comparing one month's value of current year with previous year values adding or substracting multiple parameters

Given a following dataframe df:
date mom_pct
0 2020-1-31 1.4
1 2020-2-29 0.8
2 2020-3-31 -1.2
3 2020-4-30 -0.9
4 2020-5-31 -0.8
5 2020-6-30 -0.1
6 2020-7-31 0.6
7 2020-8-31 0.4
8 2020-9-30 0.2
9 2020-10-31 -0.3
10 2020-11-30 -0.6
11 2020-12-31 0.7
12 2021-1-31 1.0
13 2021-2-28 0.6
14 2021-3-31 -0.5
15 2021-4-30 -0.3
16 2021-5-31 -0.2
17 2021-6-30 -0.4
18 2021-7-31 0.3
19 2021-8-31 0.1
20 2021-9-30 0.0
21 2021-10-31 0.7
22 2021-11-30 0.4
23 2021-12-31 -0.3
24 2022-1-31 0.4
25 2022-2-28 0.6
26 2022-3-31 0.0
27 2022-4-30 0.4
28 2022-5-31 -0.2
I want to compare the chain ratio value of a month of the current year to the value of the month of the previous year. Assume that the value of the same period last year is y_t-1, and the current value of this year is y_t. I will create a new column according to the following rules:
If y_t = y_t-1, returns 0 for new column;
If y_t ∈ (y_t-1, y_t-1 + 0.3], returns 1;
If y_t ∈ (y_t-1 + 0.3, y_t-1 + 0.5], returns 2;
If y_t > (y_t-1 + 0.5), returns 3;
If y_t ∈ [y_t-1 - 0.3, y_t-1), returns -1;
If y_t ∈ [y_t-1 - 0.5, y_t-1 - 0.3), returns -2;
If y_t < (y_t-1 - 0.5), returns -3
The expected result:
date mom_pct categorial_mom_pct
0 2020-1-31 1.0 NaN
1 2020-2-29 0.8 NaN
2 2020-3-31 -1.2 NaN
3 2020-4-30 -0.9 NaN
4 2020-5-31 -0.8 NaN
5 2020-6-30 -0.1 NaN
6 2020-7-31 0.6 NaN
7 2020-8-31 0.4 NaN
8 2020-9-30 0.2 NaN
9 2020-10-31 -0.3 NaN
10 2020-11-30 -0.6 NaN
11 2020-12-31 0.7 NaN
12 2021-1-31 1.0 0.0
13 2021-2-28 0.6 -1.0
14 2021-3-31 -0.5 3.0
15 2021-4-30 -0.3 3.0
16 2021-5-31 -0.2 3.0
17 2021-6-30 -0.4 -1.0
18 2021-7-31 0.3 -1.0
19 2021-8-31 0.1 -1.0
20 2021-9-30 0.0 -1.0
21 2021-10-31 0.7 3.0
22 2021-11-30 0.4 3.0
23 2021-12-31 -0.3 -3.0
24 2022-1-31 0.4 -3.0
25 2022-2-28 0.6 0.0
26 2022-3-31 0.0 2.0
27 2022-4-30 0.4 3.0
28 2022-5-31 -0.2 0.0
I attempt to create multiple columns and ranges, then check mom_pct is in which range. Is it possible to do that in a more effecient way? Thanks.
df1['mom_pct_zero'] = df1['mom_pct'].shift(12)
df1['mom_pct_pos1'] = df1['mom_pct'].shift(12) + 0.3
df1['mom_pct_pos2'] = df1['mom_pct'].shift(12) + 0.5
df1['mom_pct_neg1'] = df1['mom_pct'].shift(12) - 0.3
df1['mom_pct_neg2'] = df1['mom_pct'].shift(12) - 0.5
I would do it as follows
def categorize(v):
if np.isnan(v) or v == 0.:
return v
sign = -1 if v < 0 else 1
eps = 1e-10
if abs(v) <= 0.3 + eps:
return sign * 1
if abs(v) <= 0.5 + eps:
return sign * 2
return sign * 3
df['categorial_mom_pct'] = df['mom_pct'].diff(12).map(categorize)
print(df)
Note that I added a very small eps to the threshold to counter the precision issue with floating point arithmetic
abs(-0.3) <= 0.3 # True
abs(-0.4 + 0.1) <= 0.3 # False
abs(-0.4 + 0.1) <= 0.3 + 1e-10 # True
Out:
date mom_pct categorial_mom_pct
0 2020-1-31 1.0 NaN
1 2020-2-29 0.8 NaN
2 2020-3-31 -1.2 NaN
3 2020-4-30 -0.9 NaN
4 2020-5-31 -0.8 NaN
5 2020-6-30 -0.1 NaN
6 2020-7-31 0.6 NaN
7 2020-8-31 0.4 NaN
8 2020-9-30 0.2 NaN
9 2020-10-31 -0.3 NaN
10 2020-11-30 -0.6 NaN
11 2020-12-31 0.7 NaN
12 2021-1-31 1.0 0.0
13 2021-2-28 0.6 -1.0
14 2021-3-31 -0.5 3.0
15 2021-4-30 -0.3 3.0
16 2021-5-31 -0.2 3.0
17 2021-6-30 -0.4 -1.0
18 2021-7-31 0.3 -1.0
19 2021-8-31 0.1 -1.0
20 2021-9-30 0.0 -1.0
21 2021-10-31 0.7 3.0
22 2021-11-30 0.4 3.0
23 2021-12-31 -0.3 -3.0
24 2022-1-31 0.4 -3.0
25 2022-2-28 0.6 0.0
26 2022-3-31 0.0 2.0
27 2022-4-30 0.4 3.0
28 2022-5-31 -0.2 0.0

Creating multiple cohort from the pivot table

I have a requirement like below.
The initial information is a list of gross adds.
201910
201911
201912
202001
202002
20000
30000
32000
40000
36000
I have a pivot table as below.
201910
201911
201912
202001
202002
1000
2000
2400
3200
1800
500
400
300
200
nan
200
150
100
nan
nan
200
100
nan
nan
nan
160
nan
nan
nan
nan
Need to generate the report like below.
Cohort01:
5%
3%
3%
1%
1%
1%
From Cohort02 onwards it will take the average of last value of cohort01.
Similarly for Cohort03 for both nan values it will take the average of corresponding values of cohort01 and cohort2.
Again while calculating for cohort04 it will take the average of previous two cohorts(cohort02 and cohort03 values) to add all three nan value.
Is there anyone who can provide me a solution on this in Python.
The report should be generated as below.
All cohorts should be created separately.
You could try it like this:
res = df.apply(lambda x: round(100/(df_gross.iloc[0]/x),1),axis=1)
print(res)
201910 201911 201912 202001 202002
0 5.0 6.7 7.5 8.0 5.0
1 2.5 1.3 0.9 0.5 NaN
2 1.0 0.5 0.3 NaN NaN
3 1.0 0.3 NaN NaN NaN
4 0.8 NaN NaN NaN NaN
for idx,col in enumerate(res.columns[1:],1):
res[col] = res[col].fillna((res.iloc[:,max(idx-2,0)]+res.iloc[:,idx-1])/2)
print(res)
201910 201911 201912 202001 202002
0 5.0 6.7 7.50 8.000 5.0000
1 2.5 1.3 0.90 0.500 0.7000
2 1.0 0.5 0.30 0.400 0.3500
3 1.0 0.3 0.65 0.475 0.5625
4 0.8 0.8 0.80 0.800 0.8000

Replacing values in a string with NaN

Faced a simple task, but I can not solve. There is a table in df:
Date X1 X2
02.03.2019 2 2
03.03.2019 1 1
04.03.2019 2 3
05.03.2019 1 12
06.03.2019 2 2
07.03.2019 3 3
08.03.2019 4 1
09.03.2019 1 2
And I need for rows where Date < 05.03.2019 set X1=NaN, X2=NaN:
Date X1 X2
02.03.2019 NaN NaN
03.03.2019 NaN NaN
04.03.2019 NaN NaN
05.03.2019 1 12
06.03.2019 2 2
07.03.2019 3 3
08.03.2019 4 1
09.03.2019 1 2
First convert column Date to datetimes and then set values by DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
df.loc[df['Date'] < '2019-03-05', ['X1','X2']] = np.nan
print (df)
Date X1 X2
0 2019-03-02 NaN NaN
1 2019-03-03 NaN NaN
2 2019-03-04 NaN NaN
3 2019-03-05 1.0 12.0
4 2019-03-06 2.0 2.0
5 2019-03-07 3.0 3.0
6 2019-03-08 4.0 1.0
7 2019-03-09 1.0 2.0
If there is DatetimeIndex:
df.index = pd.to_datetime(df.index, format='%d.%m.%Y')
#change datetime to 2019-03-04
df.loc[:'2019-03-04'] = np.nan
print (df)
X1 X2
Date
2019-03-02 NaN NaN
2019-03-03 NaN NaN
2019-03-04 NaN NaN
2019-03-05 1.0 12.0
2019-03-06 2.0 2.0
2019-03-07 3.0 3.0
2019-03-08 4.0 1.0
2019-03-09 1.0 2.0
Or:
df.index = pd.to_datetime(df.index, format='%d.%m.%Y')
df.loc[df.index < '2019-03-05'] = np.nan
Dont use this solution, this is just another approach possible (-: (this will affect all columns)
df.mask(df.Date < '05.03.2019').combine_first(df[['Date']])
Date X1 X2
0 02.03.2019 NaN NaN
1 03.03.2019 NaN NaN
2 04.03.2019 NaN NaN
3 05.03.2019 1.0 12.0
4 06.03.2019 2.0 2.0
5 07.03.2019 3.0 3.0
6 08.03.2019 4.0 1.0
7 09.03.2019 1.0 2.0

Pandas merge two dataframes by taking the mean between the columns

I am working with Pandas DataFrames and looking to take the mean between two of them. I am looking to take the mean between columns with the same names.
For example
df1
time x y z
0 1 1.25 2.5 0.75
1 2 2.75 2.5 3.00
2 3 1.50 2.5 1.25
3 4 3.00 2.5 3.50
4 5 0.50 2.5 2.25
df2
time x y z
0 2 0.75 2.5 1.75
1 3 3.00 2.5 3.00
2 4 1.25 2.5 0.25
3 5 3.50 2.5 2.00
4 6 2.25 2.5 2.25
and the result I am looking for is
df3
time x y z
0 1 1.25 2.5 0.75
1 2 1.75 2.5 2.375
2 3 2.25 2.5 2.125
3 4 2.125 2.5 1.875
4 5 2.00 2.5 2.125
5 6 2.25 2.5 2.25
Is there a simple way in Pandas that I can do this, using the merge function or similar?
I am looking for a way of doing it without having to specify the name of the columns.
I think you need concat + groupby and aggregate mean:
df = pd.concat([df1, df2]).groupby('time', as_index=False).mean()
print (df)
time x y z
0 1 1.250 2.5 0.750
1 2 1.750 2.5 2.375
2 3 2.250 2.5 2.125
3 4 2.125 2.5 1.875
4 5 2.000 2.5 2.125
5 6 2.250 2.5 2.250

Resources