Trapezoidal Kernel in python - python-3.x

I wanted to implement a trapezoidal kernel in python(probably using numpy or scipy) for convolution just like the one which comes in the astropy module as Trapezoidal1Dkernel. I have tried convolving with a trapezoidal waveform but results were not satisfactory.
def trapzoid_signal(t, width=2., slope=1., amp=1., offs=0):
global trasig
trasig=[]
trasig = slope*width*signal.sawtooth(2*np.pi*t/width, width=0.5)/4.
trasig+= slope*width/4.
trasig[trasig>amp] = amp
return trasig + offs
t = np.linspace(0, 32, 34)
trapzoid_signal(t, width=32, slope=1, amp=0.0322)
print(trasig)
z = scipy.signal.convolve(trasig,new)
If I print z it gives:
[ nan nan nan ..., nan nan nan]
I tried plotting z it gives nothing. Any help?

Eureka!!! I did it. The thing as to why it was not plotting or printing values as [ nan nan nan ..., nan nan nan] was removed by using the following code which I found on StackOverflow itself.
ind = np.where(~np.isnan(new))[0]
first, last = ind[0], ind[-1]
new[:first] = new[first]
new[last + 1:] = new[last]
Then it solved my problem. I not only got the values of z but also got my plot. Thanks to stackoverflow.com.

Related

Stratifying folds with StratifiedKFold in sklearn

I do not understand very well the logic behind sklearn function train_test_split and StratifiedKFold for obtaining balanced splits according to multiple "columns" and not only according to the target distribution. I know the previous sentence is a bit obscure so I hope the following code helps.
import numpy as np
import pandas as pd
import random
n_samples = 100
prob = 0.2
pos = int(n_samples * prob)
neg = n_samples - pos
target = [1] * pos + [0] * neg
cat = ["a"] * 50 + ["b"] * 50
random.shuffle(target)
random.shuffle(cat)
ds = pd.DataFrame()
ds["target"] = target
ds["cat"] = cat
ds["f1"] = np.random.random(size=(n_samples,))
ds["f2"] = np.random.random(size=(n_samples,))
print(ds.head())
This is a 100-example dataset, target distribution is governed by p, in this case we have 20% positive examples. There is a binary categorical column cat, perfectly balanced. The output of the previous code is:
target cat f1 f2
0 0 a 0.970585 0.134268
1 0 a 0.410689 0.225524
2 0 a 0.638111 0.273830
3 0 b 0.594726 0.579668
4 0 a 0.737440 0.667996
with train_test_split(), stratify on target and cat, if we study the frequencies, we get:
from sklearn.model_selection import train_test_split, StratifiedKFold
# with train_test_split
training, valid = train_test_split(range(n_samples),
test_size=20,
stratify=ds[["target", "cat"]])
print("---")
print("* training")
print(ds.loc[training, ["target", "cat"]].value_counts() / len(training)) # balanced
print("* validation")
print(ds.loc[valid, ["target", "cat"]].value_counts() / len(valid)) # balanced
we get this:
* dataset
0 0.8
1 0.2
Name: target, dtype: float64
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
---
* training
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
* validation
target cat
0 a 0.4
b 0.4
1 a 0.1
b 0.1
dtype: float64
It is perfectly stratified.
Now with StratifiedKFold:
# with stratified k-fold
skf = StratifiedKFold(n_splits=5)
try:
for train, valid in skf.split(X=range(len(ds)), y=ds[["target", "cat"]]):
pass
except:
print("! does not work")
for train, valid in skf.split(X=range(len(ds)), y=ds.target):
print("happily iterating")
output:
! does not work
happily iterating
happily iterating
happily iterating
happily iterating
happily iterating
How do I obtain what I got with train_test_split with StratifiedKFold? I know there might be data distributions not allowing such stratifications in k-fold cross validation, but I cannot understand why train_test_split accepts two or more columns and the other method does not.
This doesn't seem readily possible currently.
Multilabel isn't exactly what you're looking for, but related. That's been asked here before, and was an Issue on sklearn's github (not sure why it got closed).
As a bit of a hack, you should be able to just combine your two columns into a new one with ordered pairs, and stratify on that?

How to implement dynamic parameter estimation with missing data in Gekko?

Going back and forth through the documentation, I was able to set-up a dynamic parameter estimation in Gekko.
Here's the code, with measurement values shown below (the file is named MeasuredAlgebrProductionRate_30min_18h.csv on my system, and uses ;as separator):
import numpy as np
import matplotlib.pyplot as plt
from gekko import GEKKO
#%% Read measurement data from CSV file
t_x_q_obs = np.genfromtxt('MeasuredAlgebrProductionRate_30min_18h.csv', delimiter=';')
#t_obs, x_obs, q_obs = t_xq_obs[:,0:3]
#%% Initialize Model
m = GEKKO(remote=False)
m.time = t_x_q_obs[:,0] #np.arange(0, 18/24+1e-6, 1/2*1/24)
# Declare parameter
V_liq = m.Param(value = 159.0)
# Declare FVs
k_1 = m.FV(value = 0.80)
k_1.STATUS = 1
f_1 = m.FV(value = 10.0)
f_1.STATUS = 1
# Diff. Variables
X = m.Var(value = 80.0) # at t=0
Y = m.Var(value = 80.0*0.2)
rho_1 = m.Intermediate(k_1*X)
#q_prod = m.Intermediate(0.52*f_1*X/24)
#X = m.CV(value = t_x_q_obs[:,1])
q_prod = m.CV(value = t_x_q_obs[:,2])
#%% Equations
m.Equations([X.dt() == -rho_1, Y.dt() == 0, q_prod == 0.52*f_1*X/24])
m.options.IMODE = 5
m.solve(disp=False)
#%% Plot some results
plt.plot(m.time, np.array(X.value)/10, label='X')
plt.plot(t_x_q_obs[:,0], t_x_q_obs[:,2], label='q_prod Meas.')
plt.plot(m.time, q_prod.value, label='q_prod Sim.')
plt.xlabel('time')
plt.ylabel('X / q_prod')
plt.grid()
plt.legend(loc='best')
plt.show()
0.0208333333 NaN 30.8306036
0.0416666667 NaN 29.1200832
0.0625 74.866 28.7700549
0.0833333333 NaN 29.2318865
0.104166667 NaN 30.7727362
0.125 NaN 29.8743804
0.145833333 NaN 29.9923447
0.166666667 NaN 30.9169679
0.1875 NaN 28.5956184
0.208333333 NaN 27.7361632
0.229166667 NaN 26.6669496
0.25 NaN 27.17477
0.270833333 75.751 23.6270346
0.291666667 NaN 23.0646928
0.3125 NaN 23.6442113
0.333333333 NaN 23.089118
0.354166667 NaN 22.9101616
0.375 NaN 22.7453854
0.395833333 NaN 23.2182759
0.416666667 NaN 21.4901903
0.4375 NaN 21.1449899
0.458333333 NaN 20.7093537
0.479166667 NaN 20.3109086
0.5 NaN 20.6825141
0.520833333 NaN 19.199583
0.541666667 NaN 19.6173416
0.5625 NaN 19.5543139
0.583333333 NaN 20.4501879
0.604166667 NaN 18.7678061
0.625 NaN 18.4629262
0.645833333 NaN 18.3730322
0.666666667 NaN 19.5375442
0.6875 NaN 18.1975297
0.708333333 NaN 18.0370627
0.729166667 NaN 17.5734727
0.75 NaN 18.8632046
So far, so good. Suppose I also have measurements of X (second column) at some time points (first column), the rest is not available (therefore NaN).
I would like to adjust k_1 and f_1, so that simulated and observed variables X and q_prod match as closely as possible.
Is this feasible with Gekko? If so, how?
Another question: Gekko throws an error if m.time has more elements than there are time points of observed variables. However, my initial values of X and Y refer to t=0, not t=0.0208333333. Hence, the commented out part after m.time =, see above. (Measurements at t=0 are not available.) Do initial conditions in Gekko refer to the first element of m.time, as they do in Matlab, or to t=0?
If you have a missing measurement then you can include a non-numeric value such as NaN and Gekko ignores that entry in the objective function. Here is a test case with one NaN value in ym:
Nonlinear Regression with NaN Data Value
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
xm = np.array([0,1,2,3,4,5])
ym = np.array([0.1,0.2,np.nan,0.5,0.8,2.0])
m = GEKKO(remote=False)
x = m.Param(value=xm,name='x')
a = m.FV()
a.STATUS=1
y = m.CV(value=ym,name='y')
y.FSTATUS=1
m.Equation(y==0.1*m.exp(a*x))
m.options.IMODE = 2
m.options.SOLVER = 1
m.solve(disp=True)
print('Optimized, a = ' + str(a.value[0]))
plt.plot(xm,ym,'bo')
plt.plot(xm,y.value,'r-')
m.open_folder()
plt.show()
When you open the run folder with m.open_folder() and look at the data file gk_model0.csv, there is the NaN in the y value column.
y,x
0.1,0
0.2,1
nan,2
0.5,3
0.8,4
2.0,5
This is IMODE=2 so it is a steady state regression problem but shows the same thing that happens with dynamic estimation problems. There is more information on the objective function with m.options.EV_TYPE=1 (default) or m.options.EV_TYPE=2 for estimation and how bad values are handled in a data file. When the measurement is a non-numeric value, that bad value is dropped from the objective function summation. Here is a version with a dynamic model:
Dynamic Regression with Fixed Initial Condition
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
xm = np.array([0,1,2,3,4,5])
ym = np.array([2.0,1.5,np.nan,2.2,3.0,5.0])
m = GEKKO(remote=False)
m.time = xm
a = m.FV(lb=0.1,ub=2.0)
a.STATUS=1
y = m.CV(value=ym,name='y',fixed_initial=False)
y.FSTATUS=1
m.Equation(y.dt()==a*y)
m.options.IMODE = 5
m.options.SOLVER = 1
m.solve(disp=True)
print('Optimized, a = ' + str(a.value[0]))
plt.figure(figsize=(6,2))
plt.plot(xm,ym,'bo',label='Meas')
plt.plot(xm,y.value,'r-',label='Pred')
plt.ylabel('y')
plt.ylim([0,6])
plt.legend()
plt.show()
As you observed, you need to have the same length for m.time as for your measurement values. If you are missing values then you can include append a np.nan to the beginning of the data horizon. By default, Gekko uses the first value specified in the value property to set the initial condition. If you don't want Gekko to use that value then set fixed_initial=False for your CV.
Dynamic Regression with Free Initial Condition
y = m.CV(value=ym,name='y',fixed_initial=False)

My input contains NaN, infinity, or value too large for dtype -

I am trying to run PCA on a dataset but I am running into an issue involving NaN. I tried dropping multiple columns and changing the datatypes of my dataframe, but none of these worked.
My piece of my code:
from sklearn.preprocessing import StandardScaler
features = ['caloroies','protein','fat','sodium','fiber','carbo','sugars','potass','vitamins','shelf','weight','cups']
x = df.loc[:, features].values
y = df.loc[:,['rating_bucketed']].values
x = StandardScaler().fit_transform(x)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(Data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
The error I receive from this is as follows:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
If I check my x variable, I receive the following:
print(x)
[[ nan 4.00e+00 1.00e+00 1.30e+02 1.00e+01 5.00e+00 6.00e+00 2.80e+02
2.50e+01 3.00e+00 1.00e+00 3.30e-01]
[ nan 3.00e+00 5.00e+00 1.50e+01 2.00e+00 8.00e+00 8.00e+00 1.35e+02
0.00e+00 3.00e+00 1.00e+00 1.00e+00]
[ nan 4.00e+00 1.00e+00 2.60e+02 9.00e+00 7.00e+00 5.00e+00 3.20e+02
2.50e+01 3.00e+00 1.00e+00 3.30e-01]
[ nan 4.00e+00 0.00e+00 1.40e+02 1.40e+01 8.00e+00 0.00e+00 3.30e+02
2.50e+01 3.00e+00 1.00e+00 5.00e-01]
Just so you can have an idea of my starting dataset:
enter image description here
Python 3.7.1
Numpy 1.15.4
Pandas 0.23.4
Sklearn 0.20.1
Can anyone point me in the right direction as to where I am going wrong?

How to ensure centroids of the clusters in k means algorithm doesn't switch everytime?

I have a csv file which looks like below
date mse
2018-02-11 14.34
2018-02-12 7.24
2018-02-13 244.5
2018-02-14 3.5
2018-02-16 12.67
2018-02-21 45.66
2018-02-22 15.33
2018-02-24 98.44
2018-02-26 23.55
2018-02-27 45.12
2018-02-28 78.44
2018-03-01 34.11
2018-03-05 23.33
2018-03-06 127.45
... ...
... ...
Now I try to apply k means to the mse values to get 2 clusters which gives me 2 centroids one for each.Now I am given a mse value and I need to find for which of the two centroids is nearer to the given mse value.I do something like this
from sklearn.cluster import KMeans
import pandas as pd
centroid_list = []
given_mse = 7.382409087
kmeans = KMeans(n_clusters=2)
df = pd.read_csv("data.csv", parse_dates=["date"])
kmeans.fit_predict(df[['mse']])
centroid_list.append(kmeans.cluster_centers_.ravel())
#print(centroids_list) # array([ 153.27996598, 19810.6925875 ]
for i in centroids_list:
t1 = abs(given_mse - i[0])
t2 = abs(given_mse - i[1])
if t1 < t2:
result.append("label 1")
else:
result.append("label 2")
print(result) # ['label1']
Now as you can see I get two centroid values 153.27996598 and 19810.6925875 assigned to each cluster.
The problem is it keeps switching the values often [(x,y) or (y,x)] when you run the program because of which I get the end result as either label1 or at times label2.
Any idea how this can be fixed.Is there any sckit-learn technique to prevent this switching?
As mentioned by #Vivek Kumar, I needed to pass an additional parameter random_state while setting the k means.The value for random_state can be any integer.
kmeans = KMeans(n_clusters=2, random_state=1)

scikit-learn StratifiedShuffleSplit KeyError with index

This is my pandas dataframe lots_not_preprocessed_usd:
<class 'pandas.core.frame.DataFrame'>
Index: 78718 entries, 2017-09-12T18-38-38-076065 to 2017-10-02T07-29-40-245031
Data columns (total 20 columns):
created_year 78718 non-null float64
price 78718 non-null float64
........
decade 78718 non-null int64
dtypes: float64(8), int64(1), object(11)
memory usage: 12.6+ MB
head(1):
artist_name_normalized house created_year description exhibited_in exhibited_in_museums height images max_estimated_price min_estimated_price price provenance provenance_estate_of sale_date sale_id sale_title style title width decade
key
2017-09-12T18-38-38-076065 NaN c11 1862.0 An Album and a small Quantity of unframed Draw... NaN NaN NaN NaN 535.031166 267.515583 845.349242 NaN NaN 1998-06-21 8033 OILS, WATERCOLOURS & DRAWINGS FROM 18TH - 20TH... watercolor painting An Album and a small Quantity of unframed Draw... NaN 186
My script:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size =0.2, random_state=42)
for train_index, test_index in split.split(lots_not_preprocessed_usd, lots_not_preprocessed_usd['decade']):
strat_train_set = lots_not_preprocessed_usd.loc[train_index]
strat_test_set = lots_not_preprocessed_usd.loc[test_index]
I'm getting the error message
KeyError Traceback (most recent call last)
<ipython-input-224-cee2389254f2> in <module>()
3 split = StratifiedShuffleSplit(n_splits=1, test_size =0.2, random_state=42)
4 for train_index, test_index in split.split(lots_not_preprocessed_usd, lots_not_preprocessed_usd['decade']):
----> 5 strat_train_set = lots_not_preprocessed_usd.loc[train_index]
6 strat_test_set = lots_not_preprocessed_usd.loc[test_index]
......
KeyError: 'None of [[32199 67509 69003 ..., 44204 2809 56726]] are in the [index]'
There seems to be a problem with my index (e.g. 2017-09-12T18-38-38-076065) which I don't understand. Where is the issue?
If I use another split it works as expected:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(lots_not_preprocessed_usd, test_size=0.2, random_state=42)
When you use .loc you need to pass same index for row_indexer so use .iloc when you want to use orindary numeric indexer instead of .loc. In the for loop train_index and text_index are not datetime since split.split(X,y) return array of random indices.
...
for train_index, test_index in split.split(lots_not_preprocessed_usd, lots_not_preprocessed_usd['decade']):
strat_train_set = lots_not_preprocessed_usd.iloc[train_index]
strat_test_set = lots_not_preprocessed_usd.iloc[test_index]
Sample example
lots_not_preprocessed_usd = pd.DataFrame({'some':np.random.randint(5,10,100),'decade':np.random.randint(5,10,100)},index= pd.date_range('5-10-15',periods=100))
for train_index, test_index in split.split(lots_not_preprocessed_usd, lots_not_preprocessed_usd['decade']):
strat_train_set = lots_not_preprocessed_usd.iloc[train_index]
strat_test_set = lots_not_preprocessed_usd.iloc[test_index]
Sample output :
strat_train_set.head()
decade some
2015-08-02 6 7
2015-06-14 7 6
2015-08-14 7 9
2015-06-25 9 5
2015-05-15 7 9

Resources