scikit-learn StratifiedShuffleSplit KeyError with index - python-3.x

This is my pandas dataframe lots_not_preprocessed_usd:
<class 'pandas.core.frame.DataFrame'>
Index: 78718 entries, 2017-09-12T18-38-38-076065 to 2017-10-02T07-29-40-245031
Data columns (total 20 columns):
created_year 78718 non-null float64
price 78718 non-null float64
........
decade 78718 non-null int64
dtypes: float64(8), int64(1), object(11)
memory usage: 12.6+ MB
head(1):
artist_name_normalized house created_year description exhibited_in exhibited_in_museums height images max_estimated_price min_estimated_price price provenance provenance_estate_of sale_date sale_id sale_title style title width decade
key
2017-09-12T18-38-38-076065 NaN c11 1862.0 An Album and a small Quantity of unframed Draw... NaN NaN NaN NaN 535.031166 267.515583 845.349242 NaN NaN 1998-06-21 8033 OILS, WATERCOLOURS & DRAWINGS FROM 18TH - 20TH... watercolor painting An Album and a small Quantity of unframed Draw... NaN 186
My script:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size =0.2, random_state=42)
for train_index, test_index in split.split(lots_not_preprocessed_usd, lots_not_preprocessed_usd['decade']):
strat_train_set = lots_not_preprocessed_usd.loc[train_index]
strat_test_set = lots_not_preprocessed_usd.loc[test_index]
I'm getting the error message
KeyError Traceback (most recent call last)
<ipython-input-224-cee2389254f2> in <module>()
3 split = StratifiedShuffleSplit(n_splits=1, test_size =0.2, random_state=42)
4 for train_index, test_index in split.split(lots_not_preprocessed_usd, lots_not_preprocessed_usd['decade']):
----> 5 strat_train_set = lots_not_preprocessed_usd.loc[train_index]
6 strat_test_set = lots_not_preprocessed_usd.loc[test_index]
......
KeyError: 'None of [[32199 67509 69003 ..., 44204 2809 56726]] are in the [index]'
There seems to be a problem with my index (e.g. 2017-09-12T18-38-38-076065) which I don't understand. Where is the issue?
If I use another split it works as expected:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(lots_not_preprocessed_usd, test_size=0.2, random_state=42)

When you use .loc you need to pass same index for row_indexer so use .iloc when you want to use orindary numeric indexer instead of .loc. In the for loop train_index and text_index are not datetime since split.split(X,y) return array of random indices.
...
for train_index, test_index in split.split(lots_not_preprocessed_usd, lots_not_preprocessed_usd['decade']):
strat_train_set = lots_not_preprocessed_usd.iloc[train_index]
strat_test_set = lots_not_preprocessed_usd.iloc[test_index]
Sample example
lots_not_preprocessed_usd = pd.DataFrame({'some':np.random.randint(5,10,100),'decade':np.random.randint(5,10,100)},index= pd.date_range('5-10-15',periods=100))
for train_index, test_index in split.split(lots_not_preprocessed_usd, lots_not_preprocessed_usd['decade']):
strat_train_set = lots_not_preprocessed_usd.iloc[train_index]
strat_test_set = lots_not_preprocessed_usd.iloc[test_index]
Sample output :
strat_train_set.head()
decade some
2015-08-02 6 7
2015-06-14 7 6
2015-08-14 7 9
2015-06-25 9 5
2015-05-15 7 9

Related

"Input contains NaN, infinity or a value too large for dtype('float32')" when I train a DecisionTreeClassifier [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I'm trying to code a Decision Tree method for the data in an exoplanet catalogue. It's a worskhop for one of the courses of my Master's studies.
I have writen this in an Jupyter Notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
data = pd.read_csv('exoplanet.eu_catalog_2021.12.15.csv')
data_new = data.select_dtypes(include=['float64'])#Select only dtype float64 data
data_new[~data_new.isin([np.nan, np.inf, -np.inf]).any(1)]
data_new_2 = data_new.loc[:,('mass', 'mass_error_min')]
data_new_2.dropna(subset =["mass_error_min"], inplace = True)
data_new_2.info()
print(data_new_2)
with this result
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1425 entries, 1 to 4892
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mass 1425 non-null float64
1 mass_error_min 1425 non-null float64
dtypes: float64(2)
memory usage: 33.4 KB
As you can see, there is no empty cells. Besides, I wrote this for converting all the numbers to float64 (just in case!)
data_new_2['mass'] = data_new_2['mass'].astype(float)
data_new_2['mass_error_min'] = data_new_2['mass_error_min'].astype(float)
Then, I splitted the data into the traininig and test subsets
from sklearn.model_selection import train_test_split
X = data_new_2.drop(["mass"], axis = 1)
y = data_new_2["mass"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 42)
And there is no problem... until this part
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train_2)
because I get this error message
ValueError Traceback (most recent call last)
<ipython-input-327-7b81afce3234> in <module>
1 from sklearn.tree import DecisionTreeClassifier
2 classifier = DecisionTreeClassifier()
----> 3 classifier.fit(X_train, y_train_2)
.
.
.
~/.local/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
104 msg_err.format
105 (type_err,
--> 106 msg_dtype if msg_dtype is not None else X.dtype)
107 )
108 # for object dtype data, we only check for NaNs (GH-13254)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
I don't understand why this error message appears because I have no Nan, infitnity or "too large" data in X_train nor y_train data.
What can I do?
There are some infinite values in your mass_error_min column:
data_new_2.describe()
mass mass_error_min
count 1425.000000 1425.0000
mean 6.060956 inf
std 13.568726 NaN
min 0.000002 0.0000
25% 0.054750 0.0116
50% 0.725000 0.0700
75% 3.213000 0.5300
max 135.300000 inf
So, you have to fill those inf with some value, use this code:
value = data_new_2['mass_error_min'].quantile(0.98)
data_new_2 = data_new_2.replace(np.inf, value)

lightgbm || ValueError: Series.dtypes must be int, float or bool

Dataframe has filled na values .
Schema of dataset has no object dtype as specified in documentation.
df.info()
output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 429 entries, 351 to 559
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 429 non-null category
1 Married 429 non-null category
2 Dependents 429 non-null category
3 Education 429 non-null category
4 Self_Employed 429 non-null category
5 ApplicantIncome 429 non-null int64
6 CoapplicantIncome 429 non-null float64
7 LoanAmount 429 non-null float64
8 Loan_Amount_Term 429 non-null float64
9 Credit_History 429 non-null float64
10 Property_Area 429 non-null category
dtypes: category(6), float64(4), int64(1)
memory usage: 23.3 KB
I have following code .....................................................................................................................................................................................................................................................................................................................
I am trying to classification of dataset using lightgbm
import lightgbm as lgb
train_data=lgb.Dataset(x_train,label=y_train,categorical_feature=cat_cols)
#define parameters
params = {'learning_rate':0.001}
model= lgb.train(params, train_data, 100,categorical_feature=cat_cols)
getting following error :
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-178-aaa91a2d8719> in <module>
6
7
----> 8 model= lgb.train(params, train_data, 100,categorical_feature=cat_cols)
~\Anaconda3\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
229 # construct booster
230 try:
--> 231 booster = Booster(params=params, train_set=train_set)
232 if is_valid_contain_train:
233 booster.set_train_data_name(train_data_name)
~\Anaconda3\lib\site-packages\lightgbm\basic.py in __init__(self, params, train_set, model_file, model_str, silent)
1981 break
1982 # construct booster object
-> 1983 train_set.construct()
1984 # copy the parameters from train_set
1985 params.update(train_set.get_params())
~\Anaconda3\lib\site-packages\lightgbm\basic.py in construct(self)
1319 else:
1320 # create train
-> 1321 self._lazy_init(self.data, label=self.label,
1322 weight=self.weight, group=self.group,
1323 init_score=self.init_score, predictor=self._predictor,
~\Anaconda3\lib\site-packages\lightgbm\basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
1133 raise TypeError('Cannot initialize Dataset from {}'.format(type(data).__name__))
1134 if label is not None:
-> 1135 self.set_label(label)
1136 if self.get_label() is None:
1137 raise ValueError("Label should not be None")
~\Anaconda3\lib\site-packages\lightgbm\basic.py in set_label(self, label)
1648 self.label = label
1649 if self.handle is not None:
-> 1650 label = list_to_1d_numpy(_label_from_pandas(label), name='label')
1651 self.set_field('label', label)
1652 self.label = self.get_field('label') # original values can be modified at cpp side
~\Anaconda3\lib\site-packages\lightgbm\basic.py in list_to_1d_numpy(data, dtype, name)
88 elif isinstance(data, Series):
89 if _get_bad_pandas_dtypes([data.dtypes]):
---> 90 raise ValueError('Series.dtypes must be int, float or bool')
91 return np.array(data, dtype=dtype, copy=False) # SparseArray should be supported as well
92 else:
ValueError: Series.dtypes must be int, float or bool
did anyone helped you yet? If not: The answer lies within transforming your variable.
Go to this link:GitHub Discussion lightGBM
The creators of LightGBM were confronted with that same question once.
In the Link above they (STRIKER) tell you, that you should: transform your variables with astype("category") (pandas/scikit) AND you should labelEncode them, because you need an INT ! value in your feature column, especially an INT32.
However, labelEncoding and astype('category') should normally do the same:
Encoding
Antoher useful link is this advanced doc about the categorical feature:Categorical feature light gbm homepage where they tell you that they cant deal with object(string) dtypes as in your data.
If you are still feeling uncomfortable with this explanation, here is my code snippet from the kaggle space_race_set. If you are still having problems. Just ask away.
cat_feats = ['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country']
labelencoder = LabelEncoder()
for col in cat_feats:
train_df[col] = labelencoder.fit_transform(train_df[col])
for col in cat_feats:
train_df[col] = train_df[col].astype('int')
y = train_df[["Status Mission"]]
X = train_df.drop(["Status Mission"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
train_data = lgb.Dataset(X_train,
label=y_train,
categorical_feature=['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country'],
free_raw_data=False)
test_data = lgb.Dataset(X_test,
label=y_test,
categorical_feature=['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country'],
free_raw_data=False)
I had the same problem. My y_train was in int64 dtype. This solved my problem:
model_LGB.fit(
X = X_train,
y = y_train.astype('int32'))

How to implement dynamic parameter estimation with missing data in Gekko?

Going back and forth through the documentation, I was able to set-up a dynamic parameter estimation in Gekko.
Here's the code, with measurement values shown below (the file is named MeasuredAlgebrProductionRate_30min_18h.csv on my system, and uses ;as separator):
import numpy as np
import matplotlib.pyplot as plt
from gekko import GEKKO
#%% Read measurement data from CSV file
t_x_q_obs = np.genfromtxt('MeasuredAlgebrProductionRate_30min_18h.csv', delimiter=';')
#t_obs, x_obs, q_obs = t_xq_obs[:,0:3]
#%% Initialize Model
m = GEKKO(remote=False)
m.time = t_x_q_obs[:,0] #np.arange(0, 18/24+1e-6, 1/2*1/24)
# Declare parameter
V_liq = m.Param(value = 159.0)
# Declare FVs
k_1 = m.FV(value = 0.80)
k_1.STATUS = 1
f_1 = m.FV(value = 10.0)
f_1.STATUS = 1
# Diff. Variables
X = m.Var(value = 80.0) # at t=0
Y = m.Var(value = 80.0*0.2)
rho_1 = m.Intermediate(k_1*X)
#q_prod = m.Intermediate(0.52*f_1*X/24)
#X = m.CV(value = t_x_q_obs[:,1])
q_prod = m.CV(value = t_x_q_obs[:,2])
#%% Equations
m.Equations([X.dt() == -rho_1, Y.dt() == 0, q_prod == 0.52*f_1*X/24])
m.options.IMODE = 5
m.solve(disp=False)
#%% Plot some results
plt.plot(m.time, np.array(X.value)/10, label='X')
plt.plot(t_x_q_obs[:,0], t_x_q_obs[:,2], label='q_prod Meas.')
plt.plot(m.time, q_prod.value, label='q_prod Sim.')
plt.xlabel('time')
plt.ylabel('X / q_prod')
plt.grid()
plt.legend(loc='best')
plt.show()
0.0208333333 NaN 30.8306036
0.0416666667 NaN 29.1200832
0.0625 74.866 28.7700549
0.0833333333 NaN 29.2318865
0.104166667 NaN 30.7727362
0.125 NaN 29.8743804
0.145833333 NaN 29.9923447
0.166666667 NaN 30.9169679
0.1875 NaN 28.5956184
0.208333333 NaN 27.7361632
0.229166667 NaN 26.6669496
0.25 NaN 27.17477
0.270833333 75.751 23.6270346
0.291666667 NaN 23.0646928
0.3125 NaN 23.6442113
0.333333333 NaN 23.089118
0.354166667 NaN 22.9101616
0.375 NaN 22.7453854
0.395833333 NaN 23.2182759
0.416666667 NaN 21.4901903
0.4375 NaN 21.1449899
0.458333333 NaN 20.7093537
0.479166667 NaN 20.3109086
0.5 NaN 20.6825141
0.520833333 NaN 19.199583
0.541666667 NaN 19.6173416
0.5625 NaN 19.5543139
0.583333333 NaN 20.4501879
0.604166667 NaN 18.7678061
0.625 NaN 18.4629262
0.645833333 NaN 18.3730322
0.666666667 NaN 19.5375442
0.6875 NaN 18.1975297
0.708333333 NaN 18.0370627
0.729166667 NaN 17.5734727
0.75 NaN 18.8632046
So far, so good. Suppose I also have measurements of X (second column) at some time points (first column), the rest is not available (therefore NaN).
I would like to adjust k_1 and f_1, so that simulated and observed variables X and q_prod match as closely as possible.
Is this feasible with Gekko? If so, how?
Another question: Gekko throws an error if m.time has more elements than there are time points of observed variables. However, my initial values of X and Y refer to t=0, not t=0.0208333333. Hence, the commented out part after m.time =, see above. (Measurements at t=0 are not available.) Do initial conditions in Gekko refer to the first element of m.time, as they do in Matlab, or to t=0?
If you have a missing measurement then you can include a non-numeric value such as NaN and Gekko ignores that entry in the objective function. Here is a test case with one NaN value in ym:
Nonlinear Regression with NaN Data Value
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
xm = np.array([0,1,2,3,4,5])
ym = np.array([0.1,0.2,np.nan,0.5,0.8,2.0])
m = GEKKO(remote=False)
x = m.Param(value=xm,name='x')
a = m.FV()
a.STATUS=1
y = m.CV(value=ym,name='y')
y.FSTATUS=1
m.Equation(y==0.1*m.exp(a*x))
m.options.IMODE = 2
m.options.SOLVER = 1
m.solve(disp=True)
print('Optimized, a = ' + str(a.value[0]))
plt.plot(xm,ym,'bo')
plt.plot(xm,y.value,'r-')
m.open_folder()
plt.show()
When you open the run folder with m.open_folder() and look at the data file gk_model0.csv, there is the NaN in the y value column.
y,x
0.1,0
0.2,1
nan,2
0.5,3
0.8,4
2.0,5
This is IMODE=2 so it is a steady state regression problem but shows the same thing that happens with dynamic estimation problems. There is more information on the objective function with m.options.EV_TYPE=1 (default) or m.options.EV_TYPE=2 for estimation and how bad values are handled in a data file. When the measurement is a non-numeric value, that bad value is dropped from the objective function summation. Here is a version with a dynamic model:
Dynamic Regression with Fixed Initial Condition
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
xm = np.array([0,1,2,3,4,5])
ym = np.array([2.0,1.5,np.nan,2.2,3.0,5.0])
m = GEKKO(remote=False)
m.time = xm
a = m.FV(lb=0.1,ub=2.0)
a.STATUS=1
y = m.CV(value=ym,name='y',fixed_initial=False)
y.FSTATUS=1
m.Equation(y.dt()==a*y)
m.options.IMODE = 5
m.options.SOLVER = 1
m.solve(disp=True)
print('Optimized, a = ' + str(a.value[0]))
plt.figure(figsize=(6,2))
plt.plot(xm,ym,'bo',label='Meas')
plt.plot(xm,y.value,'r-',label='Pred')
plt.ylabel('y')
plt.ylim([0,6])
plt.legend()
plt.show()
As you observed, you need to have the same length for m.time as for your measurement values. If you are missing values then you can include append a np.nan to the beginning of the data horizon. By default, Gekko uses the first value specified in the value property to set the initial condition. If you don't want Gekko to use that value then set fixed_initial=False for your CV.
Dynamic Regression with Free Initial Condition
y = m.CV(value=ym,name='y',fixed_initial=False)

How to ensure centroids of the clusters in k means algorithm doesn't switch everytime?

I have a csv file which looks like below
date mse
2018-02-11 14.34
2018-02-12 7.24
2018-02-13 244.5
2018-02-14 3.5
2018-02-16 12.67
2018-02-21 45.66
2018-02-22 15.33
2018-02-24 98.44
2018-02-26 23.55
2018-02-27 45.12
2018-02-28 78.44
2018-03-01 34.11
2018-03-05 23.33
2018-03-06 127.45
... ...
... ...
Now I try to apply k means to the mse values to get 2 clusters which gives me 2 centroids one for each.Now I am given a mse value and I need to find for which of the two centroids is nearer to the given mse value.I do something like this
from sklearn.cluster import KMeans
import pandas as pd
centroid_list = []
given_mse = 7.382409087
kmeans = KMeans(n_clusters=2)
df = pd.read_csv("data.csv", parse_dates=["date"])
kmeans.fit_predict(df[['mse']])
centroid_list.append(kmeans.cluster_centers_.ravel())
#print(centroids_list) # array([ 153.27996598, 19810.6925875 ]
for i in centroids_list:
t1 = abs(given_mse - i[0])
t2 = abs(given_mse - i[1])
if t1 < t2:
result.append("label 1")
else:
result.append("label 2")
print(result) # ['label1']
Now as you can see I get two centroid values 153.27996598 and 19810.6925875 assigned to each cluster.
The problem is it keeps switching the values often [(x,y) or (y,x)] when you run the program because of which I get the end result as either label1 or at times label2.
Any idea how this can be fixed.Is there any sckit-learn technique to prevent this switching?
As mentioned by #Vivek Kumar, I needed to pass an additional parameter random_state while setting the k means.The value for random_state can be any integer.
kmeans = KMeans(n_clusters=2, random_state=1)

Trapezoidal Kernel in python

I wanted to implement a trapezoidal kernel in python(probably using numpy or scipy) for convolution just like the one which comes in the astropy module as Trapezoidal1Dkernel. I have tried convolving with a trapezoidal waveform but results were not satisfactory.
def trapzoid_signal(t, width=2., slope=1., amp=1., offs=0):
global trasig
trasig=[]
trasig = slope*width*signal.sawtooth(2*np.pi*t/width, width=0.5)/4.
trasig+= slope*width/4.
trasig[trasig>amp] = amp
return trasig + offs
t = np.linspace(0, 32, 34)
trapzoid_signal(t, width=32, slope=1, amp=0.0322)
print(trasig)
z = scipy.signal.convolve(trasig,new)
If I print z it gives:
[ nan nan nan ..., nan nan nan]
I tried plotting z it gives nothing. Any help?
Eureka!!! I did it. The thing as to why it was not plotting or printing values as [ nan nan nan ..., nan nan nan] was removed by using the following code which I found on StackOverflow itself.
ind = np.where(~np.isnan(new))[0]
first, last = ind[0], ind[-1]
new[:first] = new[first]
new[last + 1:] = new[last]
Then it solved my problem. I not only got the values of z but also got my plot. Thanks to stackoverflow.com.

Resources