SARIMAX ValueError: xnames and params do not have the same length - statistics

I am running SARIMAX and I am getting the following error:
ValueError: xnames and params do not have the same length.
I need help understanding what this error means and how I can avoid it.
This is the model that I am trying to run using statsmodels in python:
mod = sm.tsa.statespace.SARIMAX(y,order=(1, 1, 1), seasonal_order=(1, 1, 1, 12), enforce_stationarity=False, enforce_invertibility=False)

Related

How to pass group information in sklearn Random Grid Search for XGBRanker?

When I'm tryingto perform random grid search on XGBRanker model, I keep getting an error as follows:
/workspace/src/objective/rank_obj.cc:52: Check failed: gptr.size() != 0 && gptr.back() == info.labels_.Size(): group structure not consistent with #rows
The error seems to be regarding the structure of the group information passed. I'm passing the size of each group. If there are N rows and 2 groups then the array passed would be [g1_size, g2_size].
I'm not sure where I'm going wrong since I'm able to fit the model without any issues. Only when I try to perform RandomGridSearchCV, am I facing this error. The code snippet is as follows:
model = xgb.XGBRanker(
objective="rank:ndcg",
max_depth= 10,
n_estimators=100,
verbosity=1)
param_dist = {'n_estimators': [100,200,300],
'learning_rate': [1e-3,1e-4,1e-5],
'subsample': [0.8,0.9,1],
'max_depth': [5, 6, 7]
}
fit_params = {"group": groups}
scoring = make_scorer(ndcg_score, greater_is_better=True)
clf = RandomizedSearchCV(model,
param_distributions=param_dist,
cv =5,
n_iter=5,
scoring=scoring,
error_score=0,
verbose=3,
n_jobs=-1)
clf.fit(X_train, Y_train,**fit_params)

StratifiedKFold split doesn't seem to work

I am trying to perform KFold cross-validation via Keras but due to some reason, the KFold split isn't working.
from sklearn.model_selection import StratifiedKFold
X = train_data[features]
y = train_data['price']
kfold = StratifiedKFold(n_splits=10, shuffle=True)
for train, test in kfold.split(X,y):
print(X[train])
I was actually fitting the model subsequently but that didn't work, so I tried printing the results, which produced the following warning and output.
Warning: /opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_split.py:672: UserWarning: The least populated class in y has only 1 member, which is less than n_splits=10.
% (min_groups, self.n_splits)), UserWarning)
Error: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 9,\n 10,\n ...\n 39989, 39990, 39991, 39992, 39993, 39994, 39995, 39996, 39997,\n 39998],\n dtype='int64', length=36000)] are in the [columns]"
The error is self-explanatory:
Warning:
/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_split.py:672:
UserWarning: The least populated class in y has only 1 member, which
is less than n_splits=10. % (min_groups, self.n_splits)), UserWarning)
This means that, for the underrepresented class, you have only one sample, hence the stratified split is unable to work.
I recommend that you check your dataset again in order to verify/correct the labels.
Using x.iloc[test_index] worked for me

Optimization of predictions from sklearn model (e.g. RandomForestRegressor)

Does anyone used any optimization models on fitted sklearn models?
What I'd like to do is fit model based on train data and using this model try to find the best combination of parameters for which model would predict the biggest value.
Some example, simplified code:
import pandas as pd
df = pd.DataFrame({
'temperature': [10, 15, 30, 20, 25, 30],
'working_hours': [10, 12, 12, 10, 30, 15],
'sales': [4, 7, 6, 7.3, 10, 8]
})
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = df.drop(['sales'], axis=1)
y = df['sales']
model.fit(X, y);
Our baseline is a simple loop and predict all combination of variables:
results = pd.DataFrame(columns=['temperature', 'working_hours', 'sales_predicted'])
import numpy as np
for temp in np.arange(1,100.01,1):
for work_hours in np.arange(1,60.01,1):
results = pd.concat([
results,
pd.DataFrame({
'temperature': temp,
'working_hours': work_hours,
'sales_predicted': model.predict(np.array([temp, work_hours]).reshape(1,-1))
}
)
]
)
print(results.sort_values(by='sales_predicted', ascending=False))
Using that way it's difficult or impossible to:
* do it fast (brute method)
* implement constraint concerning two or more variables dependency
We tried PuLP library and PyOmo library, but both doesn't allow to put model.predict function as an objective function returning error:
TypeError: float() argument must be a string or a number, not 'LpVariable'
Do anyone have any idea how we can get rid off loop and use some other stuff?
When people talk about optimizing fitted sklearn models, they usually mean maximizing accuracy/performance metrics. So if you are trying to maximize your predicted value, you can definitely improve your code to achieve it more efficiently, like below.
You are collecting all the predictions in a big results dataframe, and then sorting it in ascending order. Instead, you can just search for an increase in your target variable (sales_predicted) on-the-fly, using a simple if logic. So just change your loop into this:
max_sales_predicted = 0
for temp in np.arange(1, 100.01, 1):
for work_hours in np.arange(1, 60.01, 1):
sales_predicted = model.predict(np.array([temp, work_hours]).reshape(1, -1))
if sales_predicted > max_sales_predicted:
max_sales_predicted = sales_predicted
desired_temp = temp
desired_work_hours = work_hours
So that you can only take into account any specification that produces a predictiong that exceeds your current target, and else, do nothing.
The result of my code is the same as yours, i.e. a max_sales_predicted value of 9.2. Also, desired_temp and desired_work_hours now give you the specification that produce that maxima. Hope this helps.

XGRegressor not fitting data

I would like to run a CV for an XGBoost tree regression on my X_train, y_train data. My target is of integer values from 25 to 40. I tried to run this code on my training dataset
# A parameter grid for XGBoost
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
cv_params = {
'min_child_weight': [1, 3, 5],
'gamma': [0.5, 1, 2, 3],
'subsample': [i/10.0 for i in range(6,11)],
'colsample_bytree': [i/10.0 for i in range(6,11)],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.02, 0.1]
}
# Initialize XGB
xgb_for_gridsearch = XGBRegressor(
n_estimators = 1000,
objective = 'reg:logistic',
seed = 7
)
# Initialize GridSearch
xgb_grid = GridSearchCV(
estimator = xgb_for_gridsearch,
param_grid = cv_params,
scoring = 'explained_variance',
cv = 5,
n_jobs = -1
)
xgb_grid.fit(X_train, y_train)
xgb_grid.grid_scores_
I get an error the fit().
I kinda expected that the CV would just take forever, but not really an error. The error output is a couple of thousand lines long, so I will just put the only part that relates to my code:
During handling of the above exception, another exception occurred:
JoblibXGBoostError Traceback (most recent call last)
<ipython-input-44-a5c1d517107d> in <module>()
25 )
26
---> 27 xgb_grid.fit(X_train, y_train)
Does anyone know what this relates to?
Am I using conflicting parameters?
Would it be better to use xgboost.cv()?
I can also add the whole error code if that would help, should I just add it at the bottom of this question?
UPDATE: added error to a Gist, as suggested XGRegressor_not_fitting_data, since the error is too long.
Thanks for adding the full error code, it is easier to help you.
A github repo is fine, yet you may find it easier to use https://gist.github.com/ or https://pastebin.com/
Note that the most helpfull line of the full error is generally the last one, which contains here:
label must be in [0,1] for logistic regression
It seems you have used logistic regression (objective = 'reg:logistic', in your code), which is a classification loss, and so it requires y_train to be an array of either 0 or 1.
You can easily fix it with something like
y_train_bin = (y_train == 1).astype(int)
xgb_grid.fit(X_train, y_train_bin)

Why tensor.view() is not working in pytorch?

I have the following piece of code.
embedded = self.embedding(input).view(1, 1, -1)
embedded = self.drop(embedded)
print(embedded[0].size(), hidden[0].size())
concatenated_output = torch.cat((embedded[0], hidden[0]), 1)
The last line of the code is giving me the following error.
RuntimeError: inconsistent tensor sizes at /data/users/soumith/miniconda2/conda-bld/pytorch-0.1.9_1487344852722/work/torch/lib/THC/generic/THCTensorMath.cu:141
Please note, when I am printing the tensor shapes at line no. 3, I am getting the following output.
torch.size([1, 300]) torch.size([1, 1, 300])
Why I am getting [1, 300] shape for embedded tensor even though I have used the view method as view(1, 1, -1)?
Any help would be appreciated!
embedded was a 3d-tensor and hidden was a tuple of two elements (hidden states and cell states) where each element is a 3d-tensor. hidden was the output from LSTM layer. In PyTorch, LSTM returns hidden states [h] and cell states [c] as a tuple which made me confused about the error.
So, I updated the last line of the code as follows and it solved the problem.
concatenated_output = torch.cat((embedded, hidden[0]), 1)

Resources