Scikit-Learn Multiple Regression Fails with ElasticNetCV - scikit-learn

According to the documentation and other SO questions, ElasticNetCV accepts multiple output regression. When I try it, though, it fails. Code:
from sklearn import linear_model
import numpy as np
import numpy.random as rnd
nsubj = 10
nfeat_train = 5
nfeat_predict = 20
x = rnd.random((nsubj, nfeat_train))
y = rnd.random((nsubj, nfeat_predict))
lm = linear_model.LinearRegression()
lm.fit(x,y) # works
el = linear_model.ElasticNetCV()
el.fit(x,y) # fails
Error message:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
This is with scikit-learn version 0.14.1. Is this a mismatch between the documentation and implementation?

You may want to take a look at sklearn.linear_model.MultiTaskElasticNetCV. But beware, this object assumes that your multiple targets share features. Thus, a feature is either active for all tasks (with variable activation for each, which can be small), or active for none of them. Before using this object, make sure this is the functionality you need.

Related

Pycaret predict error in multiclassification using Colab

I'm using the Pycaret library in Colab to make a simple prediction on this dataset:
https://www.kaggle.com/andrewmvd/fetal-health-classification
When i run my code:
from pycaret.utils import enable_colab
enable_colab()
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
from pycaret.classification import *
from pandas_profiling import ProfileReport
df= pd.read_csv("/content/drive/MyDrive/Pycaret/fetal_health.csv")
df2 = df.iloc[:,:11]
df2['fetal_health'] = df['fetal_health']
test = df2.sample(frac=0.10, random_state=42, weights='fetal_health')
train = df2.drop(test.index)
test.reset_index(inplace=True, drop=True)
train.reset_index(inplace=True, drop=True)
clf = setup(data =train, target = 'fetal_health', session_id=42,
log_experiment=True, experiment_name='fetal', normalize=True)
best = compare_models(sort="Accuracy")
rf = create_model('rf', fold=30)
tuned_rf = tune_model(rf, optimize='Accuracy')
predict_model(tuned_rf)
I get this error:
I think this is because my target variable is imbalanced (see img) and is causing the predictions to be incorrect.
Can someone pls help me understand ?
Tks in advance
Have you run each step in a separate cell to check the outputs?
Run
clf = setup(data =train, target = 'fetal_health', session_id=42,
log_experiment=True, experiment_name='fetal', normalize=True)
and check:
Are all variable types correctly inferred? (E.g., using your code with the Kaggle dataset of the same name, all variable shows as numeric except for severe_decelerations that shows as "Categorical" -- is it correct?
Is there any preprocessing configuration that needs to change? I'm sure your issue has nothing to do with an imbalanced target variable, but you can test yourself by changing your setup (adding fix_imbalance = True to change the default -- it shows as False when you check the setup output).
You can learn more about the available preprocessing configurations here:
https://pycaret.gitbook.io/docs/get-started/preprocessing
Also, while troubleshooting, you can save yourself some work by using
best_model = create_model(best, fold=30)
predict_model(best_model)
(No need to look up the best model to add manually to create_model(),
or to use tune_model() until you got the model working.)
I found what the problem was:
My target variables begin with value 1 and has 3 different values. This makes a error when the Pycaret tries to make a list comprehension (because it starts with the zero index).
To solve that i just transformed my variable to begin with zero and worked fine
Leandro,
thank you so much for your solution! I was having the same problem with the same dataset!
A. Beal, I tried your solution, but still the same error message appeared, so I tried Leandro's solution, and the problem was, in fact, the target beginning with 1, and not 0. Thank you for your suggestion on how to reduce the code!

Unable to Reproduce Results while using Scikit-learn RFECV

I am trying to use Recursive Feature Elimination with CV and produce reproducible results. Even though I have tried fixing the randomness by random_state = SEED as arguments of the components used as well as tried setting the random seed globally as well using np.random.seed(SEED). However, I am unable to control for the randomness and am unable to reproduce results. Attached is the code segment.
estimator = GradientBoostingClassifier(random_state=SEED, n_estimators=2*df.shape[1])
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=SEED)
selector = RFECV(estimator, n_jobs=-1,step=STEP, cv=cv)
selector = selector.fit(df, y)
df = df.loc[:, selector.support_]
print("Shape of final data AFTER FEATURE SELECTION")
print(df.shape, y.shape)
Specifically, if I run this segment of code it returns different number of features selected at each run. Any help would be appreciated

Saving LinearRegression (from sklearn.linear_model) coefficients in a list

I'm stuck in a problem that should be very simple. I'm running four simple linear regressions (changing only the x variables) and I need to store both de intercept and the scope coefficient in a list, for all regressions.
I thought it would be very easy, but it seems I'm not good at handling lists. The result stores me the same coefficients for all four models in the list.
This is my code:
from sklearn.linear_model import LinearRegression
variables = ['Number_of_likes','Number_of_comments','Number_of_followers','Number_of_repplies']
models = [None] * 4
lm = LinearRegression()
#Fit regressions
models[0] = lm.fit(X[[variables[0]]],y)
models[1] = lm.fit(X[[variables[1]]],y)
models[2] = lm.fit(X[[variables[2]]],y)
models[3] = lm.fit(X[[variables[3]]],y)
When I look at "models", it seems to be storing the results only for the last regression, in all four slots.
Hope I explained well my problem.
lm.fit() will modify the existing instance, not create a new copy of it. Also, the models list will store these instances by reference, which yields the behavior you are seeing.
To solve this, you need to create a new LogisticRegression every time you want to fit it to a new input, not re-use the same old model. For example:
models = [] # just an empty list; we will append our models to it one by one
for var in variables:
lm = LinearRegression() # create a new object
lm.fit(X[[var]], y) # fit it
models.append(lm) # add it to the list
Or, a more faithful version to your original code would be (using sklearn.base.clone):
from sklearn.base import clone # to create a new copy of the lm object
lm = LinearRegression()
#Fit regressions
models[0] = clone(lm).fit(X[[variables[0]]],y)
models[1] = clone(lm).fit(X[[variables[1]]],y)
models[2] = clone(lm).fit(X[[variables[2]]],y)
models[3] = clone(lm).fit(X[[variables[3]]],y)

Reprojecting Xarray Dataset

I'm trying to reproject a Lambert Conformal dataset to Plate Carree. I know that this can easily be done visually using cartopy. However, I'm trying to create a new dataset rather than just show a reprojected image. Below is methodology I have mapped out but I'm unable to subset the dataset properly (Python 3.5, MacOSx).
from siphon.catalog import TDSCatalog
import xarray as xr
from xarray.backends import NetCDF4DataStore
import numpy as np
import cartopy.crs as ccrs
from scipy.interpolate import griddata
import numpy.ma as ma
from pyproj import Proj, transform
import metpy
# Declare bounding box
min_lon = -78
min_lat = 36
max_lat = 40
max_lon = -72
boundinglat = [min_lat, max_lat]
boundinglon = [min_lon, max_lon]
# Load the dataset
cat = TDSCatalog('https://thredds.ucar.edu/thredds/catalog/grib/NCEP/HRRR/CONUS_2p5km/latest.xml')
dataset_name = sorted(cat.datasets.keys())[-1]
dataset = cat.datasets[dataset_name]
ds = dataset.remote_access(service='OPENDAP')
ds = NetCDF4DataStore(ds)
ds = xr.open_dataset(ds)
# parse the temperature at 850 and # 0z reftime
tempiso = ds.metpy.parse_cf('Temperature_isobaric')
t850 = tempiso[0][2]
# transform bounding lat/lons to src_proj
src_proj = tempiso.metpy.cartopy_crs #aka lambert conformal conical
extents = src_proj.transform_points(ccrs.PlateCarree(), np.array(boundinglon), np.array(boundinglat))
# subset the data using the indexes of the closest values to the src_proj extents
t850_subset = t850[(np.abs(tempiso.y.values - extents[1][0])).argmin():(np.abs(tempiso.y.values - extents[1][1])).argmin()][(np.abs(tempiso.x.values - extents[0][1])).argmin():(np.abs(tempiso.x.values - extents[0][0])).argmin()]
# t850_subset should be a small, reshaped dataset, but it's shape is 0x2145
# now use nplinspace, npmeshgrid & scipy interpolate to reproject
My transform point > find nearest value subsetting isn't working. It's claiming the closest points are outside the realm of the dataset. As noted, I plan to use nplinspace, npmeshgrid and scipy interpolate to create a new, square lat/lon dataset from t850_subset.
Is there an easier way to resize & reproject an xarray dataset?
Your easiest path forward is to take advantage of xarray's ability to do pandas-like data selection; this is IMO the best part of xarray. Replace your last two lines with:
# By transposing the result of transform_points, we can unpack the
# x and y coordinates into individual arrays.
x_lim, y_lim, _ = src_proj.transform_points(ccrs.PlateCarree(),
np.array(boundinglon), np.array(boundinglat)).T
t850_subset = t850.sel(x=slice(*x_lim), y=slice(*y_lim))
You can find more information in the documentation on xarray's selection and indexing functionality. You would probably also be interested in xarray's built-in support for interpolation. And if interpolation methods beyond SciPy's are of interest, MetPy also has a suite of other interpolation methods.
We have various "regridding" methods in Iris, if that isn't too much of a context switch for you.
Xarray explains its relationship to Iris here, and provides a to_iris() method.

How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification

If I run a basic logistic regression with 4 classes, I can get the predict_proba array.
How can i manually calculate the probabilities using the coefficients and intercepts? What are the exact steps to get the same answers that predict_proba generates?
There seem to be multiple questions about this online and several suggestions which are either incomplete or don't match up anyway.
For example, I can't replicate this process from my sklearn model so what is missing?
https://stats.idre.ucla.edu/stata/code/manually-generate-predicted-probabilities-from-a-multinomial-logistic-regression-in-stata/
Thanks,
Because I had the same question but could not find an answer that gave the same results I had a look at the sklearn GitHub repository to find the answer. Using the functions from their own package I was able to create the same results I got from predict_proba().
It appears that sklearn uses a special softmax() function that differs from the usual softmax function in their code.
Let's assume you build a model like this:
from sklearn.linear_model import LogisticRegression
X = ...
Y = ...
model = LogisticRegression(multi_class="multinomial", solver="saga")
model.fit(X, Y)
Then you can calculate the probabilities either with model.predict(X) or use the sklearn function mentioned above to calculate them manually like this.
from sklearn.utils.extmath import softmax,
import numpy as np
scores = np.dot(X, model.coef_.T) + model.intercept_
softmax(scores) # Sklearn implementation
In the documentation for their own softmax() function, they note that
The softmax function is calculated by
np.exp(X) / np.sum(np.exp(X), axis=1)
This will cause overflow when large values are exponentiated. Hence
the largest value in each row is subtracted from each data point to
prevent this.
Replicate sklearn calcs (saw this on a different post):
V = X_train.values.dot(model.coef_.transpose())
U = V + model.intercept_
A = np.exp(U)
P=A/(1+A)
P /= P.sum(axis=1).reshape((-1, 1))
seems slightly different than softmax calcs, or the UCLA stat example, but it works.

Resources