Using make_classification from sklearn for a binary classification problem, i get labels 0 and 1. Is there anyway i can change it to get -1 instead of 0?
You can use np.where :
from sklearn.datasets import make_classification
import numpy as np
X,y = make_classification()
y_transformed = np.where(y==0,-1,1)
Related
I'm trying to use sklearn.manifold.TSNE to visualize data that I sample from a generative model and compare the distribution of generated data vs training data (to measure 'extrapolation').
Here's how I'm doing it:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import joblib
import numpy as np
import pandas as pd
tsne = TSNE(n_components=2, random_state=0)
x_train = tsne.fit_transform(embds_train)
x_generated = tsne.fit_transform(embds_generated)
My question is, is it necessary to call tsne.fit_transform() on both embeddings for training and generated samples? Or I could fit only once and then add other embeddings to already fitted space?
I am struggling with a machine learning project, in which I am trying to combine :
a sklearn column transform to apply different transformers to my numerical and categorical features
a pipeline to apply my different transformers and estimators
a GridSearchCV to search for the best parameters.
As long as I fill-in the parameters of my different transformers manually in my pipeline, the code is working perfectly.
But as soon as I try to pass lists of different values to compare in my gridsearch parameters, I am getting all kind of invalid parameter error messages.
Here is my code :
First I divide my features into numerical and categorical
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)
Then I create 2 different preprocessing pipelines for numerical and categorical features:
numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),OneHotEncoder(handle_unknown='ignore'))
I combined both into another pipeline, set my parameters, and run my GridSearchCV code
model=make_pipeline(preprocessor, LinearRegression() )
params={
'columntransformer__numerical_pipeline__knnimputer__n_neighbors':[1,2,3,4,5,6,7]
}
grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=10)
cv = KFold(n_splits=5)
all_accuracies = cross_val_score(grid, X, y, cv=cv,scoring='r2')
I tried different ways to declare the paramaters, but never found the proper one. I always get an "invalid parameter" error message.
Could you please help me understanding what went wrong?
Really a lot of thanks for your support, and take good care!
I am assuming that you might have defined preprocessor as the following,
preprocessor = Pipeline([('numerical_pipeline',numerical_pipeline),
('cat_pipeline', cat_pipeline)])
then you need to change your param name as following:
pipeline__numerical_pipeline__knnimputer__n_neighbors
but, there are couple of other problems with the code:
you don't have to call cross_val_score after performing GridSearchCV. Output of GridSearchCV itself would have the cross validation result for each combination of hyper parameters.
KNNImputer would not work when you data is having string data. You need to apply cat_pipeline before num_pipeline.
Complete example:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
import pandas as pd # doctest: +SKIP
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.nan],
'rating': [5, 3, 4, 5]}) # doctest: +SKIP
y = [1,0,1,1]
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)
numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),
OneHotEncoder(handle_unknown='ignore', sparse=False))
preprocessor = Pipeline([('cat_pipeline', cat_pipeline),
('numerical_pipeline',numerical_pipeline)])
model=make_pipeline(preprocessor, LinearRegression() )
params={
'pipeline__numerical_pipeline__knnimputer__n_neighbors':[1,2]
}
grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=2)
grid.fit(X, y)
I am working on this problem and unsure on how to proceed.
Load the R data set mtcars as a pandas dataframe.
Build a linear regression model by considering the log of independent variable wt, and log of dependent variable mpg.
Fit the model with data.
Perform ANOVA on the linear model obtained in the previous step.(Hint:Use anova.anova_lm)
Display the F-statistic value.
I see in another post below solution was provided. But it doesn't to seem work.
import statsmodels.api as sm
import numpy as np
mtcars = sm.datasets.get_rdataset('mtcars')
mtcars_data = mtcars.data
liner_model = sm.formula.ols('np.log(wt) ~ np.log(mpg)',mtcars_data)
liner_result = liner_model.fit()
print(liner_result.rsquared)'''
fixed it
import statsmodels.api as sm
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.stats import anova
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
model = smf.ols(formula='np.log(mpg) ~ np.log(wt)', data=mtcars).fit()
print(anova.anova_lm(model))
print(anova.anova_lm(model).F["np.log(wt)"])
I want to get the distribution of each features in cancer dataset using ggplot but its giving me error.
#pip install plotnine
from plotnine import ggplot
from plotnine import *
from sklearn.datasets import load_breast_cancer
for i in cancer.feature_names:
ggplot(cancer.data)+aes(x=i)+geom_bar(size=10)
This is the error message i got
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I would recommand to use seaborn for that. Here is an example of plotting the distribution of each in feature in cancer dataset by target:
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
# loading data
cancer = load_breast_cancer()
data = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
columns= np.append(cancer['feature_names'], ['target']))
df = data.melt(['target'], var_name='cols', value_name='vals')
g = sns.FacetGrid(df, col='cols', hue="target", palette="Set1", col_wrap=4)
g = (g.map(sns.distplot, "vals", hist=True, ))
from plotnine import ggplot
from plotnine import *
from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()
import pandas as pd
import matplotlib.pyplot as plt
data=pd.DataFrame(cancer.data,columns=cancer.feature_names)
for i in data.columns:
print(ggplot(data)+aes(x=i)+geom_density(size=1))
print(ggplot(data)+aes(x=i)+geom_bar(size=10))
from sklearn import tree
from sklearn.datasets import load_iris
iris=load_iris()
dir(iris)
#output data to traixn setosa,versicolor and virginica
x=iris.data
#fetching data
x=np.delete(x, np.s_[::50], 0)
#print(x)
y=iris.target
#featching output
y=np.delete(y, np.s_[::50], 0)
algo=tree.DecisionTreeClassifier
when i try to use fit it does not support
train=algo.fit(x,y)
res=train.pridict([test_setosa])
print(res)
You need to change something in your code. The DecisionTreeClassifier is a class and the way your call it in your code is wrong.
Replace
algo=tree.DecisionTreeClassifier
with
algo=tree.DecisionTreeClassifier()
Full code
from sklearn import tree
from sklearn.datasets import load_iris
import numpy as np
iris=load_iris()
dir(iris)
#output data to traixn setosa,versicolor and virginica
x=iris.data
#fetching data
x=np.delete(x, np.s_[::50], 0)
#print(x)
y=iris.target
#featching output
y=np.delete(y, np.s_[::50], 0)
algo=tree.DecisionTreeClassifier()
train=algo.fit(x,y)
res=train.predict([test_setosa])