I want to use lmfit in order to fit my data.
The function I am using, has only one argument features. The content of features will be different (both columns and values), so I can't initialize parameters.
I tried to create a dataframe as here, but I can't use the guess method because this is for LorentzianModel and I just want to use Model.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
from sklearn.linear_model import LinearRegression
df = {'a': [0, 0.2, 0.3], 'b':[14, 10, 9], 'target':[100, 200, 300]}
df = pd.DataFrame(df)
X = df[['a', 'b']]
y = df[['target']]
model = LinearRegression().fit(X, y)
features = pd.DataFrame({"a": np.array([0, 0.11, 0.36]),
"b": np.array([10, 14, 8])})
def eval_custom(features):
res = model.predict(features)
return res
x_val = features[["a"]].values
def calling_func(features, x_val):
pred_custom = eval_custom(features)
df = pd.DataFrame({'x': np.squeeze(x_val), 'y': np.squeeze(pred_custom)})
themodel = lmfit.Model(eval_custom)
params = themodel.guess(df['y'], x=df['x'])
result = themodel.fit(df['y'], params, x = df['x'])
result.plot_fit()
calling_func(features, x_val)
The model function needs to take independent variables and the individual model parameters as arguments. You're wrapping all of that into a single pandas Dataframe and then sending that. Don't do that.
If you need to create a dataframe from the current values of the model, do that inside your model function.
Also: a generic model function does not have a working guess function. Use model.make_params() and definitely, definitely (no exceptions, nope not ever) provide actual initial values for every parameter.
Related
I believe the error is telling me I have null values in my data and I've tried fixing it but the error keeps appearing. I don't want to delete the null data because I consider it relevant to my analysis.
The columns of my data are in this order: 'Titulo', 'Autor', 'Género', 'Año Leido', 'Puntaje', 'Precio', 'Año Publicado', 'Paginas', **'Estado.' **The ones in bold are strings data.
Code:
import numpy as np
#Load Data
import pandas as pd
dataset = pd.read_excel(r"C:\Users\renat\Documents\Data Science Projects\Classification\Book Purchases\Biblioteca.xlsx")
#print(dataset.columns)
#Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
#Handling missing values
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')
#Convert X and y to NumPy arrays
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,8].values
print(X.shape, y.shape)
# Crea una instancia de LabelEncoder
labelEncoderTitulo = LabelEncoder()
X[:, 0] = labelEncoderTitulo.fit_transform(X[:, 0])
labelEncoderAutor = LabelEncoder()
X[:, 1] = labelEncoderAutor.fit_transform(X[:, 1])
labelEncoderGenero = LabelEncoder()
X[:, 2] = labelEncoderGenero.fit_transform(X[:, 2])
labelEncoderEstado = LabelEncoder()
X[:, -1] = labelEncoderEstado.fit_transform(X[:, -1])
#Instantiate our KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X,y)
y_pred = knn.predict(X)
print(y_pred)
Error Message:
ValueError: Input X contains NaN.
KNeighborsClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
You have to fit and transform the data with the SimpleImputer you created. From the documentation:
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') # Here the imputer is created
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) # Here the imputer is fitted, i.e. learns the mean
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(X)) # Here the imputer is applied, i.e. filling the mean
The crucial parts here are imp_mean.fit() and imp_mean.transform(X)
Additionally I'd use another technique to handle categorical data since LabelEncoder is not suitable here:
This transformer should be used to encode target values, i.e. y, and not the input X.
For alternatives see here: How to consider categorical variables in distance based algorithms like KNN or SVM?
You need SimpleImputer to impute the missing values in X. We fit the imputer on X and then transform X to replace the NaN values with the mean of the column.After imputing missing values, we encode the target variable using LabelEncoder.
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X = imputer.fit_transform(X)
# Encode target variable
labelEncoderEstado = LabelEncoder()
y = labelEncoderEstado.fit_transform(y)
I'm just looking for a little bit of help. I'm struggling to work out if what I'm doing is right or not, and even if Naive Bayes is even the right way to do this.
I am wanting the user to be able to input their elo, and the 'app' to suggest them a opening move set, based on win rate at that ELO. For this I am using the following dataset: https://www.kaggle.com/datasnaek/chess
The important data out of this, are the opening name (what I'm trying to suggest), the average rating (what the user can input), and winner (we need to see if white wins).
This is my code so far:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from matplotlib.colors import ListedColormap
from sklearn import preprocessing
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
#Read in dataset
data = pd.read_csv(f"games.csv")
# set new column that is true/false depending on if white wins
data['white_wins'] = (data['winner'] == "white")
# Create new columns, average rating (based on white rating and black rating) and category (categorization of rating for Naive Bayes)
data['average_rating'] = data.apply(lambda row: (row['white_rating'] + row['black_rating']) / 2, axis=1)
data['category'] = data['average_rating'] // 100 + 1
# Drop unneccessary columns
data = data.drop(['turns', 'moves', 'victory_status', 'id', 'winner', 'rated', 'created_at', 'last_move_at', 'opening_ply', 'white_id', 'black_id', 'increment_code', 'opening_eco', 'white_rating', 'black_rating'], axis=1)
#Label Encoder Initialisation
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
opening_name_encoded=le.fit_transform(data.opening_name)
category_encoded=le.fit_transform(data.category)
label=le.fit_transform(data.white_wins)
#Package features together
features=zip(opening_name_encoded, category_encoded)
#Create a Gaussian Classifier
model = GaussianNB()
# Train the model using the training sets
model.fit(features,label)
And i currently get the error:
error
Also, i'm not even convinced this is correct, as if i continue down this stream, I am only going to be predicting if white wins based on the opening moveset, and elo. I'm really unsure on where to take this to get it to the point i need.
Thanks for any help!
zip returns in iterator, so your code is not doing what you expect. My guess is that you intended features to be a list of 2-tuples. If that is the case, then adjust your code to features = list(zip(opening_name_encoded, category_encoded))
In [31]: zip([1, 2, 3], ['a', 'b', 'c'])
Out[31]: <zip at 0x25d61abfd80>
In [32]: list(zip([1, 2, 3], ['a', 'b', 'c']))
Out[32]: [(1, 'a'), (2, 'b'), (3, 'c')]
I was looking to implement custom GLM using sklearn/Scikit-learn. The same is possible with statsmodel for example using statsmodel we could use below code
import pandas as pd
import statsmodels.api as sm
data = [(300,1),(200,0),(170,1),(420,1),(240,1),(133,0),(323,1),(150,0),(230,0),(499,0)]
Labels = ['datapoint','value']
df = pd.DataFrame.from_records(data, columns=Labels)
glm_linear = sm.GLM(df.value, df.datapoint, family=sm.families.Gaussian(sm.families.links.identity()))
res = glm_linear.fit()
print(res.summary())
Here as we see we can pass any link and random function using family attribute in sm.GLM method.
I was looking for something similar in sklearn
You can use sklearn TweedieRegressor with parameter power=0 to specify the normal distribution:
from sklearn.linear_model import TweedieRegressor
import pandas as pd
data = [(300,1), (200,0), (170,1), (420,1), (240,1), (133,0), (323,1), (150,0), (230,0), (499,0)]
Labels = ['datapoint','value']
df = pd.DataFrame.from_records(data, columns=Labels)
X, y = df.datapoint, df.value
glm_gaussian = TweedieRegressor(power=0, fit_intercept=False)
glm_gaussian.fit(X.to_numpy()[:, None], y)
print(glm_gaussian.coef_)
array([0.00173114])
I am manually trying to build a linear regression model for understanding purpose without using the builtin function. I am getting the error while plotting the regression line. Kindly help me fix it.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sb
data = {'X': list(np.arange(0,10,1)), 'Y': [1,3,2,5,7,8,8,9,10,12]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(np.ones(10), columns = ['ones'])
df_new = pd.concat([df2,df], axis = 1)
X = df_new.loc[:, ['ones', 'X']].values
Y = df_new['Y'].values.reshape(-1,1)
theta = np.array([0.5, 0.2]).reshape(-1,1)
Y_pred = X.dot(theta)
sb.lineplot(df['X'].values.reshape(-1,1),Y_pred)
plt.show()
Error message:
If using all scalar values, you must pass an index
You are passing a 2d array, while seaborn's lineplot expects a 1d array (or a pandas column which is basically same). So change it to
sb.lineplot(df['X'].values,Y_pred.reshape(-1))
I am working on a project and While getting the dummies value i was getting memory exception
I have tried using .astype(np.int8) and i have also tried writing exception handling code by importing psutil
i am using below code
dummy_cols = ['emp_title','grade','home_ownership','verification_status','addr_state','pub_rec','application_type']
df_dummies = pd.get_dummies(df[dummy_cols], drop_first = True)
It's not working and throwing an error
pandas.get_dummies creates a dense representation of dummy variables, which may request lots of memory depending on the number of levels in the categorical features.
I would prefer scikit-learn.preprocessing.OneHotEncoder that outputs sparse matrices:
The code would look like this :
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Create a fake dataframe
df = pd.DataFrame(
{
"df1": np.random.choice(["a", "b"], 100),
"df2": np.random.choice(["c", "d"], 100)
}
)
dummy_cols = ["df1", "df2"]
# LabelEncode categoricals
for f in dummy_cols:
df[f] = LabelEncoder().fit_transform(df[f])
# Transform to dummies in sparse representation (csr_matrix)
df_dummies = OneHotEncoder().fit_transform(df[dummy_cols])