I am running a panel reggression using Python linearmodels, something like:
import pandas as pd
from linearmodels.panel import PanelOLS
data = pd.read_csv('data.csv', sep=',')
data = data.set_index(['panel_id', 'date'])
controls = ['A','B','C']
controls['const'] = 1
model = PanelOLS(data.Y, controls, entity_effects= True)
result = model.fit(use_lsdv=True)
I really need to pull out the coefficient on the constant, but looks like this would not work
intercept = result.summary.const
Could not really find the answer in
linearmodels' documentation on github
More generally, does anyone know how to pull out the estimate coefficients from the linearmodels summary? Thank you!
result.params['const']
would give the intercept, in general result.params gives the series of regression coefficients in linearmodels
Related
I'm using the Pycaret library in Colab to make a simple prediction on this dataset:
https://www.kaggle.com/andrewmvd/fetal-health-classification
When i run my code:
from pycaret.utils import enable_colab
enable_colab()
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
from pycaret.classification import *
from pandas_profiling import ProfileReport
df= pd.read_csv("/content/drive/MyDrive/Pycaret/fetal_health.csv")
df2 = df.iloc[:,:11]
df2['fetal_health'] = df['fetal_health']
test = df2.sample(frac=0.10, random_state=42, weights='fetal_health')
train = df2.drop(test.index)
test.reset_index(inplace=True, drop=True)
train.reset_index(inplace=True, drop=True)
clf = setup(data =train, target = 'fetal_health', session_id=42,
log_experiment=True, experiment_name='fetal', normalize=True)
best = compare_models(sort="Accuracy")
rf = create_model('rf', fold=30)
tuned_rf = tune_model(rf, optimize='Accuracy')
predict_model(tuned_rf)
I get this error:
I think this is because my target variable is imbalanced (see img) and is causing the predictions to be incorrect.
Can someone pls help me understand ?
Tks in advance
Have you run each step in a separate cell to check the outputs?
Run
clf = setup(data =train, target = 'fetal_health', session_id=42,
log_experiment=True, experiment_name='fetal', normalize=True)
and check:
Are all variable types correctly inferred? (E.g., using your code with the Kaggle dataset of the same name, all variable shows as numeric except for severe_decelerations that shows as "Categorical" -- is it correct?
Is there any preprocessing configuration that needs to change? I'm sure your issue has nothing to do with an imbalanced target variable, but you can test yourself by changing your setup (adding fix_imbalance = True to change the default -- it shows as False when you check the setup output).
You can learn more about the available preprocessing configurations here:
https://pycaret.gitbook.io/docs/get-started/preprocessing
Also, while troubleshooting, you can save yourself some work by using
best_model = create_model(best, fold=30)
predict_model(best_model)
(No need to look up the best model to add manually to create_model(),
or to use tune_model() until you got the model working.)
I found what the problem was:
My target variables begin with value 1 and has 3 different values. This makes a error when the Pycaret tries to make a list comprehension (because it starts with the zero index).
To solve that i just transformed my variable to begin with zero and worked fine
Leandro,
thank you so much for your solution! I was having the same problem with the same dataset!
A. Beal, I tried your solution, but still the same error message appeared, so I tried Leandro's solution, and the problem was, in fact, the target beginning with 1, and not 0. Thank you for your suggestion on how to reduce the code!
As part of a school assignment on DSL and code generation, I have to translate the following program written in Python/Scikit-learn into R language (the topic of the exercise is an hypothetic Machine Learning DSL).
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
df = pd.read_csv('boston.csv', sep=',')
df.head()
y = df["medv"]
X = df.drop(columns=["medv"])
clf = DecisionTreeRegressor()
scoring = ['neg_mean_absolute_error','neg_mean_squared_error']
results = cross_validate(clf, X, y, cv=6,scoring=scoring)
print('mean_absolute_errors = '+str(results['test_neg_mean_absolute_error']))
print('mean_squared_errors = '+str(results['test_neg_mean_squared_error']))
Since I'm a perfect newbie in Machine Learning, and especially in R, I can't do it.
Could someone help me ?
Sorry for the late answer, probably you have already finished your school assignment. Of course we cannot just do it for you, you probably have to figure it out by yourself. Moreover, I don't get exactly what you need to do. But some tips are:
Read a csv file
data <-read.csv(file="name_of_the_file", header=TRUE, sep=",")
data <-as.data.frame(data)
The header=TRUE indicates that the file has one row which includes the names of the columns, the sep=',' is the same as in python (the seperator in the file is ',')
The as.data.frame makes sure that your data is kept in a dataframe format.
Add/delete a column
data<- data[,-"name_of_the_column_to_be_deleted"] #delete a column
data$name_of_column_to_be_added<- c(1:10) #add column
In order to add a column you will need to add the elements it will include. Also the # symbol indicates the beginning of a comment.
Modelling
For the modelling part I am not sure about what you want to achieve, but R offers a huge selection of algorithms to choose from (i.e. if you want to grow a tree take a look into the page https://www.statmethods.net/advstats/cart.html where it uses the following script to grow a tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
method="class", data=kyphosis))
Suppose I have the following Dataframe
import pandas as pd, numpy as np, statsmodels.formula.api as smf
# Generate the data
Stocks=100
mean = [0.5, 1000, 10]
Var = [0.5, 60, 3]
A=np.random.normal(loc=0.5,scale=0.5,size=(Stocks, 1))
for a, b in zip(mean, Var):
A=np.concatenate((A, np.random.normal(loc=a,scale=b, size=(Stocks,1))), axis=1)
df1=pd.DataFrame(A, columns=['Betas','M/B','Size', 'P/E'])
df1['PAR_stock']=0.08+0.801*df1['Size']+0.321*df1['M/B']+0.164*df1['P/E']-0.084*df1['Betas']
and I now have the following DataFrame. I want to select the variables that make the best fit between Beta, size and P/E and M/B.
formula = 'PAR_stock ~ Betas + Size + Q("P/E") + Q("M/B")'
results = smf.ols(formula, df1).fit()
print(results.summary())
I want python to do each and tell me which variables are the best to use in an OLS regression and tell me that this is the best model.
Is there a way to do this in python using machine learning codes.
To the best of my knowledge, there is a library in R, called glmulti is there something similar in python?
PS: I am still new to this so please do not be harsh in your comments. If you have any suggestion or a book that explains these things explicitly feel free to share it. Thank you for your co-operation
I have a quick question. My code looks like below:
import quandl
names_of_company = ['KGHM','INDYKPOL','KRUK','KRUSZWICA']
for names in names_of_company:
x = quandl.get('WSE/{names_of_company}', start_date='2018-11-26',
end_date='2018-11-29')
I am trying to get all the data in one output but I can't change the names of each company one after another. Do you have any ideas?
Thanks for help
unless I'm missing something, looks like you should just be able to do a pretty basic for loop. it was the syntax that was was incorrect.
import quandl
import pandas as pd
names_of_company = ['KGHM','INDYKPOL','KRUK','KRUSZWICA']
results = pd.DataFrame()
for names in names_of_company:
x = quandl.get('WSE/%s' %names, start_date='2018-11-26',
end_date='2018-11-29')
x['company'] = names
results = results.append(x).reset_index(drop=True)
I am practicing machine learning and working with a movie/rating dataset. I am trying to create a new column in the dataframe which numerically identifies each genre (around 1300 of them). My logic was to create a dictionary of the unique genres and label with a integer. Then create a for loop to iterate through each row of the dataframe, checking the genre of each, then assigning its appropriate value to a new column named "genre_Id". However this has been causing a infinite loop in which I can not even break with ctrl-c. Same issue when working in Jupyter ( Interrupt Kernel fails to stop it). Below is a summarized version of my approach.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
movies_data = pd.read_csv("C://mypython/moviedata/movies.csv")
ratings_data = pd.read_csv("C://mypython/moviedata/ratings.csv")
joined = pd.merge(movies_data,ratings_data, how = 'inner', on=['movieId'])
print(joined.head())
pd.options.display.float_format = '{:,.2f}'.format
genres = joined['genres'].unique()
genre_dict = {}
Id = 1
for i in genres:
genre_dict[i] = Id
Id += 1
joined['genre_id'] = 0
increment = 0
for i in joined['genres']:
if i in genre_dict:
joined['genre_id'][increment] = genre_dict[i]
increment += 1
I know I should probably be taking a smaller sample to work with as there is about 20,000,000 rows in the dataset but I figured I'd try this as a exercise.
I also recieve the "setting values from copy warning" though this hasn't caused me issues in the past for my other projects. Any thoughts on how to do this would be greatly appreciated.
EDIT Found a solution using the Series map feature.
joined['genre_id'] = joined.genres.map(genre_dict)
I have no permission to just comment. This is a suggestion and right procedure to handle categorical values in a dataset. You can use inbuilt sklearn.preprocessing.OneHotEncoder function which do the work you wanted to do.
For better understanding with examples check this One Hot Encode Sequence Data in Python. Let me know if this works for you.