Sum of residuals of scipy regression model - python-3.x

I am going through a stats workbook with python, there is a practice hands on question on which i am stuck. Its related to Poisson regression and here is the problem statement:-
Perform the following tasks:
Load the R data set Insurance from MASS package and Capture the data as pandas data frame
Build a Poisson regression model with a log of an
independent variable, Holders and dependent variable Claims.
Fit the model with data.
Find the sum of residuals.
I am stuck with point 4 above. Can anyone help with this step?
Here is what i have done so far :-
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
df = sm.datasets.get_rdataset('Insurance', package='MASS', cache=False).data
poisson_model = smf.poisson('np.log(Holders) ~ -1 + Claims', df)
poisson_result = poisson_model.fit()
print(poisson_result.summary())
Now how to get sum of residuals?

np.sum(poisson_result.resid)
works fine
You have used the wrong variables to build the poisson model as pointed out by Karthikeyan.
Use this instead,
poisson_model = smf.poisson('Claims ~ np.log(Holders)',df)

Try below code for Fresco play
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
df_insurance=sm.datasets.get_rdataset("Insurance","MASS")
df_data=df_insurance.data
insurance_model=smf.poisson('Claims ~ np.log(Holders)', df_data).fit()
print(np.cumsum(insurance_model.resid))

1.a) Load the R data set Insurance from MASS package
1.b) and Capture the data as pandas data frame
2) Build a Poisson regression model with a log of an independent variable, Holders and dependent variable Claims.
3) Fit the model with data.
4) Find the sum of residuals.
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
#load the R data set insurrance from MASS package
ins = sm.datasets.get_rdataset('Insurance','MASS').data
# capture the data as pandas data frame
ins_pd = pd.DataFrame(ins)
# build a poisson regressions model with
# a log of an independent variable "Holders"
# and dependent variable "Claims"
# fit the model with data
result = smf.poisson('Claims ~ np.log(Holders)',data=ins).fit()
# you can also use
# model = smf.poisson('Claims ~ np.log(Holders)',data=ins)
# result = model.fit()
# Find tue sum of residuals
print('Sum ot the residuals:',np.sum(result.resid))
i'm new on this so i don't know if capture the data as panda dataframe is fine or not but letme now
greetings

Fresco Mex
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
df_data=sm.datasets.get_rdataset("Insurance","MASS").data
df_dataf= pd.DataFrame(df_data)
insurance_model=smf.poisson('Claims ~ np.log(Holders)',df_data)
insurance_model_result=insurance_model.fit()
print(np.sum(insurance_model_result.resid))

in the poisson_model = smf.poisson('np.log(Holders) ~ -1 + Claims', df) statement, the dependent variable "Claims" should come in the right hand side
poisson_model = smf.poisson('Claims ~ np.log(Holders)-1 ', df)

this qualified in "Fresco" if anyone is looking for the solution
df_insurance=sm.datasets.get_rdataset("Insurance","MASS")
df_data=df_insurance.data
insurance_model=smf.poisson('Claims ~ np.log(Holders)',df_data)
insurance_model_result=insurance_model.fit()
res=(insurance_model_result.resid)
print(np.sum(res))

I don't know it will work or not .but I refer this docs
https://vincentarelbundock.github.io/Rdatasets/doc/MASS/Insurance.html
https://vincentarelbundock.github.io/Rdatasets/datasets.html
So I hope this will work too.
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
data=pd.DataFrame(sm.datasets.get_rdataset("Insurance","MASS",cache=True).data)
model=smf.poisson('Claims ~ District + Group + Age + np.log(Holders)',data).fit()
print(np.sum(model.resid))

Try np.cumsum(model.resid) for this question.
Ideally np.sum(model.resid) should be the right answer for the question... But if the system is not accepting it, try the cumsum

Related

Annotating clustering from DBSCAN to original Pandas DataFrame

I have working code that is utilizing dbscan to find tight groups of sparse spatial data imported with pd.read_csv.
I am maintaining the original spatial data locations and would like to annotate the labels returned by dbscan for each data point to the original dataframe and then write a csv with the same information.
So the code below is doing exactly what I would expect it to at this point, I would just like to extend it to import the label for each row in the original dataframe.
import argparse
import string
import os, subprocess
import pathlib
import glob
import gzip
import re
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from sklearn.cluster import DBSCAN
X = pd.read_csv(tmp_csv_name)
X = X.drop('Name', axis = 1)
X = X.drop('Type', axis = 1)
X = X.drop('SomeValue', axis = 1)
# only columns 'x' and 'y' now remain
db=DBSCAN(eps=EPS, min_samples=minSamples, metric='euclidean', algorithm='auto', leaf_size=30).fit(X)
labels = def_inst_dbsc.labels_
unique_labels = set(labels)
# maxX , maxY are manual inputs temporarily
while sizeX > 16 or sizeY > 16 :
sizeX=sizeX*0.8 ; sizeY=sizeY*0.8
fig, ax = plt.subplots(figsize=(sizeX,sizeY))
plt.xlim(0,maxX)
plt.ylim(0,maxY)
plt.scatter(X['x'], X['y'], c=colors, marker="o", picker=True)
# hackX , hackY are manual inputs temporarily
# which represent the boundaries defined in the original dataset
poly = patches.Polygon(xy=list(zip(hackX,hackY)), fill=False)
ax.add_patch(poly)
plt.show()

A function to insert data in dataset using python

I create a program that predict digits from in a dataset. I want when it predict data their should be two cases if it predict right then data should added automatically in dataset otherwise it takes right answer throw user and insert to dataset.
code
import numpy as np
import pandas as pd
import matplotlib.pyplot as pt
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv("train.csv").values
clf = DecisionTreeClassifier()
xtrain = data[0:21000,1:]
train_label=data[0:21000,0]
clf.fit(xtrain,train_label)
xtest = data[21000: ,1:]
actual_label=data[21000:,0]
d = xtest[9]
d.shape = (28,28)
pt.imshow(d,cmap='gray')
print(clf.predict([xtest[9]]))
pt.show()
I'm not sure I'm following your question, but if you want to distinguish between good and wrong predictions and take different ways, you should specific do that.
predictions = clf.predict(xtest)
good_predictions = xtest[pd.Series(predictions == actual_label)]
bad_predictions = xtest[pd.Series(predictions != actual_label)]
So, in good_predictions will be all the rows in xtest that where predicted right.

Python - Need help in solving "Load the R data set mtcars as a pandas dataframe." problem

I am working on this problem and unsure on how to proceed.
Load the R data set mtcars as a pandas dataframe.
Build a linear regression model by considering the log of independent variable wt, and log of dependent variable mpg.
Fit the model with data.
Perform ANOVA on the linear model obtained in the previous step.(Hint:Use anova.anova_lm)
Display the F-statistic value.
I see in another post below solution was provided. But it doesn't to seem work.
import statsmodels.api as sm
import numpy as np
mtcars = sm.datasets.get_rdataset('mtcars')
mtcars_data = mtcars.data
liner_model = sm.formula.ols('np.log(wt) ~ np.log(mpg)',mtcars_data)
liner_result = liner_model.fit()
print(liner_result.rsquared)'''
fixed it
import statsmodels.api as sm
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.stats import anova
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
model = smf.ols(formula='np.log(mpg) ~ np.log(wt)', data=mtcars).fit()
print(anova.anova_lm(model))
print(anova.anova_lm(model).F["np.log(wt)"])

Why Python is not responding "windows can try to restore the program. If you restore or close the program, you might lose information."

I am using python 3x in Jupyter notebook, what I want to do is to plot some of the R plots in a shell in jupyter notebook. However, the problem is that when I do so, the plot is drawn on another window rather than in the shell. However, when I close it the jupyter notebook gives me an error a "Dead Kernel"
My code is:
# To fit a restricted VAR model we will have to call the function from R
# Call function from R
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
pandas2ri.activate()
# Calling packages
import pandas as pd, numpy as np
# Test for serial correlation
MTS = importr("MTS", lib_loc = "C:/Users/Rami Chehab/Documents/R/win-library/3.3")
RMTSmq=MTS.mq
# Create data
df = pd.DataFrame(np.random.random((108, 2)), columns=['Number1','Number2'])
# Test for data
RMTSmq(df, adj=4)
After I close the program I am getting this
Can someone help me? I wish if I may plot the graph inside jupyter notebook if possible.
Thank you
I would like to thank my great friend Sarunas for his wild idea. In particular, he informed me that there is a way out through "instead of showing the picture (in the window) you could try to write it as png or another image format? Then open it with some photo viewer? "
That is exactly what I have done!
Consider for example that I would like to show a figure from R say the plot of sin(x) from x=-pi till 2pi
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
grdevices = importr('grDevices')
grdevices.png(file="Rami1.png", width=512, height=512)
p = ro.r('curve(sin, -pi, 2*pi)')
grdevices.dev_off()
print()
from IPython.display import Image
Image("Rami1.png")
The output is
This is another answer for the above question using precisely the arguments raised in the above question. Rather the flavour
import rpy2.robjects as ro, pandas as pd, numpy as np
from rpy2.robjects.packages import importr
# Call function from R
import rpy2.robjects as robjects
from rpy2.robjects import r
from rpy2.robjects.numpy2ri import numpy2ri
from rpy2.robjects.packages import importr
# To plot drawings in R
grdevices = importr('grDevices')
# Save the figure as Rami1.png
grdevices.png(file="Rami1.png", width=512, height=512)
# We are interested in finding if there is any serial correlation in the Multivariate residuals
# Since there is a fitting VAR it will be cumbersome to create this function here therefore consider
# that residauls resi as follow
resi = pd.DataFrame(np.random.random((108, 2)), columns=['Number1','Number2'])
# firt take the values of the dataframe to numpy
resi1=np.array(resi, dtype=float)
# Taking the variable from Python to R
r_resi = numpy2ri(resi1)
# Creating this variable in R (from python)
r.assign("resi", r_resi)
# Calling libraries in R for mq to function which is MTS
r('library("MTS")')
# Calling a function in R (from python)
p = ro.r('result <-mq(resi,adj=4)')
grdevices.dev_off()
from IPython.display import Image
Image("Rami1.png")

Sklearn.mixture.dpgmm not functioning correctly

I'm having trouble with sklearn.mixture.dpgmm. The main issue is that it is not returning correct covariances for synthetic data (2 separated 2D gaussians), where it really should have no issue. In particular, when I do dpgmm._get_covars(), the covariance matrices have diagonal elements that are always exactly 1.0 too large, regardless of the input data distributions. This seems like a bug, as gmm works perfectly (when limiting to known exact number of groups)
Another issue is that dpgmm.weights_ makes no sense, they sum to one but the values appear meaningless.
Does anyone have a solution to this or see something clearly wrong with my example?
Here is the exact script I'm running:
import itertools
import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
import pdb
from sklearn import mixture
# Generate 2D random sample, two gaussians each with 10000 points
rsamp1 = np.random.multivariate_normal(np.array([5.0,5.0]),np.array([[1.0,-0.2],[-0.2,1.0]]),10000)
rsamp2 = np.random.multivariate_normal(np.array([0.0,0.0]),np.array([[0.2,-0.0],[-0.0,3.0]]),10000)
X = np.concatenate((rsamp1,rsamp2),axis=0)
# Fit a mixture of Gaussians with EM using 2
gmm = mixture.GMM(n_components=2, covariance_type='full',n_iter=10000)
gmm.fit(X)
# Fit a Dirichlet process mixture of Gaussians using 10 components
dpgmm = mixture.DPGMM(n_components=10, covariance_type='full',min_covar=0.5,tol=0.00001,n_iter = 1000000)
dpgmm.fit(X)
print("Groups With data in them")
print(np.unique(dpgmm.predict(X)))
##print the input and output covars as example, should be very similar
correct_c0 = np.array([[1.0,-0.2],[-0.2,1.0]])
print "Input covar"
print correct_c0
covars = dpgmm._get_covars()
c0 = np.round(covars[0],decimals=1)
print "Output Covar"
print c0
print("Output Variances Too Big by 1.0")
According to the dpgmm docs this Class is Deprecated in version 0.18 and will be removed in version 0.20
You should use BayesianGaussianMixture Class instead, with parameter weight_concentration_prior_type set with option "dirichlet_process"
Hope it helps
instead of writing
from sklearn.mixture import GMM
gmm = GMM(2, covariance_type='full', random_state=0)
you should write:
from sklearn.mixture import BayesianGaussianMixture
gmm = BayesianGaussianMixture(2, covariance_type='full', random_state=0)

Resources