ANOVA Feature Selection in python - python-3.x

data=pd.read_csv("https://raw.githubusercontent.com/sharmaroshan/Online-Shoppers-Purchasing- Intention/master/online_shoppers_intention.csv")
I'm trying to perform feature selection based on ANOVA (Categorical vs Numerical variable).
dependent variable: Revenue
independent variable: Administrative,Administrative_Duration
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
model = ols('Revenue ~ Informational',data = data).fit()
anova_table=anova_lm(model)
But getting an following error,
Value Error(Shape issue)

The problem is related to the column Revenue in data, because it is boolean.
In fact, if you convert from boolean to integer then it works:
data.Revenue = data.Revenue.astype(int)

Related

What would the equivalent machine learning program in R language of this Python one?

As part of a school assignment on DSL and code generation, I have to translate the following program written in Python/Scikit-learn into R language (the topic of the exercise is an hypothetic Machine Learning DSL).
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
df = pd.read_csv('boston.csv', sep=',')
df.head()
y = df["medv"]
X = df.drop(columns=["medv"])
clf = DecisionTreeRegressor()
scoring = ['neg_mean_absolute_error','neg_mean_squared_error']
results = cross_validate(clf, X, y, cv=6,scoring=scoring)
print('mean_absolute_errors = '+str(results['test_neg_mean_absolute_error']))
print('mean_squared_errors = '+str(results['test_neg_mean_squared_error']))
Since I'm a perfect newbie in Machine Learning, and especially in R, I can't do it.
Could someone help me ?
Sorry for the late answer, probably you have already finished your school assignment. Of course we cannot just do it for you, you probably have to figure it out by yourself. Moreover, I don't get exactly what you need to do. But some tips are:
Read a csv file
data <-read.csv(file="name_of_the_file", header=TRUE, sep=",")
data <-as.data.frame(data)
The header=TRUE indicates that the file has one row which includes the names of the columns, the sep=',' is the same as in python (the seperator in the file is ',')
The as.data.frame makes sure that your data is kept in a dataframe format.
Add/delete a column
data<- data[,-"name_of_the_column_to_be_deleted"] #delete a column
data$name_of_column_to_be_added<- c(1:10) #add column
In order to add a column you will need to add the elements it will include. Also the # symbol indicates the beginning of a comment.
Modelling
For the modelling part I am not sure about what you want to achieve, but R offers a huge selection of algorithms to choose from (i.e. if you want to grow a tree take a look into the page https://www.statmethods.net/advstats/cart.html where it uses the following script to grow a tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
method="class", data=kyphosis))

Reprojecting Xarray Dataset

I'm trying to reproject a Lambert Conformal dataset to Plate Carree. I know that this can easily be done visually using cartopy. However, I'm trying to create a new dataset rather than just show a reprojected image. Below is methodology I have mapped out but I'm unable to subset the dataset properly (Python 3.5, MacOSx).
from siphon.catalog import TDSCatalog
import xarray as xr
from xarray.backends import NetCDF4DataStore
import numpy as np
import cartopy.crs as ccrs
from scipy.interpolate import griddata
import numpy.ma as ma
from pyproj import Proj, transform
import metpy
# Declare bounding box
min_lon = -78
min_lat = 36
max_lat = 40
max_lon = -72
boundinglat = [min_lat, max_lat]
boundinglon = [min_lon, max_lon]
# Load the dataset
cat = TDSCatalog('https://thredds.ucar.edu/thredds/catalog/grib/NCEP/HRRR/CONUS_2p5km/latest.xml')
dataset_name = sorted(cat.datasets.keys())[-1]
dataset = cat.datasets[dataset_name]
ds = dataset.remote_access(service='OPENDAP')
ds = NetCDF4DataStore(ds)
ds = xr.open_dataset(ds)
# parse the temperature at 850 and # 0z reftime
tempiso = ds.metpy.parse_cf('Temperature_isobaric')
t850 = tempiso[0][2]
# transform bounding lat/lons to src_proj
src_proj = tempiso.metpy.cartopy_crs #aka lambert conformal conical
extents = src_proj.transform_points(ccrs.PlateCarree(), np.array(boundinglon), np.array(boundinglat))
# subset the data using the indexes of the closest values to the src_proj extents
t850_subset = t850[(np.abs(tempiso.y.values - extents[1][0])).argmin():(np.abs(tempiso.y.values - extents[1][1])).argmin()][(np.abs(tempiso.x.values - extents[0][1])).argmin():(np.abs(tempiso.x.values - extents[0][0])).argmin()]
# t850_subset should be a small, reshaped dataset, but it's shape is 0x2145
# now use nplinspace, npmeshgrid & scipy interpolate to reproject
My transform point > find nearest value subsetting isn't working. It's claiming the closest points are outside the realm of the dataset. As noted, I plan to use nplinspace, npmeshgrid and scipy interpolate to create a new, square lat/lon dataset from t850_subset.
Is there an easier way to resize & reproject an xarray dataset?
Your easiest path forward is to take advantage of xarray's ability to do pandas-like data selection; this is IMO the best part of xarray. Replace your last two lines with:
# By transposing the result of transform_points, we can unpack the
# x and y coordinates into individual arrays.
x_lim, y_lim, _ = src_proj.transform_points(ccrs.PlateCarree(),
np.array(boundinglon), np.array(boundinglat)).T
t850_subset = t850.sel(x=slice(*x_lim), y=slice(*y_lim))
You can find more information in the documentation on xarray's selection and indexing functionality. You would probably also be interested in xarray's built-in support for interpolation. And if interpolation methods beyond SciPy's are of interest, MetPy also has a suite of other interpolation methods.
We have various "regridding" methods in Iris, if that isn't too much of a context switch for you.
Xarray explains its relationship to Iris here, and provides a to_iris() method.

Not able to make a class in Python and output it to excel

I'm fairly new in Python and working in a inventory management position.
One important thing in inventory management is calculating the safety stock.
So, this is what I'm trying to achieve.
I have imported a file with 3 columns; FR, sigma and LT for 3 rows. See hereunder the code and the output:
code:
import pandas as pd
df = pd.read_excel("Desktop\\TestPython4.xlsx")
xcol=["FR","sigma","LT"]
x=df[xcol].values
output:
snapshot
To calculate the safety stock, I have the following (simplified) formula of it;
CDF(FR)*sigma*sqrt(LT)
where CDF is the cumulative distribution function of the normal distribution and FR is a number between 0 and 1 (thus the well-knowned z-value is the output).
I want to output a the file with an extra column that displays the safety stock.
For this I made a class safetystock with the following code:
class Safetystock:
def __init__(self,FR,sigma,LT):
self.FR = FR
self.sigma = sigma
self.LT = LT
pass
def calculate():
SS=st.norm.ppf(FR)
return print(SS*sigma*np.sqrt(LT))
pass
Then I made the variable: "output"
Output = Safetystock(df.FR,df.sigma,df.LT)
I said that the data in the file needs to be taken into account.
Then I added a column to df, named output that needs to contain the variable "Output":
df["output"]=Output
Now, when I want to call df, it gives me this:
actual output
What am I doing wrong?
Cheers,
Steven
What about
import pandas as pd
import numpy as np
import scipy.stats as st
df = pd.read_excel("Desktop\\TestPython4.xlsx")
df["output"] = st.norm.ppf(df.FR)*df.sigma*np.sqrt(df.LT)

RuntimeWarning: divide by zero encountered in log when using pvlib

I'm using PVLib to model a PV system. I'm pretty new to coding and Python, and this is my first time using PVLib, so not surprisingly I've hit some difficulties.
Specifically, I've got created the following code using the extensive readthedocs examples at http://pvlib-python.readthedocs.io/en/latest/index.html
import pandas as pd
import numpy as np
from numpy import isnan
import datetime
import pytz
# pvlib imports
import pvlib
from pvlib.forecast import GFS, NAM, NDFD, HRRR, RAP
from pvlib.pvsystem import PVSystem, retrieve_sam
from pvlib.modelchain import ModelChain
# set location (Royal Greenwich Observatory, London, UK)
latitude, longitude, tz = 51.4769, 0.0005, 'Europe/London'
# specify time range.
start = pd.Timestamp(datetime.date.today(), tz=tz)
end = start + pd.Timedelta(days=5)
periods = 8 # number of periods that the GFS model and/or the model chain allows us to forecast power output.
# specify what irradiance variables we want
irrad_vars = ['ghi', 'dni', 'dhi']
# Use Global Forecast System model. The GFS is the US model that provides forecasts for the entire globe.
fx_model = GFS() # note: gives output in 3-hourly intervals
# retrieve data in processed format (convert temps from Kelvin to Celsius, combine elements of wind speed, complete irradiance data)
# Returns pandas.DataFrame object
fx_data = fx_model.get_processed_data(latitude, longitude, start, end)
# load module and inverter specifications
sandia_modules = pvlib.pvsystem.retrieve_sam('SandiaMod')
cec_inverters = pvlib.pvsystem.retrieve_sam('cecinverter')
module = sandia_modules['SolarWorld_Sunmodule_250_Poly__2013_']
inverter = cec_inverters['ABB__PVI_3_0_OUTD_S_US_Z_M_A__240_V__240V__CEC_2014_']
# model a fixed system in the UK. 10 strings of 250W panels, with 40 panels per string. Gives a nominal 100kW array
system = PVSystem(module_parameters=module, inverter_parameters=inverter, modules_per_string=40, strings_per_inverter=10)
# use a ModelChain object to calculate modelling intermediates
mc = ModelChain(system, fx_model.location, orientation_strategy='south_at_latitude_tilt')
# extract relevant data for model chain
mc.run_model(fx_data.index, weather=fx_data)
# OTHER CODE AFTER THIS TO DO SOMETHING WITH THE DATA
Having used a lot of print() statements in the console to debug, I can see that at the final line
mc.run_model(fx_data.index....
I get the following error:
/opt/pyenv/versions/3.6.0/lib/python3.6/site-packages/pvlib/pvsystem.py:1317:
RuntimeWarning: divide by zero encountered in log
module['Voco'] + module['Cells_in_Series']*delta*np.log(Ee) +
/opt/pyenv/versions/3.6.0/lib/python3.6/site-packages/pvlib/pvsystem.py:1323:
RuntimeWarning: divide by zero encountered in log
module['C3']*module['Cells_in_Series']*((delta*np.log(Ee)) ** 2) +
As a result, when I then go on to look at the ac_power outputs, I get what looks like erroneous data (every hour with a forecast that is not NaN = 3000 W).
I'd really appreciate any help you can give as I don't know what's causing it. Maybe I'm specifying the system incorrectly?
Thanks, Matt
I think the warnings you're seeing are ok to ignore. A handful of pvlib algorithms spit out warnings due to things like 0 values at night.
I think your problem with the non-NaN values is unrelated to the warnings. Study the other modeling results (stored as mc attributes -- see documentation and source code) to see if you can track down the source of your problem.

(Python3) Conditional Mean in Garch Model

I am using "arch" package of python . I am fitting a GARCH(1,1) model with mean model ARX. After the fitting, we can call the conditional volatility directly. However, I don't know how to call the modeled conditional mean values
Any help?
If you use the attribute resid you can compute fitted values. For example
import datetime as dt
import pandas_datareader.data as web
st = dt.datetime(1990,1,1)
en = dt.datetime(2016,1,1)
data = web.get_data_yahoo('^GSPC', start=st, end=en)
returns = 100 * data['Adj Close'].pct_change().dropna()
from arch import arch_model
am = arch_model(returns, mean='AR',lags=1)
res = am.fit(update_freq=5)
fitted = returns - res.resid
fitted.plot()

Resources