Excel to Python for big panel data, regression ready format? - excel

Trying to get some big panel data from excel into python so I can do some GMM/Cross sectional panel data regression analysis (think sci-kit package). I have moved my data from excel to Python but the format for regression analysis is not correct (see below). The Scikit website has some datasets on there to play with, but it is not really helpful for discussing formats and how to get your data into a similar format to get my data into Python.
Does anyone have any experience using excel (.xlsx) data and getting it into Python, 'regression-ready'?
I have already done my needed regression analysis in R and Stata, but I would like to get better at using Python for regression analysis, since it has some nice attributes.
Here is my dataframe format so far, from excel to Python.
(this is truncated from a 10,000 X 60 shape dataset)
BANKS YEARS CIR DSF EQCUS EQLI EQNT EQUITY
0 CR1 2005 65.46 927915.00 28.553 23.948 37.542 264946.50
1 CR1 2006 65.98 1026491.00 30.491 26.584 36.143 312986.00
2 CR1 2007 60.26 1437615.00 27.003 23.413 28.238 388197.20
3 CR1 2008 58.08 1605464.00 24.024 20.160 25.828 385696.80
4 CR1 2009 65.21 1538570.00 28.160 22.850 27.907 433267.30
5 CR1 2010 54.45 1822863.00 31.009 24.555 28.274 565254.60
6 CR1 2011 57.38 2075505.00 30.905 24.861 29.618 641440.50
7 CR1 2012 62.12 2533641.00 29.595 24.509 28.883 749821.50
Data types:
>>>df.dtypes
BANKS object
YEARS int64
CIR float64
DSF float64
EQCUS float64
EQLI float64
EQNT float64
EQUITY float64
Unicode in the columns (I don't think sci-kit likes that!)
>>>df.columns.tolist()
[u'BANKS', u'YEARS', u'CIR', u'DSF', u'EQCUS', u'EQLI', u'EQNT', u'EQUITY']

I'm not sure which columns you're including in the regression, or what errors you're getting, but you can't use categorical variables in regressions (like 'BANKS'). You need to convert the categorical var to dummy vars (binary 0/1) and exclude the original categorical variable from your regression.
I also don't believe you can include rows with missing data points, so you either need to impute the data or drop the rows. (df.fillna in pandas)
You may want to consider using pandas to manage datasets in python. It's a package you can install and import in python, and makes python behave more like R or STATA. There's a nice tutorial here: http://pandas.pydata.org/pandas-docs/stable/10min.html
Pandas even has a function for converting categorical variables into dummy variables: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
Hope that helps...

Related

Cleaner workflow with panel data in pandas and statsmodels: panel, dataframe with multiindex, or plain dataframe?

Imagine I have an unbalanced panel such as:
Firm Year y x
A 1990 1.1 2.2
A 1991 2.1 3.2
B 1990 4.1 9.2
B 1991 5.2 10.1
B 1992 5.3 10.1
etc...
I'm reasonably new to Pandas and Statsmodels and am curious what structure has a cleaner workflow for typical panel data?
A dataframe with no special indexing?
A dataframe with a multiindex over firm and year?
eg. set_index(['year','firm'])
A pandas.Panel?
If I were in SQL, (year, firm) would be the primary key. My instinct is that a multiindex best captures the structure of the data and various use cases? Is that a sensible data structure if I want to run a panel regression with firm or year fixed effects in statsmodels? Is using a multiindex so clunky it's not worth the bother if performance isn't an issue?

Python pandas: Best way to normalize data? [duplicate]

This question already has answers here:
Normalise between 0 and 1 ignoring NaN
(3 answers)
Closed 6 years ago.
I have a large pandas dataframe with about 80 columns. Each of the 80 columns in the dataframe report daily traffic statistics for websites (the columns are the websites).
As I don't want to work with the raw traffic statistics, I rather like to normalize all of my columns (except for the first, which is the date). Either from 0 to 1 or (even better) from 0 to 100.
Date A B ...
10/10/2010 100.0 402.0 ...
11/10/2010 250.0 800.0 ...
12/10/2010 800.0 2000.0 ...
13/10/2010 400.0 1800.0 ...
That being said, I wonder which normalization to apply. Min-Max scaling vs. z-Score Normalization (standardization)? Some of my columns have strong outliers. It would be great to have an example. I am sorry not being able to provide the full data.
First, turn your Date column into an index.
dates = df.pop('Date')
df.index = dates
Then either use z-score normalizing:
df1 = (df - df.mean())/df.std()
or min-max scaling:
df2 = (df-df.min())/(df.max()-df.min())
I would probably advise z-score normalization, because min-max scaling is highly susceptible to outliers.

Import 5 dimensions from Excel into GAMS

I have a parameter with 4 dimensions which I would like to import in Excel.
Currently I have a series of Excel sheets with each over 500,000 rows. The columns are:
Parcel,
Farm,
Year,
Species,
Class,
Surface
Ideally I would like to have a parameter in GAMS like: Surface(Parcel,Farm,Year,Species,Class)
Is there an elegant way to do this?
Problem solved.
For those interested, the solution is as simple as this:
$CALL GDXXRW.EXE DeParcels2012_GAMS.xlsx par=input2012 rdim=5
The number of dimensions can be specified in the gdxxrw tool.

How do I define a Standard Deviation function in Pentaho Schema Workbench

I'm building an OLAP Analysis with Pentaho's BI Suite (Community Edition). Many of my measures are standard deviations of the variables in my fact tables.
Does someone has a tip on how to define a Standard Deviation aggregation function in Schema Workbench? Lot's of my jobs could benefit of it.
Thanks in advance!
You could use a MeasureExpression
There is a guide on how to do this in Postgresql here, what is your underlying db?
http://blog.endpoint.com/2009/07/subverting-postgresql-aggregates-for.html
There has long been a request to support custom aggregators, it's not been done yet.
In my case the database has 3 mill rows, the MDX cube has 3124 cells.
So the MDX function would calculate the std dev from the 3124 cell values, whereas a "real" statistician usually would use all 3 mill rows.
To get the statisticians STDDEV, I added a column in the database, being the square of the row value.
Then in Mondrian I defined a new measure, the std dev, as :
square root of ( sum of squared values - (average value * average value))
This has some consequences for hierarchies, but that is another story.
How I'm calculating standard deviations now:
I created an ID dimension, not to explore, just to make sure that Mondrian isn't calculating Standard Deviation of values already aggregated.
Then I created a new Calculated Member using the MDX formula:
Stddev(Descendants([ID_Dimension.ID_Hierarchy],,Leaves),[Measures].[Measure with values to be aggregated]).
Performance sucks.
The idea came from this very old forum post.

How to plot a pre-binned histogram in Excel 2007, without Analysis Toolpak

I have two columns of data. Column A has values. Column B has bins associated with each value in Column A (so lets say bins 1 to 10). example:
A B
-- ----
.43 1
.29 4
.23 1
.11 8
... ...
Is it possible to create a histogram chart without using the analysis toolpak and ideally not having to post-process this data in frequency counts?
The answer is to use a PivotChart. In that way, I don't need to use the Analysis Toolpak AND I don't need to treat the data. The data can be used as-is.

Resources