Interpretation of variables in multi-level regression with random effects - python-3.x

I have a dataset that looks like the one below (first 5 rows shown). CPA is an observed result from an experiment (treatment) on different advertising flights. Flights are hierarchically grouped in campaigns.
campaign_uid flight_uid treatment CPA
0 0C2o4hHDSN 0FBU5oULvg control -50.757370
1 0C2o4hHDSN 0FhOqhtsl9 control 10.963426
2 0C2o4hHDSN 0FwPGelRRX exposed -72.868952
3 0C5F8ZNKxc 0F0bYuxlmR control 13.356081
4 0C5F8ZNKxc 0F2ESwZY22 control 141.030900
5 0C5F8ZNKxc 0F5rfAOVuO exposed 11.200450
I fit a model like the following one:
model.fit('CPA ~ treatment', random=['1|campaign_uid'])
To my knowledge, this model simply says:
We have a slope for treatment
We have a global intercept
We also have an intercept per campaign
so one would just get one posterior for each such variable.
However, looking at the results below, I also get posteriors for the following variable: 1|campaign_uid_offset. What does it represent?
Code for fitting the model and the plot:
model = Model(df)
results = model.fit('{} ~ treatment'.format(metric),
random=['1|campaign_uid'],
samples=1000)
# Plotting the result
pm.traceplot(model.backend.trace)

1|campaign_uid
These are the random intercepts for campaigns that you mentioned in your list of parameters.
1|campaign_uid_sd
This is the standard deviation of the aforementioned random campaign intercepts.
CPA_sd
This is the residual standard deviation. That is, your model can be written (in part) as CPA_ij ~ Normal(b0 + b1*treatment_ij + u_j, sigma^2), and CPA_sd represents the parameter sigma.
1|campaign_uid_offset
This is an alternative parameterization of the random intercepts. bambi uses this transformation internally in order to improve the MCMC sampling efficiency. Normally this transformed parameter is hidden from the user by default; that is, if you make the traceplot using results.plot() rather than pm.traceplot(model.backend.trace) then these terms are hidden unless you specify transformed=True (it's False by default). It's also hidden by default from the results.summary() output. For more information about this transformation, see this nice blog post by Thomas Wiecki.

Related

Scale before PCA

I'm using PCA from sckit-learn and I'm getting some results which I'm trying to interpret, so I ran into question - should I subtract the mean (or perform standardization) before using PCA, or is this somehow embedded into sklearn implementation?
Moreover, which of the two should I perform, if so, and why is this step needed?
I will try to explain it with an example. Suppose you have a dataset that includes a lot features about housing and your goal is to classify if a purchase is good or bad (a binary classification). The dataset includes some categorical variables (e.g. location of the house, condition, access to public transportation, etc.) and some float or integer numbers (e.g. market price, number of bedrooms etc). The first thing that you may do is to encode the categorical variables. For instance, if you have 100 locations in your dataset, the common way is to encode them from 0 to 99. You may even end up encoding these variables in one-hot encoding fashion (i.e. a column of 1 and 0 for each location) depending on the classifier that you are planning to use. Now if you use the price in million dollars, the price feature would have a much higher variance and thus higher standard deviation. Remember that we use square value of the difference from mean to calculate the variance. A bigger scale would create bigger values and square of a big value grow faster. But it does not mean that the price carry significantly more information compared to for instance location. In this example, however, PCA would give a very high weight to the price feature and perhaps the weights of categorical features would almost drop to 0. If you normalize your features, it provides a fair comparison between the explained variance in the dataset. So, it is good practice to normalize the mean and scale the features before using PCA.
Before PCA, you should,
Mean normalize (ALWAYS)
Scale the features (if required)
Note: Please remember that step 1 and 2 are not the same technically.
This is a really non-technical answer but my method is to try both and then see which one accounts for more variation on PC1 and PC2. However, if the attributes are on different scales (e.g. cm vs. feet vs. inch) then you should definitely scale to unit variance. In every case, you should center the data.
Here's the iris dataset w/ center and w/ center + scaling. In this case, centering lead to higher explained variance so I would go with that one. Got this from sklearn.datasets import load_iris data. Then again, PC1 has most of the weight on center so patterns I find in PC2 I wouldn't think are significant. On the other hand, on center | scaled the weight is split up between PC1 and PC2 so both axis should be considered.

How to find a regression line for a closed set of data with 4 parameters in matlab or excel?

I have a set of data I have acquired from simulations. There are 3 parameters that go into my simulations and I get one result out.
I can graph the data from the small subset i have and see the trends for each input, but I need to be able to extrapolate this and get some form of a regression equation seeing as the simulation takes a long time.
In matlab or excel, is it possible to list the inputs and outputs to obtain a 4 parameter regression line for a given set of information?
Before this gets flagged as a duplicate, i understand polyfit will give me an equation of best fit and will be as accurate as i want it, but i need the equation to correspond to the inputs, not just a regression line.
In other words if i 20 simulations of inputs a, b, c and output y, is there a way to obtain a "best fit":
y=B0+B1*a+B2*b+B3*c
using the data?
My usual recommendation for higher-dimensional curve fitting is to pose the problem as a minimization problem (that may be unneeded here with the nice linear model you've proposed, but I'm a hammer-nail guy sometimes).
It starts by creating a correlation function (the functional form you think maps your inputs to the output) given a vector of fit parameters p and input data xData:
correl = #(p,xData) p(1) + p(2)*xData(:,1) + p(3)*xData(:2) + p(4)*xData(:,3)
Then you need to define a function to minimize given the parameter vector, which I call the objective; this is typically your correlation minus you output data.
The details of this function are determined from the solver you'll use (see below).
All of the method need a starting vector pGuess, which is dependent on the trends you see.
For nonlinear correlation function, finding a good pGuess can be a trial but necessary for a good solution.
fminsearch
To use fminsearch, the data must be collapsed to a scalar value using some norm (2 here):
x = [a,b,c]; % your input data as columns of x
objective = #(p) norm(correl(p,x) - y,2);
p = fminsearch(objective,pGuess); % you need to define a good pGuess
lsqnonlin
To use lsqnonlin (which solves the same problem as above in different ways), the norm-ing of the objective is not needed:
objective = #(p) correl(p,x) - y ;
p = lsqnonlin(objective,pGuess); % you need to define a good pGuess
(You can also specify lower and upper bounds on the parameter solution, which is nice.)
lsqcurvefit
To use lsqcurvefit (which is simply a wrapper for lsqnonlin), only the correlation function is needed along with the data:
p = lsqcurvefit(correl,pGuess,x,y); % you need to define a good pGuess

Statistics Dummy Variable as Dependent Variable Regression

I have a bunch of independent variables: height, weight, etc that I want to regress a dummy variable on to. For instance, if I wanted to regress diabetes (0 if patient doesnt have diabetes, 1 if patient does have diabetes) and I wanted to figure out the effect of an increase in 1 pound of weight on the probability of having diabetes, how would I do that? I'm sure there are multiple ways of doing it but I just never have heard of a model that does this. I thought it was the probit model but I'm not sure. Any thoughts?
The problem you are describing is known as logistic regression; a web search for that should turn up a lot of hits. Most commonly, the response is some function of a linear combination of inputs, but more generally, the response could be a nonlinear function of inputs.
The dependence of the response on an input (e.g. weight) is interesting, but not exactly well-posed, since the change of the probability of the response varies over the range of the input variable; the change is very small for very large or very small values of the input, and reaches some maximum in between.

Data mining for significant variables (numerical): Where to start?

I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.

Reverse engineer a new set of points from an original set by altering moments, skew, and/or Kurtosis?

I don't know if this is even possible but I'd like to be able to take a set of points, run something on them that calculates the moments, skew and kurtosis values, and have another function that would take those elements and reverse engineer a new set of points using modified values for the moments, skew and/or kurtosis. I already have the analytical function in Delphi Pro 6 which is:
procedure MomentSkewKurtosis(const Data: array of Double;var M1, M2, M3, M4, Skew,Kurtosis: Extended);
I'm looking for a partner function that could return a new Data array after I make alterations to any of the output parameters "var" in MomentSkewKurtosis() and pass them back in to the partner function as input parameters. For example, suppose I wanted to increase the Skew of the data and get a new set of points back that would be the original set of points altered just enough to generate the new Skew value.
The problem is not easy, and probably better targetted at stats, but I'll give you a pointer to a paper that I think is very good, and straight to the mark: Towards the Optimal Reconstruction of a Distribution from its Moments
Hope this helps!
Obviously you can't reconstruct an arbitrary density distribution from a finite amount of variables. You can create a distribution which fits the parameters, but it's not necessarily the original distribution.
And as far as I remember Mean, Variance, Skew and Kurtosis are just functions of the first 4 momenta. So you can't choose them independently from the momenta.
On the other hand there exists a function which you can apply on each data member and that produces a new dataset with the desired properties. I suspect that since you fixed the first 4 momenta it's a polynomial of grade 3.

Resources