Building an error component logit using mlogit - statistics

I want to build an error component logit using R's mlogit library.
I have considered my dataset as a panel dataset (i.e. each row indicates an alternative) and then build an error component logit model.
While I understand that in order to build a mixed-logit model, I need to add the list of covariates in rpar command. However, I do not want to estimate random parameters for the covariates but for the intercept term.

In a multinomial logit model you can estimate J-1 alternative specific constants (intercepts). The easiest way to make them random is to create alternative specific indicators.
For example, let's say that you have three alternatives: 1, 2 and 3, and that these are stored in the variable alt. Now you can create two new variables called alt_1 and alt_2, which are equal to 1 for alternative 1 and 2, respectively, and 0 otherwise.
data$alt_1 <- ifelse(data$alt == 1, 1, 0)
data$alt_2 <- ifelse(data$alt == 2, 1, 0)
Now use the mlogit.data() function.
In your model you would then specify alt_1 and alt_2 to be random parameters following some distribution. Now you have random alternative specific constants and you estimate the mean and standard deviation. If you want them to be simple error components with zero mean and unit standard deviation, you can fix the mean and sd parameters for the intercepts to 0 and 1 respectively.

Related

How to generate a random stochastic matrix or ndarray?

I was looking for a crate that would allow me to easily and randomly generate probability vectors, stochastic matrices or, in general, ndarrays that are stochastic. For people not familiar with these concepts, a probability vector v is defined as follows
0 <= v[i] <= 1, for all i
sum(v[i]) = 1
Similarly, a stochastic matrix is a matrix where each column (or row) satisfies the conditions above.
More generally, a ndarray A would be stochastic if
0 <= A[i, j, k, ..., h] <= 1, for all indices
sum(A[i, j, k, ..., :]) = 1, for all indices
Here, ... just means other indices between k and the last index h. : is a notation to indicate all elements of that dimension.
Is there a crate that does this easily (i.e. you just need to call a function or something like that)? If not, how would you do it? I suppose one could just generate a random ndarray and then change the array by dividing the last dimension by the sum of the elements in that dimension, so, for a 1d array (a probability vector), we could do something like this
use ndarray::Array1;
use ndarray_rand::RandomExt;
use ndarray_rand::rand_distr::Uniform;
fn main() {
let mut a = Array1::random(10, Uniform::new(0.0, 1.0));
a = &a / a.sum();
println!("The sum is {:?}", a.sum());
}
But how would you do it for higher dimensional arrays? We could use a for loop an iterate over all indices, but that doesn't look like it would be efficient. I suppose there must be a way to do this operation in a vectorized form. Is there a function (in the standard library, in the ndarray crate or some other crate) that does this for us? Could we use ndarray-rand to do this without having to divide by the sum?
Requirements
Efficiency is not strictly necessary, but it would be nice.
I am more looking for a simple solution (no more complicated than what I wrote above).
Numerical stability would also be great (e.g. generating random integers and then dividing by the sum may be a better idea than generating random floats and then do the same thing).
I would like to use ndarrays and the related crate, but it's ok if you share also other solutions (which may be useful to others that don't use ndarrays)
I would argue that sampling with whatever distribution you have on hands (U(0,1), Exponential, abs Normal, ...) and then dividing by sum is the wrong way to go.
Start with distribution which has property values being in the [0...1] range and sum of values being equal to 1.
Fortunately, there is such distribution - Dirichlet distribution.
And, apparently, there is a Rust lib to do Dirichlet sampling. Cannot say anything about lib quality.
https://docs.rs/rand_distr/latest/rand_distr/struct.Dirichlet.html
UPDATE
Wrt sampling and then normalizing, problem is, noone knows what would be distribution of the RVs
U(0,1)/(U(0,1) + U(0,1) + ... + U(0,1))
Mean value? Median? Variance? Anything to say at all?
You could even construct it like
[U(0,1);Exp(2);|N(0,1)|;U(0,88);Exp(4.5);...] and as soon as you divide it by sum, values in the vector would be between 0 and 1 and summed to 1. Even less to say about properties of such RVs.
I assume you want to generate random vector/matrices for some purpose, like Monte Carlo etc. Dealing with known distribution with well-defined properties, mean values, variance looks like right way to go.
If I understand correctly, the Dirichlet distribution allows you to generate a probability vector, where the probabilities depend on the initial parameters that you pass, but you would still need to pass these parameters (manually)
Yes, concentration parameters. By default all ones, which makes RVs uniformly distributed in the simplex.
So, are you suggesting the Dirichlet distribution because it was designed on purpose to generate probability vectors?
I'm suggesting Dirichlet because by default it will produce uniformly in-the-simplex distributed RVs, summed to 1 and with well-known statistical properties, starting with PDF, CDF, mean, median, variance, ...
UPDATE II
For Dirichlet
PDF=Prod(xiai-1)/B(a)
So for the case where all ai=1
PDF = 1/B(a)
so given the constrains defining simplex Sum(xi)=1 this is as uniform as it gets.

Using scipy.stats.entropy on gmm.predict_proba() values

Background so I don't throw out an XY problem -- I'm trying to check the goodness of fit of a GMM because I want statistical back-up for why I'm choosing the number of clusters I've chosen to group these samples. I'm checking AIC, BIC, entropy, and root mean squared error. This question is about entropy.
I've used kmeans to cluster a bunch of samples, and I want an entropy greater than 0.9 (stats and psychology are not my expertise and this problem is both). I have 59 samples; each sample has 3 features in it. I look for the best covariance type via
for cv_type in cv_types:
for n_components in n_components_range:
# Fit a Gaussian mixture with EM
gmm = mixture.GaussianMixture(n_components=n_components,
covariance_type=cv_type)
gmm.fit(data3)
where the n_components_range is just [2] (later I'll check 2 through 5).
Then I take the GMM with the lowest AIC or BIC, saved as best_eitherAB, (not shown) of the four. I want to see if the label assignments of the predictions are stable across time (I want to run for 1000 iterations), so I know I then need to calculate the entropy, which needs class assignment probabilities. So I predict the probabilities of the class assignment via gmm's method,
probabilities = best_eitherAB.predict_proba(data3)
all_probabilities.append(probabilities)
After all the iterations, I have an array of 1000 arrays, each contains 59 rows (sample size) by 2 columns (for the 2 classes). Each inner row of two sums to 1 to make the probability.
Now, I'm not entirely sure what to do regarding the entropy. I can just feed the whole thing into scipy.stats.entropy,
entr = scipy.stats.entropy(all_probabilities)
and it spits out numbers - as many samples as I have, I get a 2 item numpy matrix for each. I could feed just one of the 1000 tests in and just get 1 small matrix of two items; or I could feed in just a single column and get a single values back. But I don't know what this is, and the numbers are between 1 and 3.
So my questions are -- am I totally misunderstanding how I can use scipy.stats.entropy to calculate the stability of my classes? If I'm not, what's the best way to find a single number entropy that tells me how good my model selection is?

scikit-learn Decision trees Regression: retrieve all samples for leaf (not mean)

I have started using scikit-learn Decision Trees and so far it is working out quite well but one thing I need to do is retrieve the set of sample Y values for the leaf node, especially when running a prediction. That is given an input feature vector X, I want to know the set of corresponding Y values at the leaf node instead of just the regression value which is the mean (or median) of those values. Of course one would want the sample mean to have a small variance but I do want to extract the actual set of Y values and do some statistics/create a PDF. I have used code like this how to extract the decision rules from scikit-learn decision-tree?
To print the decision tree but the output of the 'value' is the single float representing the mean. I have a large dataset so limit the leaf size to e.g. 100, I want to access those 100 values...
another solution is to use an (undocumented?) feature of the sklearn DecisionTreeRegressor object which is .tree.impurity
it returns the standard deviation of the values per each leaf

Interpretation of variables in multi-level regression with random effects

I have a dataset that looks like the one below (first 5 rows shown). CPA is an observed result from an experiment (treatment) on different advertising flights. Flights are hierarchically grouped in campaigns.
campaign_uid flight_uid treatment CPA
0 0C2o4hHDSN 0FBU5oULvg control -50.757370
1 0C2o4hHDSN 0FhOqhtsl9 control 10.963426
2 0C2o4hHDSN 0FwPGelRRX exposed -72.868952
3 0C5F8ZNKxc 0F0bYuxlmR control 13.356081
4 0C5F8ZNKxc 0F2ESwZY22 control 141.030900
5 0C5F8ZNKxc 0F5rfAOVuO exposed 11.200450
I fit a model like the following one:
model.fit('CPA ~ treatment', random=['1|campaign_uid'])
To my knowledge, this model simply says:
We have a slope for treatment
We have a global intercept
We also have an intercept per campaign
so one would just get one posterior for each such variable.
However, looking at the results below, I also get posteriors for the following variable: 1|campaign_uid_offset. What does it represent?
Code for fitting the model and the plot:
model = Model(df)
results = model.fit('{} ~ treatment'.format(metric),
random=['1|campaign_uid'],
samples=1000)
# Plotting the result
pm.traceplot(model.backend.trace)
1|campaign_uid
These are the random intercepts for campaigns that you mentioned in your list of parameters.
1|campaign_uid_sd
This is the standard deviation of the aforementioned random campaign intercepts.
CPA_sd
This is the residual standard deviation. That is, your model can be written (in part) as CPA_ij ~ Normal(b0 + b1*treatment_ij + u_j, sigma^2), and CPA_sd represents the parameter sigma.
1|campaign_uid_offset
This is an alternative parameterization of the random intercepts. bambi uses this transformation internally in order to improve the MCMC sampling efficiency. Normally this transformed parameter is hidden from the user by default; that is, if you make the traceplot using results.plot() rather than pm.traceplot(model.backend.trace) then these terms are hidden unless you specify transformed=True (it's False by default). It's also hidden by default from the results.summary() output. For more information about this transformation, see this nice blog post by Thomas Wiecki.

How to find a regression line for a closed set of data with 4 parameters in matlab or excel?

I have a set of data I have acquired from simulations. There are 3 parameters that go into my simulations and I get one result out.
I can graph the data from the small subset i have and see the trends for each input, but I need to be able to extrapolate this and get some form of a regression equation seeing as the simulation takes a long time.
In matlab or excel, is it possible to list the inputs and outputs to obtain a 4 parameter regression line for a given set of information?
Before this gets flagged as a duplicate, i understand polyfit will give me an equation of best fit and will be as accurate as i want it, but i need the equation to correspond to the inputs, not just a regression line.
In other words if i 20 simulations of inputs a, b, c and output y, is there a way to obtain a "best fit":
y=B0+B1*a+B2*b+B3*c
using the data?
My usual recommendation for higher-dimensional curve fitting is to pose the problem as a minimization problem (that may be unneeded here with the nice linear model you've proposed, but I'm a hammer-nail guy sometimes).
It starts by creating a correlation function (the functional form you think maps your inputs to the output) given a vector of fit parameters p and input data xData:
correl = #(p,xData) p(1) + p(2)*xData(:,1) + p(3)*xData(:2) + p(4)*xData(:,3)
Then you need to define a function to minimize given the parameter vector, which I call the objective; this is typically your correlation minus you output data.
The details of this function are determined from the solver you'll use (see below).
All of the method need a starting vector pGuess, which is dependent on the trends you see.
For nonlinear correlation function, finding a good pGuess can be a trial but necessary for a good solution.
fminsearch
To use fminsearch, the data must be collapsed to a scalar value using some norm (2 here):
x = [a,b,c]; % your input data as columns of x
objective = #(p) norm(correl(p,x) - y,2);
p = fminsearch(objective,pGuess); % you need to define a good pGuess
lsqnonlin
To use lsqnonlin (which solves the same problem as above in different ways), the norm-ing of the objective is not needed:
objective = #(p) correl(p,x) - y ;
p = lsqnonlin(objective,pGuess); % you need to define a good pGuess
(You can also specify lower and upper bounds on the parameter solution, which is nice.)
lsqcurvefit
To use lsqcurvefit (which is simply a wrapper for lsqnonlin), only the correlation function is needed along with the data:
p = lsqcurvefit(correl,pGuess,x,y); % you need to define a good pGuess

Resources