Does minizinc have any syntax to specify the distribution of values selection of a variable. For eg.
var 0..100: X ;
I would like to specify that X take values in range 0..50 90% of the time and take values in 51..100 10% of time. The syntax
int_search( [X], first_fail, **indomain_random**, complete ) satisfy;
specifies that X may get any value in 0..100 with the same probability.
MiniZinc don't have any syntax to state such a random distribution.
One way might be to change the FlatZinc solver so it behave this way when "indomain_random" is used. Though this requires that you have access to the source of the FlatZinc solver.
That said, what exactly is your use case for wanting this distribution?
Related
My question is, when building a decision tree in sklearn, if I have a categorical variable, is there a problem if I manually input the values of the variable as numbers? (assuming the dataframe is small)
And, will there be difference in results if my variable is nominal or ordinal?
I don't think there should be much difference since the theory says that you should look for the best combination in terms of entropy and other metrics, so it shouldn't care if one value is smaller than another.
Thank you very much
There are differences if your categorical variable is ordinal or nominal:
If your variable is ordinal, you can just change each categories for a number (for example: bad, normal, good can be changed for 1,2,3). Note that you are keeping only one column. You can do it manually if you have few samples. You can use LabelEncoder from sklearn to do it.
If your variable is not ordinal you have to add new columns to you dataset, one for each category. You can do it manually, but I would recommend use pd.get_dummies().
To sump up, you have to be very careful knowing if the categorical variable is ordinal or not. And you can deal with them manually (you would obtain same results), but it's recommend to use functions predefined to avoid some mistakes.
I am trying to simulate a number of different distribution types for a project using Excel. Right now, I have generated a normal distribution with a mean of 35 and a standard deviation of 3.33. So far so good.
I would like to also generate some other distribution types.
One I have tried is a lognormal. To get that, I am using the following code:
=(LOGNORM.INV(RAND(),LN(45^2/SQRT(45^2+3.33^2)),SQRT(LN((45^2+3.33^2)/4.5^2))
It produces some output, but I would welcome anyone's input on the syntax.
What I really want to try to do is a power law distribution. From what I can tell, Excel does not have a built-in function to randomly generate this data. Does anyone know of a way to do it, besides switching software packages?
Thanks for any help you can provide.
E
For the (type I) Pareto distribution, if the parameters are a min value xm and an exponent alpha then the cdf is given by
p = 1 - (xm/x)^alpha
This gives the probability, p, that the random variable takes on a value which is <= x. This is easy to invert, so you can use inverse sampling to generate random variables which follow that distribution:
x = xm/(1-p)^(1/alpha) = xm*(1-p)^(-1/alpha)
If p is uniform over [0,1] then so is 1-p, so in the above you can just use RAND() to simulate 1/p. Thus, in Excel if you wanted to e.g. simulate a type-1 Pareto distribution with xm = 2 and alpha = 3, you would use the formula:
= 2 * RAND()^(-1/3)
If you are going to be doing this sort of thing a lot with different distributions, you might want to consider using R, which can be called directly from Excel using the REXcel add-in. R has a very large number of built-in distributions that it can directly sample from (and it also uses a better underlying random number generator than Excel does).
I have a set of data I have acquired from simulations. There are 3 parameters that go into my simulations and I get one result out.
I can graph the data from the small subset i have and see the trends for each input, but I need to be able to extrapolate this and get some form of a regression equation seeing as the simulation takes a long time.
In matlab or excel, is it possible to list the inputs and outputs to obtain a 4 parameter regression line for a given set of information?
Before this gets flagged as a duplicate, i understand polyfit will give me an equation of best fit and will be as accurate as i want it, but i need the equation to correspond to the inputs, not just a regression line.
In other words if i 20 simulations of inputs a, b, c and output y, is there a way to obtain a "best fit":
y=B0+B1*a+B2*b+B3*c
using the data?
My usual recommendation for higher-dimensional curve fitting is to pose the problem as a minimization problem (that may be unneeded here with the nice linear model you've proposed, but I'm a hammer-nail guy sometimes).
It starts by creating a correlation function (the functional form you think maps your inputs to the output) given a vector of fit parameters p and input data xData:
correl = #(p,xData) p(1) + p(2)*xData(:,1) + p(3)*xData(:2) + p(4)*xData(:,3)
Then you need to define a function to minimize given the parameter vector, which I call the objective; this is typically your correlation minus you output data.
The details of this function are determined from the solver you'll use (see below).
All of the method need a starting vector pGuess, which is dependent on the trends you see.
For nonlinear correlation function, finding a good pGuess can be a trial but necessary for a good solution.
fminsearch
To use fminsearch, the data must be collapsed to a scalar value using some norm (2 here):
x = [a,b,c]; % your input data as columns of x
objective = #(p) norm(correl(p,x) - y,2);
p = fminsearch(objective,pGuess); % you need to define a good pGuess
lsqnonlin
To use lsqnonlin (which solves the same problem as above in different ways), the norm-ing of the objective is not needed:
objective = #(p) correl(p,x) - y ;
p = lsqnonlin(objective,pGuess); % you need to define a good pGuess
(You can also specify lower and upper bounds on the parameter solution, which is nice.)
lsqcurvefit
To use lsqcurvefit (which is simply a wrapper for lsqnonlin), only the correlation function is needed along with the data:
p = lsqcurvefit(correl,pGuess,x,y); % you need to define a good pGuess
In Maple, there is some feature that allows you to calculate the pdf of a function of a random variable. For example, if X is exponentially distributed, and you want to know the distribution of X^2, then there is a function that will do that for you.
My question is , is there a functionality in matlab that allows you to do so? I have looked through the matlab's guide, but I didn't see it.
The Statistics toolbox includes many probability distributions for you to choose from, both parametric and non-parametric distributions. For each it provides functions for PDF, CDF, fitting, random number generation, etc..
I suggest you start with the "Distribution Fitting app": dfittool.
EDIT:
In addition, MuPAD has support for a number of distributions, which you can manipulate symbolically. Example:
The function intlib::changevar might be of interest here, though it seems intended for integrals...
Also, if you're interested in getting the values of the PMF, or discrete PDF, then, given x some RV with some distribution,
my_pmf = hist(x)/sum(x);
So try,
doc hist
Let's say, I have two random variables,x and y, both of them have n observations. I've used a forecasting method to estimate xn+1 and yn+1, and I also got the standard error for both xn+1 and yn+1. So my question is that what the formula would be if I want to know the standard error of xn+1 + yn+1, xn+1 - yn+1, (xn+1)*(yn+1) and (xn+1)/(yn+1), so that I can calculate the prediction interval for the 4 combinations. Any thought would be much appreciated. Thanks.
Well, the general topic you need to look at is called "change of variables" in mathematical statistics.
The density function for a sum of random variables is the convolution of the individual densities (but only if the variables are independent). Likewise for the difference. In special cases, that convolution is easy to find. For example, for Gaussian variables the density of the sum is also a Gaussian.
For product and quotient, there aren't any simple results, except in special cases. For those, you might as well compute the result directly, maybe by sampling or other numerical methods.
If your variables x and y are not independent, that complicates the situation. But even then, I think sampling is straightforward.