Quasi-Monte-Carlo vs. variable dimensionality? - montecarlo

I've been looking through the Matlab documention on using quasi-random sampling of N-dimensional unit cubes. This represents a problem with N stochastic parameters. Based on the fact that it is a unit cube, I presume that I need to use the inverse CDF of each parameter to map from the [0,1] domain to the value range of each parameter.
I would like to try this on a problem for which I now use Monte Carlo. Unfortunately, the problem I'm analyzing does not have a fixed number of dimensions. For each instantiation of the problem, I generate a variable number of widgets (say) using a Poisson distribution. Only after that do I randomly generate the parameters for each widget. That whole process yields one instance of the problem to be analyzed, so the number of parameters varies from one instance to the next.
Is this kind of problem still amenable to Quasi-Monte-Carlo?

What I used once was to get highest possible dimension of the problem d, generate Sobol sequence in d and use whatever number of points necessary for a particular sampling. I would say it helped somewhat...

From talking to a much smarter colleague, we need to consider the various combinations of widget counts for each widget type. For example, if we have 2 of widget type#1, 4 of widget type #2, 1 of widget type #3, etc., that constitutes one combination. QMC can be applied to that one combination. We are assuming that number of widget#i is independent of the number of widget#j for i<>j, so the probability of each combination is just the product of p(2 widgets of type#1), p(4 widgets of type#2), p(1 widget of type#3), etc. The individual probabilities are easy to get from their Poisson distributions (or their flat distributions, or whatever distribution is being used). If there are N widget types, this is just a joint PMF in N-space. This probability is then used to weight the QMC result for that particular combination. Note that even when the exactly combination is nailed down, QMC is still needed because there each widget is associated with 3 stochastic parameters.

Related

How to interpret Random Effects Plot from mgcv

I have a few questions regarding using a random effect in a GAM. First, how do you interpret and communicate the output graph?
I have fire modeled as a random effect in this GAM because it is largely a random occurrence at my different field sites and I only noted it as a binary. It wouldn't work as a normal variable since it has too few levels and there is also relatively few sites with fire. However, it greatly improved model variance capture when included so I don't want to simply exclude it. I don't know how to interpret the output and I am also not entirely confident that there wouldn't be another way to include it in the model other than as a random effect. Any help would be greatly appreciated!
The effect has been modelled as a random slope if you didn't code it as a factor in the data. The value on the y axis is the estimated slope; it will be a little smaller in absolute value than if you use Fire as a linear fixed effect in the model formula because it is being penalised (shrunk) towards zero.
This likely should have been fitted as a binary fixed effect; code Fire as a factor with two levels (Yes/No, or Burned / Unburned say). Just because a variable represents something that is random over the data doesn't mean it is a suitable random effect; fire here has some average effect and the fixed effect describes that well. There's nothing stopping you from using Fire coded as a factor as a random effect via the smooth, but with only two levels it's not going the two intercepts aren't going to be estimate that precisely.
Now, if you had repeated observations on n sites and you thought the Fire effect varied across the n sites then you could do s(Site, Fire, bs = 're') where both Site and Fire are factors and you'll get different Fire effects for each Site. Then the plot you show would have many points on it as it is a QQ-plot of the estimated values for the effect of Fire in each Site, hence 1 point per Site. Given the way this model is estimated, these are somewhat assumed to be distributed Gaussian with some variance that is inversely proportional to the smoothness parameter selected by gam() when fitting this random effect smoother. That's why the default plot is as it is; it's a QQ-plot comparing the observed distribution of estimate values of the random effects against the theoretical expectation.

Random Index from a Tensor (Sampling with Replacement from a Tensor)

I'm trying to manipulate individual weights of different neural nets to see how their performance degrades. As part of these experiments, I'm required to sample randomly from their weight tensors, which I've come to understand as sampling with replacement (in the statistical sense). However, since it's high-dimensional, I've been stumped by how to do this in a fair manner. Here are the approaches and research I've put into considering this problem:
This was previously implemented by selecting a random layer and then selecting a random weight in that layer (ignore the implementation of picking a random weight). Since layers are different sizes, we discovered that weights were being sampled unevenly.
I considered what would happen if we sampled according to the numpy.shape of the tensor; however, I realize now that this encounters the same problem as above.
Consider what happens to a rank 2 tensor like this:
[[*, *, *],
[*, *, *, *]]
Selecting a row randomly and then a value from that row results in an unfair selection. This method could work if you're able to assert that this scenario never occurs, but it's far from a general solution.
Note that this possible duplicate actually implements it in this fashion.
I found people suggesting flattening the tensor and use numpy.random.choice to select randomly from a 1D array. That's a simple solution, except I have no idea how to invert the flattened tensor back into its original shape. Further, flattening millions of weights would be a somewhat slow implementation.
I found someone discussing tf.random.multinomial here, but I don't understand enough of it to know whether it's applicable or not.
I ran into this paper about resevoir sampling, but again, it went over my head.
I found another paper which specifically discusses tensors and sampling techniques, but it went even further over my head.
A teammate found this other paper which talks about random sampling from a tensor, but it's only for rank 3 tensors.
Any help understanding how to do this? I'm working in Python with Keras, but I'll take an algorithm in any form that it exists. Thank you in advance.
Before I forget to document the solution we arrived at, I'll talk about the two different paths I see for implementing this:
Use a total ordering on scalar elements of the tensor. This is effectively enumerating your elements, i.e. flattening them. However, you can do this while maintaining the original shape. Consider this pseudocode (in Python-like syntax):
def sample_tensor(tensor, chosen_index: int) -> Tuple[int]:
"""Maps a chosen random number to its index in the given tensor.
Args:
tensor: A ragged-array n-tensor.
chosen_index: An integer in [0, num_scalar_elements_in_tensor).
Returns:
The index that accesses this element in the tensor.
NOTE: Entirely untested, expect it to be fundamentally flawed.
"""
remaining = chosen_index
for (i, sub_list) in enumerate(tensor):
if type(sub_list) is an iterable:
if |sub_list| > remaining:
remaining -= |sub_list|
else:
return i joined with sample_tensor(sub_list, remaining)
else:
if len(sub_list) <= remaining:
return tuple(remaining)
First of all, I'm aware this isn't a sound algorithm. The idea is to count down until you reach your element, with bookkeeping for indices.
We need to make crucial assumptions here. 1) All lists will eventually contain only scalars. 2) By direct consequence, if a list contains lists, assume that it also doesn't contain scalars at the same level. (Stop and convince yourself for (2).)
We also need to make a critical note here too: We are unable to measure the number of scalars in any given list, unless the list is homogeneously consisting of scalars. In order to avoid measuring this magnitude at every point, my algorithm above should be refactored to descend first, and subtract later.
This algorithm has some consequences:
It's the fastest in its entire style of approaching the problem. If you want to write a function f: [0, total_elems) -> Tuple[int], you must know the number of preceding scalar elements along the total ordering of the tensor. This is effectively bound at Theta(l) where l is the number of lists in the tensor (since we can call len on a list of scalars).
It's slow. It's too slow compared to sampling nicer tensors that have a defined shape to them.
It begs the question: can we do better? See the next solution.
Use a probability distribution in conjunction with numpy.random.choice. The idea here is that if we know ahead of time what the distribution of scalars is already like, we can sample fairly at each level of descending the tensor. The hard problem here is building this distribution.
I won't write pseudocode for this, but lay out some objectives:
This can be called only once to build the data structure.
The algorithm needs to combine iterative and recursive techniques to a) build distributions for sibling lists and b) build distributions for descendants, respectively.
The algorithm will need to map indices to a probability distribution respective to sibling lists (note the assumptions discussed above). This does require knowing the number of elements in an arbitrary sub-tensor.
At lower levels where lists contain only scalars, we can simplify by just storing the number of elements in said list (as opposed to storing probabilities of selecting scalars randomly from a 1D array).
You will likely need 2-3 functions: one that utilizes the probability distribution to return an index, a function that builds the distribution object, and possibly a function that just counts elements to help build the distribution.
This is also faster at O(n) where n is the rank of the tensor. I'm convinced this is the fastest possible algorithm, but I lack the time to try to prove it.
You might choose to store the distribution as an ordered dictionary that maps a probability to either another dictionary or the number of elements in a 1D array. I think this might be the most sensible structure.
Note that (2) is truly the same as (1), but we pre-compute knowledge about the densities of the tensor.
I hope this helps.

What N ((1,0)T , I) mean related to Gaussian Distribution

Hi everyone I am reading a book "Element of Statistical Learning) and came across the below paragraph which i dont I understand. (explains how the training data was generated)
We generated 10 means mk from a bivariate Gaussian distribution N((0,1)T,I) and labeled this class as blue. Similraly, 10 more were drawn from from N((0,1)T,I) and labeled class Orange. Then for each class we generated 100 observations as follows: for each observation, we picked an mk at random with probability 1/10, and then generated a N(mk, I/5), thus leading to a mixture of Gaussian cluster for each class.
I would appreciate if you could explain the above paragraph and especially N((0,1)T,I)
by the way- (0,1) to the power of T for Transpose.
Is this notation mathmatically common or related to a specific computer language.
In the paragraph N stands for the Normal distribution; more specifically, in this case it stands for the Multivariate normal distribution. It is not specific to any programming languages. It comes from statistics and probability theory, but due to numerous appealing properties and important applications of this probability distribution it is also widely used in programming, so you should be able to perform the described procedure in any language.
The part (0,1)^T is a vector of means. That is, we have in mind a random vector of length two, where the first element on average is 0, and the second one on average is 1.
"I" stands for the 2x2 identity matrix whose role is the variance-covariance matrix. That is, the variance of both random vector components is 1 (i.e., the diagonal terms), while off-diagonal points are 0 and correspond to the covariance between the two random variables.

Data mining for significant variables (numerical): Where to start?

I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.

Create CDF for Anderson Darling test for Octave forge Statistics package function

I am using Octave and I would like to use the anderson_darling_test from the Octave forge Statistics package to test if two vectors of data are drawn from the same statistical distribution. Furthermore, the reference distribution is unlikely to be "normal". This reference distribution will be the known distribution and taken from the help for the above function " 'If you are selecting from a known distribution, convert your values into CDF values for the distribution and use "uniform'. "
My question therefore is: how would I convert my data values into CDF values for the reference distribution?
Some background information for the problem: I have a vector of raw data values from which I extract the cyclic component (this will be the reference distribution); I then wish to compare this cyclic component with the raw data itself to see if the raw data is essentially cyclic in nature. If the the null hypothesis that the two are the same can be rejected I will then know that most of the movement in the raw data is not due to cyclic influences but is due to either trend or just noise.
If your data has a specific distribution, for instance beta(3,3) then
p = betacdf(x, 3, 3)
will be uniform by the definition of a CDF. If you want to transform it to a normal, you can just call the inverse CDF function
x=norminv(p,0,1)
on the uniform p. Once transformed, use your favorite test. I'm not sure I understand your data, but you might consider using a Kolmogorov-Smirnov test instead, which is a nonparametric test of distributional equality.
Your approach is misguided in multiple ways. Several points:
The Anderson-Darling test implemented in Octave forge is a one-sample test: it requires one vector of data and a reference distribution. The distribution should be known - not come from data. While you quote the help-file correctly about using a CDF and the "uniform" option for a distribution that is not built in, you are ignoring the next sentence of the same help file:
Do not use "uniform" if the distribution parameters are estimated from the data itself, as this sharply biases the A^2 statistic toward smaller values.
So, don't do it.
Even if you found or wrote a function implementing a proper two-sample Anderson-Darling or Kolmogorov-Smirnov test, you would still be left with a couple of problems:
Your samples (the data and the cyclic part estimated from the data) are not independent, and these tests assume independence.
Given your description, I assume there is some sort of time predictor involved. So even if the distributions would coincide, that does not mean they coincide at the same time-points, because comparing distributions collapses over the time.
The distribution of cyclic trend + error would not expected to be the same as the distribution of the cyclic trend alone. Suppose the trend is sin(t). Then it never will go above 1. Now add a normally distributed random error term with standard deviation 0.1 (small, so that the trend is dominant). Obviously you could get values well above 1.
We do not have enough information to figure out the proper thing to do, and it is not really a programming question anyway. Look up time series theory - separating cyclic components is a major topic there. But many reasonable analyses will probably be based on the residuals: (observed value - predicted from cyclic component). You will still have to be careful about auto-correlation and other complexities, but at least it will be a move in the right direction.

Resources