The Readme of HLearn states that the Monoid typeclass is used for parallel batch training. I've seen trainMonoid mentioned in several files, but I'm having a difficulty to dissect this huge codebase. Could someone explain in beginner-friendly terms how does it work? I guess it's somehow related to the associativity property.
It's explained in this article which is linked in the page you linked in the question. Since you want a beginner friendly description I'll give you a very high level description of what I understood after reading the article. Consider this as a rough overview of the idea, to understand exactly everything you have to study the articles.
The basic idea is to use algebraic properties to avoid re-doing the same work over and over. They do it by using the associativity of the monoidal operation and homomorphisms.
Given two sets A and B with two binary operations + and * an homomorphism is a function f: A -> B such that f(x + y) = f(x) * f(y), i.e. it's a function that preserves the structure between the two sets.
In the case of that article the function f is basically the function that maps the sets of inputs to the trained models.
So the idea is that you can divide the input data into different portions x and y, and instead of having to compute the model of the whole thing as in T(x + y) you can do the training on just x and y and then combine the results: T(x) * T(y).
Now this as it is doesn't really help that much but, in training you often repeat work. For example in cross validation you, for k times, sample the data into a set of inputs for the trainer and a set of data used to test the trainer. But this means that in these k iterations you are executing T over the same portions of inputs multiple times.
Here monoids come into play: you can first split the domain into subsets, and compute T over these subsets, and then to compute the result of cross validation you can just put together the results from the corresponding subsets.
To give an idea: if the data is {1,2,3,4} and k = 3 instead of doing:
T({1,2}) plus test on {3, 4}
T({1,3}) plus test on {2, 4}
T({1,4}) plus test on {2, 3}
Here you can see that we trained over 1 for three times. Using the homomorphism we can compute T({1}) once and then combine the result with the other partial result to obtain the final trained model.
The correctness of the final result is assured by the associativity of the operations and the homomorphism.
The same idea can be applied when parallelizing: you divide the inputs into k groups, perform the training in parallel and then compound the results: T(x_1 + x_2 + ... + x_k) = T(x_1) * T(x_2) * ... * T(x_k) where the T(x_i) calls are performed completely in parallel and only at the end you have to compound the results.
Regarding online training algorithms, the idea is that given a "batch" training algorithm T you can make it into an online one by doing:
T_O(m, d) = m * T(d)
where m is an already trained model (which generally will be the trained model until that point) and d is the new data point you add for training.
Again the exactness of the result is due to the homorphism that tells you that if m = T(x) then m * T(d) = T(x+d), i.e. the online algorithm gives the same result of the batch algorithm with all those data points.
The more interesting (and complex) part of all of this is how can you see a training task as such a homomorphism etc. I'll leave that to your personal study.
Related
I am new to using linear mixed-effects models. I have a dataset where participants (ID, N = 7973) completed two experimental conditions (A and B). A subset of participants are siblings and thus nested in families (famID, N = 6908).
omnibus_model <- lmer(Outcome ~ Var1*Var2*Cond + (Cond|ID) + (1|famID), data=df)
The omnibus model converges and indicates a significant three way interaction between Var1, Var2 and Cond. As a post-hoc, to better understand what is driving the omnibus model effect, I subsetted the data so that there is only one observation per ID.
condA <- df[which(df$condition=='A'),]
condA_model <- lmer(Outcome ~ Var1*Var2 + (1|famID), data=condA)
condB <- df[which(df$condition=='B'),]
condB_model <- lmer(Outcome ~ Var1*Var2 + (1|famID), data=condB)
condA_model converges; condB_model does not. In condB_model "famID (Intercept)" variance is estimated at 0. In the condA_model, I get a small, but non-zero estimate (variance=0.001479). I know I could get an estimate of the fixed effect of interest in condition A versus B by a different method(such as randomly selecting one sibling per family for the analysis and not using random effects), but I am concerned that this differential convergence pattern may indicate differences between the conditions that would influence the interpretation of the omnibus model effect.
What difference in the two conditions could causing the model in one subset not to converge? How would I test for the possible differences in my data? Shouldn't the random effect of famID be identical in both subsets and thus equally able to be estimated in both post-hoc models?
As a post-hoc, to better understand what is driving the omnibus model effect, I subsetted the data so that there is only one observation per ID.
This procedure does not make sense.
What difference in the two conditions could causing the model in one subset not to converge?
There are many reasons. For one thing, these reduced datasets are, well, reduced, ie smaller, so there is far less statistical power to detect the "effects" that you are interested in, such as a variance of a random effect. In such cases, it may be estimated as zero and result in a singular fit.
Shouldn't the random effect of famID be identical in both subsets and thus equally able to be estimated in both post-hoc models?
No, these are completely different models, since the underlying data are different. There is no reason to expect the same estimates from both models.
I want to have a measure of similarity between two points in a cluster.
Would the similarity calculated this way be an acceptable measure of similarity between the two datapoint?
Say I have to vectors: vector A and vector B that are in the same cluster. I have trained a cluster which is denoted by model and then model.computeCost() computes thesquared distance between the input point and the corresponding cluster center.
(I am using Apache Spark MLlib)
val costA = model.computeCost(A)
val costB = model.computeCost(B)
val dissimilarity = |cost(A)-cost(B)|
Dissimilarity i.e. the higher the value, the more unlike each other they are.
If you are just asking is this a valid metric then the answer is almost, it is a valid pseudometric if only .computeCost is deterministic.
For simplicity i denote f(A) := model.computeCost(A) and d(A, B) := |f(A)-f(B)|
Short proof: d is a L1 applied to an image of some function, thus is a pseudometric itself, and a metric if f is injective (in general, yours is not).
Long(er) proof:
d(A,B) >= 0 yes, since |f(A) - f(B)| >= 0
d(A,B) = d(B,A) yes, since |f(A) - f(B)| = |f(B) - f(A)|
d(A,B) = 0 iff A=B, no, this is why it is pseudometric, since you can have many A != B such that f(A) = f(B)
d(A,B) + d(B,C) <= d(A,C), yes, directly from the same inequality for absolute values.
If you are asking will it work for your problem, then the answer is it might, depends on the problem. There is no way to answer this without analysis of your problem and data. As shown above this is a valid pseudometric, thus it will measure something decently behaving from mathematical perspective. Will it work for your particular case is completely different story. The good thing is most of the algorithms which work for metrics will work with pseudometrics as well. The only difference is that you simply "glue together" points which have the same image (f(A)=f(B)), if this is not the issue for your problem - then you can apply this kind of pseudometric in any metric-based reasoning without any problems. In practise, that means that if your f is
computes the sum of squared distances between the input point and the corresponding cluster center
this means that this is actually a distance to closest center (there is no summation involved when you consider a single point). This would mean, that 2 points in two separate clusters are considered identical when they are equally far away from their own clusters centers. Consequently your measure captures "how different are relations of points and their respective clusters". This is a well defined, indirect dissimilarity computation, however you have to be fully aware what is happening before applying it (since it will have specific consequences).
Your "cost" is actually the distance to the center.
Points that have the same distance to the center are considered to be identical (distance 0), which creates a really odd pseudonetric, because it ignores where on the circle of that distance points are.
It's not very likely this will work on your problem.
I have a set of data I have acquired from simulations. There are 3 parameters that go into my simulations and I get one result out.
I can graph the data from the small subset i have and see the trends for each input, but I need to be able to extrapolate this and get some form of a regression equation seeing as the simulation takes a long time.
In matlab or excel, is it possible to list the inputs and outputs to obtain a 4 parameter regression line for a given set of information?
Before this gets flagged as a duplicate, i understand polyfit will give me an equation of best fit and will be as accurate as i want it, but i need the equation to correspond to the inputs, not just a regression line.
In other words if i 20 simulations of inputs a, b, c and output y, is there a way to obtain a "best fit":
y=B0+B1*a+B2*b+B3*c
using the data?
My usual recommendation for higher-dimensional curve fitting is to pose the problem as a minimization problem (that may be unneeded here with the nice linear model you've proposed, but I'm a hammer-nail guy sometimes).
It starts by creating a correlation function (the functional form you think maps your inputs to the output) given a vector of fit parameters p and input data xData:
correl = #(p,xData) p(1) + p(2)*xData(:,1) + p(3)*xData(:2) + p(4)*xData(:,3)
Then you need to define a function to minimize given the parameter vector, which I call the objective; this is typically your correlation minus you output data.
The details of this function are determined from the solver you'll use (see below).
All of the method need a starting vector pGuess, which is dependent on the trends you see.
For nonlinear correlation function, finding a good pGuess can be a trial but necessary for a good solution.
fminsearch
To use fminsearch, the data must be collapsed to a scalar value using some norm (2 here):
x = [a,b,c]; % your input data as columns of x
objective = #(p) norm(correl(p,x) - y,2);
p = fminsearch(objective,pGuess); % you need to define a good pGuess
lsqnonlin
To use lsqnonlin (which solves the same problem as above in different ways), the norm-ing of the objective is not needed:
objective = #(p) correl(p,x) - y ;
p = lsqnonlin(objective,pGuess); % you need to define a good pGuess
(You can also specify lower and upper bounds on the parameter solution, which is nice.)
lsqcurvefit
To use lsqcurvefit (which is simply a wrapper for lsqnonlin), only the correlation function is needed along with the data:
p = lsqcurvefit(correl,pGuess,x,y); % you need to define a good pGuess
I've been trying to find an answer to this for months (to be used in a machine learning application), it doesn't seem like it should be a terribly hard problem, but I'm a software engineer, and math was never one of my strengths.
Here is the scenario:
I have a (possibly) unevenly weighted coin and I want to figure out the probability of it coming up heads. I know that coins from the same box that this one came from have an average probability of p, and I also know the standard deviation of these probabilities (call it s).
(If other summary properties of the probabilities of other coins aside from their mean and stddev would be useful, I can probably get them too.)
I toss the coin n times, and it comes up heads h times.
The naive approach is that the probability is just h/n - but if n is small this is unlikely to be accurate.
Is there a computationally efficient way (ie. doesn't involve very very large or very very small numbers) to take p and s into consideration to come up with a more accurate probability estimate, even when n is small?
I'd appreciate it if any answers could use pseudocode rather than mathematical notation since I find most mathematical notation to be impenetrable ;-)
Other answers:
There are some other answers on SO that are similar, but the answers provided are unsatisfactory. For example this is not computationally efficient because it quickly involves numbers way smaller than can be represented even in double-precision floats. And this one turned out to be incorrect.
Unfortunately you can't do machine learning without knowing some basic math---it's like asking somebody for help in programming but not wanting to know about "variables" , "subroutines" and all that if-then stuff.
The better way to do this is called a Bayesian integration, but there is a simpler approximation called "maximum a postieri" (MAP). It's pretty much like the usual thinking except you can put in the prior distribution.
Fancy words, but you may ask, well where did the h/(h+t) formula come from? Of course it's obvious, but it turns out that it is answer that you get when you have "no prior". And the method below is the next level of sophistication up when you add a prior. Going to Bayesian integration would be the next one but that's harder and perhaps unnecessary.
As I understand it the problem is two fold: first you draw a coin from the bag of coins. This coin has a "headsiness" called theta, so that it gives a head theta fraction of the flips. But the theta for this coin comes from the master distribution which I guess I assume is Gaussian with mean P and standard deviation S.
What you do next is to write down the total unnormalized probability (called likelihood) of seeing the whole shebang, all the data: (h heads, t tails)
L = (theta)^h * (1-theta)^t * Gaussian(theta; P, S).
Gaussian(theta; P, S) = exp( -(theta-P)^2/(2*S^2) ) / sqrt(2*Pi*S^2)
This is the meaning of "first draw 1 value of theta from the Gaussian" and then draw h heads and t tails from a coin using that theta.
The MAP principle says, if you don't know theta, find the value which maximizes L given the data that you do know. You do that with calculus. The trick to make it easy is that you take logarithms first. Define LL = log(L). Wherever L is maximized, then LL will be too.
so
LL = hlog(theta) + tlog(1-theta) + -(theta-P)^2 / (2*S^2)) - 1/2 * log(2*pi*S^2)
By calculus to look for extrema you find the value of theta such that dLL/dtheta = 0.
Since the last term with the log has no theta in it you can ignore it.
dLL/dtheta = 0 = (h/theta) + (P-theta)/S^2 - (t/(1-theta)) = 0.
If you can solve this equation for theta you will get an answer, the MAP estimate for theta given the number of heads h and the number of tails t.
If you want a fast approximation, try doing one step of Newton's method, where you start with your proposed theta at the obvious (called maximum likelihood) estimate of theta = h/(h+t).
And where does that 'obvious' estimate come from? If you do the stuff above but don't put in the Gaussian prior: h/theta - t/(1-theta) = 0 you'll come up with theta = h/(h+t).
If your prior probabilities are really small, as is often the case, instead of near 0.5, then a Gaussian prior on theta is probably inappropriate, as it predicts some weight with negative probabilities, clearly wrong. More appropriate is a Gaussian prior on log theta ('lognormal distribution'). Plug it in the same way and work through the calculus.
You can use p as a prior on your estimated probability. This is basically the same as doing pseudocount smoothing. I.e., use
(h + c * p) / (n + c)
as your estimate. When h and n are large, then this just becomes h / n. When h and n are small, this is just c * p / c = p. The choice of c is up to you. You can base it on s but in the end you have to decide how small is too small.
You don't have nearly enough info in this question.
How many coins are in the box? If it's two, then in some scenarios (for example one coin is always heads, the other always tails) knowing p and s would be useful. If it's more than a few, and especially if only some of the coins are only slightly weighted then it is not useful.
What is a small n? 2? 5? 10? 100? What is the probability of a weighted coin coming up heads/tail? 100/0, 60/40, 50.00001/49.99999? How is the weighting distributed? Is every coin one of 2 possible weightings? Do they follow a bell curve? etc.
It boils down to this: the differences between a weighted/unweighted coin, the distribution of weighted coins, and the number coins in your box will all decide what n has to be for you to solve this with a high confidence.
The name for what you're trying to do is a Bernoulli trial. Knowing the name should be helpful in finding better resources.
Response to comment:
If you have differences in p that small, you are going to have to do a lot of trials and there's no getting around it.
Assuming a uniform distribution of bias, p will still be 0.5 and all standard deviation will tell you is that at least some of the coins have a minor bias.
How many tosses, again, will be determined under these circumstances by the weighting of the coins. Even with 500 tosses, you won't get a strong confidence (about 2/3) detecting a .51/.49 split.
In general, what you are looking for is Maximum Likelihood Estimation. Wolfram Demonstration Project has an illustration of estimating the probability of a coin landing head, given a sample of tosses.
Well I'm no math man, but I think the simple Bayesian approach is intuitive and broadly applicable enough to put a little though into it. Others above have already suggested this, but perhaps if your like me you would prefer more verbosity.
In this lingo, you have a set of mutually-exclusive hypotheses, H, and some data D, and you want to find the (posterior) probabilities that each hypothesis Hi is correct given the data. Presumably you would choose the hypothesis that had the largest posterior probability (the MAP as noted above), if you had to choose one. As Matt notes above, what distinguishes the Bayesian approach from only maximum likelihood (finding the H that maximizes Pr(D|H)) is that you also have some PRIOR info regarding which hypotheses are most likely, and you want to incorporate these priors.
So you have from basic probability Pr(H|D) = Pr(D|H)*Pr(H)/Pr(D). You can estimate these Pr(H|D) numerically by creating a series of discrete probabilities Hi for each hypothesis you wish to test, eg [0.0,0.05, 0.1 ... 0.95, 1.0], and then determining your prior Pr(H) for each Hi -- above it is assumed you have a normal distribution of priors, and if that is acceptable you could use the mean and stdev to get each Pr(Hi) -- or use another distribution if you prefer. With coin tosses the Pr(D|H) is of course determined by the binomial using the observed number of successes with n trials and the particular Hi being tested. The denominator Pr(D) may seem daunting but we assume that we have covered all the bases with our hypotheses, so that Pr(D) is the summation of Pr(D|Hi)Pr(H) over all H.
Very simple if you think about it a bit, and maybe not so if you think about it a bit more.
Overview
I have a multivariate timeseries of "inputs" of dimension N that I want to map to an output timeseries of dimension M, where M < N. The inputs are bounded in [0,k] and the outputs are in [0,1]. Let's call the input vector for some time slice in the series "I[t]" and the output vector "O[t]".
Now if I knew the optimal mapping of pairs <I[t], O[t]>, I could use one of the standard multivariate regression / training techniques (such as NN, SVM, etc) to discover a mapping function.
Problem
I do not know the relationship between specific <I[t], O[t]> pairs, rather have a view on the overall fitness of the output timeseries, i.e. the fitness is governed by a penalty function on the complete output series.
I want to determine the mapping / regressing function "f", where:
O[t] = f (theta, I[t])
Such that penalty function P(O) is minimized:
minarg P( f(theta, I) )
theta
[Note that the penalty function P is being applied the resultant series generated from multiple applications of f to the I[t]'s across time. That is f is a function of I[t] and not the whole timeseries]
The mapping between I and O is complex enough that I do not know what functions should form its basis. Therefore expect to have to experiment with a number of basis functions.
Have a view on one way to approach this, but do not want to bias the proposals.
Ideas?
... depends on your definition of optimal mapping and penalty function. I'm not sure if this is the direction you're taking, but here's a couple of suggestions:
For example you can find a mapping of the data from the higher dimensional space to a lower dimension space that tries to preserve the original similarity between data points (something like Multidimensional Scaling [MDS]).
Or you can prefer to map the data to a lower dimension that accounts for as much of the variability in the data as possible (Principal Component Analysis [PCA]).