How to get more performance out of automatic differentiation? - haskell

I am having a hard time optimizing a program that is relying on ads conjugateGradientDescent function for most of it's work.
Basically my code is a translation of an old papers code that is written in Matlab and C. I have not measured it, but that code is running at several iterations per second. Mine is in the order of minutes per iteration ...
The code is available in this repositories:
https://github.com/fhaust/aer
https://github.com/fhaust/aer-utils
The code in question can be run by following these commands:
$ cd aer-utils
$ cabal sandbox init
$ cabal sandbox add-source ../aer
$ cabal run learngabors
Using GHCs profiling facilities I have confirmed that the descent is in fact the part that is taking most of the time:
(interactive version here: https://dl.dropboxusercontent.com/u/2359191/learngabors.svg)
-s is telling me that productivity is quite low:
Productivity 33.6% of total user, 33.6% of total elapsed
From what I have gathered there are two things that might lead to higher performance:
Unboxing: currently I use a custom matrix implementation (in src/Data/SimpleMat.hs). This was the only way I could get ad to work with matrices (see: How to do automatic differentiation on hmatrix?). My guess is that by using a matrix type like newtype Mat w h a = Mat (Unboxed.Vector a) would achieve better performance due to unboxing and fusion. I found some code that has ad instances for unboxed vectors, but up to now I haven't been able to use these with the conjugateGradientFunction.
Matrix derivatives: In an email I just can't find at the moment Edward mentions that it would be better to use Forward instances for matrix types instead of having matrices filled with Forward instances. I have a faint idea how to achieve that, but have yet to figure out how I'd implement it in terms of ads type classes.
This is probably a question that is too wide to be answered on SO, so if you are willing to help me out here, feel free to contact me on Github.

You are running into pretty much the worst-case scenario for the current ad library here.
FWIW- You won't be able to use the existing ad classes/types with "matrix/vector ad". It'd be a fairly large engineering effort, see https://github.com/ekmett/ad/issues/2
As for why you can't unbox: conjugateGradient requires the ability to use Kahn mode or two levels of forward mode on your functions. The former precludes it from working with unboxed vectors, as the data types carry syntax trees, and can't be unboxed. For various technical reasons I haven't figured out how to make it work with a fixed sized 'tape' like the standard Reverse mode.
I think the "right" answer here is for us to sit down and figure out how to get matrix/vector AD right and integrated into the package, but I confess I'm timesliced a bit too thinly right now to give it the attention it deserves.
If you get a chance to swing by #haskell-lens on irc.freenode.net I'd happy to talk about designs in this space and offer advice. Alex Lang has also been working on ad a lot and is often present there and may have ideas.

Related

Timetabling/Scheduling Library

I'm looking for a library to help me solve a constraint based logic problem where I need to schedule a number of different events of varying duration. The events have different attributes associated with them and my main issue is that I need to encode "preferences" based on these attributes. These preferences aren't hard constraints, but I would like to maximise how well they are satisfied in the solution. There are also different preferences of competing priorities.
I've taken a look at a few constraint solvers (Sat4j, clasp, Glucose, GlueMiniSat, etc.) but from what I've seen they all seem to only deal with fixed constraints, and setting up preferences would be non-trivial.
I don't care too much about what technology/language it's in - I'm happy to write a wrapper around it.
Absolutely, Choco Solver is a powerful Java constraint solver that is often used for scheduling and planning.
Let's take the following example:
"it would be nice if x = 10"
You can encode preferences in different ways.
1) through variables and constraints.
1.1) reify the constraint with a binary variables
ICF.arithm(x,"=",10).reifyWith(b);
it basically means b = 1 <=> x = 10 (so the constraint may or may not be satisfied), then you can maximise b (possibly with a weight)
1.2) through gap variables
solver.post(ICF.arithm(x,'-',gap,"=",10);
then you can minimise the absolute value of gap (possibly with a weight)
to the constraint.
2) through search : when solving the problem ask the search strategy to try x=10 before trying another value. There is not optimality proof but it works quite well in practice.
Hope this help. Please feel free to contact us for more support on Choco Solver www.cosling.com
best,
I think OptaPlanner is a tool that can help you to solve this problem, check this:
OptaPlanner is a constraint satisfaction solver. It optimizes business
resource planning. Every organization faces scheduling puzzles: assign
a limited set of constrained resources (employees, assets, time and
money) to provide products or services to customers. OptaPlanner
optimizes such planning problems to do more business with less
resources. Use cases include Vehicle Routing, Employee Rostering, Job
Scheduling, Bin Packing and many more.
OptaPlanner is a lightweight, embeddable planning engine. It enables
normal Java™ programmers to solve optimization problems efficiently.
Constraints apply on plain domain objects and can reuse existing code.
There’s no need to input difficult mathematical equations. Under the
hood, OptaPlanner combines sophisticated optimization heuristics and
metaheuristics (such as Tabu Search, Simulated Annealing and Late
Acceptance) with very efficient score calculation.
OptaPlanner is open source software, released under the Apache
Software License. It is written in 100% pure Java™, runs on any JVM
and is available in the Maven Central repository too.
Source:
http://www.optaplanner.org/
It's part of Drools, which has another interesting tools:
http://www.drools.org/
Another actively-maintained library is "choco-solver".
website
github
Another alternative is the Gecode Toolkit. It is an open-source and modern Constraint Programming Solver.

Is there any working implementation of reverse mode automatic differentiation for Haskell?

The closest-related implementation in Haskell I have seen is the forward mode at http://hackage.haskell.org/packages/archive/fad/1.0/doc/html/Numeric-FAD.html.
The closest related related research appears to be reverse mode for another functional language related to Scheme at http://www.bcl.hamilton.ie/~qobi/stalingrad/.
I see reverse mode in Haskell as kind of a holy grail for a lot of tasks, with the hopes that it could use Haskell's nested data parallelism to gain a nice speedup in heavy numerical optimization.
In response to this question, I've uploaded a package named ad to Hackage for handling reverse-mode automatic differentiation in Haskell.
Internally, it leverages a trick from Andy Gill's Kansas Lava to observe sharing in the tape it records for back propagation purposes, and uses type level branding to avoid confusing sensitivities.
I've tried to keep the API relatively close to that of Barak Pearlmutter and Jeffrey Mark Siskind's fad package, but I couldn't resist making a couple of minor tweaks here and there for generality.
I still need to go through and finish up the remaining unimplemented fad combinators, figure out a nice way to build a reverse-mode AD tower, validate that I didn't screw up my recollection of basic calculus, and provide a nice API for using this approach to get local reverse mode checkpoints in an otherwise forward mode AD program, but I am quite happy with how things have progressed thus far.
We have a bunch of forward mode AD implementations (I even have one in my monoids library!), but reverse mode AD for all of Haskell seems to be intractable.
Sadly while Pearlmutter and Siskind give a translation for a lambda calculus, it doesn't map into something you can do for arbitrary Haskell lambdas, you don't get the right introspection properties and given the way the shape of the types change in the translation you don't get something that is amenable to being packed into a monad, arrow, or other control structure.
I had a go at it via a series of email exchanges with Pearlmutter, but ultimately the best I was able to obtain was a reverse mode AD solution for a small EDSL in Haskell, and not a solution for Haskell itself.
Not that I'm aware of. I do know that some Haskell folks are interested in automatic differentiation, but some quick digging found little more than short asides mentioning the reverse mode; I expect you already found the same material I did.
I also note that the fad package and Stalingrad project you found are in fact the work of the same two people, and that at least Prof. Pearlmutter has posted to the haskell-cafe mailing list. You may want to consider contacting him directly about his work--it's possible he has something in progress, or hit serious obstacles while attempting to implement reverse-mode AD.
Sorry I couldn't turn up anything more useful; if someone else wants to dig further, at least the links above are a starting point.
I think forward is the way to go in Haskell. You shouldn't be able to do reverse mode on arbitrary functions, as Edward pointed out. But you responded that you should be able to do it on certain constrained functions. And said constraints can lead readily to forward mode. Eg. if you have a function:
foo :: Num a => a -> a -> a
Then you can instantiate a with a differentiable type, and thus differentiate foo in forward mode.
See the vector-space library on Hackage for very elegant forward mode automatic differentiation. It might not be entirely clear how to use it at first. Read the paper about it, Beautiful Differentiation by Conal Elliott.

Software to Tune/Calibrate Properties for Heuristic Algorithms

Today I read that there is a software called WinCalibra (scroll a bit down) which can take a text file with properties as input.
This program can then optimize the input properties based on the output values of your algorithm. See this paper or the user documentation for more information (see link above; sadly doc is a zipped exe).
Do you know other software which can do the same which runs under Linux? (preferable Open Source)
EDIT: Since I need this for a java application: should I invest my research in java libraries like gaul or watchmaker? The problem is that I don't want to roll out my own solution nor I have time to do so. Do you have pointers to an out-of-the-box applications like Calibra? (internet searches weren't successfull; I only found libraries)
I decided to give away the bounty (otherwise no one would have a benefit) although I didn't found a satisfactory solution :-( (out-of-the-box application)
Some kind of (Metropolis algorithm-like) probability selected random walk is a possibility in this instance. Perhaps with simulated annealing to improve the final selection. Though the timing parameters you've supplied are not optimal for getting a really great result this way.
It works like this:
You start at some point. Use your existing data to pick one that look promising (like the highest value you've got). Set o to the output value at this point.
You propose a randomly selected step in the input space, assign the output value there to n.
Accept the step (that is update the working position) if 1) n>o or 2) the new value is lower, but a random number on [0,1) is less than f(n/o) for some monotonically increasing f() with range and domain on [0,1).
Repeat steps 2 and 3 as long as you can afford, collecting statistics at each step.
Finally compute the result. In your case an average of all points is probably sufficient.
Important frill: This approach has trouble if the space has many local maxima with deep dips between them unless the step size is big enough to get past the dips; but big steps makes the whole thing slow to converge. To fix this you do two things:
Do simulated annealing (start with a large step size and gradually reduce it, thus allowing the walker to move between local maxima early on, but trapping it in one region later to accumulate precision results.
Use several (many if you can afford it) independent walkers so that they can get trapped in different local maxima. The more you use, and the bigger the difference in output values, the more likely you are to get the best maxima.
This is not necessary if you know that you only have one, big, broad, nicely behaved local extreme.
Finally, the selection of f(). You can just use f(x) = x, but you'll get optimal convergence if you use f(x) = exp(-(1/x)).
Again, you don't have enough time for a great many steps (though if you have multiple computers, you can run separate instances to get the multiple walkers effect, which will help), so you might be better off with some kind of deterministic approach. But that is not a subject I know enough about to offer any advice.
There are a lot of genetic algorithm based software that can do exactly that. Wrote a PHD about it a decade or two ago.
A google for Genetic Algorithms Linux shows a load of starting points.
Intrigued by the question, I did a bit of poking around, trying to get a better understanding of the nature of CALIBRA, its standing in academic circles and the existence of similar software of projects, in the Open Source and Linux world.
Please be kind (and, please, edit directly, or suggest editing) for the likely instances where my assertions are incomplete, inexact and even flat-out incorrect. While working in related fields, I'm by no mean an Operational Research (OR) authority!
[Algorithm] Parameter tuning problem is a relatively well defined problem, typically framed as one of a solution search problem whereby, the combination of all possible parameter values constitute a solution space and the parameter tuning logic's aim is to "navigate" [portions of] this space in search of an optimal (or locally optimal) set of parameters.
The optimality of a given solution is measured in various ways and such metrics help direct the search. In the case of the Parameter Tuning problem, the validity of a given solution is measured, directly or through a function, from the output of the algorithm [i.e. the algorithm being tuned not the algorithm of the tuning logic!].
Framed as a search problem, the discipline of Algorithm Parameter Tuning doesn't differ significantly from other other Solution Search problems where the solution space is defined by something else than the parameters to a given algorithm. But because it works on algorithms which are in themselves solutions of sorts, this discipline is sometimes referred as Metaheuristics or Metasearch. (A metaheuristics approach can be applied to various algorihms)
Certainly there are many specific features of the parameter tuning problem as compared to the other optimization applications but with regard to the solution searching per-se, the approaches and problems are generally the same.
Indeed, while well defined, the search problem is generally still broadly unsolved, and is the object of active research in very many different directions, for many different domains. Various approaches offer mixed success depending on the specific conditions and requirements of the domain, and this vibrant and diverse mix of academic research and practical applications is a common trait to Metaheuristics and to Optimization at large.
So... back to CALIBRA...
From its own authors' admission, Calibra has several limitations
Limit of 5 parameters, maximum
Requirement of a range of values for [some of ?] the parameters
Works better when the parameters are relatively independent (but... wait, when that is the case, isn't the whole search problem much easier ;-) )
CALIBRA is based on a combination of approaches, which are repeated in a sequence. A mix of guided search and local optimization.
The paper where CALIBRA was presented is dated 2006. Since then, there's been relatively few references to this paper and to CALIBRA at large. Its two authors have since published several other papers in various disciplines related to Operational Research (OR).
This may be indicative that CALIBRA hasn't been perceived as a breakthrough.
State of the art in that area ("parameter tuning", "algorithm configuration") is the SPOT package in R. You can connect external fitness functions using a language of your choice. It is really powerful.
I am working on adapters for e.g. C++ and Java that simplify the experimental setup, which requires some getting used to in SPOT. The project goes under name InPUT, and a first version of the tuning part will be up soon.

Using Polymorphic Code for Legitimate Purposes?

I recently came across the term Polymorphic Code, and was wondering if anyone could suggest a legitimate (i.e. in legal and business appropriate software) reason to use it in a computer program? Links to real world examples would be appreciated!
Before someone answers, telling us all about the benefits of polymorphism in object oriented programming, please read the following definition for polymorphic code (taken from Wikipedia):
"Polymorphic code is code that uses a polymorphic engine to mutate while keeping the original algorithm intact. That is, the code changes itself each time it runs, but the function of the code in whole will not change at all."
Thanks, MagicAndi.
Update
Summary of answers so far:
Runtime optimization of the original code
Assigning a "DNA fingerprint" to each individual copy of an application
Obfuscate a program to prevent reverse-engineering
I was also introduced to the term 'metamorphic code'.
Runtime optimization of the original code, based on actual performance statistics gathered when running the application in its real environment and real inputs.
Digitally watermarking music is something often done to determine who was responsible for leaking a track, for example. It makes each copy of the music unique so that copies can be traced back to the original owner, but doesn't affect the audible qualities of the track.
Something similar could be done for compiled software by running each individual copy through a polymorphic engine before distributing it. Then if a cracked version of this software is released onto the Internet, the developer might be able to tell who cracked it by looking for specific variations produced the polymorphic engine (a sort of DNA test). As far as I know, this technique has never been used in practice.
It's not exactly what you were looking for I guess, since the polymorphic engine is not distributed with the code, but I think it's the closest to a legitimate business use you will find for this kind of technique.
Polymorphic code is a nice thing, but metamorphic is even nicer. To the legitimate uses: well, I can't think of anything other than anti-cracking and copy protection. Look at vx.org.ua if you wan't real world uses (not that legitimate though)
As Sami notes, on-the-fly optimisation is an excellent application of polymorphic code. A great example of this is the Fastest Fourier Transform in the West. It has a number of solvers at its disposal, which it combines with self-profiling to adjust the code path and solver parameters on subsequent executions. The result is the program optimises itself for your computing environment, getting faster with subsequent runs!
A related idea that may possibly be of interest is computational steering. This is the practice of altering the execution path of large simulations as the run proceeds, to focus on areas of interest to the researcher. The overall purpose of the simulation is not changed, but the feedback cycle acts to optimise the calculation. In this case the executable code is not being explicitly rewritten, but the effect from a user perspective is similar.
Polymorph code can be used to obfuscate weak or proprietary algorithms - that may use encryption e. g.. There're many "legitimate" uses for that. The term legitimate these days is kind of narrow-minded when it comes to IT. The core-paradigms of IT contain security. Whether you use polymorph shellcode in exploits or detect such code with an AV scanner. You have to know about it.
Obfuscate a program i.e. prevent reverse-engineering: goal being to protect IP (Intellectual Property).

How do I calculate these statistics?

I'm writing an app to help facilitate some research, and part of this involves doing some statistical calculations. Right now, the researchers are using a program called SPSS. Part of the output that they care about looks like this:
They're really only concerned about the F and Sig. values. My problem is that I have no background in statistics, and I can't figure out what the tests are called, or how to calculate them.
I thought the F value might be the result of the F-test, but after following the steps given on Wikipedia, I got a result that was different from what SPSS gives.
This website might help you out a bit more. Also this one.
I'm working from a fairly rusty memory of a statistics course, but here goes nothing:
When you're doing analysis of variance (ANOVA), you actually calculate the F statistic as the ratio from the mean-square variances "between the groups" and the mean-square variances "within the groups". The second link above seems pretty good for this calculation.
This makes the F statistic measure exactly how powerful your model is, because the "between the groups" variance is explanatory power, and "within the groups" variance is random error. High F implies a highly significant model.
As in many statistical operations, you back-determine Sig. using the F statistic. Here's where your Wikipedia information comes in slightly handy. What you want to do is - using the degrees of freedom given to you by SPSS - find the proper P value at which an F table will give you the F statistic you calculated. The P value where this happens [F(table) = F(calculated)] is the significance.
Conceptually, a lower significance value shows a very strong ability to reject the null hypothesis (which for these purposes means to determine your model has explanatory power).
Sorry to any math folks if any of this is wrong. I'll be checking back to make edits!!!
Good luck to you. Stats is fun, just maybe not this part. =)
I assume from your question that your research colleagues want to automate the process by which certain statistical analyses are performed (i.e., they want to batch process data sets). You have two options:
1) SPSS is now scriptable through python (as of version 15) - go to spss.com and search for python. You can write python scripts to automate data analyses and extract key values from pivot tables, and then process the answers any way you like. This has the virtue of allowing an exact comparison between the results from your python script and the hand-calculated efforts in SPSS of your collaborators. Thus you won't have to really know any statistics to do this work (which is a key advantage)
2) You could do this in R, a free statistics environment, which could probably be scripted. This has the disadvantage that you will have to learn statistics to ensure that you are doing it correctly.
Statistics is hard :-). After a year of reading and re-reading books and papers and can only say with confidence that I understand the very basics of it.
You might wish to investigate ready-made libraries for whichever programming language you are using, because they are many gotcha's in math in general and statistics in particular (rounding errors being an obvious example).
As an example you could take a look at the R project, which is both an interactive environment and a library which you can use from your C++ code, distributed under the GPL (ie if you are using it only internally and publishing only the results, you don't need to open your code).
In short: don't do this by hand, link/use existing software. And sain_grocen's answer is incorrect. :(
These are all tests for significance of parameter estimates that are typically used in Multivariate response Multiple Regressions. These would not be simple things to do outside of a statistical programming environment. I would suggest either getting the output from a pre-existing statistical program, or using one that you can link to and use that code.
I'm afraid that the first answer (sain_grocen's) will lead you down the wrong path. His explanation is likely of a special case of what you are actually dealing with. The anova explained in his links is for a single variate response, in a balanced design. These aren't the F statistics you are seeing. The names in your output (Pillai's Trace, Hotelling's Trace,...) are some of the available multivariate versions. They have F distributions under certain assumptions. I can't explain a text books worth of material here, I would advise you to start by looking at
"Applied Multivariate Statistical Analysis" by Johnson and Wichern
Can you explain more why SPSS itself isn't a fine solution to the problem? Is it that it generates pivot tables as output that are hard to manipulate? Is it the cost of the program?
F-statistics can arise from any number of particular tests. The F is just a distribution (loosely: a description of the "frequencies" of groups of values), like a Normal (Gaussian), or Uniform. In general they arise from ratios of variances. Opinion: many statisticians (myself included), find F-based tests to be unstable (jargon: non-robust).
The particular output statistics (Pillai's trace, etc.) suggest that the original analysis is a MANOVA example, which as other posters describe is a complicated, and hard to get right procedure.
I'm guess also that, based on the MANOVA, and the use of SPSS, this is a psychology or sociology project... if not please enlighten. It might be that other, simpler models might actually be easier to understand and more repeatable. Consult your local university statistical consulting group, if you have one.
Good luck!
Here's an explanation of MANOVA ouptput, from a very good site on statistics and on SPSS:
Output with explanation:
http://faculty.chass.ncsu.edu/garson/PA765/manospss.htm
How and why to do MANOVA or multivariate GLM:
(same path as above, but terminating in '/manova.htm')
Writing software from scratch to calculate these outputs would be both lengthy and difficult;
there's lots of numerical problems and matrix inversions to do.
As Henry said, use Python scripts, or R. I'd suggest working with somebody who knows SPSS if scripting.
In addition, SPSS itself is capable of exporting the output tables to files using something called OMS.
A script within SPSS can do this.
Find out who in your research group knows SPSS and work with them.

Resources