Software to Tune/Calibrate Properties for Heuristic Algorithms - linux

Today I read that there is a software called WinCalibra (scroll a bit down) which can take a text file with properties as input.
This program can then optimize the input properties based on the output values of your algorithm. See this paper or the user documentation for more information (see link above; sadly doc is a zipped exe).
Do you know other software which can do the same which runs under Linux? (preferable Open Source)
EDIT: Since I need this for a java application: should I invest my research in java libraries like gaul or watchmaker? The problem is that I don't want to roll out my own solution nor I have time to do so. Do you have pointers to an out-of-the-box applications like Calibra? (internet searches weren't successfull; I only found libraries)
I decided to give away the bounty (otherwise no one would have a benefit) although I didn't found a satisfactory solution :-( (out-of-the-box application)

Some kind of (Metropolis algorithm-like) probability selected random walk is a possibility in this instance. Perhaps with simulated annealing to improve the final selection. Though the timing parameters you've supplied are not optimal for getting a really great result this way.
It works like this:
You start at some point. Use your existing data to pick one that look promising (like the highest value you've got). Set o to the output value at this point.
You propose a randomly selected step in the input space, assign the output value there to n.
Accept the step (that is update the working position) if 1) n>o or 2) the new value is lower, but a random number on [0,1) is less than f(n/o) for some monotonically increasing f() with range and domain on [0,1).
Repeat steps 2 and 3 as long as you can afford, collecting statistics at each step.
Finally compute the result. In your case an average of all points is probably sufficient.
Important frill: This approach has trouble if the space has many local maxima with deep dips between them unless the step size is big enough to get past the dips; but big steps makes the whole thing slow to converge. To fix this you do two things:
Do simulated annealing (start with a large step size and gradually reduce it, thus allowing the walker to move between local maxima early on, but trapping it in one region later to accumulate precision results.
Use several (many if you can afford it) independent walkers so that they can get trapped in different local maxima. The more you use, and the bigger the difference in output values, the more likely you are to get the best maxima.
This is not necessary if you know that you only have one, big, broad, nicely behaved local extreme.
Finally, the selection of f(). You can just use f(x) = x, but you'll get optimal convergence if you use f(x) = exp(-(1/x)).
Again, you don't have enough time for a great many steps (though if you have multiple computers, you can run separate instances to get the multiple walkers effect, which will help), so you might be better off with some kind of deterministic approach. But that is not a subject I know enough about to offer any advice.

There are a lot of genetic algorithm based software that can do exactly that. Wrote a PHD about it a decade or two ago.
A google for Genetic Algorithms Linux shows a load of starting points.

Intrigued by the question, I did a bit of poking around, trying to get a better understanding of the nature of CALIBRA, its standing in academic circles and the existence of similar software of projects, in the Open Source and Linux world.
Please be kind (and, please, edit directly, or suggest editing) for the likely instances where my assertions are incomplete, inexact and even flat-out incorrect. While working in related fields, I'm by no mean an Operational Research (OR) authority!
[Algorithm] Parameter tuning problem is a relatively well defined problem, typically framed as one of a solution search problem whereby, the combination of all possible parameter values constitute a solution space and the parameter tuning logic's aim is to "navigate" [portions of] this space in search of an optimal (or locally optimal) set of parameters.
The optimality of a given solution is measured in various ways and such metrics help direct the search. In the case of the Parameter Tuning problem, the validity of a given solution is measured, directly or through a function, from the output of the algorithm [i.e. the algorithm being tuned not the algorithm of the tuning logic!].
Framed as a search problem, the discipline of Algorithm Parameter Tuning doesn't differ significantly from other other Solution Search problems where the solution space is defined by something else than the parameters to a given algorithm. But because it works on algorithms which are in themselves solutions of sorts, this discipline is sometimes referred as Metaheuristics or Metasearch. (A metaheuristics approach can be applied to various algorihms)
Certainly there are many specific features of the parameter tuning problem as compared to the other optimization applications but with regard to the solution searching per-se, the approaches and problems are generally the same.
Indeed, while well defined, the search problem is generally still broadly unsolved, and is the object of active research in very many different directions, for many different domains. Various approaches offer mixed success depending on the specific conditions and requirements of the domain, and this vibrant and diverse mix of academic research and practical applications is a common trait to Metaheuristics and to Optimization at large.
So... back to CALIBRA...
From its own authors' admission, Calibra has several limitations
Limit of 5 parameters, maximum
Requirement of a range of values for [some of ?] the parameters
Works better when the parameters are relatively independent (but... wait, when that is the case, isn't the whole search problem much easier ;-) )
CALIBRA is based on a combination of approaches, which are repeated in a sequence. A mix of guided search and local optimization.
The paper where CALIBRA was presented is dated 2006. Since then, there's been relatively few references to this paper and to CALIBRA at large. Its two authors have since published several other papers in various disciplines related to Operational Research (OR).
This may be indicative that CALIBRA hasn't been perceived as a breakthrough.

State of the art in that area ("parameter tuning", "algorithm configuration") is the SPOT package in R. You can connect external fitness functions using a language of your choice. It is really powerful.
I am working on adapters for e.g. C++ and Java that simplify the experimental setup, which requires some getting used to in SPOT. The project goes under name InPUT, and a first version of the tuning part will be up soon.

Related

Calculations of factorials

Working over a problem connected with analytic number theory, I want to make some simple computer experiments in order to examine some theoretical conjectures. The algorithms are very simple: they contain standard arithmetic operations and factorials, but I would like to find values depending on a parameter. For instance, if I understand correctly, the problem with such calculations at WolframAlpha service is that I cannot write an expression depending on a parameter and then change the value of the parameter by typing it only once. But that is what I need. I am new in programming, long ago I used some old languages like Algol, but I am not aware of the modern situation with simple computer experiments. So, my goal is to calculate some simple expressions for multiple values of a parameter, preferably with installing some simple software or by using an online machinery. How could this be done?
Assuming that my question can be perceived as off topic, if so, I would much appreciate any further recommendations before closing.

What statistical distribution is used to benchmark an algorithm?

I have benchmarked my algorithm, it run for 1000 times. Now I have all time values and at this point it would be interesting to know the mean, standard deviation, median. The problem is that I don't know what is correct statistics to use to estimate these parameters. I'm not sure about using Normal distribution.
Learn about statistics. There are lots of books, guides, papers and introductions out there (1,2,3, 4)
There are also lots of libraries which implements default statistical methods:
Java Commons Math,
C++ Libs,
and there are certainly lots of others for the language you use...
And also one last hint: For a quick (initial) result I often use excel and its diagram functions. It supports some statistical methods with which you can play around a bit to see in which direction you may continue....
That really depends on what distribution your workload experiences, so you would not be able to answer generically to this.
But there is a trick: if you go one step forward, and do several iterations, each consisting of N calls, and compute, say, average time/throughput for the entire iteration. Then, for a large N and consistent workload behavior across the calls, the iteration scores may be subject to Central Limit Theorem, which can turn them into normally distributed.

Data Parallelism in Ant Colony Optimization

I have been trying to understand how ACO optimization can be implemented with data parallelism. I have read some content after searching in Google. I only need the basic idea in simple way. Most of the papers are talking about everything else instead of the main thing in simple words.
What I understood so far is, we will make it work parallel by using multi-tasking(threading). But am not sure what each thread would do or how we could separate it into threads without causing trouble.
Does it means that we should create separate thread for each ants? But that would cause lots of threads to be created! So if there are 200 ants, then 200 threads?
Am still having confusion at this data parallelism topic in ACO. I would really love to hear in simple words on how we would implement it parallely.
A few simple ideas to run ACO in parallel
Since you have already read up on ACO, here are a few simple ideas on ways to run ACO in parallel. Rather than getting caught up in multiple-threads and mutli-tasking, it might be helpful to think in terms of 'parallel compute resources' at your disposal.
ACO is one case of Agent-Based Simulation (ABS), and ABS lends itself particularly well to parellization.
Simple Options
Option 1. Run a full version of ACO in each of the parallel resources.
Code your ACO algorithm, run it in parallel fashion. (Since there is a stochastic element to the algorithm, you can then look for the 'best' solution for your problem.)
Option 2. To explore effects of varying ACO parameters
Like any simulation approach, any ACO implementation has a large number of runtime parameters: Number of vertices, time to run, Number of ants, Pheromone evaporation rates, probability functions to choose path options and many more. When you mutliply these options, they add up to some large number of cases to be run. Divide up the work among your parallel compute resources.
The two options mentioned above are sometimes referred to as 'embarrassingly' parallel. Very easy to implement (think of it as a Design of Experiments) and you get back a whole matrix of results, and you can make conclusions by studying what effect the changes in the parameters had on the solution.
Option with solution sharing
Option 3: Master-Slave approach, with Partial Solution sharing
Going up one more level in complexity, we can use each node to contribute its 'knowledge/findings' to the overall problem solution. This is sometimes called a master-slave approach. The master is trying to solve the overall problem (Could be TSP, or some similar complex problem) and each 'slave' is solving some aspect of it, but with some fairly simple algorithm. The idea is that when combined they produce powerful results.
After a certain number of iterations, the solutions are passed back and forth, with 'bad' solutions thrown out. Some variant of the Map-Shuffle-Reduce paradigm would do that. The master evaluates the current best solution, and that is transferred back to each 'slave' node (Example: the latest overall pheromone levels are given to all the slave nodes). The next round of solving resumes.
Option 3 has tons of nuanced variations, and some people spend their entire lives improving various aspects of it.
Hope some of these ideas help.

How to choose a Feature Selection Algorithm? - advice

Is there a research paper/book that I can read which can tell me for the problem at hand what sort of feature selection algorithm would work best.
I am trying to simply identify twitter messages as pos/neg (to begin with). I started out with Frequency based feature selection (having started with NLTK book) but soon realised that for a similar problem various individuals have choosen different algorithms
Although I can try Frequency based, mutual information, information gain and various other algorithms the list seems endless.. and was wondering if there an efficient way then trial and error.
any advice
Have you tried the book I recommended upon your last question? It's freely available online and entirely about the task you are dealing with: Sentiment Analysis and Opinion Mining by Pang and Lee. Chapter 4 ("Extraction and Classification") is just what you need!
I did an NLP course last term, and it came pretty clear that sentiment analysis is something that nobody really knows how to do well (yet). Doing this with unsupervised learning is of course even harder.
There's quite a lot of research going on regarding this, some of it commercial and thus not open to the public. I can't point you to any research papers but the book we used for the course was this (google books preview). That said, the book covers a lot of material and might not be the quickest way to find a solution to this particular problem.
The only other thing I can point you towards is to try googling around, maybe in scholar.google.com for "sentiment analysis" or "opinion mining".
Have a look at the NLTK movie_reviews corpus. The reviews are already pos/neg categorized and might help you with training your classifier. Although the language you find in Twitter is probably very different from those.
As a last note, please post any successes (or failures for that matter) here. This issue will come up later for sure at some point.
Unfortunately, there is no silver bullet for anything when dealing with machine learning. It's usually referred to as the "No Free Lunch" theorem. Basically a number of algorithms work for a problem, and some do better on some problems and worse on others. Over all, they all perform about the same. The same feature set may cause one algorithm to perform better and another to perform worse for a given data set. For a different data set, the situation could be completely reversed.
Usually what I do is pick a few feature selection algorithms that have worked for others on similar tasks and then start with those. If the performance I get using my favorite classifiers is acceptable, scrounging for another half percentage point probably isn't worth my time. But if it's not acceptable, then it's time to re-evaluate my approach, or to look for more feature selection methods.

How do I calculate these statistics?

I'm writing an app to help facilitate some research, and part of this involves doing some statistical calculations. Right now, the researchers are using a program called SPSS. Part of the output that they care about looks like this:
They're really only concerned about the F and Sig. values. My problem is that I have no background in statistics, and I can't figure out what the tests are called, or how to calculate them.
I thought the F value might be the result of the F-test, but after following the steps given on Wikipedia, I got a result that was different from what SPSS gives.
This website might help you out a bit more. Also this one.
I'm working from a fairly rusty memory of a statistics course, but here goes nothing:
When you're doing analysis of variance (ANOVA), you actually calculate the F statistic as the ratio from the mean-square variances "between the groups" and the mean-square variances "within the groups". The second link above seems pretty good for this calculation.
This makes the F statistic measure exactly how powerful your model is, because the "between the groups" variance is explanatory power, and "within the groups" variance is random error. High F implies a highly significant model.
As in many statistical operations, you back-determine Sig. using the F statistic. Here's where your Wikipedia information comes in slightly handy. What you want to do is - using the degrees of freedom given to you by SPSS - find the proper P value at which an F table will give you the F statistic you calculated. The P value where this happens [F(table) = F(calculated)] is the significance.
Conceptually, a lower significance value shows a very strong ability to reject the null hypothesis (which for these purposes means to determine your model has explanatory power).
Sorry to any math folks if any of this is wrong. I'll be checking back to make edits!!!
Good luck to you. Stats is fun, just maybe not this part. =)
I assume from your question that your research colleagues want to automate the process by which certain statistical analyses are performed (i.e., they want to batch process data sets). You have two options:
1) SPSS is now scriptable through python (as of version 15) - go to spss.com and search for python. You can write python scripts to automate data analyses and extract key values from pivot tables, and then process the answers any way you like. This has the virtue of allowing an exact comparison between the results from your python script and the hand-calculated efforts in SPSS of your collaborators. Thus you won't have to really know any statistics to do this work (which is a key advantage)
2) You could do this in R, a free statistics environment, which could probably be scripted. This has the disadvantage that you will have to learn statistics to ensure that you are doing it correctly.
Statistics is hard :-). After a year of reading and re-reading books and papers and can only say with confidence that I understand the very basics of it.
You might wish to investigate ready-made libraries for whichever programming language you are using, because they are many gotcha's in math in general and statistics in particular (rounding errors being an obvious example).
As an example you could take a look at the R project, which is both an interactive environment and a library which you can use from your C++ code, distributed under the GPL (ie if you are using it only internally and publishing only the results, you don't need to open your code).
In short: don't do this by hand, link/use existing software. And sain_grocen's answer is incorrect. :(
These are all tests for significance of parameter estimates that are typically used in Multivariate response Multiple Regressions. These would not be simple things to do outside of a statistical programming environment. I would suggest either getting the output from a pre-existing statistical program, or using one that you can link to and use that code.
I'm afraid that the first answer (sain_grocen's) will lead you down the wrong path. His explanation is likely of a special case of what you are actually dealing with. The anova explained in his links is for a single variate response, in a balanced design. These aren't the F statistics you are seeing. The names in your output (Pillai's Trace, Hotelling's Trace,...) are some of the available multivariate versions. They have F distributions under certain assumptions. I can't explain a text books worth of material here, I would advise you to start by looking at
"Applied Multivariate Statistical Analysis" by Johnson and Wichern
Can you explain more why SPSS itself isn't a fine solution to the problem? Is it that it generates pivot tables as output that are hard to manipulate? Is it the cost of the program?
F-statistics can arise from any number of particular tests. The F is just a distribution (loosely: a description of the "frequencies" of groups of values), like a Normal (Gaussian), or Uniform. In general they arise from ratios of variances. Opinion: many statisticians (myself included), find F-based tests to be unstable (jargon: non-robust).
The particular output statistics (Pillai's trace, etc.) suggest that the original analysis is a MANOVA example, which as other posters describe is a complicated, and hard to get right procedure.
I'm guess also that, based on the MANOVA, and the use of SPSS, this is a psychology or sociology project... if not please enlighten. It might be that other, simpler models might actually be easier to understand and more repeatable. Consult your local university statistical consulting group, if you have one.
Good luck!
Here's an explanation of MANOVA ouptput, from a very good site on statistics and on SPSS:
Output with explanation:
http://faculty.chass.ncsu.edu/garson/PA765/manospss.htm
How and why to do MANOVA or multivariate GLM:
(same path as above, but terminating in '/manova.htm')
Writing software from scratch to calculate these outputs would be both lengthy and difficult;
there's lots of numerical problems and matrix inversions to do.
As Henry said, use Python scripts, or R. I'd suggest working with somebody who knows SPSS if scripting.
In addition, SPSS itself is capable of exporting the output tables to files using something called OMS.
A script within SPSS can do this.
Find out who in your research group knows SPSS and work with them.

Resources