most suited search algorithm? - search

I'm now facing a problem and I'm not sure what the right solution is. I'll try to explain it, and I hope someone has some good solutions for me:
I have two big data arrays. One that I'm browsing, with something between 50^3 and 150^3 data samples (usually between 50 and 100, rare worst case scenario 150).
For every sample, I want to make a query on another structure that is around the same size (so huge number of total combinations, I can't explore them all).
The query can't be predicted exactly but usually, it is something like :
structure has fields A B C D E F G (EDIT : in total, it's something like 10 to 20 int fields).
query is something like :
10 < A < 20 and B > 100 and D > 200.
Yes, it's really close to SQL.
I thought to put this in a database, but actually it would be a standalone database, and I can work in RAM to make it even faster (speed is an essential criteria).
I thought to try something using GPGPU but it seems it's a terrible idea and despite search can be parallel, it does't seem to be a good idea, searching an unpredictable number of results isn't a good application (if someone can tell me if my understanding has been right it would help me confirm that I should forgive this solution).
EDIT : the nubmer of results is unpredictable because of the query nature, but the it is quite low, since the purpose is to find a low number of well suited combinations
Then since I could use a DB, why not make a RAM B-Tree? it seems close to the solution, but is it? If it is, how should I build my indexes? Can I really do multidimensional indexes, since multidimensional search will always exist? probably UB-Tree or R-tree could do the job (but in my second data sample, I could have some duplicates, so doesn't it make the R-TREE non applicable?).
The thing is, I'm not sure I understand properly all those right now, so if one of you knows trees (and gpgpu, and even solutions I didn't think to), perhaps you could let me know which solution I should explore, learn, and implement?

GPGPU is not a suitable choice due to the fact that you are limited by their capacity and since you are not telling us the data size of these samples I am assuming that a titan x tier card will not suffice. If you could go really wild, TESLA or FirePro, then it is actually worth it since you mentioned that speed really matters. But I am going to speculate that these things are out of your budget, and considering that you have to learn CUDA or OpenCL to make something that will generally be a pain to port here and there, my take is "No".
You mentioned that you have an unpredictable number of results and this is a bad thing. You should develop a formula that will "somewhat" calculate the amount of space which will be needed otherwise it will be disappointing to have your program work on something for quite some time only to get a capacity error/crash. On the other hand, if the RAM capacity is not sufficient, you could work "database style" fetching data from storage when needed(and this is quite bothersome to implement due to scheduling implementations etc).
If you do have the time to go bespoke, here is a helpfull link. Remember, you are going to stumble a lot, but when you make it you will have learnt a tone of stuff:
https://www.quora.com/What-are-some-fast-similarity-search-algorithms-and-data-structures-for-high-dimensional-vectors
In my opinion, an in memory database is the easiest and at the same time most reliable thing to do without compromising on speed. Which one to implement is on you. I think MemSQL is a good one.

Related

Testing for heteroskedasticity and autocorrelation in large unbalanced panel data

I want to test for heteroskedasticity and autocorrelation in a large unbalanced panel dataset.
I do so using the following code:
* Heteroskedasticity test
// iterated GLS with only heteroskedasticity produces
// maximum-likelihood parameter estimates
xtgls adjusted_volume ibn.rounded_time i.id i.TRD_EVENT_DT, igls panels(heteroskedastic)
estimates store hetero
* Autocorrelation
findit xtserial
net sj 3-2 st0039
net install st0039
xtserial adjusted_volume ibn.rounded_time i.id i.TRD_EVENT_DT
Though I use the calculation power of high process center, because of the iteration method, this procedure takes more than 15 hours.
What is the most efficient program to perform these tests using Stata?
This question is borderline off-topic and quite broad, but i suspect still of
considerable interest to new users. As such, here i will try to consolidate our
conversation in the comments as an answer.
I strongly advise in the future to refrain from using highly subjective
words such as 'best', which can mean different things to different people. Or
terms like 'efficient', which can have a different meaning in a different context.
It is also difficult to provide specific advice regarding the use of commands
when we know nothing about what you are trying to do.
In my view, the 'best' choice, is the choice that gets the job done as accurately
as possible given the available data. Speed is an important consideration nowadays, but accuracy is still the most fundamental one. As you continue to use Stata, you will see that it has a considerable number of commands, often with overlapping functionality. Depending on the use case, sometimes opting for one implementation over another can be 'better', in the sense that it may be more practical or faster in achieving the desired end result.
Case in point, your comment in your previous post where the noconstant option is unavailable in rreg. In that particular context you can get a reasonably good alternative using regress with the vce(robust) option. In fact, this alternative may often be adequate for several use cases.
In this particular example, xtgls will be considerably faster if the igls
option is not used. This will be especially true with larger and more 'difficult' datasets. In cases where MLE is necessary, the iterate option will allow you to specify a fixed number of iterations, which could speed things up but can be a recipe for disaster if you don't know what you are doing and is thus not recommended. This option is usually used for other purposes. However, is xtgls the only command you could use? Read here why this may in fact not necessarily be the case.
Regarding speed, Stata in general is slow, at least when the ado language is used. This is because it is an interpreted language. The only realistic option for speed gains here is through parallelisation if you have Stata MP. Even in this case, whether any gains are achieved it will depend on a number of factors,
including which command you use.
Finally, xtserial is a community-contributed command, something which you
fail to make clear in your question. It is customary and useful to provide this
information right from the start, so others know that you do not refer to an
official, built-in command.

Cannot generalize my Genetic Algorithm to new Data

I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the training phase.
However, I still thought I could take a few precautions and and get some kind of prediction on a set of unseen test stocks from the same period.
One precaution I took was:
When multiple stocks can be bought on the same day the GA only buys one from the list and it chooses this one randomly. I thought this randomness might help to avoid over-fitting?
Even if over-fitting is still occurring,shouldn't it be absent in the initial generations of the GA since it hasn't had a chance to over-fit yet?
As a note, I am aware of the no-free-lunch theorem which demonstrates ( I believe) that there is no perfect set of parameters which will produce an optimal output for two different datasets. If we take this further, does this no-free-lunch theorem also prohibit generalization?
The graph below illustrates this.
->The blue line is the GA output.
->The red line is the training data (slightly different because of the aforementioned randomness)
-> The yellow line is the stubborn test data which shows no generalization. In fact this is the most flattering graph I could produce..
The y-axis is profit, the x axis is the trading strategies sorted from worst to best ( left to right) according to there respective profits (on the y axis)
Some of the best advice I've received so far (thanks seaotternerd) is to focus on the earlier generations and increase the number of training examples. The graph below has 12 training stocks rather than just 4, and shows only the first 200 generations (instead of 1,000). Again, it's the most flattering chart I could produce, this time with medium selection pressure. It certainly looks a little bit better, but not fantastic either. The red line is the test data.
The problem with over-fitting is that, within a single data-set it's pretty challenging to tell over-fitting apart from actually getting better in the general case. In many ways, this is more of an art than a science, but here are some general guidelines:
A GA will learn to do exactly what you attach fitness to. If you tell it to get really good at predicting one series of stocks, it will do that. If you keep swapping in different stocks to predict, though, you might be more successful at getting it to generalize. There are a few ways to do this. The one that has had perhaps the most promising results for reducing over-fitting is imposing spatial structure on the population and evaluating on different test cases in different cells, as in the SCALP algorithm. You could also switch out the test cases on a time basis, but I've had more mixed results with that sort of an approach.
You are correct that over-fitting should be less of a problem early on. Generally, the longer you run a GA, the more over-fitting will be possible. Typically, people tend to assume that the general rules will be learned first, before the rote memorization of over-fitting takes place. However, I don't think I've actually ever seen this studied rigorously - I could imagine a scenario where over-fitting was so much easier than finding general rules that it happens first. I have no idea how common that is, though. Stopping early will also reduce the ability of the GA to find better general solutions.
Using a larger data-set (four stocks isn't that many) will make your GA less susceptible to over-fitting.
Randomness is an interesting idea. It will definitely hurt the GA's ability to find general rules, but it should also reduce over-fitting. Without knowing more about the specifics of your algorithm, it's hard to say which would win out.
That's a really interesting thought about the no free lunch theorem. I'm not 100% sure, but I think it does apply here to some extent - better fitting some data will make your results fit other data worse, by necessity. However, as wide as the range of possible stock behaviors is, it is much narrower than the range of all possible time series in general. This is why it is possible to have optimization algorithms at all - a given problem that we are working with tends produce data that cluster relatively closely together, relative to the entire space of possible data. So, within that set of inputs that we actually care about, it is possible to get better. There is generally an upper limit of some sort on how well you can do, and it is possible that you have hit that upper limit for your data-set. But generalization is possible to some extent, so I wouldn't give up just yet.
Bottom line: I think that varying the test cases shows the most promise (although I'm biased, because that's one of my primary areas of research), but it is also the most challenging solution, implementation-wise. So as a simpler fix you can try stopping evolution sooner or increasing your data-set.

String Matching Algorithms

I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!
The Levenshtein distance should do.
Look around - there are implementations in many languages.
Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.
Soundex or Metaphone might work.
I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.
Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.
check this one http://docs.python.org/library/difflib.html
it should help you

Generic graphing and charting solutions

I'm looking for a generic charting solution, ideally not a hosted one that provides the following features:
Charting a tuple of values where the values are:
1) A service identifier (e.g. CPU usage)
2) A client identifier within that service (e.g. server IP)
3) A value
4) A timestamp with millisecond/second resolution.
Optional:
I'd like to also extend the concept of a client identifier further, taking the above example further, I'd like to store statistics for each core separately, so, another identifier would be Core 1/Core 2..
Now, to make sure I'm clearly stating my problem, I don't want a utility that collects these statistics. I'd like something that stores them, but, this is also not mandatory, I can always store them in MySQL, or such.
What I'm looking for is something that takes values such as these, and charts them nicely, in a multitude of ways (timelines, motion, and the usual ones [pie, bar..]). Essentially, a nice visualization package that allows me to make use of all this data. I'd be collecting data from multiple services, multiple applications, and the datapoints will be of varying resolution. Some of the data will include multiple layers of nesting, some none. (For example, CPU would go down to Server IP, CPU#, whereas memory would only be Server IP, but would include a different identifier, i.e free/used/cached as the "secondary' identifier. Something like average request latency might not have a secondary identifier at all, in the case of ping). What I'm trying to get across is that having multiple layers of identifiers would be great. To add one final example of where multiple identifiers would be great: adding an extra identifier on top of ip/cpu#, namely, process name. I think the advantages of that are obvious.
For some applications, we might collect data at a very narrow scope, focusing on every aspect, in other cases, it might be a more general statistic. When stuff goes wrong, both come in useful, the first to quickly say "something just went wrong", and the second to say "why?".
Further, it would be a nice thing if the charting application threw out "bad" values, that is, if for some reason our monitoring program started to throw values of 300% CPU used on a single core for 10 seconds, it'd be nice if the charts themselves didn't reflect it in the long run. Some sort of smoothing, maybe? This could obviously be done at the data-layer though, so its not a requirement at all.
Finally, comparing two points in time, or comparing two different client identifiers of the same service etc without too much effort would be great.
I'm not partial to any specific language, although I'd prefer something in (one of the following) PHP, Python, C/C++, C#, as these are languages I'm familiar with. It doesn't have to be open source, it doesn't have to be a library, I'm open to using whatever fits my purpose the best.
More of a P.S than a requirement: I'd like to have pretty charts that are easy for non-technical people to understand, and act upon too (and like looking at!).
I'm open to clarifying, and, in advance, thanks for your time!
I am pretty sure that protovis meets all your requirements. But it has a bit of a learning curve. You are meant to learn by examples, and there are plenty to work from. It makes some pretty nice graphs by default. Every value can be a function, so you can do things like get rid of your "Bad" values.

What is the best way to search multiple sources simultaneously?

I'm writing a phonebook search, that will query multiple remote sources but I'm wondering how it's best to approach this task.
The easiest way to do this is to take the query, start a thread per remote source query (limiting max results to say 10), waiting for the results from all threads and aggregating the list into a total of 10 entries and returning them.
BUT...which of the remote source is more important if all sources return at least 10 results, so then I would have to do a search on the search results. While this would yield accurate information it seems inefficient and unlikely to scale up well.
Is there a solution commercial or open source that I could use and extend, or is there a clever algorithm I can use that I've missed?
Thanks
John, I believe what you want is federated search. I suggest you check out Solr as a framework for this. I agree with Nick that you will have to evaluate the relative quality of the different sources yourself, and build a merge function. Solr has some infrastructure for this, as this email thread shows.
To be honest I haven't seen a ready solution, but this is why we programmers exist: to create a solution if one is not readily availble :-)
The way I would do it is similar to what you describe: using threads - if this is a web application then ajax is your friend for speed and usability, for a desktop app gui representation is not even an issue.
It sounds like you can't determine or guess upfront which source is the best in terms of reliability, speed & number of results. So you need to setup you program so that it determines best results on the fly. Let's say you have 10 data sources, and therfore 10 threads. When you fire up your threads - wait for the first one to return with results > 0. This is going to be you "master" result. As other threads return you can compare them to your "master" result and add new results. There is really no way to avoid this if you want to provide unique results. You can start displaying results as soon as you have your first thread. You don't have to update your screen right away with all the new results as they come in but if takes some time user may become agitated. You can just have some sort of indicator that shows that more results are available, if you have more than 10 for instance.
If you only have a few sources, like 10, and you limit the number of results per source you are waiting for, to like 10, it really shouldn't take that much time to sort through them in any programming language. Also make sure you can recover if your remote sources are not available. If let's say, you are waiting for all 10 sources to come back to display data - you may be in for a long wait, if one of the sources is down.
The other approach is to f00l user. Sort of like airfare search sites do - where they make you want a few seconds while they collect and sort results. I really like Kayak.com's implementation - as it make me feel like it's doing something unlike some other sites.
Hope that helps.

Resources