Fast, scalable hash lookup database? (Berkeley'ish) - hashmap

I use and love Berkeley but it seems to bog down once you get near a million or so entries, especially on the inserts. I've tried memcachedb which works but it's not being maintained so I'm worried of using it in production. Does anyone have any other similar solutions, basically I want to be able to do key lookups on a large(possibly distributed) dataset(40+ million).
Note: Anything NOT in Java is a bonus. :-) It seems most things today are going the Java route.

Have you tried Project Voldemort?

I would suggest you had a look at:
Metabrew key-value store blog post
There is a big list of key-value stores with a little bit of discussion in each of them. If you still have doubts you could join the so called Nosql google group and ask for help there.

Redis is insanely fast and actively developed. It is written in C(no java). Compiles out of the box on POSIX OS(no dependencies).

Did you try the hash backend? That should be faster for insert and key search.

Related

Is there a good auto-complete search library that I can use?

I already tried cleo, not convinced majorly because of this bug that I recently encountered and yet have been unable to solve.
Also tried elasticsearch, but too complex to run even a single query and indexing and other features were pretty slow too.
So if anyone knows another better one or something I am missing out here in these two. Thanks.
Have you tried the Completion Suggester API? It is not complex and certainly not slow.

How "fast" are lists?

While developing my application I noticed, that I put more and more complexity in list. Like "joining" related docs. Or manipulating the output based on query parameters. As we know, there is a lot of stuff that can be put in lists. Stuff that could also be handled by the middleware (if you not developing a couchapp).
Just to be sure the question: How far can/should one go with lists?
You should only go to about 12 on the Jason Scale ;)
Very hard to quantify the answer. JS within Couch is as fast as JS outside of Couch, which is slower than native code, faster than some other interpreters, and slower than some other interpreters. The short answer is that if you like writing code in lists, and it works in with your development environment, then relax, don't stop until/unless it becomes a problem.
The problem with lists is that they are executed on each request. It may not be a problem for you, but I prefer to avoid using lists, and design the documents and the application to not need lists. That said, nothing stops you from putting some caching mechanism in front of your couch to reduce server load.

what is the fastest word search on index?

i'm coding a query engine to search through a very large sorted index file. so here is my plan, use binary search scan together with Levenshtein distance word comparison for a match. is there a better or faster ways than this? thanks.
You may want to look into Tries, and in many cases they are faster than binary search.
If you were searching for exact words, I'd suggest a big hash table, which would give you results in a single lookup.
Since you're looking at similar words, maybe you can group the words into many files by something like their soundex, giving you much shorter lists of words to compute the distances to. http://en.wikipedia.org/wiki/Soundex
In your shoes, I would not reinvent the wheel - rather I'd reach for the appropriate version of the Berkeley DB (now owned by Oracle, but still open-source just as it was back when it was owned and developed by the UC at Berkeley, and later when it was owned and developed by Sleepycat;-).
The native interfaces are C and Java (haven't tried the latter actually), but the Python interface is also pretty good (actually better now that it's not in Python's standard library any more, as it can better keep pace with upstream development;-), C++ is of course not a problem, etc etc -- I'm pretty sure you can use if from most any language.
And, you get your choice of "BTree" (actually more like a B*Tree) and hash (as well as other approaches that don't help in your case) -- benchmark both with realistic data, btw, you might be surprised (one way or another) at performance and storage costs.
If you need to throw multiple machines at your indexing problem (because it becomes too large and heavy for a single one), a distributed hash table is a good idea -- the original one was Chord but there are many others now (unfortunately my first-hand experience is currently limited to proprietary ones so I can't really advise you here).
after your comment on David's answer, I'd say that you need two different indexes:
the 'inverted index', where you keep all the words, each with a list of places found
an index into that file, to quickly find any word. Should easily fit in RAM, so it can be a very efficient structure, like a Hash table or a Red/Black tree. I guess the first index isn't updated frequently, so maybe it's possible to get a perfect hash.
or, just use Xapian, Lucene, or any other such library. There are several widely used and optimized.
Edit: I don't know much about word-comparison algorithms but I guess most aren't compatible with hashing. In that case, R/B Trees or Tries might be the best way.

If you had one wish for SubSonic what would it be?

I know this question seems subjective but it's really pretty simple. As a long term user, and part time contributor to SubSonic I'm interested in what the community thinks would be the single best way to improve it.
So what's your opinion, how would you make SubSonic even better? What one thing would make you more likely to use/recommend/evangelise/stop complaining about it?
As I said I know this is a bit subjective and may get closed but as SO is the main support forum for SubSonic I think this could be a useful way to solicit opinion and/or contributions.
To keep this from turning into a general discussion here's the rules:
No omnibus wishes
No duplicate wishes
Up-vote those you agree with rather than re-posting them
Ability to run in MediumTrust out of the box
In all honesty the biggest thing thats lacking is solid documentation and HowTo's
Its got better but I think it needs a lot more.
Ability to automatically map collections of other objects, like Fluent NHibernate does.
When SubSonic throws an exception that isn't clear, I'd like to be able to use Google or some other mechanism to discover more information about how to keep my development effort moving forward. Right now it's too easy to get into a situation where you have to go spelunking into the SubSonic source code since SubSonic doesn't seem to be very proactive when the user goes off the "happy path".
This critique is hardly specific to SubSonic. Many (most?) software products suffer from this same problem. I have not really had this problem with NHibernate though, which is SubSonic's most clear competitor.
Faster and higher quality releases
Binary types for SimpleRepository (Images)
Left Outer Joins
Support more database-independent code generation...
What I mean by this is that it is truly a real pain if your application wants to talk to different databases (e.g. SQL Server and Oracle) and you want to only have one set of generated DAL objects. I would love it if you had the option of specifying that any SQL code that gets sent to the DB would be as compatible with most engines as possible, since right now if you generated your objects targeting SQL Server then all queries will be of the form:
SELECT [schema].[table_name] FROM ....
Sadly, this does not work in Oracle, so basically you're out of luck there.
Perhaps this isn't a huge concern for most of you, but I'm currently writing a commercial app that touts one of its main features as being able to run on various database engines just by changing its configuration and I chose SubSonic because I thought it could handle the job pretty easily, but I'm honestly having second thoughts now because of all the hoops I may have to jump through just to get this to work correctly under different environments.
Support MS Access ,Postgres and FireBird database :)....

How to manage application resources?

We are developing a web application which is available in 3 languages.
There are these key-value pairs to translate everything. At this moment we use Excel (key, german, french, english) for this. But this does not work well ... if there is more than 1 person editing this file, you have no chance to automatically merge the different files.
Is there a good (and free) tool which can handle this job?
--- additional information ---
(This is a STRUTS application) But the question is how to manage these kinds of information in general (or at least in an conveinient way, which also supports multiple users editing this single file ("mergeable" filetypes))
Why not use gettext and manage separate .po files? See that blog entry.
If you can store this information in plain text then you will be able to use a version control system like subversion to help you with merging changes. Subversion is free.
The free guide (the "Red Book") to subversion gives a fairly good explanation of how this kind of merging works.
http://svnbook.red-bean.com/en/1.5/svn.basic.vsn-models.html#svn.basic.vsn-models.copy-merge
EDIT: Another thought - if you really want to stay using a spreadsheet - Google Docs supports simultaneous editing of a spreadsheet. You could import your existing spreadsheet and get your multi-user merging wishes for free with very little change to how you work.
Good Question.
There are some "Best Practice" depending on what you actually code in (java, ms-windows c#).
I solved this (but I think there must be a better way) by using a SQL db instead of excel file, and a wrote a plug for VS (VB6,........,..., emacs) that was able to insert new keys into the db without going to round trip with version control. The keys are the developers name of what they think is a best guess for a label. (key => save, sv => "spara", no => "", en => "save").
This db can then be generated as a module, class, obj, txt, to appropriate code(platform)
and can be accessed, depending on the ide, so in c#, bt,label = corelang.save;
Someone else can then do all the language stuff, and then we just update the db and rerun the generation to the platform resources.
After years of seeing localization done, including localization at large companies like Sony. I can only say the "standard" is Excel :)
There are tons of good ideas around, and probably many better ways to do it, but in real-life excel seems to be the best/cost effective solution that doesn't require training or making complex new tools to get the job done.
Found out, that Intellij Idea (at leas in version 7 and 8) has an editor for application resources. But it is not free at all. And it does not scale for bigger resource files with more than 1.000 keys.
Another good choice would be to use Google's spreadsheets ... for those who don't know it - it is like an "online Excell web-application". It can handle concurrent access from multiple users. Yay! But sadly, it comes from Google. This makes it impossible to be used in commercial projects.
So,
still searching...
cheers,
mana

Resources