Space efficient embedded Haskell persistence solution - haskell

I'm looking for a persistence solution (maybe a NoSQL db? or something else...) that has the following criteria:
1) Has a Haskell API
2) Is disk space efficient--the db could easily get to many gigabytes of data but I need it to run well on a typical desktop. I need something that stores the data as efficiently as possible. So, for example, storing field names in a record would be bad.
3) High performance for reading sequential records. The typical use case is start somewhere and then read forward straight through the data--reading through possibly millions of records as quickly as possible.
4) Data is basically never changed (would only be changed if it was discovered data was incorrect somehow), just logged
5) It should act directly on file(s) that can be easily moved/copied around. It should not be calling a separate running server.

If you remove the "single file" requirement with no other running process, everything else can be fulfilled by every standard RDBMS, and depending on the type of data, sometimes especially well by columnar stores in particular.
The only single-file solution I know of is sqlite. Mainly sqlite founders when a single db needs to be accessed by multiple concurrent processes. If that isn't the case, then I wouldn't be surprised if you could scale it up singificantly.
Additionally, if you're only looking for sequential scans and key-value stores, you could just go with berkeleydb, which is known to be high-performance for very large data sets.
There are high quality Haskell bindings for talking to both sqlite and berkeleydb.
Edit: For sequential access only, its also blindingly straightforward to roll your own layer with the binary or cereal packages -- you basically need to write a helper function to wrap reading records from a file sequentially rather than all at once. An abstraction for folding over them is nice as well. Then you can decide to append to a single file, or spread your writes across files as you go. Either way, that's the most lightweight and straightforward option of all. The only drawback is having to worry about durability -- safe writes in the presence of interrupts, and all that other stuff that a good DB solution should take care of for you.

CouchDB ticks most of your boxes:
1) http://hackage.haskell.org/package/CouchDB
2) Depends on how you use it. You can store any binary data in it, but its up to you to know what it means. Or you can store XML or JSON, which is less space efficient but easier to migrate as your schema evolves (which it will).
3) Don't know, but its used for big web sites.
4) CouchDB uses a CM-like concept of updates and baselines, so old data stays around. It can be purged later as obsolete, but I think thats optional.
5) No. Its written in Erlang and runs (I believe) as a separate process. But why is that a problem?

Related

Atomic probabilistic counting and set membership in Cassandra

I am looking to do probabilistic counting and set membership using structures such as bloom filters and hyperloglog.
Is there any support for using such data structures and performing operations on them atomically on the server-side, through user-defined functions or similar? Or any way for me to add extensions with such functionality?
(I could ingest the data through another system and batch the updates to reduce the contention, but it would be far simpler if all this could be handled in the database server.)
You have to implement them client side. Common approach is to every X min serialize/insert the HLL you keep in memory on your system and then merge them on reads across interested range (maybe using RRD type approach for different periods beyond X min). This is not very durable, so depending on usecase it might mean something more complex.
Although it seems a close fit to C* I think one of the big issues is deletes, but you can probably work around them. Theres a proof of concept for C* side implementation here:
http://vilkeliskis.com/blog/2013/12/28/hacking_cassandra.html
that you can likely get working "well enough". https://issues.apache.org/jira/browse/CASSANDRA-8861 may be something to watch.

Options for developing algorithm that manipulates large set of string artifacts

I am currently developing algorithms that work with hundreds of thousands of strings (~4000 chars each) and perform simple operations based on the results of functions applied to these strings. Currently I use Java and a Mysql database with one table:
ID | String | attribute a | attribute b | ....
| | | | ....
Basically, the algorithm gets one ID to start with, reads the string that is stored, performs functions on it (Attributes are set and read for that currently active column). For example, one function extracts an ID from the String (simple string parsing), stores this ID in the "attribute a" column. Once the entry is parsed, the algorithm reads "attribute a", jumps to the row with this ID and the process starts all over again.
Maybe I am over-thinking this a little bit; but the current set up has so much overhead, that it is nearly impossible to make some quick changes or to quickly test queries. Is there a better tool or programming language that has been designed for directly operating on large data sets like this and that provides efficient functions for string manipulation?
I definitely wouldn't mind spending time on learning a completely new language as I believe that using the right tool for the job saves time and prevents frustration in the long term.
I have a pet project that I've been working on, on and off, for years. It stores a large number of strings (although not text). In the past I have implemented it in Java in-memory, Scala with a database, MySQL, C in-memory, Python + Redis... and finally, Go.
Go has done the best job. I have ~300,000 strings (although shorter than yours) stored in a data structure in memory. They form a searchable, analyzable data structure. I'm sure the use case is similar enough to yours for my experience to be relevant.
Go has similar efficiency to C for data processing. It has nice syntax like Python for quick coding. It has type safety for ... type safety. It has garbage collection.
My suggestion is, learn Go and do it all in-memory. Rely on virtual memory for accommodating a large data-set. Mine is about 500 MB in RAM once loaded, but I have no dobut it would function just fine at twice that.
I do not persist to disk because I don't need to. I can re-create the data structure in 15 minutes from input files. The application is a continually running server. If you're running large batch operations to do analysis that can be suitable. Otherwise I am sure you can easily perisist to disk.
(FWIW I'm talking about www.folktunefinder.com melody search index)
It doesn't look like you need a relational database. Maybe try something like MongoDB.
I don't think this is a really language choice problem: you can definitely do big data string handling in Java just fine. You can probably solve most of your problems by:
Creating decent JUnit tests with controlled subsets of the data
Doing some profiling to find performance hotspots and tuning them
Intelligent caching of rows/Strings in memory (rather than doing round-trips to the database all the time)
Having said that, I'd almost certainly pick Clojure as a language/environment for this kind of task:
Interactive development at the REPL for testing queries etc.
Much more concise than Java
Lazy functional programming is great for big data sets (even ones that are larger than memory)
You can still access all the Java libraries
Some very neat database tools, e.g. Korma (a DSL for SQL queries) )and Datomic (a revolutionary new kind of database)

Does there exist a language with the characteristic of storing variables in persistent storage?

I had this idea this morning, and was thinking about how to implement it when it occurred to me somebody has probably already done this. I searched but found nothing, here's my idea:
In short, all variable storage is stored in persistent storage. I don't mean battery backed up RAM. I mean more like a database.
To use common technologies to explain what I mean: Lets say you were to use an SQL database for this persistent storage. An array/list would be stored as a table with one column. An ordered list would be stored as two columns with the first being a sequence number. A hash would be a table with two columns, the first being the key, the second being the value. All simple stuff. But what I'm getting at is that you could do large data moving/calculating/reporting operations with native language constructs without all that mucking about in hyper... I mean without all that SQL and loading data from the database.
I was thinking sort of like the way you can do matrix math in APL. It would be native to the language and all the underpinning storage would just work. And in reality it would use a record manager more than a SQL database. That was just to explain.
Of course this would be horribly slow, but solid state disk is getting bigger faster and cheaper, so this might not be as unwieldy as it might first seem.
Anyway, is this a novel idea or has somebody done this before?
MUMPS has something like that.
Database interaction is transparently built into the language. The MUMPS language provides a hierarchical database made up of persistent sparse arrays, which is implicitly “opened” for every MUMPS application. All variable names prefixed with the caret character (“^”) use permanent (instead of RAM) storage, will maintain their values after the application exits, and will be visible to (and modifiable by) other running applications.
Of course, it’s explicit—thus not applied to all variables—but still automatic.
How persistent are you talking? The localStorage API works well (persists across browser tabs and sessions) so long as you know users can choose to clear it out. Your question sounds eerily like WebKit client-side database storage though.
Well, to point out the obvious, there is SQL.

Transaction with Cassandra data model

According to the CAP theory, Cassandra can only have eventually consistency. To make things worse, if we have multiple reads and writes during one request without proper handling, we may even lose the logical consistency. In other words, if we do things fast, we may do it wrong.
Meanwhile the best practice to design the data model for Cassandra is to think about the queries we are going to have, and then add a CF to it. In this way, to add/update one entity means to update many views/CFs in many cases. Without atomic transaction feature, it's hard to do it right. But with it, we lose the A and P parts again.
I don't see this concerns many people, hence I wonder why.
Is this because we can always find a way to design our data model to avoid to do multiple reads and writes in one session?
Is this because we can just ignore the 'right' part?
In real practice, do we always have ACID feature somewhere in the middle? I mean maybe implement in application layer or add a middleware to handle it?
It does concern people, but presumably you are using cassandra because a single database server is unable to meet your needs due to scaling or reliability concerns. Because of this, you are forced to work around the limitations of a distributed system.
In real practice, do we always have ACID feature somewhere in the
middle? I mean maybe implement in application layer or add a
middleware to handle it?
No, you don't usually have acid somewhere else, as presumably that somewhere else must be distributed over multiple machines as well. Instead, you design your application around the limitations of a distributed system.
If you are updating multiple columns to satisfy queries, you can look at the eventually atomic section in this presentation for ideas on how to do that. Basically you write enough info about your update to cassandra before you do your write. That way if the write fails, you can retry it later.
If you can structure your application in such a way, using a co-ordination service like Zookeeper or cages may be useful.

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Resources