Looking for a network-accessible hash table - linux

I have a data acquisition application broken into a client and a server.
The server is reponsible for grabbing data from the hardware, running some realtime analysis, and recording the data to disk when it's asked.
The client is a GUI that the operator can use to look at some pretty graphs (generated by the server), set some parameters, and turn recording on and off. It's usually run on the same machine as the server, but can be run from any other machine on the network.
Both are written in Qt (C++). Both are used on Linux.
The communication between the two is currently done with a homegrown library (in C++, but not Qt) that is essentially a hash table. The server has a list of parameters, like analysis.graph.width, and those parameters can be set and get by both the server and client(s).
The system is being redesigned to support new hardware, and now is a good time to replace this library if something better exists. Here are some requirements:
Ideally would play well with Qt (using QVariant to store values, using signals/slots)
Must allow values to be many different types (integers, strings, doubles, bools, lists of those)
Keys will be strings
Must be fast, allowing set/get operations up to 30 times per second
Must allow multiple clients to set/get parameters simultaneously
I found this list: http://en.wikipedia.org/wiki/Structured_storage, but the libraries listed there seem too complex (distributed, mirrored) or not cabable enough (values can only be strings).
Is anyone aware of libraries that would fit some or all of the requirements?

Well Dave I have had used redis for the same problem. It doesn't meet all your requirements but meets
Must allow values to be many
different types (integers, strings,
doubles, bools, lists of those)
Keys will be strings
Must be fast, allowing set/get
operations up to 30 times per second
Must allow multiple clients to
set/get parameters simultaneously
You can use the c/c++ api to communicate with redis. How to use Redis within a C++ program? ... yes you will have to convert datatypes from one to another say char* to QString etc.


Does there exist a language with the characteristic of storing variables in persistent storage?

I had this idea this morning, and was thinking about how to implement it when it occurred to me somebody has probably already done this. I searched but found nothing, here's my idea:
In short, all variable storage is stored in persistent storage. I don't mean battery backed up RAM. I mean more like a database.
To use common technologies to explain what I mean: Lets say you were to use an SQL database for this persistent storage. An array/list would be stored as a table with one column. An ordered list would be stored as two columns with the first being a sequence number. A hash would be a table with two columns, the first being the key, the second being the value. All simple stuff. But what I'm getting at is that you could do large data moving/calculating/reporting operations with native language constructs without all that mucking about in hyper... I mean without all that SQL and loading data from the database.
I was thinking sort of like the way you can do matrix math in APL. It would be native to the language and all the underpinning storage would just work. And in reality it would use a record manager more than a SQL database. That was just to explain.
Of course this would be horribly slow, but solid state disk is getting bigger faster and cheaper, so this might not be as unwieldy as it might first seem.
Anyway, is this a novel idea or has somebody done this before?
MUMPS has something like that.
Database interaction is transparently built into the language. The MUMPS language provides a hierarchical database made up of persistent sparse arrays, which is implicitly “opened” for every MUMPS application. All variable names prefixed with the caret character (“^”) use permanent (instead of RAM) storage, will maintain their values after the application exits, and will be visible to (and modifiable by) other running applications.
Of course, it’s explicit—thus not applied to all variables—but still automatic.
How persistent are you talking? The localStorage API works well (persists across browser tabs and sessions) so long as you know users can choose to clear it out. Your question sounds eerily like WebKit client-side database storage though.
Well, to point out the obvious, there is SQL.

Space efficient embedded Haskell persistence solution

I'm looking for a persistence solution (maybe a NoSQL db? or something else...) that has the following criteria:
1) Has a Haskell API
2) Is disk space efficient--the db could easily get to many gigabytes of data but I need it to run well on a typical desktop. I need something that stores the data as efficiently as possible. So, for example, storing field names in a record would be bad.
3) High performance for reading sequential records. The typical use case is start somewhere and then read forward straight through the data--reading through possibly millions of records as quickly as possible.
4) Data is basically never changed (would only be changed if it was discovered data was incorrect somehow), just logged
5) It should act directly on file(s) that can be easily moved/copied around. It should not be calling a separate running server.
If you remove the "single file" requirement with no other running process, everything else can be fulfilled by every standard RDBMS, and depending on the type of data, sometimes especially well by columnar stores in particular.
The only single-file solution I know of is sqlite. Mainly sqlite founders when a single db needs to be accessed by multiple concurrent processes. If that isn't the case, then I wouldn't be surprised if you could scale it up singificantly.
Additionally, if you're only looking for sequential scans and key-value stores, you could just go with berkeleydb, which is known to be high-performance for very large data sets.
There are high quality Haskell bindings for talking to both sqlite and berkeleydb.
Edit: For sequential access only, its also blindingly straightforward to roll your own layer with the binary or cereal packages -- you basically need to write a helper function to wrap reading records from a file sequentially rather than all at once. An abstraction for folding over them is nice as well. Then you can decide to append to a single file, or spread your writes across files as you go. Either way, that's the most lightweight and straightforward option of all. The only drawback is having to worry about durability -- safe writes in the presence of interrupts, and all that other stuff that a good DB solution should take care of for you.
CouchDB ticks most of your boxes:
1) http://hackage.haskell.org/package/CouchDB
2) Depends on how you use it. You can store any binary data in it, but its up to you to know what it means. Or you can store XML or JSON, which is less space efficient but easier to migrate as your schema evolves (which it will).
3) Don't know, but its used for big web sites.
4) CouchDB uses a CM-like concept of updates and baselines, so old data stays around. It can be purged later as obsolete, but I think thats optional.
5) No. Its written in Erlang and runs (I believe) as a separate process. But why is that a problem?

What code could be used as a string aggregator for Sybase? (Like Oracle's stragg)

In my travels in Oracle, the 'stragg' function, or 'String Aggregator' was life-saving when I had to create dynamic SQL queries on the fly.
You can read up about it here: http://www.oratechinfo.co.uk/delimited_lists_to_collections.html
The basic use of it was:
select stragg(fruit) from food;
1 row(s) returned
So simple to use, concatenating chr(13) turned it into a long list, and selecting information from system tables gave a 5 minute solution to dynamically generated SQL, e.g. auditing triggers.
Now I've been charged with transferring oracle functionality related to auditing into Sybase, and a function similar to Stragg would be ideal for this purpose.
select #my_table = 'table_of_fruit'
select 'insert into '+#mytable+'_copy (' +char(10)
+ stragg(c.name) +char(10)
+ 'select '
+ stragg('inserted.'+c.name) + char(10)
+ 'from '+#mytable
from syscolumns c
where objectid(#mytable) = c.id
insert into table_of_fruit_copy
(fruit, sweetness, price)
select fruit, sweetness,price
from inserted
Done. Simple.
Except I don't know how to get a string-aggregation function working in Sybase.
Does anyone know of an attempt to do this kind of thing, or code that could work the same as stragg that could be used in this way?
The alternative at the moment is printing code based on complex cursors and such (sample LOC: 500), or select statements combining static strings and columns from user tables (sample LOC: 200). Stragg would severely reduce the complexity of this code, and would be a great deal of help in the future (sample LOC: who knows, maybe 50?)
p.s. I'm calling these selects through a shell script then piping them to file, then running the file through iSQL. Not the nicest solution, but it's better than the alternatives.
There are three separate answers
You have made comments about simplicity, which need to be addressed before we get to the solution.
It is a common requirement to be able to take a delimited list of values, say A,B,C,D, and treat this data like it was a set of rows in a table, or vice versa
This one of the Top Ten Worst Programming Practices I read about recently.
In general, Sybase types tend to be somewhat more academically and Relationally qualified than Oracle types, so we simply do not do that sort of thing in SybaseLand or DB2Land.
In 20 years of working with Sybase, I have had to code that as part of my project just once, and that was for non-technical Auditor who loaded the result set into MS Access.
On the other hand, I have had to code that at least 12 times, when producing text files for importation into Oracle databases (fulfilling external requirements is outside my project, but I satisfy any such requirement free). Obviously the target databases were sub-standard and non-relational (loading a column with more than one datum breaks 1NF, and creates Update Anomalies), which is typical of what Oracle types have to do to get some speed.
Therefore, no, it is not simplicity, at least in the sense of that principle. It is by definition, complexity.
Your reference to "arrays" is incorrect. All commercial dbms handle arrays, according to the ISO/IEC/ANSI SQL (STRAGGR and LIST operators are non-standard SQL, therefore not SQL). Sybase is very strong in processing arrays. If it was an array, you would not need special hand coding to handle it (and you do, as per your question). This is not an array, there is no definition to the cells. This is a single concatenated scalar string.
Pivoting is an entirely different process, which uses set-processing; it does not require row-processing. (I understand on good authority, that Oracle is hopeless at scalar subqueries, and thus Oracle people are used to writing them as [very inefficient] joins or inline views, and then filtering: all that can be elevated to set-processing via scalar subqueries, and it will perform much faster. Particularly your Pivots.)
Even the author in your link posts as follows. Please familiarise yourself with the caveats:
It's as simple as this: If you want to have a system with no logical limitation in the number of data elements passed to a given process, then forget the following mechanisms! They are simply the wrong way to approach the problem.
Therefore, know whatever you are doing is sub-standard, non-relational, and limited; and go ahead with your eyes open. No use pretending that: it will not break; it is not limited; it is an "array"; or that Sybase doesn't have a neat little function that Oracle has. Any professional will see through all that. And if the string length is exceeded, for God's sake send some indicator back to the caller ("!Exceeded" in the string) identifying that condition.
Essentially you are turning the set-processing engine on its head, and forcing it into row-processing mode, so it will be very slow. A WHILE loop is distinctly faster than a cursor, but both are in the same class, row-processors.
The alternative at the moment is printing code based on complex cursors and such
What 200 or 500 LoC ? It is possible I am missing something, but my code is the same few lines of code identified under "Using a Table Function" in your link. Maximum 20, if you count nice formatting; the loop; initialisation; error handling. There is nothing "complex" about it. Do the exact reverse to cancatenate a single string from multiple rows. We use stored procedures for this (which oracle does not have, really, PL/SQL is a different animal). If you have ASE 15.0.2 or greater, you can use a User Defined Function, which you can then use in place of a column. Stored procs are better for true arrays.
the concatenation operator in Sybase is the plus sign. For reversal (decomposing the CSV string) you need CHARINDEX and SUBSTRING functions
You may need the Function Reference Manual, if for nothing else, to avoid writing code where we have functions.
Likewise, we do not have a RANK() function. We are quite happy with the 4 lines of code requires for the subquery. It is only required for Oracle because subqueries are crippled.
Ok, I have answered your question, Now to address the approaches.
You will be aware that code using Oracle Extensions to the SQL standard will need to be changed.
Sybase is way more automated than Oracle; if you familiarise yourself with its feature set, in many instances, you can get the same result (as you did in Oracle) without writing any code. Writing code-for-code blocks is the chain gang, rock-breaking method of building roads, in the context of bulldozers. Even if your company had good reason to use that method, you need to the aware that features work quite differently, eg. triggers, which is why I am posting so much detail.
Another issue that will annoy you is that Oracle isn't really ANSI SQL compliant (stretches the definitions in many places, in order to appear to be compliant), and Sybase, given its customer base, is rigidly SQL compliant. So in addition to the same function working differently, or in a different deployment, you need to be aware that code changes may be required to elevate Oracle code to ANSI compliance levels, just to execute on an ANSI SQL compliant platform.
I am not sure if you are trying to write code for the content of a trigger, or if you are trying to capture the changes to a database. I will provide both answers.
Capture Changes to Database
We have an very robust, fast and configurable Auditing subsystem, fit for high volumes and banking level auditing requirements. Get your DBA to setup the sybaudit (separate) database, and to configure exactly what changes need to be captured. This facility will perform much faster than any code you or I can write in a trigger (as much as 100 times faster than your row-by-processing required for the above, as it is executed within the engine, within your executing thread). And of course the setup time is a fraction of your coding time.
Again, I am not sure exactly what you are trying to achieve, but assuming you want to copy every insert to some table to a COPY of that table (inside the Trigger), that example code you have provided will not work (and I am not counting syntax issues).
Speaking to your example, you need to do way more work, to deal with the different datatypes; column sizes; precisions; scale; etc. And perhaps the UPDATE() function to identify which columns have changed (for an UPDATE trigger of course). If all you are trying to do is convert the various datatypes to strings, check the CONVERT() function.
Triggers are transactional.
Never place row-processing code in a Trigger (it will strangle the table)
You can't place Dynamic SQL in a Trigger.
But in Sybase even that is not necessary. Refer to the User Guide, chapter 19 is devoted to Triggers, with several variations, and examples. Inside the trigger, you should be able to simply:
INSERT table_copy
SELECT column_list -- never use * unless you want the db fixed in cement
FROM inserted
If you are trying to copy the inserts to all tables into one Audit table, then beware. Then I understand your example a little bit more. You will be forcing a highly Symmetric Muli-Threading server (oracle is not a server in the architecture sense) into single-threading through your table. Auditing is multi-threaded.
Last, the use of manual methods of any kind is not required, so if you could expand a bit more on your PS, what the requirement you are trying to fulfil is, I can identify the programmatic method for you. It appears you are trying to use the PL/SQL approach (which is very limited).
Just use the LIST() function. It's a direct replacement for stragg() function. Example:
SELECT LIST(state, ', ') FROM cities

How to implement/use a secure 'read-once' local file access system?

does anybody know of a secure 'read-once' local file access system? Or how one might create one? I realise that if data is to be used on a system, then it must be capable of being read, but I think it may be possible to severely limit how data is made available and reduce the possibility of it being copied and used elsewhere.
These are my requirements:
I want to store a 'secure/encrypted' data-file on a USB stick (could be read-only CD/DVD, but better if read/write USB or even a floppy) and have this file capable of being read once (and mainly only once), on a decoded block-by-block basis, once a password has been entered. The file content is probably basic text/xml (or text-encoded data) and is to be read mainly as a sequential stream. The data (ideally) can be read by normal windows file-access methods, ie: a std file, FSO objects (stream and text file), all BASIC PC (VB6/VB.NET) file handling methods, even Excel text (import). yes, I know this probably defeats the object (as such a file can then be opened/saved), but I would still want this possibility. Finally, once the 'access' criteria had been met, the device would prevent further access.
Access to the data would be on a local PC system only. No LAN, no device sharing supported. Data on the device should not be copyable by normal means. Data would be written to the device using normal methods if possible or a special application if necessary.
To keep things simple, just one password, one file, one use, and one user would be great, but other possible enhancements include: (as icing on the cake)...
allowing 'n' opens
having multiple passwords 2 or more users, acting individually
silo-passwords, having 2 more users sign together to get access (or even
having at least n from m more users sign together to get access)
Password prompt should be given on first block-access, independent of
application calling the first block
Password could be embedded/automatic
tie the access to a nominated machine/mac/ip/disk serial number (or
other machine-code)
tie the access to a nominated program /application
if possible, delete and securely overwrite the data file
My first guess at doing this suggests that it would need a 'psuedo-device' driver that would appear as an extention to (or replacement of) the std removable-device driver. The driver would handle each file block, sector by sector, and refuse to server further decoded blocks if not authorised. The device should not give normal directory listings, but some some form of content summary may be given to a user (optional).
Unlike a DRM system, I don't want any form of on-line acces/authentication (but would consider it), I would prefer a self-contained system.
I have looked long and hard for a such a device/system, and haven't found one yet. Most devices and system tools (eg: Iomega/ironkey) appear to unlock access to files, but without limit, ie: read-many, once unlocked.
Performance is not an issue. Slow floppy read-rate would be okay. Encyption method is agnostic, anything reasonably strong 40bit+ (128bit) would be fine. I can't tell you what the data is or whats its for, I just need a way to give data to somebody and limit its use as far as possible and what they can do with it. Its a real requirement to protect confidential data and not meant for DRM or MP3s/Videos or similar.
I am an 'office' developer and not really familiar with device-drivers or DRM - Now where would I start with such a project? Is there anything out-there available to joe-public already?
Thanks - Tim.
PS: Update
I should point out that I just wish to pass data between ourselves and a single specific nominated service-provider. I don't want them to copy the data we provide. It will be used once to support a 'singular' one-off process and then be done-with. As the data is 'streamed/read' it should be 'consumed'. if the process fails, we will re-issue the data to the service-provider. the data remains our property, it is not being sold/licensed.
I do realise that no solution will be foolproof, but the risk/reward ratio should dissuade casual attempts to break the system. The data has no explicit commercial value.
PPS: Its a real requirement... What would you do?
Judging by the upvotes on #eriksons thoughtful answer, you guys are saying 'not possible / don't bother' - but apart from personally supervising that the data is used according to our wishes, what would you do?
Executive summary: this isn't a realistic solution. Re-think the process so that "read-once" isn't necessary.
A few companies (Disappearing Inc. comes to mind, and they had at least one competitor) tried to make "self-destructing" email on general-purpose hardware in the late 90s. They spent millions of dot.com dollars to develop systems that didn't really work.
The only potential solution I know of is the use of a Trusted Platform Module. These are fairly common, as they are required in all computers bought by the US government. However, their capabilities vary. You'd need one that supported something called remote attestation, which allows software to perform integrity checks on itself. With this capability, you could write software that would enforce your data destruction policy. However, I don't think this feature is widely used. My laptop has a TPM, but it doesn't support this.
You should also be aware that there is a lot of hostility against "trusted computing," because it can be used to limit the functionality of a machine. This violates the right to do as you please with your property. TPMs might make sense for corporate or government machines, but not for personal computers.
Other aspects of your problem, such as granting multiple users access to the data, requiring multiple users to gain access to the data are easier.
Encrypting data for multiple users is typically achieved by generating a key, encrypting the data with that "content encryption key", then encrypting the key (which is relatively small) with a "key encryption key" (which could be a password) belonging to each intended recipient.
Requiring some number of users to enter a password can be done securely with Shamir Secret Sharing, as I learned here on SO.
Based on the comments on the question, especially the "mailing label printing service" analogy, I'm afraid my initial answer isn't really relevant.
In a case like that, I can only see a legal solution. Disallow storage of your data in the contract. If it's worth suing them for violating the contract, do so.
Cryptographically speaking, the best thing I could think of would be to "watermark" such a "mailing list" with information that would help me prove that a copy of the list was disclosed by a particular vendor. Knowing that a watermark exists might deter any deliberate disclosures, and could help leverage a fast settlement in the case of accidental disclosure. This could use steganographic techniques within records as well as fake records in the collection.
Algorithms for doing this might already exist, but I'm not familiar with the field. Researching "digital watermarks" might be useful. Even if it only turns up algorithms for protected video and audio, perhaps these could be adapted to work with other media.
There are several problems with your approach.
If you can read the data from any application, you can safe the data anywhere. I would think this would defeat the purpose of any 'only-one-access' policy.
To get a device driver to handle your scenario, you would need deep knowledge of file-system-programming, which at least under windows is no easy undertaking. Even then, it would be hard to enforce the one time access prerequisite.
Programs have different file-access strategies, which might break your assumptions. E.g. an application may open a file once to get its size, then close and reopen it, to load its data. How should this be enforced? Do you want to limit 'OpenFile' calls? do you want to limit 'read byte' calls? Do you want to limit ... jumping around in the file?
When your medium gets copied, by whatever means, you have no way of knowing that. The games industry tries to bind the game to the original CD for years, but failed miserably for years.
I think, what would be feasible, would be a container format, with a encoder/decoder, or something like that. (See Bitlocker in Windows7) That would guarantee, that you can only decode the data once to a local disc and would then delete the container on your medium (beware, check first if the medium is writable, and bind the container to an serial-number or name of the medium so that the container cannot be copied).
Another possibility would be a separate USB device, which you can only use once to extract the data from it. Then you would only need to write a driver once in user mode with WinUSB. Encrypted USB-Sticks use this approach.
But I really think this is a bad idea, because you can very easily get around any counter measurement, when the receiving person can read all data from the medium and safe it anywhere else.

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.
