Real time multi threaded max-heap for top-N geohash - multithreading

There is a requirement to keep a list of top-10 localities in a city from where demand for our food service is emanating at any given instant. The city could have tens of thousands of localities.
If one has to make a near real time (lag no more than 5 minutes) datastore in memory that would
- keep count of incoming demand by locality (geo hash)
- reads by hundreds of our suppliers every minute (the ajax refresh is every minute)
I was thinking of a multi threaded synchronized max-heap. This would be a complex solution as tree locking is by itself a complex implementation.
Any recommendations for the best in-memory (replicatable master slave) data structure that can be read and updated in multi threaded environment?
We expect 10K QPS and 100K updates per second. When we scale to other cities and regions, we will need per city implementation of top-10.
Are there any off the shelf solutions available?
Persistence is not a need so no mySQL based solutions. If you recommend redis or mongo DB solution, please realize that the queries are not pointed-queries by key but a top-N query instead.
Thanks in advance.

If you're looking for exactly what you're describing, there are a few approaches that might work nicely. There are several papers describing concurrent data structures that could work as priority queues; here is one option that I'm not super familiar with but which looks promising. You might also want to check out concurrent skip lists, which should also match your requirements.
If I'm interpreting your problem statement correctly, you're hoping to maintain a top-10 list of locations based on the number of hits you receive. If that's the case, I would suspect that while the number of updates would be huge, the number of times that two locations would switch positions would not actually be all that large. In other words, most updates wouldn't actually require the data structure to change shape. Consequently, you could consider using a standard binary heap where each element uses an atomic-compare-and-set integer key and where you have some kind of locking system that's used only in the case where you need to add, move, or delete an element from the heap.
Given the scale that you're working at, you may also want to consider approximate solutions to your problem. The count-min sketch data structure, for example, was specifically designed to estimate frequent elements in a data stream and does so extremely quickly. It can easily be distributed and linked up with a priority queue in a manner similar to what I described above. There are lots of good implementations out there, and if I remember correctly this data structure is actually deployed in situations like the one you're describing.
Hope this helps!

Related

Using Cassandra to store immutable data?

We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.
Requirements:
We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.
Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?
I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.
Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.
You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.
This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.
Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.
Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..
All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.
Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.
Don't fall into the misconception that batched statements are faster (this one is a classic :) )
Based on the requirements you expressed, Cassandra could be a good fit as it's a write-optimized data store. Timeseries are quite a common pattern and you can define a clustering order, for example, on the timestamp of the events in order to retrieve all the events in time order. I've found this article on Datastax Academy very useful when wanted to learn about time series.
Variable data structure it's not a problem: you can store the data in a BLOB, then parse it internally from your application (i.e. store it as JSON and read it in your model), or you could even store the data in a map, although collections in Cassandra have some caveats that it's good to be aware of. Here you can find docs about collections in Cassandra 2.0/2.1.
Cassandra is quite different from a SQL database, and although CQL has some similarities there are fundamental differences in usage patterns. It's very important to know how Cassandra works and how to model your data in order to pursue efficiency - a great article from Datastax explains the basics of data modelling.
In a nutshell: Cassandra may be a good fit for you, but before using it take some time to understand its internals as it could be a bad beast if you use it poorly.

Which is better - auto-generated id or manual id assignment in couchdb documents?

Should I be generating the id of the documents in a CouchDB or should I depend on CouchDB to generate it? What are the advantages or disadvantages in these approaches? Is there any performance implications on any of these options?
There is no difference as far as CouchDB is concerned. Frederick is right that sequential ids are slightly faster. If you query /_uuids?count=10 you will notice that the UUIDs are sequential (by default).
However, even with random IDs, once you run compaction, they will all be in the "right" order internally in the .couch file and at that point there is no difference. So in the long run, I don't usually worry about it.
The main thing is that you should use mostly sequential ids. As this article and this bit of the couchdb book explain, using random ids results in a much less efficient structure internally, both speed wise and in terms of space used on disc.
Self generated ids are almost impossible to deal with if you have two or more separated instances of your app. Because the synchronisation between the different instances is not instantaneous. A solution for this can be to have one server dedicated to generate (or check the availability of) the ids, for example using a SQL database, and acting as a gate for document creation.
On the other hand, if you have only one server and will never need more, there is one advantage I find interesting to self generated uids: since they have to be unique, you can use them in urls. For instance take the slug of the title of a blog post as the _id.
Performance-wise, the CouchDB's generated ids are pretty long so if your own ids are shorter, you will save significant disk space (assuming you have a looot of documents).
Both answers above tell about PROS of sequential IDs.
Here is a major problem arose by sequential IDs.
Predictability of other IDs in documents using a single ID.
Due to this we can't use sequential IDs in application URLs as identifiers due to other IDs being predictable using one ID, and using as url authentication is also not possible.( As done by file sharing services).

Azure Table Storage Design for Web Application

I am evaluating the use of Azure Table Storage for an application I am building, and I would like to get some advice on...
whether or not this is a good idea for the application, or
if I should stick with SQL, and
if I do go with ATS, what would be a good approach to the design of the storage.
The application is a task-management web application, targeted to individual users. It is really a very simple application. It has the following entities...
Account (each user has an account.)
Task (users create tasks, obviously.)
TaskList (users can organize their tasks into lists.)
Folder (users can organize their lists into folders.)
Tag (users can assign tags to tasks.)
There are a few features / requirements that we will also be building which I need to account for...
We eventually will provide features for different accounts to share lists with each other.
Users need to be able to filter their tasks in a variety of ways. For example...
Tasks for a specific list
Tasks for a specific list which are tagged with "A" and "B"
Tasks that are due tomorrow.
Tasks that are tagged "A" across all lists.
Tasks that I have shared.
Tasks that contain "hello" in the note for the task.
Etc.
Our application is AJAX-heavy with updates occurring for very small changes to a task. So, there is a lot of small requests and updates going on. For example...
Inline editing
Click to complete
Change due date
Etc...
Because of the heavy CRUD work, and the fact that we really have a list of simple entities, it would be feasible to go with ATS. But, I am concerned about the transaction cost for updates, and also whether or not the querying / filtering I described could be supported effectively.
We imagine numbers starting small (~hundreds of accounts, ~hundreds or thousands of tasks per account), but we obviously hope to grow our accounts.
If we do go with ATS, would it be better to have...
One table per entity (Accounts, Tasks, TaskLists, etc.)
Sets of tables per customer (JohnDoe_Tasks, JohnDoe_TaskLists, etc.)
Other thoughts?
I know this is a long post, but if anyone has any thoughts or ideas on the direction, I would greatly appreciate it!
Azure Table Storage is well suited to a task application. As long as you setup your partition keys and row keys well, you can expect fast and consistent performance with a huge number of simultaneous users.
For task sharing, ATS provides optimistic concurrency to support multiple users accessing the same data in parallel. You can use optimistic concurrency to warn users when more than one account is editing the same data at the same time, and prevent them from accidentally overwriting each-other's changes.
As to the costs, you can estimate your transaction costs based on the number of accounts, and how active you expect those accounts to be. So, if you expect 300 accounts, and each account makes 100 edits a day, you'll have 30K transactions a day, which (at $.01 per 10K transactions) will cost about $.03 a day, or a little less than $1 a month. Even if this estimate is off by 10X, the transaction cost per month is still less than a hamburger at a decent restaurant.
For the design, the main aspect to think about is how to key your tables. Before designing your application for ATS, I'd recommend reading the ATS white paper, particularly the section on partitioning. One reasonable design for the application would be to use one table per entity type (Accounts, Tasks, etc), then partition by the account name, and use some unique feature of the tasks for the row key. For both key types, be sure to consider the implications on future queries. For example, by grouping entities that are likely to be updated together into the same partition, you can use Entity Group Transactions to update up to 100 entities in a single transaction -- this not only increases speed, but saves on transaction costs as well. For another implication of your keys, if users will tend to be looking at a single folder at a time, you could use the row key to store the folder (e.g. rowkey="folder;unique task id"), and have very efficient queries on a folder at a time.
Overall, ATS will support your task application well, and allow it to scale to a huge number of users. I think the main question is, do you need cloud magnitude of scaling? If you do, ATS is a great solution; if you don't, you may find that adjusting to a new paradigm costs more time in design and implementation than the benefits you receive.
What your are asking is a rather big question, so forgive me if I don't give you an exact answer.. The short answer would be: Sure, go ahead with ATS :)
Your biggest concern in this scenario would be about speed. As you've pointed out, you are expecting a lot of CRUD operations. Out of the box, ATS doesn't support tranactions, but you can architect yourself out of such a challenge by using the CQRS structure.
The big difference from using a SQL to ATS is your lack of relations and general query possibilities, since ATS is a "NoSQL" approach. This means you have to structure your tables in a way that supports your query operations, which is not a simple task..
If you are aware of this, I don't see any trouble doing what your'e describing.
Would love to see the end result!

When designing a hash_table, how many aspects should be paid attention to?

I have some candidate aspects:
The hash function is important, the hashcode should be unique as far as possible.
The backend data structure is important, the search, insert and delete operations should all have time complexity O(1).
The memory management is important, the memory overhead of every hash_table entry should be as least as possible. When the hash_table is expanding, the memory should increase efficiently, and when the hash_table is shrinking, the memory should do garbage collection efficiently. And with these memory operations, the aspect 2 should also be full filled.
If the hash_table will be used in multi_threads, it should be thread safe and also be efficient.
My questions are:
Are there any more aspects worth attention?
How to design the hash_table to full fill these aspects?
Are there any resources I can refer to?
Many thanks!
After reading some material, update my questions. :)
In a book explaining the source code of SGI STL, I found some useful informations:
The backend data structure is a bucket of linked list. When search, insert or delete an element in the hash_table:
Use a hash function to calculate the corresponding position in the bucket, and the elements are stored in the linked list after this position.
When the size of elements is larger than the size of buckets, the buckets need resize: expand the size to be 2 times larger than the old size. The size of buckets should be prime. Then copy the old buckets and elements to the new one.
I didn't find the logic of garbage collection when the number of elements is much smaller than the number of buckets. But I think this logic should be considerated when many inserts at first then many deletes later.
Other data structures such as arrays with linear detection or square detection is not as good as linked list.
A good hash function can avoid clusters, and double hash can help to resolve clusters.
The question about multi_threads is still open. :D
There are two (slightly) orthogonal concern.
While the hash function is obviously important, in general you separate the design of the backend from the design of the hash function:
the hash function depends on the data to be stored
the backend depends on the requirements of the storage
For hash functions, I would suggest reading about CityHash or MurmurHash (with an explanation on SO).
For the back-end, there are various concerns, as you noted. Some remarks:
Are we talking average or worst case complexity ? Without perfect hashing, achieving O(1) is nigh-impossible as far as I know, though the worst case frequency and complexity can be considerably dampened.
Are we talking amortized complexity ? Amortized complexity in general offer better throughput at the cost of "spikes". Linear rehashing, at the cost of a slightly lower throughput, will give you a smoother curve.
With regard to multi-threading, note that the Read/Write pattern may impact the solution, considering extreme cases, 1 producer and 99 readers is very different from 99 producers and 1 reader. In general writes are harder to parallelize, because they may require modifying the structure. At worst, they might require serialization.
Garbage Collection is pretty trivial in the amortized case, with linear-rehashing it's a bit more complicated, but probably the least challenging portion.
You never talked about the amount of data you're about to use. Writers can update different buckets without interfering with one another, so if you have a lot of data, you can try to spread them around to avoid contention.
References:
The article on Wikipedia exposes lots of various implementations, always good to peek at the variety
This GoogleTalk from Dr Cliff (Azul Systems) shows a hash table designed for heavily multi-threaded systems, in Java.
I suggest you read http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable
The link points to a blog by Cliff Click which has an entry on hash functions. Some of his conclusions are:
To go from hash to index, use binary AND instead of modulo a prime. This is many times faster. Your table size must be a power of two.
For hash collisions don't use a linked list, store the values in the table to improve cache performance.
By using a state machine you can get a very fast multi-thread implementation. In his blog entry he lists the states in the state machine, but due to license problems he does not provide source code.

Storing News in a Distributed DB vs RDBMS

Hi all: If I am storing News articles in a DB with different categories such as "Tech", "Finance", and "Health", would a distributed database work well in this system vs a RDBMS? Each of the news items would have the news articles attached as well as a few other items. I am wondering if querying would be faster, though.
Let's say I never have more than a million rows, and I want to grab the latest (within 5 hours) tech articles. I imagine that would be a map-reduce of "Give me all tech articles" (possibly 10000), then weed out only the ones that have the latest timestamp.
Am I thinking about tackling the problem in the right way, and would a DDB even be the best solution? In a few years there might be 5 million items, but even then....
Whether to use a distributed database or key-value store depends more on your operational requirements than your domain problem.
When people ask how to do time-ordered queries in Riak, we usually suggest several strategies (although none of them are a silver-bullet as Riak lacks ordered range queries):
1) If you are frequently accessing a specifically-sized chunk of time, break your data into buckets that reflect that period. For example, all data for the day, hour or minute specified would be either stored or linked to from a bucket that contains the appropriate timestamp. If I wanted all the tech news from today, the bucket name might be "tech-20100616". As your data comes in, add appropriate links from the time-boxed bucket to the actual item.
2) If the data is more sequence-oriented and not related to a specific point in time, use links to create a chain of data, linking backward in time, forward, or both. (This works well for versioned data too, like wiki pages.) You might also have to keep an object that just points at the head of the list.
Those strategies aside, Riak is probably not the 100% solution for up-to-the-minute information, but might be better for the longer-term storage. You could combine it with something like Redis, memcached, or even MongoDB (which has great performance if your data is mildly transient and can fit in memory) to hold a rolling index of the latest stuff.

Resources