Using Cassandra to store immutable data? - cassandra

We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.
Requirements:
We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.
Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?

I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.
Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.
You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.
This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.
Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.
Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..
All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.
Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.
Don't fall into the misconception that batched statements are faster (this one is a classic :) )

Based on the requirements you expressed, Cassandra could be a good fit as it's a write-optimized data store. Timeseries are quite a common pattern and you can define a clustering order, for example, on the timestamp of the events in order to retrieve all the events in time order. I've found this article on Datastax Academy very useful when wanted to learn about time series.
Variable data structure it's not a problem: you can store the data in a BLOB, then parse it internally from your application (i.e. store it as JSON and read it in your model), or you could even store the data in a map, although collections in Cassandra have some caveats that it's good to be aware of. Here you can find docs about collections in Cassandra 2.0/2.1.
Cassandra is quite different from a SQL database, and although CQL has some similarities there are fundamental differences in usage patterns. It's very important to know how Cassandra works and how to model your data in order to pursue efficiency - a great article from Datastax explains the basics of data modelling.
In a nutshell: Cassandra may be a good fit for you, but before using it take some time to understand its internals as it could be a bad beast if you use it poorly.

Related

How do I find out right data design and right tools/database/query for below requirement

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

What is the ideal structure of tables for Spark to work with (Tall vs Wide)?

I've been trying to think about what the ideal table structure would be for the fastest Spark queries.
I'll try and provide a use case: Let's say your gathering stats for every car in the world and you want to use calculate various metrics with basic math (i.e. add, sub, mult, div).
Would be better to structure the data in a tall table with minimal fields like: day, metric, type, value?
Or would it be better to build a wide tables, that may store metrics independently. With more fields like: day, emmision_value, tire_pressure_value, speed_value, weight_value, heat_value, radio_value, etc .
Is it right to say that tall tables are better for spark? I assume it would be less memory intensive with a taller table.
As mentioned in the comments, this is a subjective question not exactly related to spark, but I'll try and answer none the less.
I assume it would be less memory intensive with a taller table.
Not really, the amount of storage required should be the same in either case based on the use case you have mentioned so let's get this out of the way. In case of taller tables there more rows and lesser columns and in case of wide tables the opposite. So on a cell level it should roughly be the same. I'm considering un compressed data independent of storage format.
Now lets talk about the mentioned use case. Simply put, it's aggregations. This may be fed downstream or may be used for reporting. Generally keeping this is mind, wider tables/views are better simply because - Lesser rows per day = less I/O as less shuffle.
Having said that, look through the cons below as well,
Schema evolution problems due to fixed schema
more suited for batch processing
Taller tables will be more streaming friendly, easier to extend for additional metrics and if its used with a source that supports push down, can result in quick partial scans.
in short, it very much depends on your operations.

Searching through polymorphic data with Elasticsearch

I am stumped at what seems to be a fundamental problem with Elasticsearch and polymorphic data. I would like to be able to find multiple types of results (e.g. users and videos and playlists) with just one Elasticsearch query. It has to be just one query, since that way Elasticsearch can do all the scoring and I won't have to do any magic to combine multiple query results of different types.
I know that Elasticsearch uses a flat document structure, bringing me to the following problem. If I index polymorphic data, I will have to specify a 'missing' value for each unique attribute that I care about in scoring subtypes of the polymorphic data.
I've looked for examples of other dealing with this problem and couldn't find any. There doesn't seem to be anything in the documentation on this either. Am I overlooking something obvious or was Elasticsearch just not designed to do something like this?
Kind regards,
Steffan
Thats not the issue of Elasticsearch itself, its the problem (or limitation) of underlying lucene indexes. So, any db/engine based on lucene will have the same problems (if not worse :), ES does a hell ton of job for you). Probably, ES will ease the pain in further releases, but not dramatically. And IMO, there's hardly any hi-perf search engine that can bear with true polymorphic data.
The answer depends on your data structure, thats for sure. Basically, you have two options:
Put all your data in single index, and split it by types. And you already know the overhead - lucent indexes works poorly with sparse data. More similar your data is, less problem you have. Anyway, ES will do all the underlying job for "missing" values, you only have to cope with memory/disk overhead for storing sparse data.
If your data is organised with parent-child relation (i.e. video -> playlist), you definitely need single index for such data. Which is leaving you with this approach only.
Divide your data into multiple indexes. This way you have slightly higher disk overhead for lucene index + possibly higher CPU usage when aggregation data from multiple shards (so, you should tune your sharding respectively).
You still can query ES for all your documents in single request, as ES supports multi-index queries.
So, this looks like question purely of your data structure. I'd recommend to simply fire up small cluster to measure memory/disk/cpu usage for expected data. More details on "index vs shard" – great article by Adrien.
Slightly off-topic, if ES doesn't seem to feet your needs, I suggest you to
still consider merging data on application side. ES works great with multiple light request (instead of few heavier), and as your results from ES is sorted already, you need to merge sorted streams having sorted input. Not so much magic there, tbh.

Real time multi threaded max-heap for top-N geohash

There is a requirement to keep a list of top-10 localities in a city from where demand for our food service is emanating at any given instant. The city could have tens of thousands of localities.
If one has to make a near real time (lag no more than 5 minutes) datastore in memory that would
- keep count of incoming demand by locality (geo hash)
- reads by hundreds of our suppliers every minute (the ajax refresh is every minute)
I was thinking of a multi threaded synchronized max-heap. This would be a complex solution as tree locking is by itself a complex implementation.
Any recommendations for the best in-memory (replicatable master slave) data structure that can be read and updated in multi threaded environment?
We expect 10K QPS and 100K updates per second. When we scale to other cities and regions, we will need per city implementation of top-10.
Are there any off the shelf solutions available?
Persistence is not a need so no mySQL based solutions. If you recommend redis or mongo DB solution, please realize that the queries are not pointed-queries by key but a top-N query instead.
Thanks in advance.
If you're looking for exactly what you're describing, there are a few approaches that might work nicely. There are several papers describing concurrent data structures that could work as priority queues; here is one option that I'm not super familiar with but which looks promising. You might also want to check out concurrent skip lists, which should also match your requirements.
If I'm interpreting your problem statement correctly, you're hoping to maintain a top-10 list of locations based on the number of hits you receive. If that's the case, I would suspect that while the number of updates would be huge, the number of times that two locations would switch positions would not actually be all that large. In other words, most updates wouldn't actually require the data structure to change shape. Consequently, you could consider using a standard binary heap where each element uses an atomic-compare-and-set integer key and where you have some kind of locking system that's used only in the case where you need to add, move, or delete an element from the heap.
Given the scale that you're working at, you may also want to consider approximate solutions to your problem. The count-min sketch data structure, for example, was specifically designed to estimate frequent elements in a data stream and does so extremely quickly. It can easily be distributed and linked up with a priority queue in a manner similar to what I described above. There are lots of good implementations out there, and if I remember correctly this data structure is actually deployed in situations like the one you're describing.
Hope this helps!

Is Cassandra a good candidate database for that must sustain over 100 read/write operations per second?

Currently our system uses PostgreSQL, however we seem to have pushed the limit of its capabilities. Some of our tables need to handle over 100 read/write operations per second so it is probably time to scale horizontally across multiple machines.
Have a lot of experience using GAE's Big Table. Big Table had rich options for querying. For example, queries were possible against list data fields. Cassandra is supposed to be based off of Big Table, but if I understand correctly, for Cassandra, we will actually have to custom-code a layer on top of Cassandra that uses and maintains index tables.
Would be great if there was an open source database available for which we did not have to build our own custom logic for maintaining index tables, zig-zag merge joins, etc...
Is Cassandra a good candidate here? Or are there ones that might be considered better?
Unless the operations are huge joins or return hundreds of thousands of rows, any database you choose will be able to sustain 100 ops/s. Cassandra will have no problems serving thousands if not tens of thousands of reads and writes per node.
Without knowing more about your particular use case it's impossible to give you meaningful advice. Cassandra is a great database, but if it's right for you I don't know. I'd suggest looking through the cassandra tag here on Stack Overflow and look at what people ask about and if it looks at all like what you're trying to do, and if the answers say that it's possible with Cassandra (I know I've answered quite a few questions where the answer was that Cassandra wasn't the best choice for that particular case).
Cassandra and GAE Big Table have big similarities, but also big differences. One thing that trips up new Cassandra users is that there really isn't any way of doing things like "add this thing only unless that other thing was there" or "add an item and remove all but the last N items".

Resources