Cassandra store and query dynamic (user defined) data - cassandra

We've been looking into using Cassandra to store some of the larger data in a multi-tenant system we are building. The decision to use Cassandra is mostly to do with scaling capabilities and performance when working with large data sets, but I am not sure whether what we're looking for is possible in Cassandra, so I'm hoping someone has some clues as to whether (and how) this could be done:
We are looking for a way to provide our users to first define their own Entity types then define fields in those entities (and field types). Once they've defined this, their data (that matches the definitions they just created) could be imported, stored and most importantly queried by pretty much any field they defined.
So for instance, we may have one user who defines an Airplane, which has the manufacturer name, model, tail number, year of production, etc...
Their data will, then, contain those fields, be searchable and sortable by those fields, etc..
Another user may decide to define a Boat, which can then have different fields, which should be also sortable and searchable by content.
Because of the possible number of entries - the typical relational approach is unlikely to yield adequate performance, so we're looking at a noSQL approach.
Is this something that could be done in C*? Or are there any other suggestions in terms of a storage engine that would offer best flexibility?

I can see two important points in your requirements
Dynamic typing/schemaless data: Cassandra defines how data is structured like a relational database. Yet you can use columns of complex type: map...
Query by any field: Cassandra requires each query to provide the partition id. Cassandra data model is driven by querying, if you don't know your queries in advance, you won't be able to design the appropriate model, and you won't be able to query it.
I advise you to have look at Elasticsearch.
Then, if you have to use Cassandra for some other reason, then I advise you to look a DataStax Enterprise edition of Cassandra which integrates with SolR and Spark: both will give you extra querying capabilities.

Related

How do I find out right data design and right tools/database/query for below requirement

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

Is Cassandra just a storage engine?

I've been evaluating Cassandra to replace MySQL in our microservices environment, due to MySQL being the only portion of the infrastructure that is not distributed. Our needs are both write and read intensive as it's a platform for exchanging raw data. A type of "bus" for lack of better description. Our selects are fairly simple and should remain that way, but I'm already struggling to get past some basic filtering due to the extreme limitations of select queries.
For example, if I need to filter data it has to be in the key. At that point I can't change data in the fields because they're part of the key. I can use a SASI index but then I hit a wall if I need to filter by more than one field. The hope was that materialized views would help with this but in another post I was told to avoid them, due to some instability and problematic behavior.
It would seem that Cassandra is good at storage but realistically, not good as a standalone database platform for non-trivial applications beyond very basic filtering (i.e. a single field.) I'm guessing I'll have to accept the use of another front-end like Elastic, Solr, etc. The other option might be to accept the idea of filtering data within application logic, which is do-able, as long as the data sets coming back remain small enough.
Apache Cassandra is far more than just a storage engine. Its design is a distributed database oriented towards providing high availability and partition tolerance which can limit query capability if you want good and reliable performance.
It has a query engine, CQL, which is quite powerful, but it is limited in a way to guide user to make effective queries. In order to use it effectively you need to model your tables around your queries.
More often than not, you need to query your data in multiple ways, so users will often denormalize their data into multiple tables. Materialized views aim to make that user experience better, but it has had its share of bugs and limitations as you indicated. At this point if you consider using them you should be aware of their limitations, although that is generally good idea for evaluating anything.
If you need advanced querying capabilities or do not have an ahead of time knowledge of what the queries will be, Cassandra may not be a good fit. You can build these capabilities using products like Spark and Solr on top of Cassandra (such as what DataStax Enterprise does), but it may be difficult to achieve using Cassandra alone.
On the other hand there are many use cases where Cassandra is a great fit, such as messaging, personalization, sensor data, and so on.

Do we need to denormalize model in Cassandra?

We usually store a graph of objects in databases. in rdbms, we need to male joins to retry the relationships between objects. In cassandra, it is promoted to denormalize model to fit the queries. But making this, we make the update of the model more complex or more specified.
In Cassandra, it exists complex data types like set, map, list ou tuples. These types make possible to store the relationships between object in a straitghforward manner (association, aggregation, composition of object) by storing inside for instance a list the ids of the connected objects.
The only drawback is then to have to divide a sql complex join request in several requests.
I ve not seen papers on cassandra dealing with this kind of solution. Has someone in mind the reason why this solution is not promoted?
Cassandra is highly write optimized database. So writes are cheap, meaning an extra three or four writes will hardly matter considering the difficulties it would create if it were not otherwise.
Regarding graphs of objects, the answer is: No. Cassandra isn't meant to store graphs of objects. Cassandra is meant to store data for queries. The RDBMS equivalent would be views in PostgreSQL. Data has to be stored in a way that a query can be easily serviced. The main reason being that reads are slow. The goal of data modeling in Cassandra is to make sure a read is almost always from a single partition.
If it were normalized data, a query would need to hit a minimum of two partitions and worst case scenarios would create latencies that would render the application unusable for any practical purpose.
Hence data modeling in Cassandra is always centered on queries and not the relationship between objects.
More on these basic rules can be found in Datastax's documentation
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling

Cassandra version differences

I started reading Cassandra the definitive guide, which is based on Cassandra 0.7. Now, I'm trying to experiment with Cassandra 2.1.5 and it seems that there's a lot of differences which is really confusing.
For example, I see that in 0.7 version CQL did not exist. On the other hand, data model seems quite different. You can now define a schema with CQL, while in version 0.7 there was no schema.
Can anyone shortly explain the differences, especially about the data model?
I understand that in 0.7 version the idea was about different length rows, that is, rows that have different number of columns. But now I understand that each column is actually a field that contains a number of parameters, so you can have as much fields as you want within the same row (same key).
Can someone summarize the differences? Maybe I did not understand correctly.
An important point to consider, is that the underlying storage model remains the same. CQL is simply an abstraction layer on top of that model, to make it easier to work with and model your data. DataStax MVP John Berryman has a great article on this: Understanding How CQL3 Maps to Cassandra’s Internal Data Structure
In this article, Berryman observes that:
The value of the CQL primary key is used internally as the row key (which in the new CQL paradigm is being called a “partition key”).
The names of the non-primary key CQL fields are used internally as columns names. The values of the non-primary key CQL fields are then internally stored as the corresponding column values.
Additionally, he outlines the benefits of using the CQL-based approach:
It provides fast look-up by partition key and efficient scans and slices by cluster key.
It groups together related data as CQL rows. This means that you can do in one query what would otherwise take multiple queries into different column families.
It allows for individual fields to be added, modified, and deleted independently.
It is strictly better than the old Cassandra paradigm. Proof: you can coerce CQL Tables to behave exactly like old-style Cassandra ColumnFamilies. (See the examples here.)
It extends easily to implementation of sets lists and maps (which are super ugly if you’re working directly in old cassandra) — but that’s for another blog post.
The CQL protocol allows for asynchronous communication as compared with the synchronous, call-response communication required by Thrift. As a result, CQL is capable of being much faster and less resource intensive than Thrift – especially when using single threaded clients.
can have as much fields as you want within the same row (same key).
Actually, there is a hard limit of about 2 billion columns per partition (rowkey).

PouchDB structure

i am new with nosql concept, so when i start to learn PouchDB, i found this conversion chart. My confusion is, how PouchDB handle if lets say i have multiple table, does it mean that i need to create multiple databases? Because from my understanding in pouchdb a database can store a lot of documents, but a document mean a row in sql or am i misunderstood?
The answer to this question seems to be surprisingly under-documented. While #llabball clearly gave a decent answer, I don't think that views are always the way to go.
As you can read here in the section When not to use map/reduce, Nolan explains that for simpler applications, the key is to abuse _ids, and leverage the power of allDocs().
In other words, if you had two separate types (say artists, and albums), then you could prefix the id of each type to obtain an easily searchable data set. For example _id: 'artist_name' & _id: 'album_title', would allow you to easily retrieve artists in name order.
Laying out the data this way will result in better performance due to not requiring extra indexes, and less code. Clearly however, if your data requirements are more complex, then views are the way to go.
... does it mean that i need to create multiple databases?
No.
... a document mean a row in sql or am i misunderstood?
That's right. The SQL table defines column header (name and type) - that are the JSON property names of the doc.
So, all docs (rows) with the same properties (a so called "schema") are the equivalent of your SQL table. You can have as much different schemata in one database as you want (visit json-schema.org for some inspiration).
How to request them separately? Create CouchDB views! You can get all/some "rows" of your tabular data (docs with the same schema) with one request as you know it from SQL.
To write such views easily the property type is very common for CouchDB docs. Your known name from a SQL table can be your type like doc.type: "animal"
Your view names will be maybe animalByName or animalByWeight. Depends on your needs.
Sometimes multiple-databases plan is a good option, like a database per user or even a database per user-feature. Take a look at this conversation on CouchDB mailing list.

Resources