Do we need to denormalize model in Cassandra? - cassandra

We usually store a graph of objects in databases. in rdbms, we need to male joins to retry the relationships between objects. In cassandra, it is promoted to denormalize model to fit the queries. But making this, we make the update of the model more complex or more specified.
In Cassandra, it exists complex data types like set, map, list ou tuples. These types make possible to store the relationships between object in a straitghforward manner (association, aggregation, composition of object) by storing inside for instance a list the ids of the connected objects.
The only drawback is then to have to divide a sql complex join request in several requests.
I ve not seen papers on cassandra dealing with this kind of solution. Has someone in mind the reason why this solution is not promoted?

Cassandra is highly write optimized database. So writes are cheap, meaning an extra three or four writes will hardly matter considering the difficulties it would create if it were not otherwise.
Regarding graphs of objects, the answer is: No. Cassandra isn't meant to store graphs of objects. Cassandra is meant to store data for queries. The RDBMS equivalent would be views in PostgreSQL. Data has to be stored in a way that a query can be easily serviced. The main reason being that reads are slow. The goal of data modeling in Cassandra is to make sure a read is almost always from a single partition.
If it were normalized data, a query would need to hit a minimum of two partitions and worst case scenarios would create latencies that would render the application unusable for any practical purpose.
Hence data modeling in Cassandra is always centered on queries and not the relationship between objects.
More on these basic rules can be found in Datastax's documentation
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling

Related

rdbms tables to couchdb database tables

We are creating a new web application and we intend to use CouchDB. The old web application is being rewritten and we are migrating from RDBMS to CouchDB. I have a RDBMS schema with 10+ tables and I want to recreate the same in CouchDB. Which is better approach to do this in CouchDB?
Options
Create a new database in CouchDB for every table in my RDBMS schema
Create only one database in CouchDB and store all RDBMS tables into this CouchDB, having an explicit column called doc_type/table_type to represent which table/row type it represents in RDBMS table.
What are the pros and cons of these approaches? What is the recommended approach?
It all depends.
In general, be wary of trying to "translate" a RDBMS schema naively to CouchDB -- it rarely ends up in a happy place. A relational schema -- if designed well -- will be normalized and reliant on multi-tabular joins to retrieve data. In CouchDB, your data model will (probably) not be normalized nearly as much, and instead the document unit representing either a row from a table, or a row returned from a joins in the relational DB.
In CouchDB there are no joins, and no atomic transactions above the document unit. When designing a data model for CouchDB, consider how data is accessed and changed. Things you need to be accessed atomically belong in the same document.
As to the many databases vs a single database with documents with a "type" field, the single database option allows you to easily perform map-reduce queries across your whole data set. This isn't possible if you use multiple databases, as a map-reduce view is strictly per-database. The number of databases should be dictated by the access pattern -- if you have data that is only ever accessed by a subset of your application's queries, and never needs slicing and dicing with other data, that can be housed in a separate database.
I'd also recommend checking out the new partitioned database facility in CouchDB and Cloudant.

Using Cassandra to store immutable data?

We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.
Requirements:
We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.
Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?
I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.
Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.
You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.
This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.
Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.
Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..
All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.
Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.
Don't fall into the misconception that batched statements are faster (this one is a classic :) )
Based on the requirements you expressed, Cassandra could be a good fit as it's a write-optimized data store. Timeseries are quite a common pattern and you can define a clustering order, for example, on the timestamp of the events in order to retrieve all the events in time order. I've found this article on Datastax Academy very useful when wanted to learn about time series.
Variable data structure it's not a problem: you can store the data in a BLOB, then parse it internally from your application (i.e. store it as JSON and read it in your model), or you could even store the data in a map, although collections in Cassandra have some caveats that it's good to be aware of. Here you can find docs about collections in Cassandra 2.0/2.1.
Cassandra is quite different from a SQL database, and although CQL has some similarities there are fundamental differences in usage patterns. It's very important to know how Cassandra works and how to model your data in order to pursue efficiency - a great article from Datastax explains the basics of data modelling.
In a nutshell: Cassandra may be a good fit for you, but before using it take some time to understand its internals as it could be a bad beast if you use it poorly.

Cassandra: Minimizing metadata overhead with UDT

I have a 40 column RDBMS table which I am porting to Cassandra.
Using the estimator at http://docs.datastax.com/en/cassandra/2.1/cassandra/planning/architecturePlanningUserData_t.html
I created a excel sheet with column names, data types, size of each column etc.
The Cassandra specific overhead for each RDBMS row is a whopping 1KB when the actual data is only 192 bytes.
Since the overheads are proportional to number of columns, I thought it would be much better if I just create a UDT for the fields that are not part of the primary key. That way, I would incur the column overhead only once.
Also, I don't intend to run queries on inner fields of the UDT. Even if I did want that, Cassandra has very limited querying features that work on non PK fields.
Is this a good strategy to adopt? Are there any pitfalls? Are all these overheads easily eliminated by compression or some other internal operation?
On the surface, this isn't a bad idea at all. You are essentially abstracting your data by another level, but in a way that it is still manageable to meet your needs. It's actually good thinking.
I have a 40 column RDBMS table
This part slightly worries me. Essentially, you'd be creating a UDT with 40 properties. Not a huge deal in and of itself. Cassandra should handle that just fine.
But while you may not be querying on the inner fields of the UDT, you need to ask yourself how often you plan to update them. Cassandra stores UDTs as "frozen" types in a single column. This is important to understand for two reasons:
You cannot read a single property of a UDT without reading all properties of the UDT.
Likewise, you cannot update a single property in a UDT without rewriting all of them, either.
So you should keep that in mind while designing your application. As long as you won't be writing frequent updates to individual properties of the UDT, this should be a good solution for you.

Cassandra store and query dynamic (user defined) data

We've been looking into using Cassandra to store some of the larger data in a multi-tenant system we are building. The decision to use Cassandra is mostly to do with scaling capabilities and performance when working with large data sets, but I am not sure whether what we're looking for is possible in Cassandra, so I'm hoping someone has some clues as to whether (and how) this could be done:
We are looking for a way to provide our users to first define their own Entity types then define fields in those entities (and field types). Once they've defined this, their data (that matches the definitions they just created) could be imported, stored and most importantly queried by pretty much any field they defined.
So for instance, we may have one user who defines an Airplane, which has the manufacturer name, model, tail number, year of production, etc...
Their data will, then, contain those fields, be searchable and sortable by those fields, etc..
Another user may decide to define a Boat, which can then have different fields, which should be also sortable and searchable by content.
Because of the possible number of entries - the typical relational approach is unlikely to yield adequate performance, so we're looking at a noSQL approach.
Is this something that could be done in C*? Or are there any other suggestions in terms of a storage engine that would offer best flexibility?
I can see two important points in your requirements
Dynamic typing/schemaless data: Cassandra defines how data is structured like a relational database. Yet you can use columns of complex type: map...
Query by any field: Cassandra requires each query to provide the partition id. Cassandra data model is driven by querying, if you don't know your queries in advance, you won't be able to design the appropriate model, and you won't be able to query it.
I advise you to have look at Elasticsearch.
Then, if you have to use Cassandra for some other reason, then I advise you to look a DataStax Enterprise edition of Cassandra which integrates with SolR and Spark: both will give you extra querying capabilities.

Why Document DB ( like mongodb and couchdb ) are better for large amount of data?

I am very newbie to this world of document db.
So... why this db are better than RDBMS ( like mysql or postgresql ) for very large amount of data ?
She have implement good indexing to carry this types of file, and this is designed for. This solution is better for Document Database, because is for it. Normal database is not designed to saving "documents", in this option you must hard work to search over your documents data, because each can be in other format this is a lot of work. If you choice document db solution you have all-in-one implemented because this database is for only "docuemnts", because this have implementation of these needed for it functions.
You want to distribute your data over multiple machines when you have a lot of data. That means that joins become really slow because joining between data on different machines means a lot of data communication between those machines.
You can store data in a mongodb/couchdb document in a hierarchical way so there is less need for joins.
But is is dependent on you use case(s). I think that relational databases do a better job when it comes to reporting.
MongoDB and CouchDB don't support transactions. Do you or your customers need transactions?
What do you want to do? Analyzing a lot of data (business intelligence/reporting) or a lot of small modifications per second "HVSP (High Volume Simple Processing)"?

Resources