Columnar storage: Cassandra vs Redshift - cassandra

How is columnar storage in the context of a NoSQL database like Cassandra different from that in Redshift. If Cassandra is also a columnar storage then why isn't it used for OLAP applications like Redshift?

The storage engines of Cassandra and Redshift are very different, and are created for different cases.
Cassandra's storage not really "columnar" in wide known meaning of this type of databases, like Redshift, Vertica etc, it is much more closer to key-value family in NoSQL world. The SQL syntax used in Cassandra is not any ANSI SQL, and it has very limited set of queries that can be ran there. Cassandra's engine built for fast writing and reading of records, based on key, while Redshift's engine is built for fast aggregations (MPP), and has wide support for analytical queries, and stores,encodes and compresses data on column level.
It can be easily understood with following example:
Suppose we have a table with user id and many metrics (for example weight, height, blood pressure etc...).
I we will run aggregate the query in Redshift, like average weight, it will do the following (in best scenario):
Master will send query to nodes.
Only the data for this specific column will be fetched from storage.
The query will be executed in parallel on all nodes.
Final result will be fetched to master.
Running same query in Cassandra, will result in scan of all "rows", and each "row" can have several versions, and only the latest should be used in aggregation. If you familiar with any key-value store (Redis, Riak, DynamoDB etc..) it is less effective than scanning all keys there.
Cassandra many times used for analytical workflows with Spark, acting as a storage layer, while Spark acting as actual query engine, and basically shouldn't be used for analytical queries by its own. With each version released more and more aggregation capabilities are added, but it is very far from being real analytical database.

I encountered the same question today, and found that this resource on AWS: https://aws.amazon.com/nosql/columnar/

Related

Presto's support for approx_distinct

I am evaluating distributed query engines for analytical queries (both interactive as well as batch) on large scale data (~100GB). One of the requirements is low latency (<= 1s) for count-distinct queries, where approximate results (with up to 5% error) are acceptable.
Presto seems to support this with its approx_distinct(). As far as my understanding goes, it uses HyperLogLog for that. However, unless the data is persisted in rolled-up form, along with the HyperLogLog values, it would have to be computed on the fly. I do not think my queries would finish within a second for large datasets.
Does it support rollup with HyperLogLog computation at ingestion time (similar to Druid)? Given that unlike Druid, Presto queries the data from external stores (Hive/Cassandra/RDBMS etc.), I am not sure that ingestion time rollups are supported, unless Presto's native store supports them. Can someone please confirm?
There isn't such a thing as "Presto's native store". Presto is query execution engine with connector architecture allowing plugging in multiple storage layers.
If you want an approximate count-distinct for a whole data set, you can compute table stats (When using Presto with Hive, this currently needs to be done in Hive).
If you want an approximate count-distinct for a dynamic selection of data, you still need to read the data. Then you won't get to second latency with such big data set. However, you can combine approx_distinct (or use plain count(distinct ..)) with TABLESAMPLE to limit the size of data read.
You can try with Verdict, which can significantly reduce query processing cost by applying statistics and approximate query processing, yielding 99.9% accuracy. It runs on all SQL-based engines including Apache Hive, Apache Impala, Apache Spark, Amazon Redshift, etc..
You can download source code from here. After downloading and some simple setup, you can issue query as you normally do and get results in a much shorter time.

How to use join in cassandra database tables like mysql

I am new to node.js.
And use Express framework and use cassandra as a database.
My question is it is possible to use join with multiple table relationship like mysql.
For e.g.,
In mysql
select * from `table1`
left join `table2` on table1.id=table2.id
Short: No.
There are no joins in cassandra (and other nosql databases) - which is different from relational databases.
In cassandra the standard way is to denormalize data and maybe store multiple copies if necessary. As a rule of thumb think query first - and store your data in that way you need to query it later.
Cassandra will perform very very well if your data is evenly spread accross your cluster and the every day queries hit only one or only a few primary key(s) (to be exact - partition key).
Have a look at: http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
And there are trainings from DataStax: https://academy.datastax.com/resources/ds220-data-modeling (and others too).
No. Joins are not possible in cassandra (or any other NoSQL databases).
You should design your table in cassandra based on your query requirements.
And while using NoSQL system it's recommended to de-normalize your data.
Basic Rules of Cassandra Data Modeling
Basic googling returns this blog post from the company behind Cassandra: https://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise
In short, it says that joins are possible via Spark or ODBC connections.
It comes down to the performance characteristics of such joins, esp. compared to making a "join" by hand, i.e. a lookup query on one table for every (relevant) row on the other. Any ideas?

PySpark Cassandra Connector efficiently querying across partition keys

I'm faced with the following problem using PySpark and dataframes with the cassandra-connector. My Cassandra data lake consists of metric measurements across (network) devices, and the entries are of type (device,interface,metric,time,value).
My cassandra table for the raw data has:
PRIMARY KEY ((device,interface,metric),time)
for supposedly efficient fetching of time ranges for a given measurement.
Now for reporting purposes, users can query any set of device/interface/metric combinations (ie give me a specific metric for all interfaces of a device). Now I know the list of each, so I'm not looking to do wildcard searches, but rather IN queries.
I'm using Spark 1.4, so I'm adding filters like to obtain dataframes to calculate min/max/percentile/etc... on the recorded metric values.
metrics_raw_sub = metrics_raw\
.filter(metrics_raw.device.inSet (device_list))\
.filter(metrics_raw.interface.inSet (interface_list))\
.filter(metrics_raw.metric.inSet (metric_list))
This isn't very efficient as these predicates do not get pushed down to CQL (only the last predicate can be an IN query), so I'm pulling in tons of data and filtering on the client side. (not good)
Why doesn't cassandra-connector allow multiple IN predicates across partition columns? Doing this in a native CQL shell appears to work?
Another approach to my problem above would be to (and this yields efficient individual queries as predicates are pushed down to Cassandra):
for device in device_list:
for interface in interface_list:
metrics_raw_sub = metrics_raw\
.filter(metrics_raw.device == device)\
.filter(metrics_raw.interface == interface)\
.filter(metrics_raw.metric.inSet (metric_list))
And then run the aggregation logic for each subquery, but I feel like this is largely serialising what should be a parallel computation across all requested device/interface/metric values... Can I batch the Cassandra queries so I can run my analytics on one large distributed dataframe?
Bottom line, I'm looking to do this very efficiently. If the turn-around times are short enough, we'll run these on-demand. If not, we'll need to look into pre-computing them and storing into tables (which sacrifices flexibility for doing custom time-range reporting)
Any insights would be much appreciated!!
Nik.

How to transfer mysql data to cassandra database?

I am running mysql server in production side which is currently having 200 GB data. Now it is very difficult to manage mysql server because it is growing exponetially. I have heard a lot about cassandra and I did POC on that. Cassandra provide high availability and eventually consistent data . Cassandra is perfect for our requirement. Now the problem is how to transfer all mysql data to cassandra database.
Since MYSQL is relational database and cassandra is NOSQL. How to map MYSQL table and its relational table to cassandra table.
I believe you are asking the wrong question. There is no rule for transitionning from a relational model to Cassandra.
The first question is the following: What are your requirements in terms of performance, availability, data volume & growth, and most important of all query abilities? Do you need ACID? Can you change the applicative code accessing the database to fit to a Cassandra more denormalized model?
The answer to these questions will tell you whether Cassandra is compatible with your use case or not.
As a rule of thumb:
If you use mysql with a lot of indices and usually perform join during queries the Cassandra data model, then the applicative code to use the database will require a lot of work, or maybe even Cassandra will not be the right choice. Same, if you really need ACID you may have a problem with Cassandra consistency model.
If your SQL data model is fully denormalized and your perform queries without joins, then you can just replicate your DB tables schema as Cassandra column families and you're done, even if this may not be optimal.
Your use case is probably in between and you really need to understand how you can model your data in cassandra, you have to get this understanding and perform this analysis by yourself because you know your domain and we don't. However, don't hesitate to give clues about your model and how you need to query your data so you can be advised.
200GB is low for Cassandra and you may discover that your data is taking much less space in Cassandra than in MYSQL, even when widely denormalized because Cassandra is pretty efficient.
you can migrate data from mysql to cassandra using spark .
spark have connectivity with mysql as well as cassandra . First you have create model in cassandra according your requirement after that you have pull all data from mysql and after done some transformation you can directly push data in cassandra .
Transferring relational data directly to Cassandra isn't possible. You have to denormalize it. However, be warned that some queries and methods of denormalizing those are anti-patterns. Get through those free courses first:
http://academy.datastax.com/courses/ds201-cassandra-core-concepts
https://academy.datastax.com/courses/ds220-data-modeling
If you fail at Cassandra's data model design of your relational data, you won't get nice features provided by Cassandra. For eample you won't get horizontal scalability (you might have hot-spots in your claster) or high avaiability (it might happen that for some queries all of the nodes will be needed to build response)

Choosing a NoSQL database

I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?
I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.

Resources