How can I Rely on Cassandra, since it doesn't meet my needs, By Itself? - cassandra

As we read about Cassandra, before, we decided to choose it as our main database.The most important, useful and special feature which encourage us to choose this db, was Scalability, which helps us using Large volumes of data.
But, It seems that, it can not meet our requirements by itself. I asked some questions about our requirement in Stackoverfolw and how we can response them using Cassandra, and the answer was using alternative tools on top of Cassandra as Spark, Solr, DSE Search Tools and etc.
Our case is BIG Data Really, but we will have a large variety of Queries, too.
With these explanations, is it wise to stay on Cassandra?... Or It's better to switch to another db?

Cassandra is not adequate for ad-hoc queries, so I would recommend that you use Hive on Cassandra, mapping your Cassandra tables to Hive tables, usig the connector: cassandra_handler_for_hive, ( and then use hive to do joins and conditions on non partition keys)
I should mention that the performance of queries using Hive with Cassandra is not reasonable, (I have had a case where count(*) on a table with 500M records took 1 hour on 4 nodes). As a work around I used to copy the tables in HDFS after that do the computation using data on HDFS, but this is not good solution if you are seeking the fresh data.
Now for your question: To use Cassandra or not, it depends realy on your needs, Cassandra have a good performance in read/write per second record.
If your needs are met with using Hive/Cassandra to do the queries you need, so why not stay on Cassandra?

Related

Is there a way to efficiently get the top n smallest datapoints over the cluster key in Cassandra?

I understand that for Cassandra data is sorted on the cluster key only per partition key.
I am wondering if Cassandra has optimizations on global scans. Lets say that the cluster key is an integer value, if I want to search over all data on a Cassandra cluster to find collections with values < 3. The Cassandra query engine will not need to continue looking at collections in a partition after encountering a number >= 3. Are there APIs (such as CDK) offered by Cassandra which exercise these optimizations?
There isn't a native CQL optimisation available for full table scans -- they will always be bad since Cassandra is optimised for OLTP workloads.
There are however optimisations done by the spark-cassandra-connector for analytics (OLAP) workloads with Spark.
OLTP vs OLAP are worlds apart so you have to use the right tool for the job. Cheers!
Querying by partition key is the best way the query in Cassandra. If you want to use clustering key for querying then you can use "ALLOW FILTERING" option. But it is recommended not to use "ALLOW FILTERING" in production.
For scanning complete table and filtering some data, you can use spark to do your work. Why bother C* for which it is not designed, better to take help of its friends (spark in this case)

Parquet vs Cassandra using Spark and DataFrames

I have come to this dilemma that I cannot choose what solution is going to be better for me. I have a very large table (couple of 100GBs) and couple of smaller (couple of GBs). In order to create my data pipeline in Spark and use spark ML I need to join these tables and do couple of GroupBy (aggregate) operations. Those operations were really slow for me so I chose to do one of these two:
Use Cassandra and use indexing to speed the GoupBy operations.
Use Parquet and Partitioning based on the layout of the data.
I can say that Parquet partitioning works faster and more scalable with less memory overhead that Cassandra uses. So the question is this:
If developer infers and understands the data layout and the way it is going to be used, wouldn't it better for just use Parquet since you will have more control over it? Why should I pay the price for the overhead that Cassandra causes?
Cassandra is also a good solution for analytics use cases, but in another way. Before you model your keyspaces, you have to know how you need to read the data. You can also use where and range queries, but in a hard restricted way. Sometimes you will hate this restriction, but there are reasons for these restrictions. Cassandra is not like Mysql. In MySQL the performance is not a key feature. It's more about flexibility and consistency. Cassandra is a high performance write/read database. Better in write than in read. Cassandra has also a linear scalability.
Okay, a bit about your use case: Parquet is the better option for you. This is why:
You aggregate raw data on really large and not splitted datasets
Your Spark ML Job sounds like a scheduled, not long-running job. (onces a week, day?)
This fits more in the use cases of Parquet. Parquet is a solution for ad-hoc analysis, filter analysis stuff. Parquet is really nice if you need to run a query 1 or 2 times a month. Parquet is also a nice solution if a marketing guy wants to know one thing and the response time is not so important. Simply and short:
Use Cassandra if you know the queries.
Use Cassandra if a query will be used in a daily business
Use Cassandra if Realtime matters (I talk about a maximum of 30 seconds latency, from, customer makes an action and I can see the result in my dashboard)
Use Parquet if Realtime doesn't matter
Use Parquet if the query will not perform 100x a day.
Use Parquet if you want to do batch processing stuff
It depends on your usecase. Cassandra makes it much easier (also outside of Spark) to access your data with (limited) pseudo-SQL. That makes it a perfect fit for building online-applications on top (e.g. to display the data in an UI) of it.
Also Cassandra makes it easier if you have to deal with updates, that is not only the new data going to be ingested in your data pipeline(e.g. logs) but you also have to take care about updates (e.g. system has to handle corrections of data)
When your usecase is to do analytics with Spark (and you don't care about the topics mentioned above), it should be feasible and considerable cheaper to use Parquet/HDFS - as you've stated. With HDFS you also achieve data locality with Spark and you might have the advantage that your analytic Spark applications are even faster if you are reading large blocks of data.

What additional benefits does Spark give over CQL?

We are exploring SPARK for cassandra in order to over come limitations with CQL.
We were initially restricted to CQL but faced few road blocks/hurdles over RDBMS. To name a few as below
For comparing >(Greater than) and < (Less than) on a column, we are restricted to have the columns in Clustering key. Even If I have a column in Clustering, I should still provide the Partition key to do < or > on clustering key.
Can't check for NULL on any column value
In order to query on any column other Partition key, we have to create index on that column
ORDER BY a column which isn't a CLUSTERING KEY
GROUP BY Limitations
Join Tables
I am a newbie with cassandra and end up in revisiting my schema often due to the limitations.
Hence similar to HIVE/PIG for HDFS, What additional benefits does Spark give over CQL ?
CQL is not a replacement for SQL. It is really designed for pulling out values from a few, usually one, partition key, and as you pointed out, does not do any sort of aggregation, grouping, very limited sorting, etc. (though Cassandra 3.0 will have UDFs and UDAs).
Here is what Spark offers over CQL:
General aggregation and querying via DataFrames and SQL, including JOINs, GROUP BY, ORDER BY, and UDFs
Significantly faster queries -- orders of magnitude faster -- if you cache the Cassandra data in memory using sqlContext.cacheTable
Integrated machine learning, statistics, graph processing, and virtually any kind of distributed computation you can imagine, using Scala, Java, Python, and R APIs
Ability to ETL in and out of Cassandra tables from and to many other data sources - including various HDFS formats, Amazon S3, DBMSes, Mongo, and most other databases today
Spark is really a completely different beast from CQL. It offers complex analytics over vast quantities of data, CQL doesn't. However, there are some limitations as well:
Spark is not good at highly concurrent queries. For that, you want to keep queries simple and use CQL to pull out a very small amount of data.
Caching data in Spark is not HA and cannot update as you write new data into C*
If you want very fast analytical queries over Cassandra with support for updates and no need to cache, then check out my project http://github.com/tuplejump/FiloDB.

indexed Apache Ignite cache vs. optimized, in-memory CassandraDB

For a complex real-time Apache Storm topology I need aggregates of my data (stored in CassandraDB) for some computation steps. So far the data is queried when needed with CQL (Cassandra Query Language) and aggregated in a Storm bolt. That is a bit slow, so we want to have the data needed for the aggregation cached. Two option are on the table:
Put the data needed in an indexed Ignite Cache and sliding-window-query it from Storm. In this case we would only need one Cache and use different queries, depending on the aggregation.
Put the data in Cassandras in-memory, off-heap cache.
Argument for Ignite: We only need one indexed cache, while we would need one Cassandra table for each aggregation, for fast access. (Also ACID, but obviously we already live with CAP, so not a strong argument for our architects.)
Argument for Cassandra: We don't need to introduce a new technology.
But: What about speed? How fast would an indexed Ignite cache be compared to an optimized (= own table for each query) in-memory Cassandra?
I believe that in-memory indexed SQL in Ignite would be faster than Cassandra CQL queries. Apache Ignite is ANSI-99 SQL compatible, so you should be able to do all sorts of aggregations, joins, order by, group by, etc.
I will raise a point within the Ignite community to see if Cassandra CQL could be benchmarked against Ignite SQL. When done, will post the results here.

Comparing Cassandra's CQL vs Spark/Shark queries vs Hive/Hadoop (DSE version)

I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. The question is in which use case we would pick a specific solution instead of the other.
I will share a few thoughts based on my experience. But, if possible for you, please let us know about your use-case. It'll help us in answering your queries in a better manner.
1- If you are going to have more writes than reads, Cassandra is obviously a good choice. Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. But if you need to perform operations like JOIN and GROUP BY, even though CQL solves primitive GROUP BY use cases through write time and compact time sorts and implements one-to-many relationships, CQL is not the answer.
2- Spark SQL (Formerly Shark) is very fast for the two reasons, in-memory processing and planning data pipelines. In-memory processing makes it ~100x faster than Hive. Like Hive, Spark SQL handles larger than memory data types very well and up to 10x faster thanks to planned pipelines. Situation shifts to Spark SQL benefit when multiple data pipelines like filter and groupBy are present. Go for it when you need ad-hoc real time querying. Not suitable when you need long running jobs over gigantic amounts of data.
3- Hive is basically a warehouse that runs on top of your existing Hadoop cluster and provides you SQL like interface to handle your data. But Hive is not suitable for real-time needs. It is best suited for offline batch processing. Doesn't need any additional infra as it uses underlying HDFS for data storage. Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP.
Note : Spark SQL emulates Apache Hive behavior on top of Spark, so it supports virtually all Hive features but potentially faster. It supports the existing Hive Query language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts.
But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. I could just suggest based on your questions.
Hope this answers some of your queries.
P.S. : The above answer is based on solely my experience. Comments/corrections are welcome.
There is a very good effort for benchmark documented here - https://amplab.cs.berkeley.edu/benchmark/

Resources