Parquet vs Cassandra using Spark and DataFrames - apache-spark

I have come to this dilemma that I cannot choose what solution is going to be better for me. I have a very large table (couple of 100GBs) and couple of smaller (couple of GBs). In order to create my data pipeline in Spark and use spark ML I need to join these tables and do couple of GroupBy (aggregate) operations. Those operations were really slow for me so I chose to do one of these two:
Use Cassandra and use indexing to speed the GoupBy operations.
Use Parquet and Partitioning based on the layout of the data.
I can say that Parquet partitioning works faster and more scalable with less memory overhead that Cassandra uses. So the question is this:
If developer infers and understands the data layout and the way it is going to be used, wouldn't it better for just use Parquet since you will have more control over it? Why should I pay the price for the overhead that Cassandra causes?

Cassandra is also a good solution for analytics use cases, but in another way. Before you model your keyspaces, you have to know how you need to read the data. You can also use where and range queries, but in a hard restricted way. Sometimes you will hate this restriction, but there are reasons for these restrictions. Cassandra is not like Mysql. In MySQL the performance is not a key feature. It's more about flexibility and consistency. Cassandra is a high performance write/read database. Better in write than in read. Cassandra has also a linear scalability.
Okay, a bit about your use case: Parquet is the better option for you. This is why:
You aggregate raw data on really large and not splitted datasets
Your Spark ML Job sounds like a scheduled, not long-running job. (onces a week, day?)
This fits more in the use cases of Parquet. Parquet is a solution for ad-hoc analysis, filter analysis stuff. Parquet is really nice if you need to run a query 1 or 2 times a month. Parquet is also a nice solution if a marketing guy wants to know one thing and the response time is not so important. Simply and short:
Use Cassandra if you know the queries.
Use Cassandra if a query will be used in a daily business
Use Cassandra if Realtime matters (I talk about a maximum of 30 seconds latency, from, customer makes an action and I can see the result in my dashboard)
Use Parquet if Realtime doesn't matter
Use Parquet if the query will not perform 100x a day.
Use Parquet if you want to do batch processing stuff

It depends on your usecase. Cassandra makes it much easier (also outside of Spark) to access your data with (limited) pseudo-SQL. That makes it a perfect fit for building online-applications on top (e.g. to display the data in an UI) of it.
Also Cassandra makes it easier if you have to deal with updates, that is not only the new data going to be ingested in your data pipeline(e.g. logs) but you also have to take care about updates (e.g. system has to handle corrections of data)
When your usecase is to do analytics with Spark (and you don't care about the topics mentioned above), it should be feasible and considerable cheaper to use Parquet/HDFS - as you've stated. With HDFS you also achieve data locality with Spark and you might have the advantage that your analytic Spark applications are even faster if you are reading large blocks of data.

Related

Impala vs Spark performance for ad hoc queries

I'm interested only in query performance reasons and architectural differences behind them. All answers I've seen before were outdated or hadn't provide me with enough context of WHY Impala is better for ad hoc queries.
From 3 considerations below only the 2nd point explain why Impala is faster on bigger datasets. Could you please contribute to the following statements?
Impala doesn't miss time for query pre-initialization, means impalad daemons are always running & ready. In other hand, Spark Job Server provide persistent context for the same purposes.
Impala is in-memory and can spill data on disk, with performance penalty, when data doesn't have enough RAM. The same is true for Spark. The main difference is that Spark is written on Scala and have JVM limitations, so workers bigger than 32 GB aren't recommended (because of GC). In turn, [wrong, see UPD] Impala is implemented on C++, and has high hardware requirements: 128-256+ GBs of RAM recommended. This is very significant, but should benefit Impala only on datasets that requires 32-64+ GBs of RAM.
Impala is integrated with Hadoop infrastructure. AFAIK the main reason to use Impala over another in-memory DWHs is the ability to run over Hadoop data formats without exporting data from Hadoop. Means Impala usually use the same storage/data/partitioning/bucketing as Spark can use, and do not achieve any extra benefit from data structure comparing to Spark. Am I right?
P.S. Is Impala faster than Spark in 2019? Have you seen any performance benchmarks?
UPD:
Questions update:
I. Why Impala recommends 128+ GBs RAM? What is an implementation language of each Impala's component? Docs say that "Impala daemons run on every node in the cluster, and each daemon is capable of acting as the query planner, the query coordinator, and a query execution engine.". If impalad is Java, than what parts are written on C++? Is there smth between impalad & columnar data? Are 256 GBs RAM required for impalad or some other component?
II. Impala loose all in-memory performance benefits when it comes to cluster shuffles (JOINs), right? Does Impala have any mechanics to boost JOIN performance compared to Spark?
III. Impala use Multi-Level Service Tree (smth like Dremel Engine see "Execution model" here) vs Spark's Directed Acyclic Graph. What does actually MLST vs DAG mean in terms of ad hoc query performance? Or it's a better fit for multi-user environment?
First off, I don't think comparison of a general purpose distributed computing framework and distributed DBMS (SQL engine) has much meaning. But if we would still like to compare a single query execution in single-user mode (?!), then the biggest difference IMO would be what you've already mentioned -- Impala query coordinators have everything (table metadata from Hive MetaStore + block locations from NameNode) cached in memory, while Spark will need time to extract this data in order to perform query planning.
Second biggie would probably be shuffle implementation, with Spark writing temp files to disk at stage boundaries against Impala trying to keep everything in-memory. Leading to a radical difference in resilience - while Spark can recover from losing an executor and move on by recomputing missing blocks, Impala will fail the entire query after a single impalad daemon crash.
Less significant performance-wise (since it typically takes much less time compared to everything else) but architecturally important is work distribution mechanism -- compiled whole stage codegens sent to the workers in Spark vs. declarative query fragments communicated to daemons in Impala.
As far as specific query optimization techniques (query vectorization, dynamic partition pruning, cost-based optimization) -- they could be on par today or will be in the near future.

Cassandra vs HDFS Compression ratio

I do an evaluation on HDFS and Cassandra's storage amount using the same input data in a single machine. Both HDFS and Cassandra has only 1 replica.
My input data are binary bytes, in total 31M. It turned out to be HDFS has less data than Cassandra.
HDFS : 16.4 M. (use COMPRESS.BLOCK strategy)(
Cassandra: 50M. (use CQL interface, with default setting(e.g. compression))
How could that be possible, since Cassandra use columnar storage ?
Is there anyone could help me figure it out? Thanks very much.
My Cassandra version is 2.1.9.
You will see better C* disk usage if using 3.+. its a 2.1 thing that requires the column name along with each field, so if you have 10 fields it will be a lot worse. 3.x is a lot better as it doesnt store redundant data.
HDFS and C* are two completely different things for solving different kinds of problems. If your looking just for most efficient use in disk space then hdfs is probably what you want, as it can store bulk binary data much more efficiently. If your looking for faster reads/writes, C* may be a better choice. C* adds to your data to organize and make queries more efficient and to provide guarantees about the data (for consistency). Compression will earn some of that back but in a lot of cases its gonna take up more space than just your raw data would.

best failsafe strategy to store result of spark sql for structured streaming and OLAP queries

I would like to store result of continuous queries running against streaming data in such a manner so that results are persisted into distributed nodes to ensure failover and scalability.
Can Spark SQL experts please shed some light on
- (1) which storage option I should choose so that OLAP queries are faster
- (2) how to ensure data available for query even if one node is down
- (3) internally how does Spark SQL store the resultset ?
Thanks
Kaniska
It depends what kind of latency you can afford.
One way is to persist the result into HDFS/Cassandra using Persist() API. If your data is small then cache() of each RDD should give you a good result.
Store where your spark executors are co-located. For example:
It is also possible to use Memory based storage like tachyon to persist your stream (i.e. each RDD of your stream) and query against it.
If latency is not an issue then persist(MEMORY_OR_DISK_2) should give you what you need. Mind you performance is a hit or miss in that scenario. Also this stores the data in two executors.
In other cases if your clients are more comfortable in OLTP like database where they just need to query the constantly updating result you can use conventional database like postgres or mysql. This is a preferred method among many as query time is consistent and predictable. If the result is not update heavy but partitioned (say by time) then Greenplum like systems are also a choice.

How can I Rely on Cassandra, since it doesn't meet my needs, By Itself?

As we read about Cassandra, before, we decided to choose it as our main database.The most important, useful and special feature which encourage us to choose this db, was Scalability, which helps us using Large volumes of data.
But, It seems that, it can not meet our requirements by itself. I asked some questions about our requirement in Stackoverfolw and how we can response them using Cassandra, and the answer was using alternative tools on top of Cassandra as Spark, Solr, DSE Search Tools and etc.
Our case is BIG Data Really, but we will have a large variety of Queries, too.
With these explanations, is it wise to stay on Cassandra?... Or It's better to switch to another db?
Cassandra is not adequate for ad-hoc queries, so I would recommend that you use Hive on Cassandra, mapping your Cassandra tables to Hive tables, usig the connector: cassandra_handler_for_hive, ( and then use hive to do joins and conditions on non partition keys)
I should mention that the performance of queries using Hive with Cassandra is not reasonable, (I have had a case where count(*) on a table with 500M records took 1 hour on 4 nodes). As a work around I used to copy the tables in HDFS after that do the computation using data on HDFS, but this is not good solution if you are seeking the fresh data.
Now for your question: To use Cassandra or not, it depends realy on your needs, Cassandra have a good performance in read/write per second record.
If your needs are met with using Hive/Cassandra to do the queries you need, so why not stay on Cassandra?

Comparing Cassandra's CQL vs Spark/Shark queries vs Hive/Hadoop (DSE version)

I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. The question is in which use case we would pick a specific solution instead of the other.
I will share a few thoughts based on my experience. But, if possible for you, please let us know about your use-case. It'll help us in answering your queries in a better manner.
1- If you are going to have more writes than reads, Cassandra is obviously a good choice. Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. But if you need to perform operations like JOIN and GROUP BY, even though CQL solves primitive GROUP BY use cases through write time and compact time sorts and implements one-to-many relationships, CQL is not the answer.
2- Spark SQL (Formerly Shark) is very fast for the two reasons, in-memory processing and planning data pipelines. In-memory processing makes it ~100x faster than Hive. Like Hive, Spark SQL handles larger than memory data types very well and up to 10x faster thanks to planned pipelines. Situation shifts to Spark SQL benefit when multiple data pipelines like filter and groupBy are present. Go for it when you need ad-hoc real time querying. Not suitable when you need long running jobs over gigantic amounts of data.
3- Hive is basically a warehouse that runs on top of your existing Hadoop cluster and provides you SQL like interface to handle your data. But Hive is not suitable for real-time needs. It is best suited for offline batch processing. Doesn't need any additional infra as it uses underlying HDFS for data storage. Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP.
Note : Spark SQL emulates Apache Hive behavior on top of Spark, so it supports virtually all Hive features but potentially faster. It supports the existing Hive Query language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts.
But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. I could just suggest based on your questions.
Hope this answers some of your queries.
P.S. : The above answer is based on solely my experience. Comments/corrections are welcome.
There is a very good effort for benchmark documented here - https://amplab.cs.berkeley.edu/benchmark/

Resources