Does Spark SQL include a table streaming optimization for joins and, if so, how does it decide which table to stream?
When doing joins, Hive assumes the last table is the largest one. As a join optimization, it will attempt to buffer the smaller join tables and stream the last one through. If the last table in the join list is not the largest one, Hive has the /*+ STREAMTABLE(tbl) */ hint which tells it the table that should be streamed. As of v1.4.1, Spark SQL does not support the STREAMTABLE hint.
This question has been asked for normal RDD processing, outside of Spark SQL, here. The answer does not apply to Spark SQL where the developer has no control of explicit cache operations.
I have looked for an answer to this question some time ago and all I could come up with was setting a spark.sql.autoBroadcastJoinThreshold parameter, which is by default 10 MB. It will then attempt to automatically broadcast all the tables with size smaller than the limit set by you. Join order plays no role here for this setting.
If you are interestend in further improving join performance, I highly recommend this presentation.
This is the upcoming Spark 2.3 here (RC2 is being voted for the next release).
As of v1.4.1, Spark SQL does not support the STREAMTABLE hint.
It does not in the latest (and voted to be released soon) Spark 2.3 either.
There is no support for STREAMTABLE hint, but given the recent change (in SPARK-20857 Generic resolved hint node) to build a hint framework that should be fairly easy to write.
You'd have to write some Spark optimizations and possibly physical plan(s) that would support STREAMTABLE (which seems like a lot of work) but it's possible. The tools are there.
Regarding join optimizations, in the upcoming Spark 2.3 there are two main logical optimizations:
ReorderJoin
CostBasedJoinReorder (exclusively for cost-based optimization)
Related
I'm currently working on a Spark migration project that aims to migrate all Spark SQL pipelines for Spark 3.x version and take advantage of all performance improvements on it. My company is using Spark 2.4.0 but we are targeting to use officially the 3.1.1 for all Spark SQL data pipelines but without AQE enabled yet. The primary goal is to keep everything the same but use the newest version. Later on, we can easily enable AQE for all data pipelines.
For a specific case, right after the spark version change, we faced the following error:
org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
We investigated this issue and looking at Spark UI logs, we noticed a change in the query plan as follows:
Spark 2.4.0:
Spark 2.4.0 is using the default SortMergeJoin to do the join operation between the tbl_a and tbl_b, but when we look at query plan from new Spark 3.1.1:
We can notice that instead of SortMergeJoin it is using the BroadcastHashJoin to do the join between tbl_a and tbl_b. Not only this, but if I'm not wrong, the BroadcastExchange operation is occurring on the big table side, which seems strange from my perspective.
As additional information, we have the following properties regarding the execution of both jobs:
spark.sql.autoBroadcastJoinThreshold = 10Mb
spark.sql.adaptive.enabled = false # AQE is disabled
spark.sql.shuffle.partitions = 200
and other non-relevant properties.
Do you guys have any clue on why this is happening? My questions are:
Why Spark 3 has changed the join approach in this situation given that AQE is disabled and the spark.sql.autoBroadcastJoinThreshold is much smaller than the data set size?
Is this the expected behavior or could this represents a potential bug in Spark 3.x?
Please, let me know your thoughts. I appreciate all the help in advance.
UPDATE - 2022-07-27
After digging into Spark code for some days, and debugging it, I was able to understand what is happening. Basically, the retrieved statistics are the problem. Apparently, Spark 3 gets the statistics from a Hive table attribute called rawDataSize. If this isn't defined, than it looks for totalSize table property, as we can see in the following source code:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala#L69
During my tests, this property presented a very small number (way lower than the autoBroadcastThreshold property) making Spark Optimizer think it was safe to broadcast the right relation, but when the actual broadcast operation happened, it showed a bigger size, approximately the same as in the picture for the right relation, causing the timeout error.
I fixed the issue for my test by running the following command on Hive for a specific partition set:
ANALYZE TABLE table_b PARTITION(ds='PARTITION_VALUE', hr='PARTITION_VALUE') COMPUTE STATISTICS;
The rawDataSize now is zero and Spark 3 is using the totalSize (has a reasonable number) as the relation size and consequently, is not using BHJ for this situation.
Now the issue is figuring out why the rawDataSize is so small in the first place or even zero, given that the hive property hive.stats.autogather is true by default (auto calculates the statistics for every DML command) but it seems to be another problem.
Spark has made many improvements around joins.
One of them is :
AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold. This is not as efficient as planning a broadcast hash join in the first place, but it’s better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if spark.sql.adaptive.localShuffleReader.enabled is true)
https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#converting-sort-merge-join-to-broadcast-join
I am trying to use Spark Cassandra Connector for analytics on top of data in Cassandra and found two types of implementations. Can anyone throw some light on the difference between two and advantages/disadvantage? I am trying to see which one to use for querying large datasets. Thanks
Option 1 - Using Spark Session SQL
sparkSession.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> table, "keyspace" -> keyspace))
.load()
Option 2 - Using SCC API
CassandraJavaUtil.javaFunctions(sc)
.cassandraTable("my_keyspace", "my_table", .mapColumnTo(Integer.class))
.select("column1");
The difference is that first uses Dataframe API, while second is RDD API. I wouldn’t expect much performance differences between them. From practical point of view, I would recommend to use Dataframe API as much as possible, as it could be more optimized when performing operations on data. Although there are still operations that are available only in RDD API, such as deletion of data, but that’s also easy to achieve on top of Dataframes…
If you worry about performance, then I recommend to use at least connector 2.5.0 that has a lot of optimizations that before we’re available only in commercial version, like, direct join, etc. (more in this blog post)
All: I am looking for someone with more knowledge to check my understanding of Hive and Spark
I have been researching different large scale database solutions and I am trying to understand the difference in execution between Hive and Spark. I attempted to install Hadoop, Hive, and Spark to see how they perform. I was able to get Hadoop and Spark to work. I was unable to get Hive to work.
When I ran queries in Spark after they passed through the optimizer, it seems that the biggest advantage is that only the relevant table data is selected from the source at the earliest inception. So if I only needed Table1.columns(A,B,C) in the final answer, but told the system to JOIN Table1 & Table2 on (Table1.A=Table2.B) it immediately reduces the carried table to only the relevant items...I do not think Hive performs that way. I believe it will do the full join and perform the reduction later.
There are also differences in the memory storage (Hive going back the the HDFS frequently, vs Spark keeping things in RAM). This has both advantages and disadvantages depending on the data set/query.
Unfortunately because I cannot get Hive to run, my theory is based off of reading outputs of other people running things in Hive.
I Think hive and spark originally have different goals, and their execution styles are based on those goals.
Apache spark is a framework that allows you to do calculations on big datasets. stored on hdfs
Hive is an SQL interface to retriev data stored in an hdfs, and other clusterized and object store filesystems (S3 is an example) in a structured way.
Spark keeps things on ram because its more focused on making calculations with the data sets. Hive is more focused on retrieving data in a structured way, so it does not focus on speed that much (that being said, there have been improvements in hive, like llap that are meant to improve performance).
I like to use analogies with traditional software tools. On one side, you can have a relational database, and on the other side, a programming language. They both overlap in some functionality (you can write and read to disk with the programming language, and you can do some calculations with the sql engine. However, if the task at hand requires intensive and complex calculations you would probably use the programming language. If you are looking for a system that lets you store data in a structured way, you would go for the sql engine.
Hive on Tez and Spark both use Ram(memory) for operating on data . The number of partitions computed which will be treated as individual tasks would be quite different from Hive on Tez vs Spark . Hive on Tez by default tries to use combiner to merge certain splits into single partition . Hive one Tez seem to handle autoscaling of clusters in a better way than spark and does work most of the time.Spark doesn't work with autoscaling it would have lot of shuffle errors and will fail when there are multiple stages . But given a fixed size of cluster Spark seems to perform better over Hive on TEZ this could be attributed to some of the optimizations done and also how the shuffle ,serialization etc are implemented .
I'm interested only in query performance reasons and architectural differences behind them. All answers I've seen before were outdated or hadn't provide me with enough context of WHY Impala is better for ad hoc queries.
From 3 considerations below only the 2nd point explain why Impala is faster on bigger datasets. Could you please contribute to the following statements?
Impala doesn't miss time for query pre-initialization, means impalad daemons are always running & ready. In other hand, Spark Job Server provide persistent context for the same purposes.
Impala is in-memory and can spill data on disk, with performance penalty, when data doesn't have enough RAM. The same is true for Spark. The main difference is that Spark is written on Scala and have JVM limitations, so workers bigger than 32 GB aren't recommended (because of GC). In turn, [wrong, see UPD] Impala is implemented on C++, and has high hardware requirements: 128-256+ GBs of RAM recommended. This is very significant, but should benefit Impala only on datasets that requires 32-64+ GBs of RAM.
Impala is integrated with Hadoop infrastructure. AFAIK the main reason to use Impala over another in-memory DWHs is the ability to run over Hadoop data formats without exporting data from Hadoop. Means Impala usually use the same storage/data/partitioning/bucketing as Spark can use, and do not achieve any extra benefit from data structure comparing to Spark. Am I right?
P.S. Is Impala faster than Spark in 2019? Have you seen any performance benchmarks?
UPD:
Questions update:
I. Why Impala recommends 128+ GBs RAM? What is an implementation language of each Impala's component? Docs say that "Impala daemons run on every node in the cluster, and each daemon is capable of acting as the query planner, the query coordinator, and a query execution engine.". If impalad is Java, than what parts are written on C++? Is there smth between impalad & columnar data? Are 256 GBs RAM required for impalad or some other component?
II. Impala loose all in-memory performance benefits when it comes to cluster shuffles (JOINs), right? Does Impala have any mechanics to boost JOIN performance compared to Spark?
III. Impala use Multi-Level Service Tree (smth like Dremel Engine see "Execution model" here) vs Spark's Directed Acyclic Graph. What does actually MLST vs DAG mean in terms of ad hoc query performance? Or it's a better fit for multi-user environment?
First off, I don't think comparison of a general purpose distributed computing framework and distributed DBMS (SQL engine) has much meaning. But if we would still like to compare a single query execution in single-user mode (?!), then the biggest difference IMO would be what you've already mentioned -- Impala query coordinators have everything (table metadata from Hive MetaStore + block locations from NameNode) cached in memory, while Spark will need time to extract this data in order to perform query planning.
Second biggie would probably be shuffle implementation, with Spark writing temp files to disk at stage boundaries against Impala trying to keep everything in-memory. Leading to a radical difference in resilience - while Spark can recover from losing an executor and move on by recomputing missing blocks, Impala will fail the entire query after a single impalad daemon crash.
Less significant performance-wise (since it typically takes much less time compared to everything else) but architecturally important is work distribution mechanism -- compiled whole stage codegens sent to the workers in Spark vs. declarative query fragments communicated to daemons in Impala.
As far as specific query optimization techniques (query vectorization, dynamic partition pruning, cost-based optimization) -- they could be on par today or will be in the near future.
I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. The question is in which use case we would pick a specific solution instead of the other.
I will share a few thoughts based on my experience. But, if possible for you, please let us know about your use-case. It'll help us in answering your queries in a better manner.
1- If you are going to have more writes than reads, Cassandra is obviously a good choice. Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. But if you need to perform operations like JOIN and GROUP BY, even though CQL solves primitive GROUP BY use cases through write time and compact time sorts and implements one-to-many relationships, CQL is not the answer.
2- Spark SQL (Formerly Shark) is very fast for the two reasons, in-memory processing and planning data pipelines. In-memory processing makes it ~100x faster than Hive. Like Hive, Spark SQL handles larger than memory data types very well and up to 10x faster thanks to planned pipelines. Situation shifts to Spark SQL benefit when multiple data pipelines like filter and groupBy are present. Go for it when you need ad-hoc real time querying. Not suitable when you need long running jobs over gigantic amounts of data.
3- Hive is basically a warehouse that runs on top of your existing Hadoop cluster and provides you SQL like interface to handle your data. But Hive is not suitable for real-time needs. It is best suited for offline batch processing. Doesn't need any additional infra as it uses underlying HDFS for data storage. Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP.
Note : Spark SQL emulates Apache Hive behavior on top of Spark, so it supports virtually all Hive features but potentially faster. It supports the existing Hive Query language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts.
But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. I could just suggest based on your questions.
Hope this answers some of your queries.
P.S. : The above answer is based on solely my experience. Comments/corrections are welcome.
There is a very good effort for benchmark documented here - https://amplab.cs.berkeley.edu/benchmark/