spark UI,what is the meaning of the SQL tab? - apache-spark

If my understanding is correct, spark applications may contain one or more job. A job may be split into stages and stages can be split into tasks. I more or less can follow this in the spark user interface (or I at least I think so). But I am confused about the meaning of the SQL tab.
In particular:
what is the relation of a SQL query to jobs and queries?
in the SQL tab would we also see dataframe operations or only queries submitted e.g. via spark.sql()?
I have been running some examples in order to understand but it is still not very clear. Could you please help me?

The SQL tab shows what you'd consider the execution plan. It shows the stages, run times, memory and operations (Exchanges, Projections, etc). Catalyst builds the job plan based on your query, regardless of whether your query was done with spark.sql or dataset/dataframe operations.
You can find more information here:
If the application executes Spark SQL queries, the SQL tab displays information, such as the duration, jobs, and physical and logical plans for the queries.
https://spark.apache.org/docs/latest/web-ui.html

Related

Does Spark SQL cache the result for the same query execution

When I run two same queries in Spark SQL in local mode. The second run query always run faster (I assume cache locality may result this).
But when I look into Spark UI, I find out the two same queries have different number of jobs and this is the part confuses me, for example, like below.
As you could see, the second one only requires one job (20), so does this information imply Spark SQL cache the query result explicitly?
Or it caches some intermediate result of some jobs of the previous run?
Thank you for the explanation.
collect at <console>:26+details 2019/10/09 08:28:34 2 s [20]
collect at <console>:26+details 2019/10/09 08:26:01 2.3 min [16][17][18][19]

Spark UI - Spark SQL Query Execution

I am using Spark SQL API. When I see the Spark SQL section on the spark UI which details the query execution plan it says it scans parquet stage multiple times even though I am reading the parquet only once.
Is there any logical explanation?
I would also like to understand the different operations like Hash Aggregate, SortMergeJoin etc and understand the Spark UI better as a whole.
If you are doing unions or joins they may force your plan to be "duplicated" since the beginning.
Since spark doesn't keep intermediate states (unless you cache) automatically, it will have to read the sources multiple times
Something like
1- df = Read ParquetFile1
2- dfFiltered = df.filter('active=1')
3- dfFiltered.union(df)
The plan will probably look like : readParquetFIle1 --> union <-- filter <-- readParquetFIle1

Converting hive query to Spark-SQL

I have a hive query which runs in map-reduce mode per day. The total processing time takes 20 mins which is fine as per our daily process.
I am looking to execute in spark-framework.
To begin with, I have set the execution engine=spark in the hive shell and executed the same query.
The process had transformations and actions and the whole query completed around 8 mins. This query has multpile subqueries, IN Clauses and where conditions.
The question is, how does the spark-environment creates RDDs complex queries(Assume that I have just run the same query as in hive).Does it create RDD for each subqueries?
Now I would want to leverage spark-sql in place of the hive query. How should we approach these kind of complex where in we have lot of subqueries and aggregations involved.
I understand that for relational data computations, we need to leverage data frames.
Would this be a right approach in re-writing them in spark-sql or hold on to the thing which is setting the execution engine = spark and running the hive query.
In case if there are advantages in writing the queries in spark-sql and running on the spark, what would be the advantages.
For all the subqueries and various filter and aggregation logics,what would be performance for data frame APIs.

Spark-java multithreading vs running individual spark jobs

I am new with Spark and trying to understand performance difference in below approaches (Spark on hadoop)
Scenario : As per batch processing I have 50 hive queries to run.Some can run parallel and some sequential.
- First approach
All of queries can be stored in a hive table and I can write a Spark driver to read all queries at once and run all queries in parallel ( with HiveContext) using java multi-threading
pros: easy to maintain
Cons: all resources may get occupied and
performance tuning can be tough for each query.
- Second approach
using oozie spark actions run each query individual
pros:optimization can be done at query level
cons: tough to maintain.
I couldn't find any document about the first approach that how Spark will process queries internally in first approach.From performance point of view which approach is better ?
The only thing on Spark multithreading I could found is:
"within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads"
Thanks in advance
Since your requirement is to run hive queries in parallel with the condition
Some can run parallel and some sequential
This kinda of workflows are best handled by a DAG processor which Apache Oozie is. This approach will be clearner than you managing your queries by code i.e. you will be building your own DAG processor instead of using the one provided by oozie.

Comparing Cassandra's CQL vs Spark/Shark queries vs Hive/Hadoop (DSE version)

I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. The question is in which use case we would pick a specific solution instead of the other.
I will share a few thoughts based on my experience. But, if possible for you, please let us know about your use-case. It'll help us in answering your queries in a better manner.
1- If you are going to have more writes than reads, Cassandra is obviously a good choice. Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. But if you need to perform operations like JOIN and GROUP BY, even though CQL solves primitive GROUP BY use cases through write time and compact time sorts and implements one-to-many relationships, CQL is not the answer.
2- Spark SQL (Formerly Shark) is very fast for the two reasons, in-memory processing and planning data pipelines. In-memory processing makes it ~100x faster than Hive. Like Hive, Spark SQL handles larger than memory data types very well and up to 10x faster thanks to planned pipelines. Situation shifts to Spark SQL benefit when multiple data pipelines like filter and groupBy are present. Go for it when you need ad-hoc real time querying. Not suitable when you need long running jobs over gigantic amounts of data.
3- Hive is basically a warehouse that runs on top of your existing Hadoop cluster and provides you SQL like interface to handle your data. But Hive is not suitable for real-time needs. It is best suited for offline batch processing. Doesn't need any additional infra as it uses underlying HDFS for data storage. Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP.
Note : Spark SQL emulates Apache Hive behavior on top of Spark, so it supports virtually all Hive features but potentially faster. It supports the existing Hive Query language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts.
But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. I could just suggest based on your questions.
Hope this answers some of your queries.
P.S. : The above answer is based on solely my experience. Comments/corrections are welcome.
There is a very good effort for benchmark documented here - https://amplab.cs.berkeley.edu/benchmark/

Resources