Converting hive query to Spark-SQL - apache-spark

I have a hive query which runs in map-reduce mode per day. The total processing time takes 20 mins which is fine as per our daily process.
I am looking to execute in spark-framework.
To begin with, I have set the execution engine=spark in the hive shell and executed the same query.
The process had transformations and actions and the whole query completed around 8 mins. This query has multpile subqueries, IN Clauses and where conditions.
The question is, how does the spark-environment creates RDDs complex queries(Assume that I have just run the same query as in hive).Does it create RDD for each subqueries?
Now I would want to leverage spark-sql in place of the hive query. How should we approach these kind of complex where in we have lot of subqueries and aggregations involved.
I understand that for relational data computations, we need to leverage data frames.
Would this be a right approach in re-writing them in spark-sql or hold on to the thing which is setting the execution engine = spark and running the hive query.
In case if there are advantages in writing the queries in spark-sql and running on the spark, what would be the advantages.
For all the subqueries and various filter and aggregation logics,what would be performance for data frame APIs.

Related

The mechanism behind the join operation between one local table and one DB table

When I register one table from local RDD and one table from DB, I found the join operation between two tables was really slow.
The table from DB is actually a SQL that has multiple join operations, and the local RDD only has 20 records.
I am curious about the mechanism behind it.
Do we pull data from remote DB and execute all tasks in the local Spark cluster?
or Do we have an 'interesting' SQL engine to send an optimized query to DB and wait for the query result back? In my opinion, this way does not make sense, because the query executes really fast in DB.
Spark SQL will run the queries in its side. When you define some tables to join in query in first step it will fetch tables in cluster and save them in memory as RDD or Dataframe, and at last runs some tasks to do query operations.
In your example we suppose the first RDD is already in memory, but the second need to fetch. The datasource will request to your SQL engine to take tables. But before delivering the table, as it is a query with multiple join, your SQL engine will run the query in its side (different from spark cluster) and deliver the results when the table is ready. Let suppose SQL engine take TA seconds to run query (with total result, not top results when you can see in SQL client), and moving data to sql cluster (possibly over the network) takes TB seconds. After TA+TB second data is ready for spark to run.
and If TC is the time for join operation, the total time will be total = TA+TB+TC. You need to check where is you bottleneck. I think TB may be a critical one for large data.
However, when using a cluster with two or more workers, be assure that all nodes involve in the operation. Sometimes, because of wrong coding, the spark will use just one node to do operations. Make sure your data is spread out over the cluster to benefit from data locality.

spark-sql force parallel union

I'm using Spark-Sql to query Cassandra tables.
In Cassandra, i've partitioned my data with time bucket and one id, so based on queries i need to union multiple partitions with spark-sql and do the aggregations/group-by on union-result, something like this:
for(all cassandra partitions){
DataSet<Row> currentPartition = sqlContext.sql(....);
unionResult = unionResult.union(currentPartition);
}
Increasing input (number of loaded partitions), increases response time more than linearly because unions would be done sequentialy.
Because there is no harm in doing unions in parallel, and i dont know how to force spark to do them in parallel, Right now i'm using a ThreadPool
to Asyncronosly load all partitions in my application (which may cause OOM), and somehow do the sort or simple group by in java (Which make me think why even i'm using spark at all?)
The short question is:
How to force spark-sql to load cassandra partitions in parallel while doing union on them?
Also I don't want too many tasks in spark, with my Home-Made Async solution, i use coalesece(1) so one task is so fast (only wait time on casandra).

Writing SQL vs using Dataframe APIs in Spark SQL

I am a newbie in Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage,Raw and Application layer in HDFS and doing CDC(change data capture), this is currently written in Hive queries and is executed via Oozie. This needs to migrate into a Spark application(current version 1.6). The other section of code will migrate later on.
In spark-SQL, I can create dataframes directly from tables in Hive and simply execute queries as it is (like sqlContext.sql("my hive hql") ). The other way would be to use dataframe APIs and rewrite the hql in that way.
What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Some people suggested, there is an extra layer of SQL that spark core engine has to go through when using "SQL" queries directly which may impact performance to some extent but I didn't find any material substantiating that statement. I know the code would be much more compact with Datafrmae APIs but when I have my hql queries all handy would it really worth to write complete code into Dataframe API?
Thank You.
Question : What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Answer :
There is comparative study done by horton works. source...
Gist is based on situation/scenario each one is right. there is no
hard and fast rule to decide this. pls go through below..
RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):
At its core, Spark operates on the concept of Resilient Distributed Datasets, or RDD’s:
Resilient - if data in memory is lost, it can be recreated
Distributed - immutable distributed collection of objects in memory partitioned across many data nodes in a cluster
Dataset - initial data can from from files, be created programmatically, from data in memory, or from another RDD
DataFrames API is a data abstraction framework that organizes your data into named columns:
Create a schema for the data
Conceptually equivalent to a table in a relational database
Can be constructed from many sources including structured data files, tables in Hive, external databases, or existing RDDs
Provides a relational view of the data for easy SQL like data manipulations and aggregations
Under the hood, it is an RDD of Row’s
SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL through:
SQL
DataFrames API
Datasets API
Test results:
RDD’s outperformed DataFrames and SparkSQL for certain types of data processing
DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage
Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD’s
Took the best out of 3 for each test
Times were consistent and not much variation between tests
Jobs were run individually with no other jobs running
Random lookup against 1 order ID from 9 Million unique order ID's
GROUP all the different products with their total COUNTS and SORT DESCENDING by product name
In your Spark SQL string queries, you won't know a syntax error until runtime (which could be costly), whereas in DataFrames syntax errors can be caught at compile time.
Couple more additions. Dataframe uses tungsten memory representation , catalyst optimizer used by sql as well as dataframe. With Dataset API, you have more control on the actual execution plan than with SparkSQL
If query is lengthy, then efficient writing & running query, shall not be possible.
On the other hand, DataFrame, along with Column API helps developer to write compact code, which is ideal for ETL applications.
Also, all operations (e.g. greater than, less than, select, where etc.).... ran using "DataFrame" builds an "Abstract Syntax Tree(AST)", which is then passed to "Catalyst" for further optimizations. (Source: Spark SQL Whitepaper, Section#3.3)

Spark-java multithreading vs running individual spark jobs

I am new with Spark and trying to understand performance difference in below approaches (Spark on hadoop)
Scenario : As per batch processing I have 50 hive queries to run.Some can run parallel and some sequential.
- First approach
All of queries can be stored in a hive table and I can write a Spark driver to read all queries at once and run all queries in parallel ( with HiveContext) using java multi-threading
pros: easy to maintain
Cons: all resources may get occupied and
performance tuning can be tough for each query.
- Second approach
using oozie spark actions run each query individual
pros:optimization can be done at query level
cons: tough to maintain.
I couldn't find any document about the first approach that how Spark will process queries internally in first approach.From performance point of view which approach is better ?
The only thing on Spark multithreading I could found is:
"within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads"
Thanks in advance
Since your requirement is to run hive queries in parallel with the condition
Some can run parallel and some sequential
This kinda of workflows are best handled by a DAG processor which Apache Oozie is. This approach will be clearner than you managing your queries by code i.e. you will be building your own DAG processor instead of using the one provided by oozie.

Which query to use for better performance, join in SQL or using Dataset API?

While fetching and manipulating data from HBASE using spark, *Spark sql join* vs *spark dataframe join* - which one is faster?
RDD always Outperform Dataframe and SparkSQL, but from my experience Dataframe perform well compared to SparkSQL. Dataframe function perform well compare to spark sql.Below link will give some insights on this.
Spark RDDs vs DataFrames vs SparkSQL
As far as I can tell, they should behave the same regarding to performance. SQL internally will work as DataFrame
I don't have access to a cluster to properly test but I imagine that the Spark SQL just compiles down to the native data frame code.
The rule of thumb I've heard is that the SQL code should be used for exploration and dataframe operations for production code.
Spark SQL brings a powerful new optimization framework called Catalyst. Using Catalyst, Spark can automatically transform SQL queries so that they execute more efficiently.
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations, that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The execution speed will be the same, because they use same optimization algorithms.
If the join might be shared across queries carefully implemented join with RDDs might be a good option. However if this is not the case let spark/catalyst do it's job and join within spark sql. It will do all the optimization. So you wouldn't have to maintain your join logic etc.
Spark SQL join and Spark Dataframe join are almost same thing. The join is actually delegated to RDD operations under the hood. On top of RDD operation we have convenience methods like spark sql, data frame or data set. In case of spark sql it needs to spend a tiny amount of extra time to parse the SQL.
It should be evaluated more in terms of good programming practice. I like dataset because you can catch syntax errors while compiling. And the encodes behind the scene takes care of compacting the data and executing the query.
I did some performance analysis for sql vs dataframe on Cassandra using spark, I think it will be the same for HBASE also.
According to me sql works faster than dataframe approach. The reason behind this might be that in the dataframe approach there are lot of java object's involved. In sql approach everything is done in-memory.
Attaching results.

Resources