Spark SQL performance difference in spark-sql vs spark-shell REPL - apache-spark

Spark newb question: I'm making exactly the same Spark SQL query in spark-sql and in spark-shell. The spark-shell version takes about 10 seconds, while the spark-sql version takes about 20.
The spark-sql REPL gets the query directly:
spark-sql> SELECT .... FROM .... LIMIT 20
The spark-shell REPL commands are like this:
scala> val df = sqlContext.sql("SELECT ... FROM ... LIMIT 20 ")
scala> df.show()
In both cases, it's exactly the same query. Also, the query returns only a few rows because of the explicit LIMIT 20.
What's different about how the same query is executed from the different CLIs?
I'm running on Hortonworks sandbox VM (Linux CentOS) if that helps.

I think it is more about two parts,
First, it could be related the order. If you run the spark-sql first spark will be able to build the explain plan from scratch. But if you run the same query again. It could take less than the first one either from shell or sql because the explain plan will be easy to be retrieved
Second, it could be related to the spark-sql conversion to the ordering the resources. It happened multiple times. Spark-shell get the resources and start the process faster than spark-sql. You can check this from the UI or from top you will find the actual start for the spark-shell is faster than spark-sql.

Related

What is the Spark SQL statement for seeing the version of Spark processing the query?

I have access to a web tool that executes Spark SQL queries, and I need to check the version of Spark processing those queries (e.g., 3.1?). Is there a Spark SQL statement that I can run which will return the version of Spark, and possibly other information?
I looked through the docs, but I didn't see anything.
I don't have another way to determine this information, such as running spark-sql --version. I'm limited to this web interface that accepts valid Spark SQL.
Does Spark offer something like MySQL's SHOW VARIABLES statement?
There is an SQL function that returns the Spark version in the Misc Functions section:
version() - Returns the Spark version. The string contains 2 fields, the first being a release version and the second being a git revision.
Usage is:
spark-sql> SELECT version();
3.1.2 de351e30a90dd988b133b3d00fa6218bfcaba8b8
Time-taken: 0.087 seconds, Fetched 1 row(w)
There is no info about spark version, but once you set the confg for the spark session then you can see it by query. This can be editable by user I think..
spark = SparkSession.builder \
.config('spark.version', '3.3.1') \
.getOrCreate()
spark.sql('set spark.version').show()
+-------------+-----+
| key|value|
+-------------+-----+
|spark.version|3.3.1|
+-------------+-----+

No DAG generated for multiple dataframe join on Spark 3

I try to migrate my spark application from Spark 2.x to 3.x, but there is something weird for me.
In my application, there is a job with multiple join (maybe 40 ~ 50 dataframes join with a same base dataframe). And everything is OK for Spark 2.x, while on Spark 3.x, there is no DAG generated and no error logs either, the application seems to be suspended and I have no idea why.
And I try to force split those joins into multiple jobs, and things turn out to OK when each job has 5 joins.

Query performance in spark-submit vs hive shell

I am having hard time debugging why a simple query against hive external table (dynamodb backed) is taking north of 10 minutes via spark-submit and it only takes 4 seconds in hive shell.
Hive External Table that refers to a Dynamodb table say Employee[id, name, ssn, dept]. id is the partition key and ssn is the range key.
Using aws emr 5.29, spark, hive, tez, hadoop. 1 master, 4 core, m5.l
in hive shell: select name, dept, ssn from employee here id='123/ABC/X12I' returns results in 4 seconds.
Now, lets say i have the following code in code.py (ignoring the imports)
spark = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data=spark.sql("select name, dept, ssn from employee here id='123/ABC/X12I'")
# print data or get length
I submit the above on the master node as:
spark-submit --jars /pathto/emr-ddb-hive.jar, /pathto/emr-ddb-hadoop.jar code.py
The above spark submit takes a long time 14+ minutes. I am not sure which parameter needs to be tweaked or set to get better response time.
In hive shell I did a SET; to view the parameters that hive shell is using and there are a gazillion.
I also tried a boto3 dynamodb way of searching and it is way faster than my simple py sql to spark-submit.
I am missing fundamentals...Any idea or direction is appreciated.
I was doing an aggregation when I was trying to print by doing a collect() . I read about it but, did not realize that it was that bad (timing wise). I also did end up doing some more experiments like take(n) limit 1.

Does Spark SQL cache the result for the same query execution

When I run two same queries in Spark SQL in local mode. The second run query always run faster (I assume cache locality may result this).
But when I look into Spark UI, I find out the two same queries have different number of jobs and this is the part confuses me, for example, like below.
As you could see, the second one only requires one job (20), so does this information imply Spark SQL cache the query result explicitly?
Or it caches some intermediate result of some jobs of the previous run?
Thank you for the explanation.
collect at <console>:26+details 2019/10/09 08:28:34 2 s [20]
collect at <console>:26+details 2019/10/09 08:26:01 2.3 min [16][17][18][19]

How to run hive sql in spark

add file s3://nouveau3/cleanser/cleanser.py
CREATE EXTERNAL TABLE IF NOT EXISTS ext_tbl (
c STRING
) ROW FORMAT DELIMITED
LINES TERMINATED BY '\n'
LOCATION 's3-location'
tblproperties ('skip.header.line.count'='1');
CREATE TABLE main_tbl (schema);
INSERT INTO TABLE main_tbl
SELECT TRANSFORM(c)
USING 'python cleanser.py' as (schema)
FROM ext_tbl;
insert query run more than 15 mnts to improve that how can I run that query in spark? s3-location has more than 50 objects(gz format)
Approach 1 - If the query doesn't deal with too much of data and depending on the capacity of your edge-nodes you can run it directly on Spark just by login into spark-sql> shell
Approach 2 - But the spark-sql shell will not submit the query in Cluster mode it will just run on single Edge-Node and this might kill your job if Edge-Node fall short of resources.
You can write a python script which will read your query and in that u can call spark.sql("your queries"). Then you can launch this job using spark-submit --deploy-mode cluster. Here with the spark-submit command you get an option to specify the deploy-mode which should be cluster. This will leverage your entire cluster instead if just a node.

Resources