How to trace submited Spark-SQL queries - apache-spark

when people execute Spark-SQL by:
./bin/spark-sql to execute THE-QUERY on the command line OR
./bin/spark-submit -f X.sql to execute THE-QUERY by a sql file
What exactly THE-QUERY is cannot be seen on the Spark WebUI or history server.
It is hard to trace the source sql when exception happens.
So I am wondering if Spark itself actually has some features to log the submitted SQL ?

Related

How to configure `spark-sql` to connect to local spark application

I'm running a series of unit and integration tests against a complex pyspark ETL. These tests run on a local spark application running on my laptop.
Ideally I'd like to pause the execution of the ETL and query the contents of those tables using either pyspark or spark-sql REPL tools.
I can set a breakpoint() in my test classes and successfully query the local spark session, like this:
spark_session.sql("select * from global_temp.color;").show()
However, starting a SQL REPL session doesn't grant me access to the global_temp.color table. I've tried the following so far:
spark-sql
spark-sql --master spark://localhost:54321 # `spark.driver.port` from the spark UI
Anyone know or have any ideas how I might get REPL or REPL-like access to a pyspark job running on my local machine?

Spark Submit job in databricks UI unable to access existing Hive DB

I created a spark submit job in databricks to run a .py script. I created a spark object in my python script. I tried to access existing Hive Tables. But my script fails with "Table or view not found" error. Should I add some configuration settings in my spark submit job to connect to existing hive metastore?
Please try using like below while creating spark Session in spark 2.0+
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
This will usually solve these kind of errors

Can Spark-sql work without a hive installation?

I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.

Spark throws FileNotFoundException and says 'Refresh Table tablename'

I am reading a file from Ignite File System using spark.
File system URI igfs://myfs#hostname:4500/path/to/file.
Spark jobs able to read a file by few jobs, but some of the jobs says FileNotFoundException.
Finally execution ends up with FileNotFoundException.
I tested the same code with small files and I was working fine.
Thanks in advance.

Spark SQL query history

Is there any way to get a list of Spark SQL queries executed by various users in a Hadoop cluster?
For example, is there any log file where a Spark application stores the query in string format ?
There is a Spark History Server (port 18080 by default). If you have spark.eventLog.enabled,spark.eventLog.dir configured and Spark HS is running - you can check what Spark apps have been executed on your cluster. Each job there might contain SQL tab in UI where you can see SQL queries. But there are no single place or log file which stores them all.

Resources