How to configure `spark-sql` to connect to local spark application - apache-spark

I'm running a series of unit and integration tests against a complex pyspark ETL. These tests run on a local spark application running on my laptop.
Ideally I'd like to pause the execution of the ETL and query the contents of those tables using either pyspark or spark-sql REPL tools.
I can set a breakpoint() in my test classes and successfully query the local spark session, like this:
spark_session.sql("select * from global_temp.color;").show()
However, starting a SQL REPL session doesn't grant me access to the global_temp.color table. I've tried the following so far:
spark-sql
spark-sql --master spark://localhost:54321 # `spark.driver.port` from the spark UI
Anyone know or have any ideas how I might get REPL or REPL-like access to a pyspark job running on my local machine?

Related

How to submit spark batch jobs to HDInsight from Jupyter PySpark/Livy?

I'm using an Azure HDInsight cluster and through the Jupyter interface included with HDI I can run interactive spark queries, but I was wondering how to run long running jobs. E.g. if during my interactive querying I realize I want to do some long running job that will take a few hours, is there way to run a command from PySpark itself, e.g. read data from path x, do some transformation, and save in path y?
Currently if I just try to do that job inside the PySpark session itself and leave it running, Livy will eventually timeout and kill the job. Is there some command to submit the batch job and get an ID I can query later to get job status?

Azuresynapse Spark, run jupyter notebook with localhost running on e.g. localhost

I have an Apache Spark pool running on web.azuresynapse.net and can e.g. use it to run spark jobs from my "Synapse Analytics workspace"
I can develop and run python notebooks using the spark pool from there.
From what I see when I run a job, livy is supported (though) the link provided from azure synapse is for some reason inaccessible.
How could I connect to that pool using e.g. livy from e.g. my local jupyter notebook and use the pool? Or the only way to run spark code is to use a pipeline?

Why does spark-shell not start SQL context?

I use spark-2.0.2-bin-hadoop2.7 and am setting up a Spark environment. I have completed most of the steps to install and configure, but finally, I found something different from the online tutorials.
The logs are missing the line:
SQL context available as sqlContext.
When I run spark-shell, it just starts the Spark context. Why is the SQL context not started?
Under normal circumstances, should the following two lines of code be run at the same time?
Spark context available as sc
SQL context available as sqlContext.
From Spark 2.0 onwards SparkSession is used instead (as SQL Context/sqlContext was "renamed" to SparkSession/spark).
When you run spark-shell, you will get a reference to this spark session as spark. You should see the following:
Spark session available as 'spark'.
If you want to access the underlying SQL context you could do the following:
spark.sqlContext
Please don't since it's no longer required and most operations can be executed without it.

Registering temp tables in ThriftServer

I am new to Spark and am trying to understand how (if at all) is it possible to register dataframes as temp tables in the Spark thrift server.
To clarify, this is what I am trying to do:
Submit an application that generates a dataframe and registers it as a temporary table
Connect from a JDBC client to the Spark ThriftServer (running on the master) and query the temporary table, even after the application that registered it completed.
So far I've had no success with this - the Spark ThriftServer is running on the Spark master, but I'm unable to actually register any temp table to it.
Is this possible? I know I can use HiveThriftServer2.startWithContext to serve a dataframe via JDBC, but that requires the application to keep running forever + it requires me to launch additional applications.
The key idea is to register all temp tables in the Spark job and finally start SparkThriftServer from this job. It will keep your job running until you terminate thrift server. Also you will be able to query SparkThriftServer for all temp table via JDBC.
Here it is described with example

Query : Beeline interface in Spark SQL

Beeline script is one of the way of connecting to HiveServer2 present in Spark/bin.
I ran simple query as below.
In output I can see Map-Reduce is being launched.
I am just trying to understand what is advantage of beeline feature in Spark as it follows traditional map-reduce execution framework?
Can we use Spark RDD feature in beeline?
Thanks in advance.
Beeline is not part of Spark.
It's just a HiveServer2 client.
You can launch the Spark shell and execute queries within the shell, but this has nothing to do with Beeline. As Beeline has nothing to do with Spark.
This is one way.If you dont want to use Mapreduce you can use TEZ as engine.Which will run in memory as more faster than MR.
SET hive.execution.engine=tez;
But you can not run spark ifrom beeline.This is a standalone application which connects to hiveserver2.

Resources