Print all config properties of a SparkSession / SparkContext in Jupyter - apache-spark

How do I print all config properties of a SparkSession / SparkContext in in Jupyter?

Assume spark is an instance of SparkSession, and sc is an instance of SparkContext,
Scala
spark.sparkContext.getConf.getAll.foreach(println)
or
sc.getConf.getAll.foreach(println)
Python
spark.sparkContext.getConf().getAll()
or
sc.getConf().getAll()

Related

Hide SparkSession builder output in jupyter lab

I start pyspark SparkSessions in Jupyter Lab like this:
from pyspark.sql import SparkSession
import findspark
findspark.init(os.environ['SPARK_HOME'])
spark = (SparkSession.builder
.appName('myapp')
.master('yarn')
.config("spark.port.maxRetries", "1000")
.config('spark.executor.cores', "2")
.config("spark.executor.memory", "10g")
.config("spark.driver.memory", "4g")
#...
.getOrCreate()
)
And then a lot appears in the cell output...
WARNING: User-defined SPARK_HOME (/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p3969.3554875/lib/spark) overrides detected (/opt/cloudera/parcels/CDH/lib/spark).
WARNING: Running spark-class from user-defined location.
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/hadooplog/sparktmp
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/hadooplog/sparktmp
...
I would like to hide this output to clean up notebooks and make them easier to read. I've tried %%capture and spark.sparkContext.setLogLevel("ERROR") (although this only pertains to spark session logging, but even then, output still appears here and there). None of these work.
Running
pyspark version 2.4.0
jupyterlab version 3.2.1

How to run python spark script with specific jars

I have to run a python script on EMR instance using pyspark to query dynamoDB. I am able to do that by querying dynamodb on pyspark which is executed by including jars with following command.
`pyspark --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hive.jar,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar`
I ran following python3 script to query data using pyspark python module.
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
start_time = time.time()
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://nn1:9083")
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.enableHiveSupport()
.getOrCreate())
df_load = sparkSession.sql("SELECT * FROM example")
df_load.show()
print(time.time() - start_time)
Which caused following runtime exception for missing jars.
java.lang.ClassNotFoundException Class org.apache.hadoop.hive.dynamodb.DynamoDBSerDe not found
How do I convert the pyspark --jars.. to a pythonic equivalent.
As of now I tried copying the jars from the location /usr/share/... to $SPARK_HOME/libs/jars and adding that path to spark-defaults.conf external class path that had no effect.
Use spark-submit command to execute your python script. Example :
spark-submit --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hive.jar,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar script.py

My SparkSession initialization takes forever to run on my laptop. Does anybody have any idea why?

My SparkSession takes forever to initialize
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.appName('Huy')
.getOrCreate())
sc = spark.SparkContext
waited for hours without success
I got the same error. I have resolved it by setting the environment variables. We can set them directly in python code. You need a JDK in the program files.
import os
os.environ["JAVA_HOME"] = "C:\Program Files\Java\jdk-19"
os.environ["SPARK_HOME"] = "C:\Program Files\Spark\spark-3.3.1-bin-hadoop2"

Spark : Error Not found value SC

I have just started with Spark. I have CDH5 Installed with Spark . However when I try to use sparkcontext it gives Error as below
<console>:17: error: not found: value sc
val distdata = sc.parallelize(data)
I have researched about this and found error: not found: value sc
and tried to start spark context with ./spark-shell . It gives error No such File or Directory
You can either start spark-shell starting with ./ if you're in its exact directory or path/to/spark-shell if you're elsewhere.
Also, if you're running a script with spark-submit, you need to initialize sc as SparkContext first:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
There is another stackoverflow post that answers this question by getting sc(spark context) from spark session. I do it this way:
val spark = SparkSession.builder().appName("app_name").enableHiveSupport().getOrCreate()
val sc = spark.sparkContext
original answer here:
Retrieve SparkContext from SparkSession
Add spark directory to path then you may use spark-shell from anywhere.
Add import org.apache.spark.SparkContext if you are using it in a spark-submit job to create a spark context using:
val sc = new SparkContext(conf)
where conf is already defined.
Starting a new terminal fixes the problem in my case.
You need to run Hadoop daemons first (run this command "start-all.sh"). Then try
you ca run this command in spark(scala) prompt
conf.set("spark.driver.allowMultipleContexts","true")

Spark SQLContext does not work on CDH5.3

I am running spark 1.2 on CDH 5.3 and trying a simple code in spark-shell
I am failing on
val sqlContext = new SQLContext(sc)
with the error:
not found : type SQLContext
What is wrong with my environment?
Make sure you import it:
import org.apache.spark.sql.SQLContext

Resources