How to get all pyspark session properties (even the default values)? - apache-spark

My real problem is that I have a SQL query that successfully run from inside a spark session in a Spark Jupyter Notebook, but it fails when I submit it using Livy. I need to compare what's different in the sessions, but the values returned from spark.sparkContext.getConf().getAll() are the same.
In a pyspark shell I can get all the properties that were explicitly set with the command:
spark.sparkContext.getConf().getAll()
I can also get a lot of the cluster configurations with this code:
hadoopConf = {prop.getKey(): prop.getValue()
for prop
in spark.sparkContext._jsc.hadoopConfiguration().iterator()}
for i, j in sorted(hadoopConf.items()):
print(i, '=', j)
But if I try to get the value of a property that wasn't explicitly set:
spark.conf.get("spark.memory.offHeap.size")
I get a java.util.NoSuchElementException: spark.memory.offHeap.size but it has a default value configured in the spark environment.
Even weirder, some variables I can get a value even it it isn't listed above:
[30]: spark.conf.get('spark.sql.shuffle.partitions')
[30]: '200'
There are other questions but the answers there doesn't list the properties above.
How can get the default value of these properties from inside a spark shell?

Related

Changing the magic tags in same cell - Azure Databricks

I am working on Azure Databricks and have fetched a Spark data frame, and need to convert it to R data.frame. I am getting a syntax error when I am using as.data.frame in the same cell for the same.
When tried in different cells, after the initiation of magic tag (%r), and using the same command- it is throwing different errors that object is not found.
You can register the Spark DataFrame as a TempView using createOrReplaceTempView
Registering a Temp View
sparkDF.createOrReplaceTempView('TempView')
Once you have done this, TempView would be accessible throughout your Notebook.
Further using %r you can create a DataFrame out of it
SparkR
%r
library(SparkR)
sparkR <- sql('select * from TempView')
R DataFrame
%r
library(SparkR)
sparkR <- collect(sql('select * from TempView'))

Add additional jars using PYSPARK_SUBMIT_ARGS

I have code to start a spark session
spark_session = SparkSession.builder.appName(app_name)
spark_session = spark_session.getOrCreate()
sc = spark_session.sparkContext
Now I want to dynamically be able to add jars and packages using PYSPARK_SUBMIT_ARGS so I added an environment variable with the following value before I hit the code
--jars /usr/share/aws/redshift/jdbc/RedshiftJDBC4.jar --packages com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4
But I get the following error:
Error: Missing application resource.
From looking online I know its because of the fact that I am explicitly passing jars and packages so I need to provide the path to my main jar file. But I am confused as to what that will be. I am just tring to run some code by starting a pyspark shell. I know the other way of passing these while starting the shell but my use case is such that I want it to be able to do it using the env variable and I have not been able to find the answers to this issue online

How can i extract values from cassandra output using python?

I'm trying to connect cassandra database through python using cassandra driver .And it went successful with out any problem . When i tried to fetch the values from cassandra ,it has some formatted output like Row(values) .
python version 3.6
package : cassandra
from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect('employee')
k=session.execute("select count(*) from users")
print(k[0])
Output :
Row(count=11)
Expected :
11
From documentation:
By default, each row in the result set will be a named tuple. Each row will have a matching attribute for each column defined in the schema, such as name, age, and so on. You can also treat them as normal tuples by unpacking them or accessing fields by position.
So you can access your data by name as k[0].count, or by position as rows[0][0]
Please read Getting started document from driver's documentation - it will answer most of your questions.
Cassandra reply everything using something called row factory, which by default is a named tuple.
In your case, to access the output you should access k[0].count.

Set spark configuration

I am trying to set the configuration of a few spark parameters inside the pyspark shell.
I tried the following
spark.conf.set("spark.executor.memory", "16g")
To check if the executor memory has been set, I did the following
spark.conf.get("spark.executor.memory")
which returned "16g".
I tried to check it through sc using
sc._conf.get("spark.executor.memory")
and that returned "4g".
Why do these two return different values and whats the correct way to set these configurations.
Also, I am fiddling with a bunch of parameters like
"spark.executor.instances"
"spark.executor.cores"
"spark.executor.memory"
"spark.executor.memoryOverhead"
"spark.driver.memory"
"spark.driver.cores"
"spark.driver.memoryOverhead"
"spark.memory.offHeap.size"
"spark.memory.fraction"
"spark.task.cpus"
"spark.memory.offHeap.enabled "
"spark.rpc.io.serverThreads"
"spark.shuffle.file.buffer"
Is there a way that will set the configurations for all the variables.
EDIT
I need to set the configuration programmatically. How do I change it after I have done spark-submit or started the pyspark shell? I am trying to reduce the runtime of my jobs for which I am going through multiple iterations changing the spark configuration and recording the runtimes.
You can set environment variables by using: (e.g. in spark-env.sh, only stand-alone)
SPARK_EXECUTOR_MEMORY=16g
You can also set the spark-defaults.conf:
spark.executor.memory=16g
But these solutions are hardcoded and pretty much static, and you want to have different parameters for different jobs, however, you might want to set up some defaults.
The best approach is to use spark-submit:
spark-submit --executor-memory 16G
The problem of defining variables programmatically is that some of them need to be defined at startup time if not precedence rules will take over and your changes after the initiation of the job will be ignored.
Edit:
The amount of memory per executor is looked up when SparkContext is created.
And
once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
See: SparkConf Documentation
Have you tried changing the variable before the SparkContext is created, then running your iteration, stopping your SparkContext and changing your variable to iterate again?
import org.apache.spark.{SparkContext, SparkConf}
val conf = new SparkConf.set("spark.executor.memory", "16g")
val sc = new SparkContext(conf)
...
sc.stop()
val conf2 = new SparkConf().set("spark.executor.memory", "24g")
val sc2 = new SparkContext(conf2)
You can debug your configuration using: sc.getConf.toDebugString
See: Spark Configuration
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
You'll need to make sure that your variable is not defined with higher precedence.
Precedence order:
conf/spark-defaults.conf
--conf or -c - the command-line option used by spark-submit
SparkConf
I hope this helps.
In Pyspark,
Suppose I want to increase the driver memory and executor in code. I can do it as below:
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '23g'), ('spark.driver.memory','9.7g')])
To view the updated settings:
spark.sparkContext._conf.getAll()

Existing column can't be found by DataFrame#filter in PySpark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.
From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)
What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Resources