How many Spark Session to create? - apache-spark

We are building a data ingestion framework in pyspark.
The first step is to get/create a sparksession with our app name. The structure of dataLoader.py is outlined below.
spark = SparkSession \
.builder \
.appName('POC') \
.enableHiveSupport() \
.getOrCreate()
#create data frame from file
#process file
If i have to execute this dataLoader.py concurrently for loading different files, would having the same spark session cause an issue?
Do I have to create a separate spark session for every ingestion?

No, you don't create multiple spark session. Spark session should be created only once per spark application. Spark doesn't support this and your job might will fail if you use multiple spark session in the same spark job. Here is the SPARK-2243 where spark has closed the ticket saying it won't fix it.
If you want to load different files using the dataLoader.pythere are 2 options
Load and process files sequentially. Here you load one file at a time; save that to a dataframe and process that dataframe.
Create different dataLoader.py script for different files and run each spark job in parallel. Here each spark job gets its own sparkSession.

Yet another option is to create a Spark session once, share it among several threads and enable FAIR job scheduling. Each of the threads would execute a separate spark job, i.e. calling collect or other action on a data frame. The optimal number of threads depends on complexity of your job and the size of the cluster. If there are too few jobs, the cluster can be underloaded and wasting its resources. If there are too many threads, the cluster will be saturated and some jobs will be sitting idle and waiting for executors to free up.

Each spark job is independent and there can only be one instance of SparkSession ( and SparkContext ) per JVM. You won't be able to create multiple session instances.

You want to create a new spark application for every file which is certainly possible as each spark application would have 1 corresponding spark session, it is not the recommended way though (usually).You can load multiple files using the same spark session object which is preferred (usually).

Related

Does Spark Sql executions use thread local jobgroup?

From my findings running multiple sparksqls with different job groups does not put them in the specified groups.
https://issues.apache.org/jira/browse/SPARK-29340
Creating new threadlocal jobgroup works for spark dataframe jobs but not for sparksql. Is there a way to put all threadlocal spark sql executions in a separate jobgroup?
val sparkThreadLocal: SparkSession = DataCurator.spark.newSession()
sparkThreadLocal.sparkContext.setJobGroup("<id>", "<description>")
OR
sparkThreadLocal.sparkContext.setLocalProperty("spark.job.description", "<id>")
sparkThreadLocal.sparkContext.setLocalProperty("spark.jobGroup.id", "<description>")
Solved! It was an issue with using scala parallel iteration, which uses threadpools.

How does Spark Streaming schedule map tasks between driver and executor?

I use Apache Spark 2.1 and Apache Kafka 0.9.
I have a Spark Streaming application that runs with 20 executors and reads from Kafka that has 20 partitions. This Spark application does map and flatMap operations only.
Here is what the Spark application does:
Create a direct stream from kafka with interval of 15 seconds
Perform data validations
Execute transformations using drool which are map only. No reduce transformations
Write to HBase using check-and-put
I wonder if executors and partitions are 1-1 mapped, will every executor independently perform above steps and write to HBase independently, or data will be shuffled within multiple executors and operations will happen between driver and executors?
Spark jobs submit tasks that can only be executed on executors. In other words, executors are the only place where tasks can be executed. The driver is to coordinate the tasks and schedule them accordingly.
With that said, I'd say the following is true:
will every executor independently perform above steps and write to HBase independently
By the way, the answer is irrelevant to what Spark version is in use. It's always been like this (and don't see any reason why it would or even should change).

Creating many, short-living SparkSessions

I've got an application that orchestrates batch job executions and I want to create a SparkSession per job execution - especially in order to get a clean separation of registered temp views, functions etc.
So, this would lead to thousands of SparkSessions per day, that will only live for the duration of a job (from a few minutes up to a several hours). Is there any argument to not do this ?
I am aware of the fact, that there is only one SparkContext per JVM. I also know that a SparkContext performs some JVM global caching, but what exactly does this mean for this scenario ? What is e.g. cached in a SparkContext and what would happen if there are many spark jobs executed using those sessions ?
This shows how multiple sessions can be build with different configures
Use
spark1.clearActiveSession();
spark1.clearDefaultSession();
To clear the sessions.
SparkSession spark1 = SparkSession.builder()
.master("local[*]")
.appName("app1")
.getOrCreate();
Dataset<Row> df = spark1.read().format("csv").load("data/file1.csv");
df.show();
spark1.clearActiveSession();
spark1.clearDefaultSession();
SparkSession spark2 = SparkSession.builder()
.master("local[*]")
.appName("app2")
.getOrCreate();
Dataset<Row> df2 = spark1.read().format("csv").load("data/file2.csv");
df2.show();
For your questions.
Spark context save the rdds in memory for quicker processing.
If there is lot of data . The save tables or rdds are moved to the hdd .
A session can access the tables if it saved as a view at any point.
It is better to do multiple spark-submits for your jobs with unique id instead of having different configs.

How does mllib code run on spark?

I am new to distributed computing, and I'm trying to run Kmeans on EC2 using Spark's mllib kmeans. As I was reading through the tutorial I found the following code snippet on
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
I am having trouble understanding how this code runs inside the cluster. Specifically, I'm having trouble understanding the following:
After submitting the code to master node, how does spark know how to parallelize the job? Because there seem to be no part of the code that deals with this.
Is the code copied to all nodes and executed on each node? Does the master node do computation?
How do node communitate the partial result of each iteration? Is this dealt inside the kmeans.train code, or is the spark core takes care of it automatically?
Spark divides data to many partitions. For example, if you read a file from HDFS, then partitions should be equal to partitioning of data in HDFS. You can manually specify number of partitions by doing repartition(numberOfPartitions). Each partition can be processed on separate node, thread, etc. Sometimes data are partitioned by i.e. HashPartitioner, which looks on hash of the data.
Number of partitions and size of partitions generally tells you if data is distributed/parallelized correctly. Creating partitions of data is hidden in RDD.getPartitions methods.
Resource scheduling depends on cluster manager. We can post very long post about them ;) I think that in this question, the partitioning is the most important. If not, please inform me, I will edit answer.
Spark serializes clusures, that are given as arguments to transformations and actions. Spark creates DAG, which is sent to all executors and executors execute this DAG on the data - it launches closures on each partition.
Currently after each iteration, data is returned to the driver and then next job is scheduled. In Drizzle project, AMPLab/RISELab is creating possibility to create multiple jobs on one time, so data won't be sent to the driver. It will create DAG one time and schedules i.e. job with 10 iterations. Shuffle between them will be limited / will not exists at all. Currently DAG is created in each iteration and job in scheduled to executors
There is very helpful presentation about resource scheduling in Spark and Spark Drizzle.

How does Apache Spark assign partition-ids to its executors

I have a long-running Spark streaming job which uses 16 executors which only one core each.
I use default partitioner(HashPartitioner) to equally distribute data to 16 partitions. Inside updateStateByKeyfunction, i checked for the partition id from TaskContext.getPartitionId() for multiple batches and found out the partition-id of a executor is quite consistent but still changing to another id after a long run.
I'm planing to do some optimization to spark "updateStateByKey" API, but it can't be achieved if the partition-id keeps changing among batches.
So when does Spark change the partition-id of a executor?
Most probably, the task has failed and restart again, so the TaskContext has changed, and so as the partitionId.

Resources