Get number of partition or rdd in Apapche Spark job - apache-spark

I want to create a autoscaling service for Apache Spark. I would like to add number of executor to spark according to number of partition or rdd spark creates to efficiently process a job. I am using Spark standalone in cluster mode.
Things i would like to know/ acheive.
Number of Partition in a Job (Metric or endpoint)

Related

From where does RDD loads data in Spark?

From where does Spark load data for RDD? Is data already present in Executor nodes or spark shuffles data from Driver node first?
From the name itself - RDD (Resilient Distributed Dataset) - it indicates that the data resides across executors when ever you create it.
Lets say when you run parallelize() for 100 entries, it will distribute that 100 entries across your executors so that each executor has its own chunk of data to do distributed processing.
Shuffling happens - when you do any operations like repartition() or coalesce().
Also if you run functions like collect() spark will try to pull all data from executor and bring it to driver(And you loose the ability of distributed processing)
This reference has more details around internals of spark - Apache Spark architecture

How Spark performs write operation to Hive

I am working in Spark and still new to it. I am working on a job that reads data from some source, do some transformations and write to Hive.
For writing to Hive, I am doing dataframe.write.insertInto(hive_table)
My question is how does Spark write the entire dataframe to Hive? Will it write in parallel like different partitions on different executors will be written in parallel or will it collect all the data from various partitions to driver and then try to insert in one go?
Spark and Hive partitions are different. Spark executors will be writing in parallel to various Hive partitions.
Spark partitions will be processed in parallel by executors and when each executors encounters a Hive partition key, it will write to a new file in Hive location for that key.
So if you have 5 Spark partitions being processed in parallel and data in Hive is to be written in 3 partitions, whenever each executor will encounter a key for Hive partition, it will write to a file for that partitions.
You will see 5 files in each of the 3 Hive partition location written by each of the executors

How does Spark Streaming schedule map tasks between driver and executor?

I use Apache Spark 2.1 and Apache Kafka 0.9.
I have a Spark Streaming application that runs with 20 executors and reads from Kafka that has 20 partitions. This Spark application does map and flatMap operations only.
Here is what the Spark application does:
Create a direct stream from kafka with interval of 15 seconds
Perform data validations
Execute transformations using drool which are map only. No reduce transformations
Write to HBase using check-and-put
I wonder if executors and partitions are 1-1 mapped, will every executor independently perform above steps and write to HBase independently, or data will be shuffled within multiple executors and operations will happen between driver and executors?
Spark jobs submit tasks that can only be executed on executors. In other words, executors are the only place where tasks can be executed. The driver is to coordinate the tasks and schedule them accordingly.
With that said, I'd say the following is true:
will every executor independently perform above steps and write to HBase independently
By the way, the answer is irrelevant to what Spark version is in use. It's always been like this (and don't see any reason why it would or even should change).

Partitioning in spark while reading from RDBMS via JDBC

I am running spark in cluster mode and reading data from RDBMS via JDBC.
As per Spark docs, these partitioning parameters describe how to partition the table when reading in parallel from multiple workers:
partitionColumn
lowerBound
upperBound
numPartitions
These are optional parameters.
What would happen if I don't specify these:
Only 1 worker read the whole data?
If it still reads parallelly, how does it partition data?
If you don't specify either {partitionColumn, lowerBound, upperBound, numPartitions} or {predicates} Spark will use a single executor and create a single non-empty partition. All data will be processed using a single transaction and reads will be neither distributed nor parallelized.
See also:
How to optimize partitioning when migrating data from JDBC source?
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

Does spark ensure datalocality?

When I submit my spark job into yarn cluster with --num-executers=4 , I can see in the spark UI, 4 executors are allocated in 4 nodes in the cluster. In my spark application I am taking inputs from various HDFS locations in various steps. But the allocated executors remain the same through out the execution.
My doubt is whether spark do anything for data-locality, since the nodes it selects at the very beginning irrespective of where input data situated(at least just in case of HDFS)?
I know map reduce does it in some extent.
Yes, it does. Spark still uses Hadoop InputFormat and RecordReader interfaces and appropriate implementations like i.e. TextInputFormat. So Spark's behaviour in this case is very similar to common MapReduce. Spark driver retrieves block locations of the file and assigns task to executors with regard to data locality.

Resources