How Spark performs write operation to Hive - apache-spark

I am working in Spark and still new to it. I am working on a job that reads data from some source, do some transformations and write to Hive.
For writing to Hive, I am doing dataframe.write.insertInto(hive_table)
My question is how does Spark write the entire dataframe to Hive? Will it write in parallel like different partitions on different executors will be written in parallel or will it collect all the data from various partitions to driver and then try to insert in one go?

Spark and Hive partitions are different. Spark executors will be writing in parallel to various Hive partitions.
Spark partitions will be processed in parallel by executors and when each executors encounters a Hive partition key, it will write to a new file in Hive location for that key.
So if you have 5 Spark partitions being processed in parallel and data in Hive is to be written in 3 partitions, whenever each executor will encounter a key for Hive partition, it will write to a file for that partitions.
You will see 5 files in each of the 3 Hive partition location written by each of the executors

Related

From where does RDD loads data in Spark?

From where does Spark load data for RDD? Is data already present in Executor nodes or spark shuffles data from Driver node first?
From the name itself - RDD (Resilient Distributed Dataset) - it indicates that the data resides across executors when ever you create it.
Lets say when you run parallelize() for 100 entries, it will distribute that 100 entries across your executors so that each executor has its own chunk of data to do distributed processing.
Shuffling happens - when you do any operations like repartition() or coalesce().
Also if you run functions like collect() spark will try to pull all data from executor and bring it to driver(And you loose the ability of distributed processing)
This reference has more details around internals of spark - Apache Spark architecture

Get number of partition or rdd in Apapche Spark job

I want to create a autoscaling service for Apache Spark. I would like to add number of executor to spark according to number of partition or rdd spark creates to efficiently process a job. I am using Spark standalone in cluster mode.
Things i would like to know/ acheive.
Number of Partition in a Job (Metric or endpoint)

Hive partitions to Spark partitions

We need to work on a big dataset with partitioned data, for efficiency reasons. Data source resides in Hive, but with a different partition criteria. In other words, we need to retrieve data from Hive to Spark, and re-partition in Spark.
But there is an issue in Spark that causes reordering/redistributing partitioning when data is persisted (either to parquet or ORC). Therefore, our new partitioning in Spark is lost.
As an alternative, we are considering building our new partitioning in a new Hive table. The question is: is it possible to map Spark partitions from Hive partitions (for read)?
Partition Discovery --> might be what you are looking for:
" Passing the path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL will automatically extract the partitioning information from the paths. "

How to process Kafka partitions separately and in parallel with Spark executors?

I use Spark 2.1.1.
I read messages from 2 Kafka partitions using Structured Streaming. I am submitting my application to Spark Standalone cluster with one worker and 2 executors (2 cores each).
./bin/spark-submit \
--class MyClass \
--master spark://HOST:IP \
--deploy-mode cluster \
/home/ApplicationSpark.jar
I want the functionality such that, the messages from each Kafka partition should be processed by each separate executor independently. But now what is happening is, executors read and .map the partition data separately, but after mapping the unbounded tables which is formed is used commonly and having data from both the partitions.
When I ran the structured query on table, the query has to deal with data from both the partitions (more amount of data).
select product_id, max(smr.order_time), max(product_price) , min(product_price)
from OrderRecords
group by WINDOW(order_time, "120 seconds"), product_id
where Kafka partition is on Product_id
Is there any way to run the same structured query parallel but separately on the data, from the Kafka partition to which the executor is mapped?
But now what is happening is, executors read and .map the partition data separately, but after mapping the unbounded tables which is formed is used commonly and having data from both the partitions. Hence when I ran the structured query on table, the query has to deal with data from both the partitions (more amount of data).
That's the key to understand what and how can be executed without causing shuffle and sending data across partitions (possibly even over the wire).
The definitive answer depends on what your queries are. If they work on groups of records where the groups are spread across multiple topic partitions and hence on two different Spark executors, you'd have to be extra careful with your algorithm/transformation to do the processing on separate partitions (using only what's available in partitions) and aggregating the results only.

RDD and partition in Apache Spark

So, in Spark when an application is started then an RDD containing the dataset for the application (e.g. words dataset for WordCount) is created.
So far what I understand is that RDD is a collection of those words in WordCount and the operations that have been done to those dataset (e.g. map, reduceByKey, etc...)
However, afaik, Spark also has HadoopPartition (or in general: partition) which is read by every executor from HDFS. And I believe that an RDD in driver also contains all of these partitions.
So, what is getting divided among executors in Spark? Does every executor get those sub-dataset as a single RDD which contains less data compared to RDD in the driver or does every executor only deals with these partitions and read them directly from HDFS? Also, when are the partitions created? On the RDD creation?
Partitions are configurable provided the RDD is key-value based.
There are 3 main partition's property:
Tuples in the same partition are guaranteed to be in the same
machine.
Each node in a cluster can contain more than one partition.
The total number of partitions are configurable, by default it is
set to the total number of cores on all the executor nodes.
Spark supports two types of partitioning:
Hash Partitioning
Range Partitioning
When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file.
When you call rdd.repartition(x) it would perform a shuffle of the data from N partitions you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
Please see more details here and here
Your RDD have rows in it. If it is a text file, it have lines separated by \n.
Those rows are getting divided into partitions across different nodes in Spark cluster.

Resources