Partitioning in spark while reading from RDBMS via JDBC - apache-spark

I am running spark in cluster mode and reading data from RDBMS via JDBC.
As per Spark docs, these partitioning parameters describe how to partition the table when reading in parallel from multiple workers:
partitionColumn
lowerBound
upperBound
numPartitions
These are optional parameters.
What would happen if I don't specify these:
Only 1 worker read the whole data?
If it still reads parallelly, how does it partition data?

If you don't specify either {partitionColumn, lowerBound, upperBound, numPartitions} or {predicates} Spark will use a single executor and create a single non-empty partition. All data will be processed using a single transaction and reads will be neither distributed nor parallelized.
See also:
How to optimize partitioning when migrating data from JDBC source?
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

Related

From where does RDD loads data in Spark?

From where does Spark load data for RDD? Is data already present in Executor nodes or spark shuffles data from Driver node first?
From the name itself - RDD (Resilient Distributed Dataset) - it indicates that the data resides across executors when ever you create it.
Lets say when you run parallelize() for 100 entries, it will distribute that 100 entries across your executors so that each executor has its own chunk of data to do distributed processing.
Shuffling happens - when you do any operations like repartition() or coalesce().
Also if you run functions like collect() spark will try to pull all data from executor and bring it to driver(And you loose the ability of distributed processing)
This reference has more details around internals of spark - Apache Spark architecture

Can Apache Spark repartition the data received from single Kafka partition

We have a Kafka topic with multiple partitions and the load is uneven/skewed in each partition. We can't change the partitioning strategy for Kafka.
I am looking for some ways to repartition the data so that the load is equally processed by all the nodes present within the Apache Spark cluster.
Is it possible to repartition the data across all the nodes received from a single Kafka partition?
If yes, is there a way we can do it efficiently as we have to maintain the state stores also while aggregating the data

How Spark performs write operation to Hive

I am working in Spark and still new to it. I am working on a job that reads data from some source, do some transformations and write to Hive.
For writing to Hive, I am doing dataframe.write.insertInto(hive_table)
My question is how does Spark write the entire dataframe to Hive? Will it write in parallel like different partitions on different executors will be written in parallel or will it collect all the data from various partitions to driver and then try to insert in one go?
Spark and Hive partitions are different. Spark executors will be writing in parallel to various Hive partitions.
Spark partitions will be processed in parallel by executors and when each executors encounters a Hive partition key, it will write to a new file in Hive location for that key.
So if you have 5 Spark partitions being processed in parallel and data in Hive is to be written in 3 partitions, whenever each executor will encounter a key for Hive partition, it will write to a file for that partitions.
You will see 5 files in each of the 3 Hive partition location written by each of the executors

What is best approach to join data in spark streaming application?

Question : Essentially it means , rather than running a join of C* table for each streaming records , is there anyway to run a join for each micro-batch ( micro-batching ) of records in spark streaming ?
We are almost finalized to use spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version.
But have one fundamental question regarding the efficiency in the below scenario.
For the streaming data records(i.e. streamingDataSet ) , I need to look up for existing records( i.e. cassandraDataset) from Cassandra(C*) table.
i.e.
Dataset<Row> streamingDataSet = //kafka read dataset
Dataset<Row> cassandraDataset= //loaded from C* table those records loaded earlier from above.
To look up data i need to join above datasets
i.e.
Dataset<Row> joinDataSet = cassandraDataset.join(cassandraDataset).where(//somelogic)
process further the joinDataSet to implement the business logic ...
In the above scenario, my understanding is ,for each record received
from kafka stream it would query the C* table i.e. data base call.
Does not it take huge time and network bandwidth if C* table consists
billions of records? What should be the approach/procedure to be
followed to improve look up C* table ?
What is the best solution in this scenario ? I CAN NOT load once from
C* table and look up as the data keep on adding to C* table ... i.e.
new look ups might need newly persisted data.
How to handle this kind of scenario? any advices plzz..
If you're using Apache Cassandra, then you have only one possibility for effective join with data in Cassandra - via RDD API's joinWithCassandraTable. The open source version of the Spark Cassandra Connector (SCC) supports only it, while in version for DSE, there is a code that allows to perform effective join against Cassandra also for Spark SQL - so-called DSE Direct Join. If you'll use join in Spark SQL against Cassandra table, Spark will need to read all data from Cassandra, and then perform join - that's very slow.
I don't have an example for OSS SCC for doing the join for Spark Structured Streaming, but I have some examples for "normal" join, like this:
CassandraJavaPairRDD<Tuple1<Integer>, Tuple2<Integer, String>> joinedRDD =
trdd.joinWithCassandraTable("test", "jtest",
someColumns("id", "v"), someColumns("id"),
mapRowToTuple(Integer.class, String.class), mapTupleToRow(Integer.class));

Hive partitions to Spark partitions

We need to work on a big dataset with partitioned data, for efficiency reasons. Data source resides in Hive, but with a different partition criteria. In other words, we need to retrieve data from Hive to Spark, and re-partition in Spark.
But there is an issue in Spark that causes reordering/redistributing partitioning when data is persisted (either to parquet or ORC). Therefore, our new partitioning in Spark is lost.
As an alternative, we are considering building our new partitioning in a new Hive table. The question is: is it possible to map Spark partitions from Hive partitions (for read)?
Partition Discovery --> might be what you are looking for:
" Passing the path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL will automatically extract the partitioning information from the paths. "

Resources