Spark 2.4.6 + JDBC Reader: When predicate pushdown set to false, is data read in parallel by spark from the engine? - apache-spark

I am trying to extract data from a big table in SAP HANA, which is around 1.5tb in size, and the best way is to run in parallel across nodes and threads. Spark JDBC is the perfect candidate for the task, but in order to actually extract in parallel it requires partition column, lower/upper bound and number of partitions option to be set. To make the operation of the extraction easier, I considered adding an added partition column which would be the row_number() function and use MIN(), MAX() as lower/upper bounds respectively. And then the operations team just would be required to provide the number of partitions to have.
The problem is that HANA runs out of memory and it is very likely that row_number() is too costly on the engine. I can only imagine that over 100 threads run the same query during every fetch to apply the where filters and retrieve the corresponding chunk.
So my question is, if I disable the predicate pushdown option, how does spark behave? is it only read by one executor and then the filters are applied on spark side? Or does it do some magic to split the fetching part from the DB?
What could you suggest for extracting such a big table using the available JDBC reader?
Thanks in advance.

Before executing your primary query from Spark, run pre-ingestion query to fetch the size of the Dataset being loaded, i.e. as you have mentioned Min(), Max() etc.
Expecting that the data is uniformly distributed between Min and Max keys, you can partition across executors in Spark by providing Min/Max/Number of Executors.
You don't need(want) to change your primary datasource by adding additional columns to support data ingestion in this case.

Related

How does merge-sort join work in Spark and why can it throw OOM?

I want to understand the concept of merge-sort join in Spark in depth.
I understand the overall idea: this is the same approach as in merge sort algorithm: Take 2 sorted datasets, compare first rows, write smallest one, repeat.
I also understand how I can implement distributed merge sort.
But I cannot get how it is implemented in Spark with respect to concepts of partitions and executors.
Here is my take.
Given I need to join 2 tables A and B. Tables are read from Hive via Spark SQL, if this matters.
By default Spark uses 200 partitions.
Spark then will calculate join key range (from minKey(A,B) to maxKey(A,B)
) and split it into 200 parts. Both datasets to be split by key
ranges into 200 parts: A-partitions and B-partitions.
Each A-partition and each B-partition that relate to same key are sent to same executor and are
sorted there separatelt from each other.
Now 200 executors can join 200 A-partitions with 200 B-partitions
with guarantee that they share same key range.
The join happes via merge-sort algo: take smallest key from
A-partition, compare with smallest key from B-partition, write
match, or iterate.
Finally, I have 200 partitions of my data which are joined.
Does it make sense?
Issues:
Skewed keys. If some key range comprises 50% of dataset keys, some executor would suffer, because too many rows would go to the same partition.
It can even fail with OOM, while trying to sort too big A-partition or B-partition in memory (I cannot get why Spark cannot sort with disk spill, as Hadoop does?..) Or maybe it fails because it tries to read both partitions into memory for joining?
So, this was my guess. Could you please correct me and help to understand the way Spark works?
This is a common problem with joins on MPP databases and Spark is no different. As you say, to perform a join, all the data for the same join key value must be colocated so if you have a skewed distribution on the join key, you have a skewed distribution of data and one node gets overloaded.
If one side of the join is small you could use a map side join. The Spark query planner really ought to do this for you but it is tunable - not sure how current this is but it looks useful.
Did you run ANALYZE TABLE on both tables?
If you have a key on both sides that won't break the join semantics you could include that in the join.
why Spark cannot sort with disk spill, as Hadoop does?
Spark merge-sort join does spill to disk. Taking a look at Spark SortMergeJoinExec class, it uses ExternalAppendOnlyUnsafeRowArray which is described as:
An append-only array for UnsafeRows that strictly keeps content in an in-memory array until numRowsInMemoryBufferThreshold is reached post which it will switch to a mode which would flush to disk after numRowsSpillThreshold is met (or before if there is excessive memory consumption)
This is consistent with the experience of seeing tasks spilling to disk during a join operation from the Web UI.
why [merge-sort join] can throw OOM?
From the Spark Memory Management overview:
Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large. The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller.
i.e. in the case of join, increase spark.sql.shuffle.partitions to reduce the size of the partitions and the resulting hash table and correspondingly reduce the risk of OOM.

Mysql or Spark Processing of 400gb data

If I use spark in my case, based on block and cores will it be useful ?
I have 400 GB of data in single table i.e. User_events with multiple columns in MySQL. This table stores all user events from application. Indexes are there on required columns. I have an user interface where user can try different permutation and combination of fields under user_events
Currently I am facing the performance issues where query either takes 15/20 seconds or even longer or times out.
I have gone through couple of Spark tutorial but I am not sure if it can help here. Per mine understanding from spark,
First Spark has to bring all the data in memory. Bring 100 M record on netwok will be costly operation and I will be needing big memory for the
same. Isn't it ?
Once data in memory, Spark can distribute the data among partition based on cores and input data size. Then it can filter the data on each partition
in parallel. Here Spark can be beneficial as it can do the parallel operation while MySQL will be sequential. Is that correct ?
Is my understanding correct ?

How to optimize spark sql operations on large data frame?

I have a large hive table(~9 billion records and ~45GB in orc format). I am using spark sql to do some profiling of the table.But it takes too much time to do any operation on this. Just a count on the input data frame itself takes ~11 minutes to complete. And min, max and avg on any column alone takes more than one and half hours to complete.
I am working on a limited resource cluster (as it is the only available one), a total of 9 executors each with 2 core and 5GB memory per executor spread over 3 physical nodes.
Is there any way to optimise this, say bring down the time to do all the aggregate functions on each column to less than 30 minutes atleast with the same cluster, or bumping up my resources is the only way?? which I am personally not very keen to do.
One solution I came across to speed up data frame operations is to cache them. But I don't think its a feasible option in my case.
All the real world scenarios I came across use huge clusters for this kind of load.
Any help is appreciated.
I use spark 1.6.0 in standalone mode with kryo serializer.
There are some cool features in sparkSQL like:
Cluster by/ Distribute by/ Sort by
Spark allows you to write queries in SQL-like language - HiveQL. HiveQL let you control the partitioning of data, in the same way we can use this in SparkSQL queries also.
Distribute By
In spark, Dataframe is partitioned by some expression, all the rows for which this expression is equal are on the same partition.
SET spark.sql.shuffle.partitions = 2
SELECT * FROM df DISTRIBUTE BY KEY
So, look how it works:
par1: [(1,c), (3,b)]
par2: [(3,c), (1,b), (3,d)]
par3: [(3,a),(2,a)]
This will transform into:
par1: [(1,c), (3,b), (3,c), (1,b), (3,d), (3,a)]
par2: [(2,a)]
Sort By
SELECT * FROM df SORT BY key
for this case it will look like:
par1: [(1,c), (1,b), (3,b), (3,c), (3,d), (3,a)]
par2: [(2,a)]
Cluster By
This is shortcut for using distribute by and sort by together on the same set of expressions.
SET spark.sql.shuffle.partitions =2
SELECT * FROM df CLUSTER BY key
Note: This is basic information, Let me know if this helps otherwise we can use various different methods to optimize your spark Jobs and queries, according to the situation and settings.

Efficient Filtering on a huge data frame in Spark

I have a Cassandra table with 500 million rows. I would like to filter based on a field which is a partition key in Cassandra using spark.
Can you suggest the best possible/efficient approach to filter in Spark/Spark SQL based on the list keys which is also a pretty large.
Basically i need only those rows from the Cassandra table which are present in the list of keys.
We are using DSE and its features.
The approach i am using is taking lot of time roughly around an hour.
Have you checked repartitionByCassandraReplica and joinWithCassandraTable ?
https://github.com/datastax/spark-cassandra-connector/blob/75719dfe0e175b3e0bb1c06127ad4e6930c73ece/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
joinWithCassandraTable utilizes the java drive to execute a single
query for every partition required by the source RDD so no un-needed
data will be requested or serialized. This means a join between any
RDD and a Cassandra Table can be performed without doing a full table
scan. When performed between two Cassandra Tables which share the same
partition key this will not require movement of data between machines.
In all cases this method will use the source RDD's partitioning and
placement for data locality.
The method repartitionByCassandraReplica can be used to relocate data
in an RDD to match the replication strategy of a given table and
keyspace. The method will look for partition key information in the
given RDD and then use those values to determine which nodes in the
Cluster would be responsible for that data.

Spark Broadcasting Alternatives

Our application uses a long-running spark context(just like spark RPEL) to enable users perform tasks online. We use spark broadcasts heavily to process dimensional data. As in common practice, we broadcast the dimension tables and use dataframe APIs to join the fact table with the other dimension tables. One of the dimension tables is quite big and has about 100k records and 15MB of size in-memory(kyro serialized is just few MBs lesser).
We see that every spark JOB on the de-normalized dataframe is causing all the dimensions to be broadcasted over and over again. The bigger table takes ~7 secs every time it is broadcasted. We are trying to find a way to have the dimension tables broadcasted only once per context life span. We tried both sqlcontext and sparkcontext broadcasting.
Are there any other alternatives to spark broadcasting? Or is there a way to reduce the memory footprint of the dataframe(compression/serialization etc. - post-kyro is still 15MB :( ) ?
Possible Alternative
We use Iginite spark integration to load large amount of data at start of job and keep on mutating as needed.
In embedded mode you can start ignite at boot of Spark context and kill in the end.
You can read more about it here.
https://ignite.apache.org/features/igniterdd.html
Finally we were able to find a stopgap solution until spark support pinning of RDDs or preferably RDDs in a later version. This is apparently not addressed even in v2.1.0.
The solution relies on RDD mapPartitions, below is a brief summary of the approach
Collect the dimension table records as map of key-value pairs and broadcast using spark context. You can possibly use RDD.keyBy
Map fact rows using RDD mapPartitions method.
For each fact row mapParitions
collect the dimension ids in the fact row and lookup the dimension records
yields a new fact row by denormalizing the dimension ids in the fact
table

Resources