Spark Hive table join strategy - apache-spark

I have a Hive table of 14 billion records (around 1TB of size) and another Hive table of 800 million records (2GB big). I want to join them, what should be my strategy?
I have a 36 node cluster. I am using 50 executors, 30 GB to each executor.
From what I see, my options are:
Broadcasting the 2 GB table
Just joining 2 tables blindly (I have done this, it's taking almost 4 hrs to complete)
If I repartition both the tables and join them, will it increase the performance?
I observed that in the 2nd approach the last 20 tasks are extremely slow, I am hoping they are processing partitions having more data (skewed data).

The smaller table can fit into memory if you give each worker enough RAM. In that case a map side join / side data approach may well be useful.
Take a look at using the MapJoin hint:
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key
The essential point:
If all but one of the tables being joined are small, the join can be
performed as a map only job.
More details on its usage may be seen here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins#LanguageManualJoins-MapJoinRestrictions

Related

What are the main differences, in the compute resources, between left join and cross join?

I am very new in spark configuration resources and I would like to understand the main differences between using a left join vs cross join in spark in resources/compute behaviour.
If supplied record volume and spark configurations(cores and memory) being the same, I guess the major gain will be from underlying filtering of rows(join condition) for non-cartesian joins utilizing relatively lesser cores and memory.
When both of your tables have a similar size/record count:
Cartesian or Cross joins will be extremely expensive as they can easily explode the number of output rows.
Imagine 10,000 X 10,000 = 100 million
all rows from corresponding datasets will be read, sorted, and write (n cores) and fit into memory for join thus having larger footprint
Inner/Outer Joins will work on the principles of map/reduce and co-locality
by filtering rows matching join condition(map stage) from data tables using n cores followed by shuffle and sort on local executors and output result(reduce).
But, when one of your tables has a smaller size/record count:
the smaller table will be read, build a hashtable and write it using (maybe) a single partition i.e. broadcast to each executor reading X partitions of a larger table

Apache Spark Join Performance

I have 2 tables, T1 and T2.
T1 is read from Postgres and smaller in size, but gradually increases in volume) (from 0 to hiveTableSize).
T2 is read from Hive and bigger in size (more than 100k rows).
I am doing LEFT_ANTI join as
T1.join(T2, column_name, "LEFT_ANTI").
The goal is to get all rows from T1 which are not in T2. After all transformations, the data would be written to Postgres and the whole data would be read again when the job runs the next day.
What I am observing is, smallTable.join(largeTable) => does it have performance impact. My job runs anywhere from 30min to 90min with the above join, but if I comment this join out, it runs in less than 5 min.
Does Spark optimize large table joined against small table?
If the larger table is in fact only 100K rows, this join should run in seconds. There is something other than the join performance causing the bottleneck. One potential issue is that the number of partitions you have is too large. This leads to a lot of overhead when processing small data sets.
Try something like the following
T1.coalesce(n).join(T2.coalesce(n), column_name, "LEFT_ANTI")
Where n is some small integer, ideally 2 * the number of executor cores available. the coalesce function reduces the number of partitions in the data set. Honestly, at this scale, you may even want coalesce to 1 partition.
Note also that the tables are likely being read completely into Spark before the join. Because you are join across two federated sources, the only way to do the anti-join is to pull both tables into Spark scan both of them. This could be contributing to poor performance. It may even be worthwhile to copy the PG table into Spark before the join depending one where else it is being used.

How to optimize spark sql operations on large data frame?

I have a large hive table(~9 billion records and ~45GB in orc format). I am using spark sql to do some profiling of the table.But it takes too much time to do any operation on this. Just a count on the input data frame itself takes ~11 minutes to complete. And min, max and avg on any column alone takes more than one and half hours to complete.
I am working on a limited resource cluster (as it is the only available one), a total of 9 executors each with 2 core and 5GB memory per executor spread over 3 physical nodes.
Is there any way to optimise this, say bring down the time to do all the aggregate functions on each column to less than 30 minutes atleast with the same cluster, or bumping up my resources is the only way?? which I am personally not very keen to do.
One solution I came across to speed up data frame operations is to cache them. But I don't think its a feasible option in my case.
All the real world scenarios I came across use huge clusters for this kind of load.
Any help is appreciated.
I use spark 1.6.0 in standalone mode with kryo serializer.
There are some cool features in sparkSQL like:
Cluster by/ Distribute by/ Sort by
Spark allows you to write queries in SQL-like language - HiveQL. HiveQL let you control the partitioning of data, in the same way we can use this in SparkSQL queries also.
Distribute By
In spark, Dataframe is partitioned by some expression, all the rows for which this expression is equal are on the same partition.
SET spark.sql.shuffle.partitions = 2
SELECT * FROM df DISTRIBUTE BY KEY
So, look how it works:
par1: [(1,c), (3,b)]
par2: [(3,c), (1,b), (3,d)]
par3: [(3,a),(2,a)]
This will transform into:
par1: [(1,c), (3,b), (3,c), (1,b), (3,d), (3,a)]
par2: [(2,a)]
Sort By
SELECT * FROM df SORT BY key
for this case it will look like:
par1: [(1,c), (1,b), (3,b), (3,c), (3,d), (3,a)]
par2: [(2,a)]
Cluster By
This is shortcut for using distribute by and sort by together on the same set of expressions.
SET spark.sql.shuffle.partitions =2
SELECT * FROM df CLUSTER BY key
Note: This is basic information, Let me know if this helps otherwise we can use various different methods to optimize your spark Jobs and queries, according to the situation and settings.

Why join in spark in local mode is so slow?

I am using spark in local mode and a simple join is taking too long. I have fetched two dataframes: A (8 columns and 2.3 million rows) and B(8 columns and 1.2 million rows) and joining them using A.join(B,condition,'left') and called an action at last. It creates a single job with three stages, each for two dataframes extraction and one for joining. Surprisingly stage with extraction of dataframe A is taking around 8 minutes and that of dataframe B is taking 1 minute. And join happens within seconds. My important configuration settings are:
spark.master local[*]
spark.driver.cores 8
spark.executor.memory 30g
spark.driver.memory 30g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.shuffle.partitions 16
The only executor is driver itself. While extracting dataframes, i have partitioned it in 32(also tried 16,64,50,100,200) parts. I have seen shuffle write memory to be 100 MB for Stage with dataframe A extraction. So to avoid shuffle i made 16 initial partitions for both dataframes and broadcasted dataframe B(smaller), but it is not helping. There is still shuffle write memory. I have used broadcast(B) syntax for this. Am I doing something wrong? Why shuffling is still there? Also when i see event timelines its showing only four cores are processing at any point of time. Although I have a 2core*4 processor machine.Why is that so?
In short, "Join"<=>Shuffling, the big question here is how uniformly are your data distributed over partitions (see for example https://0x0fff.com/spark-architecture-shuffle/ , https://www.slideshare.net/SparkSummit/handling-data-skew-adaptively-in-spark-using-dynamic-repartitioning and just Google the problem).
Few possibilities to improve efficiency:
think more about your data (A and B) and partition data wisely;
analyze, are your data skewed?;
go into UI and look at the tasks timing;
choose such keys for partitions that during "join" only few partitions from dataset A shuffle with few partitions of B;

increasing number of partitions in spark

I was using Hive for executing SQL queries on a project. I used ORC with 50k Stride for my data and have created the hive ORC tables using this configuration with a certain date column as partition.
Now I wanted to use Spark SQL to benchmark the same queries operating on the same data.
I executed the following query
val q1 = sqlContext.sql("select col1,col2,col3,sum(col4),sum(col5) from mytable where date_key=somedatkye group by col1,col2,col3")
In hive it takes 90 seconds for this query. But spark takes 21 minutes for the same query and on looking at the job, i found the issue was because Spark creates 2 stages and on the first stage, it has only 7 tasks, one each for each of the 7 blocks of data within that given partition in orc file. The blocks are of different size, one is 5MB while the other is 45MB and because of this stragglers take more time leading to taking too much time for the job.
How do i mitigate this issue in spark. How do i manually increase the number of partitions, resulting in increasing the number of tasks in stage 1, even though there are only 7 physical blocks for the given range of the query.

Resources