I have two tables to be cross joined,
table 1: query 300M rows
table 2: product description 3000 rows
The following query does a cross join and calculate a score between the tuple, and pick the top 3 matches,
query_df.repartition(10000).registerTempTable('queries')
product_df.coalesce(1).registerTempTable('products')
CREATE TABLE matches AS
SELECT *
FROM
(SELECT *,
row_number() over (partition BY a.query_id
ORDER BY 0.40 + 0.15*score_a + 0.20*score_b + 0.5*score_c DESC) AS rank
FROM
(SELECT /*+ MAPJOIN(b) */ a.query_id,
b.product_id,
func_a(a.qvec,b.pvec) AS score_a,
func_b(a.qvec,b.pvec) AS score_b,
func_c(a.qvec,b.pvec) AS score_c
FROM queries a CROSS
JOIN products b) a) a
WHERE rn <= 3
My spark cluster looks like the following,
MASTER="yarn-client" /opt/mapr/spark/spark-1.6.1/bin/pyspark --num-executors 22 --executor-memory 30g --executor-cores 7 --driver-memory 10g --conf spark.yarn.executor.memoryOverhead=10000 --conf spark.akka.frameSize=2047
Now the issue is, as expected, due to memory leak the job fails after a couple of stages because of the extremely large temp data produced. I'm looking for some help/suggestion in optimizing the above operation in a such a way that the job should be able to run both the match and filter operation for a query_id before picking the next query_id, in a parallel fashion - similar to a sort within for loop against the queries table. If the job is slow but successful, I'm ok with it, since I can request a bigger cluster.
The above query works fine for a smaller query table, say one with 10000 records.
In the scenario where you want to join table A (big) with table B (small), the best practice is to leverage broadcast join.
A clear overview is given in https://stackoverflow.com/a/39404486/1203837.
Hope this helps.
Cartesian joins or cross join in spark is extremely expensive. I would suggest to join the tables with inner join and save the output data first. Then use that dataframe for further aggregation.
One small suggestion the map join or broadcast join could fail sometime if the smaller table is not small enough. Unless you are sure about the size of the small table refrain using the broadcast join.
Related
I have a table named "table1" and I'm splitting it based on a criterion, and then joining the split parts one by one in for loop. The following is a representation of what I am trying to do.
When I joined them, the joining time increased exponentially.
0.7423694133758545
join
0.4046192169189453
join
0.5775985717773438
join
5.664674758911133
join
1.0985417366027832
join
2.2664384841918945
join
3.833379030227661
join
12.762675762176514
join
44.14520192146301
join
124.86295890808105
join
389.46189188957214
. Following are my parameters
spark = SparkSession.builder.appName("xyz").getOrCreate()
sqlContext = HiveContext(spark)
sqlContext.setConf("spark.sql.join.preferSortMergeJoin", "true")
sqlContext.setConf("spark.serializer","org.apache.spark.serializer.KryoSerializer")
sqlContext.setConf("spark.sql.shuffle.partitions", "48")
and
--executor-memory 16G --num-executors 8 --executor-cores 8 --driver-memory 32G
Source table
Desired output table
In the join iteration, I also increased the partitions to 2000 and decreased it to 4, and cached the DF data frame by df.cached(), but nothing worked. I know I am doing something terribly wrong but I don't know what. Please can you guide me on how to correct this.
I would really appreciate any help :)
code:
df = spark.createDataFrame([], schema=SCHEMA)
for i, column in enumerate(columns):
df.cache()
df_part = df_to_transpose.where(col('key') == column)
df_part = df_part.withColumnRenamed("value", column)
if (df_part.count() != 0 and df.count() != 0):
df = df_part.join(broadcast(df), 'tuple')
I had same problem a while ago. if you check your pyspark web ui and go in stages section and checkout dag visualization of your task you can see the dag is growing exponentialy and the waiting time you see is for making this dag not doing the task acutally. I dont know why but it seams when you join table made of a dataframe with it self pyspark cant handle partitions and it's getting a lot bigger. how ever the solution i found at that moment was to save each of join results on seperated files and at the end after restarting the kernel load and join all the files again. It seams if dataframes you want to join are not made from each other you dont see this problem.
Add a checkpoint every loop, or every so many loops, so as to break lineage.
I am using DirectJoin of Spark-Cassandra-Connector (SCC) in order to join a dataframe with a cassandra table and then perform a count. When I Join on all the data from the table the Join is faster (5 minutes), than when I Join on e.g. 3/4 of it (13 minutes). Can SCC somehow know if I have chosen all the partition keys in order to perform a join?
My guess is that due to the fact that I am not using RepartitionByCassandraReplica sometimes some partition keys are sent to the right nodes and some other times not. So maybe the 5 minutes is just "luck"?
Edit
DirectJoin is always "on" on both the above cases!
Direct join issues a query for each join key. That's why full join of two tables is faster without direct join.
By default direct join is disabled if the size ratio exceeds 90% (directJoinSetting=auto, directJoinSizeRatio=0.9).
You can also force direct join by setting directJoinSetting=on, disable with directJoinSetting=off, or tune the threshold with directJoinSizeRatio=x. See https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-datasource-table-options for details.
I am trying multiple ways to optimize executions of large datasets using partitioning. In particular I'm using a function commonly used with traditional SQL databases called nTile.
The objective is to place a certain number of rows into a bucket using a combination of buckettind and repartitioning. This allows Apache Spark to process data more efficient when processing partitioned datasets or should I say bucketted datasets.
Below is two examples. The first example shows how I've used ntile to split a dataset into two buckets followed by repartitioning the data into 2 partitions on the bucketted nTile called skew_data.
I then follow with the same query but without any bucketing or repartitioning.
The problem is query without the bucketting is faster then the query with bucketting, even the query without bucketting places all the data into one partition whereas the query with bucketting splits the query into 2 partitions.
Can someone let me know why that is.
FYI
I'm running the query on a Apache Spark cluster from Databricks.
The cluster just has one single node with 2 cores and 15Gb memory.
First example with nTile/Bucketting and repartitioning
allin = spark.sql("""
SELECT
t1.make
, t2.model
, NTILE(2) OVER (ORDER BY t2.sale_price) AS skew_data
FROM
t1 INNER JOIN t2
ON t1.engine_size = t2.engine_size2
""")
.repartition(2, col("skew_data"), rand())
.drop('skew_data')
The above code splits the data into partitions as follows, with the corresponding partition distribution
Number of partitions: 2
Partitioning distribution: [5556767, 5556797]
The second example: with no nTile/Bucketting or repartitioning
allin_NO_nTile = spark.sql("""
SELECT
t1.make
,t2.model
FROM
t1 INNER JOIN t2
ON t1.engine_size = t2.engine_size2
""")
The above code puts all the data into a single partition as shown below:
Number of partitions: 1
Partitioning distribution: [11113564]
My question is, why is it that the second query(without nTile or repartitioning) is faster than query with nTile and repartitioning?
I have gone to great lengths to write this question out as fully as possible, but if you need further explanation please don't hesitate to ask. I really want to get to the bottom of this.
I abandoned my original approached and used the new PySpark function called bucketBy(). If you want to know how to apply bucketBy() to bucket data go to
https://www.youtube.com/watch?v=dv7IIYuQOXI&list=PLOmMQN2IKdjvowfXo_7hnFJHjcE3JOKwu&index=39
I have 2 tables, T1 and T2.
T1 is read from Postgres and smaller in size, but gradually increases in volume) (from 0 to hiveTableSize).
T2 is read from Hive and bigger in size (more than 100k rows).
I am doing LEFT_ANTI join as
T1.join(T2, column_name, "LEFT_ANTI").
The goal is to get all rows from T1 which are not in T2. After all transformations, the data would be written to Postgres and the whole data would be read again when the job runs the next day.
What I am observing is, smallTable.join(largeTable) => does it have performance impact. My job runs anywhere from 30min to 90min with the above join, but if I comment this join out, it runs in less than 5 min.
Does Spark optimize large table joined against small table?
If the larger table is in fact only 100K rows, this join should run in seconds. There is something other than the join performance causing the bottleneck. One potential issue is that the number of partitions you have is too large. This leads to a lot of overhead when processing small data sets.
Try something like the following
T1.coalesce(n).join(T2.coalesce(n), column_name, "LEFT_ANTI")
Where n is some small integer, ideally 2 * the number of executor cores available. the coalesce function reduces the number of partitions in the data set. Honestly, at this scale, you may even want coalesce to 1 partition.
Note also that the tables are likely being read completely into Spark before the join. Because you are join across two federated sources, the only way to do the anti-join is to pull both tables into Spark scan both of them. This could be contributing to poor performance. It may even be worthwhile to copy the PG table into Spark before the join depending one where else it is being used.
I have a Hive table of 14 billion records (around 1TB of size) and another Hive table of 800 million records (2GB big). I want to join them, what should be my strategy?
I have a 36 node cluster. I am using 50 executors, 30 GB to each executor.
From what I see, my options are:
Broadcasting the 2 GB table
Just joining 2 tables blindly (I have done this, it's taking almost 4 hrs to complete)
If I repartition both the tables and join them, will it increase the performance?
I observed that in the 2nd approach the last 20 tasks are extremely slow, I am hoping they are processing partitions having more data (skewed data).
The smaller table can fit into memory if you give each worker enough RAM. In that case a map side join / side data approach may well be useful.
Take a look at using the MapJoin hint:
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key
The essential point:
If all but one of the tables being joined are small, the join can be
performed as a map only job.
More details on its usage may be seen here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins#LanguageManualJoins-MapJoinRestrictions