Spark Datasets - Full Outer Join Issue - apache-spark

I am using Spark 2.0.0 with DataFrames API (DataSet[Row]) to join some dataframes as follows. I want to get all the rows from both usages and activations (hence the full outer join) and then do an inner join on the resulting dataset so that I consider only the rows that are also having a corresponding row in appDetails dataset (hence the inner join).
val result = usages
.join(activations, Seq("DATE", "APP_ID"), "outer")
.join(appDetails, Seq("APP_ID"), "inner")
This query does not return any results. However, if I change the full outer join to a left outer join, like this, it returns results.
val result = usages
.join(activations, Seq("DATE", "APP_ID"), "left")
.join(appDetails, Seq("APP_ID"), "inner")
However this doesn't work for me because I want all the rows from activations table as well. What's happening here? Why doesn't the full outer join work as expected in this case?

Related

How to join efficiently 2 Spark dataframes partitioned by some column, when that column is one of multiple join keys?

I am currently facing some issues in Spark 3.0.2 to efficiently join 2 Spark dataframes when
The 2 Spark DataFrames are partitioned by some key id;
id is part of the join key, but it is not the only one.
My intuition is telling me that the query optimizer is, in this case, not choosing the optimal path. I will illustrate my issue through a minimal example (note that this particular example does not really require a join, it's just for illustrative purposes).
Let's start from the simple case: the 2 dataframes are partitioned by id, and we join by id only:
from pyspark.sql import SparkSession, Row, Window
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
# Make up some test dataframe
df = spark.createDataFrame([Row(id=i // 10, order=i % 10, value=i) for i in range(10000)])
# Create the left side of the join (repartitioned by id)
df2 = df.repartition(50, 'id')
# Create the right side of the join (also repartitioned by id)
df3 = df2.select('id', F.col('order').alias('order_alias'), F.lit(0).alias('dummy'))
# Perform the join
joined_df = df2.join(df3, on='id')
joined_df.foreach(lambda x: None)
This results in the following efficient plan:
This plan is efficient: it recognizes that the 2 dataframes are already partitioned by the join key and avoids to re-shuffle them. The 2 dataframes are not only repartitioned, but also colocated.
What happens if there is an additional join key? It results in an inefficient plan:
joined_df = df2.join(df3, on=[df2.id==df3.id, df2.order==df3.order_alias])
joined_df.foreach(lambda x: None)
The plan is inefficient since it is repartitioning the 2 dataframes to do the join. This does not make sense to me. Intuitively, we could use the existing partitions: all keys to be joined will be found in the same partition as before, there is just one additional condition to apply! So I thought: perhaps we could phrase the 2nd condition as a filter?
joined_df.foreach(lambda x: None)
joined_df = df2.join(df3, on='id')
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
This however results in the same inefficient plan, since Spark query optimizer will just merge the 2nd filter with the join.
So, I finally thought that maybe I could force Spark to process the join as I want by adding a dummy cache step, by trying the following:
from pyspark import StorageLevel
joined_df = df2.join(df3, on='id')
# Note that this storage level will not cache anything, it's just to suggest to Spark that I need this intermediate result
joined_df.persist(StorageLevel(False, False, False, False))
# Do the filtering after "persisting" the join
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
joined_df_filtered.foreach(lambda x: None)
This results in an efficient plan! It is in fact much faster than the previous ones.
The workaround of "persisting" the first join to force Spark to use a more efficient processing plan is "good enough" for my use case, but I still have a few questions:
Am I missing something in my intuition that Spark should actually be reusing partitions when the partition key is part of the join key, instead of re-shuffling?
Is this expected behavior of the query optimizer? Should a ticket be filed for it?
Is there a better way to force the desired processing plan than adding the "persist" step? It seems more like an indirect workaround than a direct solution.

Left Join errors out: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans

***Edit
df_joint = df_raw.join(df_items,on='x',how='left')
The titled exception occurred in Apache Spark 2.4.5
df_raw has data of 2 columns "x", "y" and df_items is an empty data frame of schema with some other columns
left join is happening on a value to null, which should get the whole data from 1st dataframe with null columns from the 2nd dataframe.
It is completely working fine when "X" is float, how ever when I casted "X" to string its throwing error of implicit cartesian product
i received this error with spark 2.4.5.
Why it is happening and how to resolve this with out enabling the spark cross join
spark.conf.set("spark.sql.crossJoin.enabled", "true")
Might be a bug in Spark, but if you just want to add columns, you can do the following:
import pyspark.sql.functions as F
df_joint = df_raw.select(
'*',
*[F.lit(None).alias(c) for c in df_items.columns if c not in df_raw.columns]
)

spark join raises “Detected implicit cartesian product for INNER join between logical plans”

I have a situation where I have a dataframe df
and let's say I do the following steps:
df1 = df
df2 = df
and then write a query which uses D and E in the joins e.g
df3 = df1.join(df2, df1["column"] = df2["column"])
This is nothing but a self join which is widely needed in ETL.
Why does spark not handle it correctly
I have seen many posts but none of them provide a work around.
UPdate:
If I load the dataframes df1 and df2 from the same s3 location and then perform the join the issue goes away. But when you are doing ETL it may not always be the case where we persist the data and then use it to avoid this scenario.
Any thoughts?

Why does a distinct count after a MergeSort join on a Spark dataframe of Vectors give a non-deterministic results

A spark inner SortMerge Join on using a column containing a ml Vector appears to give non-deterministic and inaccurate result on larger datasets.
I was using the approxNearestNeighbors method of BucketRandomLSH projection of Spark v2.4.3, and discovered it gives different numbers of results for large data sets.
This problem only appears when executing a SortMerge join; a broadcast join gives the same result every time.
I tracked the problem down to the join on the LSH hash keys. A reproducible example is below...
import org.apache.spark.sql.functions._
import org.apache.spark.ml.linalg.Vectors
import scala.util.Random
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
//create a large dataframe containing an Array (length two) of ml Vectors of length one.
val df = Seq.fill(30000)(
(java.util.UUID.randomUUID.toString,
Seq(
Vectors.dense(Random.nextInt(10).toDouble),
Vectors.dense(Random.nextInt(10).toDouble))
)
).toDF
//ensure it's caches
df.cache.count()
//positional explode the vector column
val dfExploded = df.select(col("*"), posexplode(col("_2")))
// now self join on the exploded 'col' and 'pos' fields
dfExploded.join(dfExploded, Seq("pos","col")).drop("pos","col").distinct.count
Different result each time...
scala>dfExploded.join(dfExploded,Seq("pos","col")).drop("pos","col").distinct.count
res139: Long = 139663581
scala>dfExploded.join(dfExploded,Seq("pos","col")).drop("pos","col").distinct.count
res140: Long = 156349630

Spark scala partition dataframe for large cross joins

I have two dataframes that need to be cross joined on a 20-node cluster. However because of their size, a simple crossjoin is failing. I am looking to partition the data and perform the crossjoin and am looking for an efficient way to do it.
Simple Algorithm
Manually split file f1 into three and read into dataframes: df1A, df1B, df1C. Manually split file f2 into four and ready into dataframes: df2A, df2B, df2C, df2D. Cross join df1A X df2A, df1A X df2B,..,df1A X df2D,...,df1C X df2D. Save each cross join in a file and manually put together all files. This way Spark can perform each cross join parallely and things should complete fairly quickly.
Question
Is there is more efficient way of accomplishing this by reading both files into two dataframes, then partitioning each dataframe into 3 and 4 "pieces" and for each partition of one dataframe cross join with every partition of the other dataframe?
Data frame can be partitioned ether range or hash .
val df1 = spark.read.csv("file1.txt")
val df2 = spark.read.csv("file2.txt")
val partitionedByRange1 = df1.repartitionByRange(3, $"k")
val partitionedByRange2 = df2.repartitionByRange(4, $"k")
val result =partitionedByRange1.crossJoin(partitionedByRange2);
NOTE : set property spark.sql.crossJoin.enabled=true
You can convert this in to a rdd and then use cartesian operation on that RDD. You should then be able to save that RDD to a file. Hope that helps

Resources