Spark How to Join Only Within Partitions - apache-spark

I Have 2 large data frames. Each row has lat/lon data. My goal is to do a join between 2 dataframes and find all the points which are within a distance, e.g. 100m.
df1: (id, lat, lon, geohash7)
df2: (id, lat, lon, geohash7)
I want to partition df1 and df2 on geohash7, and then only join within the partitions. I want to avoid joining between partitions to reduce computation.
df1 = df1.repartition(200, "geohash7")
df2 = df2.repartition(200, "geohash7")
df_merged = df1.join(df2, (df1("geohash7")===df2("geohash7")) & (dist(df1("lat"),df1("lon"),df2("lat"),df2("lon"))<100) )
So basically join on geohash7 and then make sure distance between points is less than 100.
The problem is that, Spark actually will cross join all the data. How can I make it only do inter-partition join not intra-partition join?

After much playing with data, it seems that spark is smart enough to first make sure a join happens on the equality condition ("geohash7"). So if there's no match there, it won't calculate the "dist" function.
It also appears that with equality condition, it doesn't do cross-join anymore. So I didn't have to do anything else. The join above works fine.

Related

Find difference between two dataframes of large data

I need to get the difference in data between two dataframes. I'm using subtract() for this.
# Dataframes that need to be compared
df1
df2
#df1-df2
new_df1 = df1.subtract(df2)
#df2-df1
new_df2 = df2.subtract(df1)
It works fine and the output is what I needed, but my only issue is with the performance.
Even for comparing 1gb of data, it takes around 50 minutes, which is far from ideal.
Is there any other optimised method to perform the same operation?
Following are some of the details regarding the dataframes:
df1 size = 9397995 * 30
df2 size = 1500000 * 30
All 30 columns are of dtype string.
Both dataframes are being loaded from a database through jdbc
connection.
Both dataframes have same column names and in same order.
You could use the "WHERE" statement on a dataset to filter out rows you don't need. For example through PK, if you know that df1 has pk ranging from 1 to 100, you just filter in df2 for all those pk's. Obviously after a union

drop_duplicates after unionByName

I am trying to stack two dataframes (with unionByName()) and, then, dropping duplicate entries (with drop_duplicates()).
Can I trust that unionByName() will preserve the order of the rows, i.e., that df1.unionByName(df2) will always produce a dataframe whose first N rows are df1's? Because, if so, when applying drop_duplicates(), df1's row would always be preserved, which is the behaviour I want.
UnionByName will not guarantee that you will have your records ranked first from df1 and then from df2. These are distributed and parallel tasks so you definitely can't build on that.
The solution might be to add a technical priority column to each DataFrame, then unionByName() and use the row_number() analytical function to sort by priority within that ID and then select the one with the higher priority (in below case 1 means higher than 2).
Take a look at the Scala code below:
val df1WithPriority = df1.withColumn("priority", lit(1))
val df2WithPriority = df2.withColumn("priority", lit(2))
df1WithPriority
.unionByName(df2WithPriority)
.withColumn(
"row_num",
row_number()
.over(Window.partitionBy("ID").orderBy(col("priority").asc)
)
.where(col("row_num") === lit(1))

Should I reduce not required columns in DFs before join them in Spark?

Is there any sense to reduce not required columns before I join it in Spark data frames?
For example:
DF1 has 10 columns, DF2 has 15 columns, DF3 has 25 columns.
I want to join them, select needed 10 columns and save it in .parquet.
Does it make sense to transform DFs with select only needed columns before the join or Spark engine will optimize the join by itself and will not operate with all 50 columns during the join operation?
Yes, it makes a perfect sense because it reduce the amount of data shuffled between executors. And it's better to make selection of only necessary columns as early as possible - in most cases, if file format allows (Parquet, Delta Lake), Spark will read data only for necessary columns, not for all columns. I.e.:
df1 = spark.read.parquet("file1") \
.select("col1", "col2", "col3")
df2 = spark.read.parquet("file2") \
.select("col1", "col5", "col6")
joined = df1.join(df2, "col1")

Pyspark Operations are Slower than Hive

I have 3 dataframes df1, df2 and df3.
Each dataframe has approximately 3million rows. df1 and df3 has apprx. 8 columns. df2 has only 3 columns.
(source text file of df1 is approx 600MB size)
These are the operations performed:
df_new=df1 left join df2 ->group by df1 columns->select df1 columns, first(df2 columns)
df_final = df_new outer join df3
df_split1 = df_final filtered using condition1
df_split2 = df_final filtered using condition2
write df_split1,df_split2 into a single table after performing different operations on both dataframes
This entire process takes 15mins in pyspark 1.3.1, with default partition value = 10, executor memory = 30G, driver memory = 10G and I have used cache() wherever necessary.
But when I use hive queries, this hardly takes 5 mins. Is there any particular reason why my dataframe operations are slow and is there any way I can improve the performance?
You should be careful with the use of JOIN.
JOIN in spark can be really expensive. Especially if the join is between two dataframes. You can avoid expensive operations by re-partition the two dataframes on the same column or by using the same partitioner.

Filter a large RDD by iterating over another large RDD - pySpark

I have a large RDD, call it RDD1, that is approximately 300 million rows after an initial filter. What I would like to do is take the ids from RDD1 and find all other instances of it in another large dataset, call it RDD2, that is approximately 3 billion rows. RDD2 is created by querying a parquet table that is stored in Hive as well as RDD1. The number of unique ids from RDD1 is approximately 10 million elements.
My approach is to currently collect the ids and broadcast them and then filter RDD2.
My question is - is there a more efficient way to do this? Or is this best practice?
I have the following code -
hiveContext = HiveContext(sc)
RDD1 = hiveContext("select * from table_1")
RDD2 = hiveContext.sql("select * from table_2")
ids = RDD1.map(lambda x: x[0]).distinct() # This is approximately 10 million ids
ids = sc.broadcast(set(ids.collect()))
RDD2_filter = RDD2.rdd.filter(lambda x: x[0] in ids.value))
I think it would be better to just use a single SQL statement to do the join:
RDD2_filter = hiveContext.sql("""select distinct t2.*
from table_1 t1
join table_2 t2 on t1.id = t2.id""")
What I would do is take the 300 milion ids from RDD1, construct a bloom filter (Bloom filter), use that as broadcast variable to filter RDD2 and you will get RDD2Partial that contains all key-value parits for key that are in RDD1, plus some false positives. If you expect the result to be within order of millions, than you will then be able to use normal operations like join,cogroup, etc. on RDD1 and RDD2Partial to obtain exact result without any problem.
This way you greatly reduce the time of the join operation if you expect the result to be of reasonable size, since the complexity remains the same. You might get some reasonable speedups (e.g. 2-10x) even if the result is within the order of hundreds of millions.
EDIT
The bloom filter can be collected efficiently since you can combine the bits set by one element with the bits set by another element with OR, which is associative and commutative.

Resources