Spark condition on partition column from another table (Performance) - apache-spark

I have a huge parquet table partitioned on registration_ts column - named stored.
I'd like to filter this table based on data obtained from small table - stream
In sql world the query would look like:
spark.sql("select * from stored where exists (select 1 from stream where stream.registration_ts = stored.registration_ts)")
In Dataframe world:
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi")
This all works, but the performance is suffering, because the partition pruning is not applied. Spark full-scans stored table, which is too expensive.
For example this runs 2 minutes:
stream.count
res45: Long = 3
//takes 2 minutes
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
[Stage 181:> (0 + 1) / 373]
This runs in 3 seconds:
val stream = stream.where("registration_ts in (20190516204l, 20190515143l,20190510125l, 20190503151l)")
stream.count
res44: Long = 42
//takes 3 seconds
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
The reason is that in the 2-nd example the partition filter is propagated to joined stream table.
I'd like to achieve partition filtering on dynamic set of partitions.
The only solution I was able to come up with:
val partitions = stream.select('registration_ts).distinct.collect.map(_.getLong(0))
stored.where('registration_ts.isin(partitions:_*))
Which collects the partitions to driver and makes a 2-nd query. This works fine only for small number of partitions. When I've tried this solution with 500k distinct partitions, the delay was significant.
But there must be a better way ...

Here's one way that you can do it in PySpark and I've verified in Zeppelin that it is using the set of values to prune the partitions
# the collect_set function returns a distinct list of values and collect function returns a list of rows. Getting the [0] element in the list of rows gets you the first row and the [0] element in the row gets you the value from the first column which is the list of distinct values
from pyspark.sql.functions import collect_set
filter_list = spark.read.orc(HDFS_PATH)
.agg(collect_set(COLUMN_WITH_FILTER_VALUES))
.collect()[0][0]
# you can use the filter_list with the isin function to prune the partitions
df = spark.read.orc(HDFS_PATH)
.filter(col(PARTITION_COLUMN)
.isin(filter_list))
.show(5)
# you may want to do some checks on your filter_list value to ensure that your first spark.read actually returned you a valid list of values before trying to do the next spark.read and prune your partitions

Related

Find difference between two dataframes of large data

I need to get the difference in data between two dataframes. I'm using subtract() for this.
# Dataframes that need to be compared
df1
df2
#df1-df2
new_df1 = df1.subtract(df2)
#df2-df1
new_df2 = df2.subtract(df1)
It works fine and the output is what I needed, but my only issue is with the performance.
Even for comparing 1gb of data, it takes around 50 minutes, which is far from ideal.
Is there any other optimised method to perform the same operation?
Following are some of the details regarding the dataframes:
df1 size = 9397995 * 30
df2 size = 1500000 * 30
All 30 columns are of dtype string.
Both dataframes are being loaded from a database through jdbc
connection.
Both dataframes have same column names and in same order.
You could use the "WHERE" statement on a dataset to filter out rows you don't need. For example through PK, if you know that df1 has pk ranging from 1 to 100, you just filter in df2 for all those pk's. Obviously after a union

Spark SQL: Empty table join with large table slow

I have 2 table
Transaction table (partition by year)
metadata table (All key is unique no partition, 30M records)
My Spark SQL is
SELECT * FROM Transaction WHERE PARTITION_YEAR = 2022; (result: 0 record)
Very fast to get result
SELECT * FROM Metadata WHERE KEY = "A" (result: 1 record)
2-3 seconds to get result
and finally
SELECT * FROM Transaction t LEFT JOIN Metadata m ON t.key = m.key WHERE t.PARTITION_YEAR = 2022;
Pretty slow (3 mins.)
Although
SELECT * FROM (SELECT * FROM Transaction WHERE PARTITION_YEAR = 2022) t LEFT JOIN Metadata m ON t.key = m.key;
Still have to wait (3 mins)
From your Query behaviour , I am guessing your file format is parquet.
SELECT * FROM Metadata WHERE KEY = "A"
This is like a filter operation which supports PUSHDOWN filter and it won't scan the whole table , rather quickly look at the parquet metadata (RANGE) of your interested column (KEY) and figure out .
But when you join in Spark :
It has to bring in the Whole table of METADATA into memory, and it has to scan and shuffle the data based on your join condition. i.e. PUSHDOWN filter is not supported for joins.
Your last 2 queries are essentially same. Spark will only bring 2022 data into memory. Even though your partition might be empty, 30M records of METADATA will be loaded.
Incase you want to avoid this situation for empty partitions, you should check if the partition is empty then only fire the
The cheapest/efficient way to check :
val dfPartition = spark.sql("SELECT * FROM Transaction WHERE PARTITION_YEAR = 2022;")
if(!dfPartition.isEmpty()) // fastest.
{
//Fire your join query
}

Pyspark: Why show() or count() of a joined spark dataframe is so slow?

I have two large spark dataframe. I joined them by one common column as:
df_joined = df1.join(df2.select("id",'label'), "id")
I got the result, but when I want to work with df_joined, it's too slow. As I know, we need to repartition df1 and df2 to prevent large number of partition for df_joined. so, even, I changed the number of partitions,
df1r = df1.repartition(1)
df2r = df2.repartition(1)
df_joined = df1r.join(df2r.select("id",'label'), "id")
still NOT working.
any IDEA?
Spark runs 1 concurrent task for every partition of an RDD / DataFrame (up to the number of cores in the cluster).
If you’re cluster has 20 cores, you should have at least 20 partitions (in practice 2–3x times more). From the other hand a single partition typically shouldn’t contain more than 128MB.
so, instead of below two lines, which repartition your dataframe into 1 paritition:
df1r = df1.repartition(1)
df2r = df2.repartition(1)
Repartition your data based on 'id' column, joining key, into n partitions. ( n depends on data size and number of cores in cluster).
df1r = df1.repartition(n, "id")
df2r = df2.repartition(n, "id")

Difference (if there is one) between spark.sql.shuffle.partitions and df.repartition

I'm having a bit of difficulty reconciling the difference (if one exists) between sqlContext.sql("set spark.sql.shuffle.partitions=n") and re-partitioning a Spark DataFrame utilizing df.repartition(n).
The Spark documentation indicates that set spark.sql.shuffle.partitions=n configures the number of partitions that are used when shuffling data, while df.repartition seems to return a new DataFrame partitioned by the number of key specified.
To make this question clearer, here is a toy example of how I believe df.reparition and spark.sql.shuffle.partitions to work:
Let's say we have a DataFrame, like so:
ID | Val
--------
A | 1
A | 2
A | 5
A | 7
B | 9
B | 3
C | 2
Scenario 1: 3 Shuffle Partitions, Reparition DF by ID:
If I were to set sqlContext.sql("set spark.sql.shuffle.partitions=3") and then did df.repartition($"ID"), I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
Scenario 2: 5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
Is my understanding off base here? In general, my questions are:
I am trying to optimize my partitioning of a dataframe as to avoid
skew, but to have each partition hold as much of the same key
information as possible. How do I achieve that with set
spark.sql.shuffle.partitions and df.repartiton?
Is there a link
between set spark.sql.shuffle.partitions and df.repartition? If
so, what is that link?
Thanks!
I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
No
5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
and no.
This is not how partitioning works. Partitioners map values to partitions, but mapping in general case is not unique (you can check How does HashPartitioner work? for a detailed explanation).
Is there a link between set spark.sql.shuffle.partitions and df.repartition? If so, what is that link?
Indeed there is. If you df.repartition, but don't provide number of partitions then spark.sql.shuffle.partitions is used.

Filter a large RDD by iterating over another large RDD - pySpark

I have a large RDD, call it RDD1, that is approximately 300 million rows after an initial filter. What I would like to do is take the ids from RDD1 and find all other instances of it in another large dataset, call it RDD2, that is approximately 3 billion rows. RDD2 is created by querying a parquet table that is stored in Hive as well as RDD1. The number of unique ids from RDD1 is approximately 10 million elements.
My approach is to currently collect the ids and broadcast them and then filter RDD2.
My question is - is there a more efficient way to do this? Or is this best practice?
I have the following code -
hiveContext = HiveContext(sc)
RDD1 = hiveContext("select * from table_1")
RDD2 = hiveContext.sql("select * from table_2")
ids = RDD1.map(lambda x: x[0]).distinct() # This is approximately 10 million ids
ids = sc.broadcast(set(ids.collect()))
RDD2_filter = RDD2.rdd.filter(lambda x: x[0] in ids.value))
I think it would be better to just use a single SQL statement to do the join:
RDD2_filter = hiveContext.sql("""select distinct t2.*
from table_1 t1
join table_2 t2 on t1.id = t2.id""")
What I would do is take the 300 milion ids from RDD1, construct a bloom filter (Bloom filter), use that as broadcast variable to filter RDD2 and you will get RDD2Partial that contains all key-value parits for key that are in RDD1, plus some false positives. If you expect the result to be within order of millions, than you will then be able to use normal operations like join,cogroup, etc. on RDD1 and RDD2Partial to obtain exact result without any problem.
This way you greatly reduce the time of the join operation if you expect the result to be of reasonable size, since the complexity remains the same. You might get some reasonable speedups (e.g. 2-10x) even if the result is within the order of hundreds of millions.
EDIT
The bloom filter can be collected efficiently since you can combine the bits set by one element with the bits set by another element with OR, which is associative and commutative.

Resources