How do I express efficient JOINs by fusing SHUFFLE and COMBINE? - apache-spark

I apologize in advance for the long read. Your patience is appreciated.
I have two rdds A and B, and a reducer class.
A contains two columns (key, a_value). A is keyed - no key in the key column occurs more than once.
B has two columns (key, b_value)- but is not keyed, the key field can be repeated multiple times. And a few keys could be heavily skewed - lets call them hot keys.
The reducer is constructed using the a_value and can consume b_values of the corresponding key in any order. Finally, the reducer can produce a c_value that represents the reduction.
Example usage of the reducer looks like this(pseudo-code):
reducer = construct_reducer(a_value)
for b_value in b_values:
reducer.consume(b_value)
c_value = reducer.result()
I want to use the tables A and B to produce tables C that contains two columns (key, c_value).
This can be done trivially by doing
calling reduceByKey on B to produce rdd with two columns key, list[b_values]
join with A on key
execute the code block above to produce the table with columns key, c_value
The problem with that approach is that the hot keys in B cause reduceByKey to OOM even at very high executor memory.
Ideally what I would want to do, is to join and immediately reducer.consume b_values from B into A - the reducer is constructed from a_values. Another way to think of this is to use aggregateByKey but with different zeroValues obtained from another rdd.
How do I express that? (I looked at cogroup and aggregateByKey without any luck)
Thanks for reading!

Related

drop_duplicates after unionByName

I am trying to stack two dataframes (with unionByName()) and, then, dropping duplicate entries (with drop_duplicates()).
Can I trust that unionByName() will preserve the order of the rows, i.e., that df1.unionByName(df2) will always produce a dataframe whose first N rows are df1's? Because, if so, when applying drop_duplicates(), df1's row would always be preserved, which is the behaviour I want.
UnionByName will not guarantee that you will have your records ranked first from df1 and then from df2. These are distributed and parallel tasks so you definitely can't build on that.
The solution might be to add a technical priority column to each DataFrame, then unionByName() and use the row_number() analytical function to sort by priority within that ID and then select the one with the higher priority (in below case 1 means higher than 2).
Take a look at the Scala code below:
val df1WithPriority = df1.withColumn("priority", lit(1))
val df2WithPriority = df2.withColumn("priority", lit(2))
df1WithPriority
.unionByName(df2WithPriority)
.withColumn(
"row_num",
row_number()
.over(Window.partitionBy("ID").orderBy(col("priority").asc)
)
.where(col("row_num") === lit(1))

Perform multiple aggregations on a spark dataframe in one pass instead of multiple slow joins

At the moment I have 9 functions which do specific calculations to a data frame - average balance per month included, rolling P&L, period start balances, ratio calculation.
Each of those functions produce the following:
the first columns are the group by columns which the function accepts and the final column is the statistic calculation.
I.e.
Each of those functions produce a spark data frame that has the same group by variables(same first columns - 1 column if the group by variables is only 1, 2 columns if the group by variables are 2, etc.) and 1 column where the values are the specific calculation - examples of which I listed at the beginning.
Because each of those functions do different calculations, I need to produce a data frame for each one and then join them to produce a report
I join them on the group by variables because they are common in all of them(each individual statistic report).
But doing 7-8 and even more joins is very slow.
Is there a way to add those columns together without using join?
Thank you.
I can think of multiple approaches. But this looks like a good use case for a new pandas udf spark api.
You can define one group_by udf. The udf will receive the aggregated group as a pandas dataframe. You apply 9 aggregate functions on the group and return a pandas dataframe with additional 9 aggregated columns. Spark will combine each new returned pandas dataframe into a large spark dataframe.
e.g
# given you want to aggregate average and ratio
#pandas_udf("month long, avg double, ratio dbl", PandasUDFType.GROUPED_MAP)
def compute(pdf):
# pdf is a pandas.DataFrame
pdf['avg'] = compute_avg(pdf)
pdf['ratio'] = compute_ratio(pdf)
return pdf
df.groupby("month").apply(compute).show()
See Pandas-UDF#Grouped-Map
If you cluster is on a lower version you have 2 options:
Stick to dataframe api and write custom aggregate functions. See answer. They have a horrible api but usage would look like this.
df.groupBy(df.month).agg(
my_avg_func(df.amount).alias('avg'),
my_ratio_func(df.amount).alias('ratio'),
Fall back to good ol' rdd map reduce api
#pseudocode
def _create_tuple_key(record):
return (record.month, record)
def _compute_stats(acc, record):
acc['count'] += 1
acc['avg'] = _accumulate_avg(acc['count'], record)
acc['ratio'] = _accumulate_ratio(acc['count'], record)
return acc
df.toRdd.map(__create_tuple_key).reduceByKey(_compute_stats)

Spark groupBy vs repartition plus mapPartitions

My dataset is ~20 millions of rows, it takes ~ 8 GB of RAM. I'm runing my job with 2 executors, 10 GB RAM per executor, 2 cores per executor. Due to further transformations, data should be cached all at once.
I need to reduce duplicates based on 4 fields (choose any of duplicates). Two options: using groupBy and using repartition and mapPartitions. The second approach allows you to specify num of partitions, and could perform faster because of this in some cases, right?
Could you please explain what option has better performance? Do both options has the same RAM consumption?
Using groupBy
dataSet
.groupBy(col1, col2, col3, col4)
.agg(
last(col5),
...
last(col17)
);
Using repartition and mapPartitions
dataSet.sqlContext().createDataFrame(
dataSet
.repartition(parallelism, seq(asList(col1, col2, col3, col4)))
.toJavaRDD()
.mapPartitions(DatasetOps::reduce),
SCHEMA
);
private static Iterator<Row> reduce(Iterator<Row> itr) {
Comparator<Row> comparator = (row1, row2) -> Comparator
.comparing((Row r) -> r.getAs(name(col1)))
.thenComparing((Row r) -> r.getAs(name(col2)))
.thenComparingInt((Row r) -> r.getAs(name(col3)))
.thenComparingInt((Row r) -> r.getAs(name(col4)))
.compare(row1, row2);
List<Row> list = StreamSupport
.stream(Spliterators.spliteratorUnknownSize(itr, Spliterator.ORDERED), false)
.collect(collectingAndThen(toCollection(() -> new TreeSet<>(comparator)), ArrayList::new));
return list.iterator();
}
The second approach allows you to specify num of partitions, and could perform faster because of this in some cases, right?
Not really. Both approaches allow you to specify the number of partitions - in the first case through spark.sql.shuffle.partitions
spark.conf.set("spark.sql.shuffle.partitions", parallelism)
However the second approach is inherently less efficient if duplicates are common, as it shuffles first, and reduces later, skipping map-side reduction (in other words it is yet another group-by-key). If duplicates are rare, this won't make much difference though.
On a side note Dataset already provides dropDuplicates variants, which take a set of columns, and first / last is not particular meaningful here (see discussion in How to select the first row of each group?).

Spark Dataset: how to change alias of the columns after a flatmap?

I have two spark datasets that I'm trying to join. The join keys are nested in dataset A, so I must flatmap it out first before joining with dataset B. The problem is that as soon as I flatmap that field, the column name becomes the default "_1", "_2", etc. Is it possible to change the alias somehow?
A.flatMap(a => a.keys).join(B).where(...)
After applying the transformation like flatMap you lose the columns as which is logical as after applying transformation like flatMap or map it does not guarantee that the number of column or datatype inside each column remain the same.That's why we lose the column name there.
What you can do is you can fetch all previous column and then apply it to the dataset like this:-
val columns = A.columns
A.flatMap(a => a.keys).toDF(columns:_ *).join(B).where(...)
this will only work if the number of columns is same after applying flatmap
Hope this clears your issue
Thanks

Filter a large RDD by iterating over another large RDD - pySpark

I have a large RDD, call it RDD1, that is approximately 300 million rows after an initial filter. What I would like to do is take the ids from RDD1 and find all other instances of it in another large dataset, call it RDD2, that is approximately 3 billion rows. RDD2 is created by querying a parquet table that is stored in Hive as well as RDD1. The number of unique ids from RDD1 is approximately 10 million elements.
My approach is to currently collect the ids and broadcast them and then filter RDD2.
My question is - is there a more efficient way to do this? Or is this best practice?
I have the following code -
hiveContext = HiveContext(sc)
RDD1 = hiveContext("select * from table_1")
RDD2 = hiveContext.sql("select * from table_2")
ids = RDD1.map(lambda x: x[0]).distinct() # This is approximately 10 million ids
ids = sc.broadcast(set(ids.collect()))
RDD2_filter = RDD2.rdd.filter(lambda x: x[0] in ids.value))
I think it would be better to just use a single SQL statement to do the join:
RDD2_filter = hiveContext.sql("""select distinct t2.*
from table_1 t1
join table_2 t2 on t1.id = t2.id""")
What I would do is take the 300 milion ids from RDD1, construct a bloom filter (Bloom filter), use that as broadcast variable to filter RDD2 and you will get RDD2Partial that contains all key-value parits for key that are in RDD1, plus some false positives. If you expect the result to be within order of millions, than you will then be able to use normal operations like join,cogroup, etc. on RDD1 and RDD2Partial to obtain exact result without any problem.
This way you greatly reduce the time of the join operation if you expect the result to be of reasonable size, since the complexity remains the same. You might get some reasonable speedups (e.g. 2-10x) even if the result is within the order of hundreds of millions.
EDIT
The bloom filter can be collected efficiently since you can combine the bits set by one element with the bits set by another element with OR, which is associative and commutative.

Resources