Spark groupBy vs repartition plus mapPartitions

Spark groupBy vs repartition plus mapPartitions - apache-spark

My dataset is ~20 millions of rows, it takes ~ 8 GB of RAM. I'm runing my job with 2 executors, 10 GB RAM per executor, 2 cores per executor. Due to further transformations, data should be cached all at once.
I need to reduce duplicates based on 4 fields (choose any of duplicates). Two options: using groupBy and using repartition and mapPartitions. The second approach allows you to specify num of partitions, and could perform faster because of this in some cases, right?
Could you please explain what option has better performance? Do both options has the same RAM consumption?
Using groupBy
dataSet
.groupBy(col1, col2, col3, col4)
.agg(
last(col5),
...
last(col17)
);
Using repartition and mapPartitions
dataSet.sqlContext().createDataFrame(
dataSet
.repartition(parallelism, seq(asList(col1, col2, col3, col4)))
.toJavaRDD()
.mapPartitions(DatasetOps::reduce),
SCHEMA
);
private static Iterator<Row> reduce(Iterator<Row> itr) {
Comparator<Row> comparator = (row1, row2) -> Comparator
.comparing((Row r) -> r.getAs(name(col1)))
.thenComparing((Row r) -> r.getAs(name(col2)))
.thenComparingInt((Row r) -> r.getAs(name(col3)))
.thenComparingInt((Row r) -> r.getAs(name(col4)))
.compare(row1, row2);
List<Row> list = StreamSupport
.stream(Spliterators.spliteratorUnknownSize(itr, Spliterator.ORDERED), false)
.collect(collectingAndThen(toCollection(() -> new TreeSet<>(comparator)), ArrayList::new));
return list.iterator();
}

The second approach allows you to specify num of partitions, and could perform faster because of this in some cases, right?
Not really. Both approaches allow you to specify the number of partitions - in the first case through spark.sql.shuffle.partitions
spark.conf.set("spark.sql.shuffle.partitions", parallelism)
However the second approach is inherently less efficient if duplicates are common, as it shuffles first, and reduces later, skipping map-side reduction (in other words it is yet another group-by-key). If duplicates are rare, this won't make much difference though.
On a side note Dataset already provides dropDuplicates variants, which take a set of columns, and first / last is not particular meaningful here (see discussion in How to select the first row of each group?).

Related

How do I express efficient JOINs by fusing SHUFFLE and COMBINE?

I apologize in advance for the long read. Your patience is appreciated.
I have two rdds A and B, and a reducer class.
A contains two columns (key, a_value). A is keyed - no key in the key column occurs more than once.
B has two columns (key, b_value)- but is not keyed, the key field can be repeated multiple times. And a few keys could be heavily skewed - lets call them hot keys.
The reducer is constructed using the a_value and can consume b_values of the corresponding key in any order. Finally, the reducer can produce a c_value that represents the reduction.
Example usage of the reducer looks like this(pseudo-code):
reducer = construct_reducer(a_value)
for b_value in b_values:
reducer.consume(b_value)
c_value = reducer.result()
I want to use the tables A and B to produce tables C that contains two columns (key, c_value).
This can be done trivially by doing
calling reduceByKey on B to produce rdd with two columns key, list[b_values]
join with A on key
execute the code block above to produce the table with columns key, c_value
The problem with that approach is that the hot keys in B cause reduceByKey to OOM even at very high executor memory.
Ideally what I would want to do, is to join and immediately reducer.consume b_values from B into A - the reducer is constructed from a_values. Another way to think of this is to use aggregateByKey but with different zeroValues obtained from another rdd.
How do I express that? (I looked at cogroup and aggregateByKey without any luck)
Thanks for reading!

Pyspark: Why show() or count() of a joined spark dataframe is so slow?

I have two large spark dataframe. I joined them by one common column as:
df_joined = df1.join(df2.select("id",'label'), "id")
I got the result, but when I want to work with df_joined, it's too slow. As I know, we need to repartition df1 and df2 to prevent large number of partition for df_joined. so, even, I changed the number of partitions,
df1r = df1.repartition(1)
df2r = df2.repartition(1)
df_joined = df1r.join(df2r.select("id",'label'), "id")
still NOT working.
any IDEA?

Spark runs 1 concurrent task for every partition of an RDD / DataFrame (up to the number of cores in the cluster).
If you’re cluster has 20 cores, you should have at least 20 partitions (in practice 2–3x times more). From the other hand a single partition typically shouldn’t contain more than 128MB.
so, instead of below two lines, which repartition your dataframe into 1 paritition:
df1r = df1.repartition(1)
df2r = df2.repartition(1)
Repartition your data based on 'id' column, joining key, into n partitions. ( n depends on data size and number of cores in cluster).
df1r = df1.repartition(n, "id")
df2r = df2.repartition(n, "id")

Difference (if there is one) between spark.sql.shuffle.partitions and df.repartition

I'm having a bit of difficulty reconciling the difference (if one exists) between sqlContext.sql("set spark.sql.shuffle.partitions=n") and re-partitioning a Spark DataFrame utilizing df.repartition(n).
The Spark documentation indicates that set spark.sql.shuffle.partitions=n configures the number of partitions that are used when shuffling data, while df.repartition seems to return a new DataFrame partitioned by the number of key specified.
To make this question clearer, here is a toy example of how I believe df.reparition and spark.sql.shuffle.partitions to work:
Let's say we have a DataFrame, like so:
ID | Val
--------
A | 1
A | 2
A | 5
A | 7
B | 9
B | 3
C | 2
Scenario 1: 3 Shuffle Partitions, Reparition DF by ID:
If I were to set sqlContext.sql("set spark.sql.shuffle.partitions=3") and then did df.repartition($"ID"), I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
Scenario 2: 5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
Is my understanding off base here? In general, my questions are:
I am trying to optimize my partitioning of a dataframe as to avoid
skew, but to have each partition hold as much of the same key
information as possible. How do I achieve that with set
spark.sql.shuffle.partitions and df.repartiton?
Is there a link
between set spark.sql.shuffle.partitions and df.repartition? If
so, what is that link?
Thanks!

I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
No
5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
and no.
This is not how partitioning works. Partitioners map values to partitions, but mapping in general case is not unique (you can check How does HashPartitioner work? for a detailed explanation).
Is there a link between set spark.sql.shuffle.partitions and df.repartition? If so, what is that link?
Indeed there is. If you df.repartition, but don't provide number of partitions then spark.sql.shuffle.partitions is used.

PySpark - optimize number of partitions after parquet read

In a parquet data lake partitioned by year and month, with spark.default.parallelism set to i.e. 4, lets say I want to create a DataFrame comprised of months 11~12 from 2017, and months 1~3 from 2018 of two sources A and B.
df = spark.read.parquet(
"A.parquet/_YEAR={2017}/_MONTH={11,12}",
"A.parquet/_YEAR={2018}/_MONTH={1,2,3}",
"B.parquet/_YEAR={2017}/_MONTH={11,12}",
"B.parquet/_YEAR={2018}/_MONTH={1,2,3}",
)
If I get the number of partitions, Spark used spark.default.parallelism as default:
df.rdd.getNumPartitions()
Out[4]: 4
Taking into account that after creating df I need to perform join and groupBy operations over each period, and that data is more or less evenly distributed over each one (around 10 million rows per period):
Question
Will a repartition improve the performance of my subsequent operations?
If so, if I have 10 different periods (5 per year in both A and B), should I repartition by the number of periods and explicitly reference the columns to repartition (df.repartition(10,'_MONTH','_YEAR'))?

Will a repartition improve the performance of my subsequent operations?
Typically it won't. The only reason to preemptively repartition data is to avoid further shuffling when the same Dataset is used for multiple joins, based on the same condition
If so, if I have 10 different periods (5 per year in both A and B), should I repartition by the number of periods and explicitly reference the columns to repartition (df.repartition(10,'_MONTH','_YEAR'))?
Let's go step-by-step:
should I repartition by the number of periods
Practitioners don't guarantee 1:1 relationship between levels and partitions, so the only thing to remember is, that you cannot have more non-empty partitions than unique keys, so using significantly larger value doesn't make sense.
and explicitly reference the columns to repartition
If you repartition and subsequently join or groupBy using the same set of columns for both parts is the only sensible solution.
Summary
repartitoning before join makes sense in two scenarios:
In case of multiple subsequent joins
df_ = df.repartition(10, "foo", "bar")
df_.join(df1, ["foo", "bar"])
...
df_.join(df2, ["foo", "bar"])
With single join when desired number of the output partitions is different than spark.sql.shuffle.partitions (and there is no broadcast join)
spark.conf.get("spark.sql.shuffle.partitions")
# 200
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
df1_ = df1.repartition(11, "foo", "bar")
df2_ = df2.repartition(11, "foo", "bar")
df1_.join(df2_, ["foo", "bar"]).rdd.getNumPartitions()
# 11
df1.join(df2, ["foo", "bar"]).rdd.getNumPartitions()
# 200
which might be preferable over:
spark.conf.set("spark.sql.shuffle.partitions", 11)
df1.join(df2, ["foo", "bar"]).rdd.getNumPartitions()
spark.conf.set("spark.sql.shuffle.partitions", 200)

Filter a large RDD by iterating over another large RDD - pySpark

I have a large RDD, call it RDD1, that is approximately 300 million rows after an initial filter. What I would like to do is take the ids from RDD1 and find all other instances of it in another large dataset, call it RDD2, that is approximately 3 billion rows. RDD2 is created by querying a parquet table that is stored in Hive as well as RDD1. The number of unique ids from RDD1 is approximately 10 million elements.
My approach is to currently collect the ids and broadcast them and then filter RDD2.
My question is - is there a more efficient way to do this? Or is this best practice?
I have the following code -
hiveContext = HiveContext(sc)
RDD1 = hiveContext("select * from table_1")
RDD2 = hiveContext.sql("select * from table_2")
ids = RDD1.map(lambda x: x[0]).distinct() # This is approximately 10 million ids
ids = sc.broadcast(set(ids.collect()))
RDD2_filter = RDD2.rdd.filter(lambda x: x[0] in ids.value))

I think it would be better to just use a single SQL statement to do the join:
RDD2_filter = hiveContext.sql("""select distinct t2.*
from table_1 t1
join table_2 t2 on t1.id = t2.id""")

What I would do is take the 300 milion ids from RDD1, construct a bloom filter (Bloom filter), use that as broadcast variable to filter RDD2 and you will get RDD2Partial that contains all key-value parits for key that are in RDD1, plus some false positives. If you expect the result to be within order of millions, than you will then be able to use normal operations like join,cogroup, etc. on RDD1 and RDD2Partial to obtain exact result without any problem.
This way you greatly reduce the time of the join operation if you expect the result to be of reasonable size, since the complexity remains the same. You might get some reasonable speedups (e.g. 2-10x) even if the result is within the order of hundreds of millions.
EDIT
The bloom filter can be collected efficiently since you can combine the bits set by one element with the bits set by another element with OR, which is associative and commutative.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string