Internals of Spark GroupBy and then Count - apache-spark

I read that in Apache Spark, GroupBy is a wide transformation meaning it requires data shuffle.My doubt is : Let's say I do df.groupBy(column).count(), so will the partitions first groupBy and count the values within their own partition and then share the result with the other partitions or will it be the case where data for similar keys are transfered to a common partition and then count operation would take place on each partition?

Related

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?
There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

Spark collect_set vs distinct

If my goal is to collect distinct values in a column as a list, is there a performance difference or pros/cons using either of these?
df.select(column).distinct().collect()...
vs
df.select(collect_set(column)).first()...
collect_set is an aggregator function and requires a groupBy in the beginning. When there is no grouping provided it will take entire data as 1 big group.
1. collect_set
df.select(collect_set(column)).first()...
This will send all data of column column to a single node which will perform collect_set operation (removing duplicates). If your data size is big then it will swamp the single executor where all data goes.
2. distinct
df.select(column).distinct().collect()...
This will partition all data of column column based on its value (called partition key), no. of partitions will be the value of spark.sql.shuffle.partitions (say 200). So 200 tasks will execute to remove duplicates, 1 for each partition key. Then only dedup data will be sent to the driver for .collect() operation. This will fail if your data after removing duplicates is huge, will cause driver to go out of memory.
TLDR:
.distinct is better than .collect_set for your specific need

Spark DataFrame RangePartitioner

[New to Spark] Language - Scala
As per docs, RangePartitioner sorts and divides the elements into chunks and distributes the chunks to different machines. How would it work for below example.
Let's say we have a dataframe with 2 columns and one column (say 'A') has continuous values from 1 to 1000. There is another dataframe with same schema but the corresponding column has only 4 values 30, 250, 500, 900. (These could be any values, randomly selected from 1 to 1000)
If I partition both using RangePartitioner,
df_a.partitionByRange($"A")
df_b.partitionByRange($"A")
how will the data from both the dataframes be distributed across nodes ?
Assuming that the number of partitions is 5.
Also, if I know that second DataFrame has less number of values then will reducing number of partitions for it make any difference ?
What I am struggling to understand is that how Spark maps one partition of df_a to a partition of df_b and how it sends (if it does) both those partitions to same machine for processing.
A very detailed explanation of how RangePartitioner works internally is described here
Specific to your question, RangePartitioner samples the RDD at runtime, collects the statistics, and only then are the ranges (limits) evaluated. Note that there are 2 parameters here - ranges (logical), and partitions (physical). The number of partitions can be affected by many factors - number of input files, inherited number from parent RDD, 'spark.sql.shuffle.partitions' in case of shuffling, etc. The ranges evaluated according to the sampling. In any case, RangePartitioner ensures every range is contained in single partition.
how will the data from both the dataframes be distributed across nodes ? how Spark maps one partition of df_a to a partition of df_b
I assume you implicitly mean joining 'A' and 'B', otherwise the question does not make any sense. In that case, Spark would make sure to match partitions with ranges on both DataFrames, according to their statistics.

Spark DataFrame Repartition and Parquet Partition

I am using repartition on columns to store the data in parquet. But
I see that the no. of parquet partitioned files are not same with the
no. of Rdd partitions. Is there no correlation between rdd partitions
and parquet partitions?
When I write the data to parquet partition and I use Rdd
repartition and then I read the data from parquet partition , is
there any condition when the rdd partition numbers will be same
during read / write?
How is bucketing a dataframe using a column id and repartitioning a
dataframe via the same column id different?
While considering the performance of joins in Spark should we be
looking at bucketing or repartitioning (or maybe both)
Couple of things here that you;re asking - Partitioning, Bucketing and Balancing of data,
Partitioning:
Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion.
Partitioning tables changes how persisted data is structured and will now create subdirectories reflecting this partitioning structure.
This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering.
In Spark, this is done by df.write.partitionedBy(column*) and groups data by partitioning columns into same sub directory.
Bucketing:
Bucketing is another technique for decomposing data sets into more manageable parts. Based on columns provided, the entire data is hashed into a user-defined number of buckets (files).
Synonymous to Hive's Distribute By
In Spark, this is done by df.write.bucketBy(n, column*) and groups data by partitioning columns into same file. number of files generated is controlled by n
Repartition:
It returns a new DataFrame balanced evenly based on given partitioning expressions into given number of internal files. The resulting DataFrame is hash partitioned.
Spark manages data on these partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.
In Spark, this is done by df.repartition(n, column*) and groups data by partitioning columns into same internal partition file. Note that no data is persisted to storage, this is just internal balancing of data based on constraints similar to bucketBy
Tl;dr
1) I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is there no correlation between rdd partitions and parquet partitions?
repartition has correlation to bucketBy not partitionedBy. partitioned files is governed by other configs like spark.sql.shuffle.partitions and spark.default.parallelism
2) When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write?
during read time, the number of partitions will be equal to spark.default.parallelism
3) How is bucketing a dataframe using a column id and repartitioning a dataframe via the same column id different?
Working similar, except, bucketing is a write operation and is used for persistence.
4) While considering the performance of joins in Spark should we be looking at bucketing or repartitioning (or maybe both)
repartition of both datasets are in memory, if one or both the datasets are persisted, then look into bucketBy also.

Apache Spark RDD sortByKey algorithm and time complexity

What is the Big-O time complexity for Apache Spark RDD sortByKey?
I am trying to assign row numbers to an RDD based on a particular order.
Say I have a {K,V} pair RDD and I wish to perform an order by key using
myRDD.sortByKey(true).zipWithIndex
What is the time complexity for this operation, in big-O form?
And what is happening under-the-covers? Bubble sort? I hope not! My dataset is very large and runs across partitions, so I'm curious whether the sortByKey function is optimal, or does some kind of intermediate data structure within a partition and then something else across partitions to optimize message passing, or what.
A quick look at the code shows that a RangePartitioner is being used under the covers. The docs say:
partitions sortable records by range into roughly
* equal ranges. The ranges are determined by sampling the content of the RDD passed in
So in essence your data is sampled (O[n]), then only the unique sample keys (m) are sorted are sorted (O[m log(m)]) and ranges of keys determined, then the entire data is shuffled around (O[n], but costly), then the data sorted internally for the range of keys received on a given partition (O[p log[p)).
zipWithIndex probably uses local sizes to assign numbers, using the partition number, so it is likely that partition metadata is stored for this effect:
Zips this RDD with its element indices. The ordering is first based on the partition index
* and then the ordering of items within each partition. So the first item in the first
* partition gets index 0, and the last item in the last partition receives the largest index.

Resources