Spark's dataframe count() function taking very long - apache-spark

In my code, I have a sequence of dataframes where I want to filter out the dataframe's which are empty. I'm doing something like:
Seq(df1, df2).map(df => df.count() > 0)
However, this is taking extremely long and is consuming around 7 minutes for approximately 2 dataframe's of 100k rows each.
My question: Why is Spark's implementation of count() is slow. Is there a work-around?

Count is a lazy operation. So it does not matter how big is your dataframe. But if you have too many costly operations on the data to get this dataframe, then once the count is called spark would actually do all the operations to get these dataframe.
Some of the costly operations may be operations which needs shuffling of data. Like groupBy, reduce etc.
So my guess is you have some complex processing to get these dataframes or your initial data which you used to get this dataframe is too huge.

Related

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?
There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

Most optimal way to removing Duplicates in pySpark

I am trying to remove duplicates in spark dataframes by using dropDuplicates() on couple of columns. But job is getting hung due to lots of shuffling involved and data skew. I have used 5 cores and 30GB of memory to do this. Data on which I am performing dropDuplicates() is about 12 million rows.
Please suggest me the most optimal way to remove duplicates in spark, considering data skew and shuffling involved.
Delete duplicate operations is an expensive operation as it compare values from one RDD to all other RDDs and tries to consolidate the results. Considering the size of your data results can time consuming.
I would recommend groupby transformation on the columns of your dataframe followed by commit action. This way only the consolidated results from your RDD will be compared with other RDD that too lazily and then you can request the result through any of the action like commit / show etc
transactions.groupBy("col1”,”col2").count.sort($"count".desc).show
distance():
df.select(['id', 'name']).distinct().show()
dropDuplicates()
df.dropDuplicates(['id', 'name']).show()
dropDuplicates() is the way to go if you want to drop duplicates over a subset of columns, but at the same time you want to keep all the columns of the original structure.

Divide operation in spark using RDD or dataframe

Suppose there is a dataset with some number of rows.
I need to find out the Heterogeneity i.e.
distinct number of rows divide by total number of rows.
Please help me with spark query to execute the same.
Dataset and dataframe supports distinct function which finds distinct rows in the dataset.
So essentially you need to do
val heterogeneity = dataset.distinct.count / dataset.count
Only thing is if the dataset is big the distinct could be expensive and you might need to set the spark shuffle partition correctly.

pyspark: isin vs join

What are general best-practices to filtering a dataframe in pyspark by a given list of values? Specifically:
Depending on the size of the given list of values, then with respect to runtime when is it best to use isin vs inner join vs
broadcast?
This question is the spark analogue of the following question in Pig:
Pig: efficient filtering by loaded list
Additional context:
Pyspark isin function
Considering
import pyspark.sql.functions as psf
There are two types of broadcasting:
sc.broadcast() to copy python objects to every node for a more efficient use of psf.isin
psf.broadcast inside a join to copy your pyspark dataframe to every node when the dataframe is small: df1.join(psf.broadcast(df2)). It is usually used for cartesian products (CROSS JOIN in pig).
In the context question, the filtering was done using the column of another dataframe, hence the possible solution with a join.
Keep in mind that if your filtering list is relatively big the operation of searching through it will take a while, and since it has do be done for each row it can quickly get costly.
Joins on the other hand involve two dataframes that will be sorted before matching, so if your list is small enough you might not want to have to sort a huge dataframe just for a filter.
Both join and isin works well for all my daily workcases.
isin works well both of small and little large (~1M) set of list.
Note - If you have a large dataset (say ~500 GB) and you want to do filtering and then processing of filtered dataset, then
using isin the data read/processing is significantly very low and Fast. Whole 500 GB will not be loaded as you have already filtered the smaller dataset from .isin method.
But for the Join case, whole 500GB will loaded and processing. So Time of Processing will be much higher.
My case, After filtering using
isin, and then processing and converting to Pandas DF. It took < 60 secs
with JOIN and then processing and converting to Pandas DF. It takes > 1 hours.

Reading Parquet columns as RDD rows

Is there a way to read columns from a Parquet file as rows in a Spark RDD, materializing the full contents of each column as a list within an RDD tuple?
The idea is that for cases where I need to run a non-distributable, in-memory-only algorithm (processing a full column of data) on a set of executors, I would like to be able to parallelize the processing by shipping the full contents of each column to the executors. My initial implementation, which involved reading the Parquet file as a DataFrame, then converting it to RDD and transposing the rows via aggregateByKey, has turned out to be too expensive in terms of time (probably due to the extensive shuffling required).
If possible, I would prefer to use an existing implementation, rather than rolling my own implementations of ParquetInputFormat, ReadSupport, and/or RecordMaterializer.
Suggestions for alternative approaches are welcome as well.

Resources