Spark Generate A Lot Of Tasks Although Partition Number is 1 Pyspark - apache-spark

My code:
df = self.sql_context.sql(f"select max(id) as id from {table}")
return df.collect()[0][0]
My table is partitioned by id - it has 100M records but only 3 distinct id's.
I expected this query to work with 1 task and scan just the partition column (id).
I don't understand how I have 691 tasks for the collect line with just 3 partitions
I guess the query is executing full scan on the table but I can't figure why it doesn't scan just the metadata

Your df contains the result of an aggregation on the entire table, it contains only one row (with only one field being the max(id)), this is why it has only 1 partition.
But the original table DataFrame may have many partitions (or only 1 partition but its computation needs ~600 stages, triggering 1 task per stage, which is not that common)
Without details on your parallelism configurations and input source type and transformations, it is not easy to help more !

Related

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?
There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

Performance of pyspark + hive when a table has many partition columns

I am trying to understand the performance impact on the partitioning scheme when Spark is used to query a hive table. As an example:
Table 1 has 3 partition columns, and data is stored in paths like
year=2021/month=01/day=01/...data...
Table 2 has 1 partition column
date=20210101/...data...
Anecdotally I have found that queries on the second type of table are faster, but I don't know why, and I don't why. I'd like to understand this so I know how to design the partitioning of larger tables that could have more partitions.
Queries being tested:
select * from table limit 1
I realize this won't benefit from any kind of query pruning.
The above is meant as an example query to demonstrate what I am trying to understand. But in case details are important
This is using s3 not HDFS
The data in the table is very small, and there are not a large number of partitons
The time for running the query on the first table is ~2 minutes, and ~10 seconds on the second
Data is stored as parquet
Except all other factors which you did not mention: storage type, configuration, cluster capacity, the number of files in each case, your partitioning schema does not correspond to the use-case.
Partitioning schema should be chosen based on how the data will be selected or how the data will be written or both. In your case partitioning by year, month, day separately is over-partitioning. Partitions in Hive are hierarchical folders and all of them should be traversed (even if using metadata only) to determine the data path, in case of single date partition, only one directory level is being read. Two additional folders: year+month+day instead of date do not help with partition pruning because all columns are related and used together always in the where.
Also, partition pruning probably does not work at all with 3 partition columns and predicate like this: where date = concat(year, month, day)
Use EXPLAIN and check it and compare with predicate like this where year='some year' and month='some month' and day='some day'
If you have one more column in the WHERE clause in the most of your queries, say category, which does not correlate with date and the data is big, then additional partition by it makes sense, you will benefit from partition pruning then.

Spark collect_set vs distinct

If my goal is to collect distinct values in a column as a list, is there a performance difference or pros/cons using either of these?
df.select(column).distinct().collect()...
vs
df.select(collect_set(column)).first()...
collect_set is an aggregator function and requires a groupBy in the beginning. When there is no grouping provided it will take entire data as 1 big group.
1. collect_set
df.select(collect_set(column)).first()...
This will send all data of column column to a single node which will perform collect_set operation (removing duplicates). If your data size is big then it will swamp the single executor where all data goes.
2. distinct
df.select(column).distinct().collect()...
This will partition all data of column column based on its value (called partition key), no. of partitions will be the value of spark.sql.shuffle.partitions (say 200). So 200 tasks will execute to remove duplicates, 1 for each partition key. Then only dedup data will be sent to the driver for .collect() operation. This will fail if your data after removing duplicates is huge, will cause driver to go out of memory.
TLDR:
.distinct is better than .collect_set for your specific need

Apache Spark page results or view results on large datasets

I am using Hive with Spark 1.6.3
I have a large dataset (40000 rows, 20 columns or so and each column contains maybe 500 Bytes - 3KB of data)
The query is a join to 3 datasets
I wish to be able to page the final join dataset, and i have found that i can use row_number() OVER (ORDER BY 1) to generate a unique row number for each row in the dataset.
After this I can do
SELECT * FROM dataset WHERE row between 1 AND 100
However, there are resources which advise not to use ORDER BY as it puts all data into 1 partition (I can see this is the case in the logs where the shuffle allocation is moving the data to one partition), when this happens I get out of memory exceptions.
How would i go about paging through the dataset in a more efficient way?
I have enabled persist - MEMORY_AND_DISK so that if a partition is too large it will spill to disk (and for some of the transformation I can see that at least some of the data is spilling to disk when I am not using row_number() )
One strategy could be select only the unique_key of the dataset first and apply row_number function on that dataset only. Since you are selecting a single column from a large dataset chances are higher that it will fit in a single partition.
val dfKey = df.select("uniqueKey")
dfKey.createOrUpdateTempTable("dfKey")
val dfWithRowNum = spark.sql(select dfKey*, row_number() as row_number OVER (ORDER BY 1))
// save dfWithRowNum
After to complete the row_number operation on the uniqueKey; save that dataframe. Now in the next stage join this dataframe with the bigger dataframe and append the row_number column to that.
dfOriginal.createOrUpdateTempTable("dfOriginal")
dfWithRowNum.createOrUpdateTempTable("dfWithRowNum")
val joined = spark.sql("select dfOriginal.* from dfOriginal join dfWithRowNum on dfOriginal.uniqueKey = dfWithRowNum.uniqueKey")
// save joined
Now you can query
SELECT * FROM joineddataset WHERE row between 1 AND 100
For the persist with MEMORY_DISK, I found that occasionally fail with insufficient memory. I would rather use DISK_ONLY where performance is penalized although the execution is guaranteed.
Well, you can apply this method on your final join dataframe.
You should also persist the dataframe as a file to guarantee the ordering, as reevaluation could creates a different order.

Running partition specific query in Spark Dataframe

I am working on spark streaming application, where I partition the data as per a certain ID in the data.
For eg: partition 0-> contains all data with id 100
partition 1 -> contains all data with id 102
Next I want to execute query on whole dataframe for final result. But my query is specific to each partition.
For eg: I need to run
select(col1 * 4) in case of partiton 0
while
select(col1 * 10) in case of parition 1.
I have looked into documentation but didnt find any clue. One solution i have is to create different RDDs/ Dataframe for different id in data. But that is not scalable in my case.
Any suggestion how to run query on dataframe where query can be specific to each partition.
Thanks
I think you should not couple your business logic with Spark's way of partitioning your data (you won't be able to repartition your data if required). I would suggest to add an artificial column in your DataFrame that equals with the partitionId value.
In any case, you can always do
df.rdd.mapPartitionsWithIndex{ case (partId, iter: Iterable[Row]) => ...}
See also the docs.

Resources