Hudi with Spark perform very slow when trying to write data into filesystem - apache-spark

I'm trying Apache Hudi with Spark by a very simple demo:
with SparkSession.builder.appName(f"Hudi Test").getOrCreate() as spark:
df = spark.read.option('mergeSchema', 'true').parquet('s3://an/existing/directory/')
hudi_options = {
'hoodie.table.name': 'users_activity',
'hoodie.datasource.write.recordkey.field': 'users_activity_id',
'hoodie.datasource.write.partitionpath.field': 'users_activity_id',
'hoodie.datasource.write.table.name': 'users_activity_result',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'users_activity_create_date',
}
df.write.format('hudi').options(**hudi_options).mode('append').save('s3://htm-hawk-data-lake-test/flink_test/copy/users_activity/')
There are about 10 parquet files in the directory; their total size is 1GB, about 6 million records. But Hudi takes a very long time to write, and it failed with org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1409413 tasks (1024.0 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB) after 2 hours.
I have checked the Spark History Server, and it shows as below:
Seems it's collecting all records in parquet files to driver and serializing them. Is it working right? How can I improve its writing performance?

Hudi seems to write the data without any problem, but it fails the indexing step which tries to collect a list of pairs (partition path, file id).
You are using the field users_activity_id as a partition key and Hudi key, if the cardinality of this field is high, you will have a lot of partitions and then a very long list of pairs (partition, file_id), especially if this field is Hudi key which is supposed to be unique (6M records = 6M partitions)

Related

Spark creating huge amount of task when read from parquet files

Im having a very high tasks number on spark queries that read from small partitioned parquet data.
I'm trying to query a table that is stored in an S3 bucket in parquet snappy file format. The table is partitioned by date/hour (one partition example: '2021/01/01 10:00:00'). Each partition contains 15/18 files with a size between 30 and 70 kB.
A simple count by partition on 1 year of data is calculated using almost 20.000 tasks. My concern is why is spark creating so many tasks for reading so little amount of data. Is there any mechanism to make each single task read all the content from a single partition? I think it would be more efficient than 15 tasks reading 30kB of data each.
spark.sql.("select count(1), date_hour from forecast.hourly_data where date_hour between '2021_01_01-00' and '2022_01_01-00' group by date_hour")
[Stage 0:> (214 + 20) / 19123]
My spark version is 2.4.7 and configuration is in default mode.
The amount of tasks are based on the amount of files that you are reading in. You can repartition after reading in the data.

Spark Performance Issue vs Hive

I am working on a pipeline that will run daily. It includes joining 2 tables say x & y ( approx. 18 MB and 1.5 GB sizes respectively) and loading the output of the join to final table.
Following are the facts about the environment,
For table x:
Data size: 18 MB
Number of files in a partition : ~191
file type: parquet
For table y:
Data size: 1.5 GB
Number of files in a partition : ~3200
file type: parquet
Now the problem is:
Hive and Spark are giving same performance (time taken is same)
I tried different combination of resources for spark job.
e.g.:
executors:50 memory:20GB cores:5
executors:70 memory:20GB cores:5
executors:1 memory:20GB cores:5
All three combinations are giving same performance. I am not sure what I am missing here.
I also tried broadcasting the small table 'x' so as to avoid shuffle while joining but not much improvement in performance.
One key observations is:
70% of the execution time is consumed for reading the big table 'y' and I guess this is due to more number of files per partition.
I am not sure how hive is giving the same performance.
Kindly suggest.
I assume you are comparing Hive on MR vs Spark. Please let me know if it is not the case.Because Hive(on tez or spark) vs Spark Sql will not differ
vastly in terms of performance.
I think the main issue is that there are too many small files.
A lot of CPU and time is consumed in the I/O itself, hence you can't experience the processing power of Spark.
My advice is to coalesce the spark dataframes immedietely after reading the parquet files. Please coalesce the 'x' dataframe into single partition and 'y'
dataframe into 6-7 partitions.
After doing the above, please perform the join(broadcastHashJoin).

What to do with "WARN TaskSetManager: Stage contains a task of very large size"?

I use spark 1.6.1.
My spark application reads more than 10000 parquet files stored in s3.
val df = sqlContext.read.option("mergeSchema", "true").parquet(myPaths: _*)
myPaths is an Array[String] that contains the paths of the 10000 parquet files. Each path is like this s3n://bucketname/blahblah.parquet
Spark warns message like below.
WARN TaskSetManager: Stage 4 contains a task of very large size
(108KB). The maximum recommended task size is 100KB.
Spark has managed to run and finish the job anyway but I guess this can slow down spark processing job.
Does anybody has a good suggestion about this problem?
The issue is that your dataset is not evenly distributed across partitions and hence some partitions have more data than others (and so some tasks compute larger results).
By default Spark SQL assumes 200 partitions using spark.sql.shuffle.partitions property (see Other Configuration Options):
spark.sql.shuffle.partitions (default: 200) Configures the number of partitions to use when shuffling data for joins or aggregations.
A solution is to coalesce or repartition your Dataset after you've read parquet files (and before executing an action).
Use explain or web UI to review execution plans.
The warning gives you a hint to optimize your query so the more effective result fetch is used (see TaskSetManager).
With the warning TaskScheduler (that runs on the driver) will fetch the result values using the less effective approach IndirectTaskResult (as you can see in the code).

increasing number of partitions in spark

I was using Hive for executing SQL queries on a project. I used ORC with 50k Stride for my data and have created the hive ORC tables using this configuration with a certain date column as partition.
Now I wanted to use Spark SQL to benchmark the same queries operating on the same data.
I executed the following query
val q1 = sqlContext.sql("select col1,col2,col3,sum(col4),sum(col5) from mytable where date_key=somedatkye group by col1,col2,col3")
In hive it takes 90 seconds for this query. But spark takes 21 minutes for the same query and on looking at the job, i found the issue was because Spark creates 2 stages and on the first stage, it has only 7 tasks, one each for each of the 7 blocks of data within that given partition in orc file. The blocks are of different size, one is 5MB while the other is 45MB and because of this stragglers take more time leading to taking too much time for the job.
How do i mitigate this issue in spark. How do i manually increase the number of partitions, resulting in increasing the number of tasks in stage 1, even though there are only 7 physical blocks for the given range of the query.

How to control partition size in Spark SQL

I have a requirement to load data from an Hive table using Spark SQL HiveContext and load into HDFS. By default, the DataFrame from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no overloaded method in HiveContext to take number of partitions parameter.
Repartitioning of the RDD causes shuffling and results in more processing time.
>
val result = sqlContext.sql("select * from bt_st_ent")
Has the log output of:
Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes)
Starting task 1.0 in stage 131.0 (TID 298, aster1.com, partition 1,NODE_LOCAL, 2204 bytes)
I would like to know is there any way to increase the partitions size of the SQL output.
Spark < 2.0:
You can use Hadoop configuration options:
mapred.min.split.size.
mapred.max.split.size
as well as HDFS block size to control partition size for filesystem based formats*.
val minSplit: Int = ???
val maxSplit: Int = ???
sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)
sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)
Spark 2.0+:
You can use spark.sql.files.maxPartitionBytes configuration:
spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)
In both cases these values may not be in use by a specific data source API so you should always check documentation / implementation details of the format you use.
* Other input formats can use different settings. See for example
Partitioning in spark while reading from RDBMS via JDBC
Difference between mapreduce split and spark paritition
Furthermore Datasets created from RDDs will inherit partition layout from their parents.
Similarly bucketed tables will use bucket layout defined in the metastore with 1:1 relationship between bucket and Dataset partition.
A very common and painful problem. You should look for a key which distributes the data in uniform partitions. The you can use the DISTRIBUTE BY and CLUSTER BY operators to tell spark to group rows in a partition. This will incur some overhead on the query itself. But will result in evenly sized partitions. Deepsense has a very good tutorial on this.
If your SQL performs a shuffle (for example it has a join, or some sort of group by), you can set the number of partitions by setting the 'spark.sql.shuffle.partitions' property
sqlContext.setConf( "spark.sql.shuffle.partitions", 64)
Following up on what Fokko suggests, you could use a random variable to cluster by.
val result = sqlContext.sql("""
select * from (
select *,random(64) as rand_part from bt_st_ent
) cluster by rand_part""")

Categories

Resources