Spark Data Frame write to parquet table - slow at updating partition stats - apache-spark

When I write data from dataframe into parquet table ( which is partitioned ) after all the tasks are successful, process is stuck at updating partition stats.
16/10/05 03:46:13 WARN log: Updating partition stats fast for:
16/10/05 03:46:14 WARN log: Updated size to 143452576
16/10/05 03:48:30 WARN log: Updating partition stats fast for:
16/10/05 03:48:31 WARN log: Updated size to 147382813
16/10/05 03:51:02 WARN log: Updating partition stats fast for:
df.write.format("parquet").mode("overwrite").partitionBy(part1).insertInto(db.tbl)
My table has > 400 columns and > 1000 partitions.
Please let me know if we can optimize and speedup updating partition stats.

I feel the problem here is there are too many partitions for a > 400 columns file. Every time you overwrite a table in hive , the statistics are updated. IN your case it will try to update statistics for 1000 partitions and again each partition has data with > 400 columns.
Try reducing the number of partitions (use another partition column or if it is a date column consider partitioning by month) and you should be able to see a significant change in performance.

Related

Hudi with Spark perform very slow when trying to write data into filesystem

I'm trying Apache Hudi with Spark by a very simple demo:
with SparkSession.builder.appName(f"Hudi Test").getOrCreate() as spark:
df = spark.read.option('mergeSchema', 'true').parquet('s3://an/existing/directory/')
hudi_options = {
'hoodie.table.name': 'users_activity',
'hoodie.datasource.write.recordkey.field': 'users_activity_id',
'hoodie.datasource.write.partitionpath.field': 'users_activity_id',
'hoodie.datasource.write.table.name': 'users_activity_result',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'users_activity_create_date',
}
df.write.format('hudi').options(**hudi_options).mode('append').save('s3://htm-hawk-data-lake-test/flink_test/copy/users_activity/')
There are about 10 parquet files in the directory; their total size is 1GB, about 6 million records. But Hudi takes a very long time to write, and it failed with org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1409413 tasks (1024.0 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB) after 2 hours.
I have checked the Spark History Server, and it shows as below:
Seems it's collecting all records in parquet files to driver and serializing them. Is it working right? How can I improve its writing performance?
Hudi seems to write the data without any problem, but it fails the indexing step which tries to collect a list of pairs (partition path, file id).
You are using the field users_activity_id as a partition key and Hudi key, if the cardinality of this field is high, you will have a lot of partitions and then a very long list of pairs (partition, file_id), especially if this field is Hudi key which is supposed to be unique (6M records = 6M partitions)

How do I find the partition key of Cassandra partitions with size greater than 100MB?

I want to get a list of partition with size greater than 100 MB for analysis. How do I achieve this ?
Cassandra logs a WARN with details of partitions getting compacted when the partition size is larger than compaction_large_partition_warning_threshold. The default in cassandra.yaml is 100MB:
# Log a warning when compacting partitions larger than this value
compaction_large_partition_warning_threshold: 100MiB
You can parse the system.log on the Cassandra nodes and look for log entries which contain the string Writing large partition. It looks something like:
WARN [CompactionExecutor:#] BigTableWriter.java:258 maybeLogLargePartitionWarning \
Writing large partition ks_name/tbl_name:pk (###.###MiB) to sstable /path/to/.../...-big-Data.db
It should be easy enough to write a shell script that would extract the table name and partition key from the logs. Cheers!

Would repartitioning in spark change number of spark partitions whilst reading data from cassandra source?

I am reading a table from cassandra table in spark. I have big partition in cassandra and when partition size of cassandra exceeds 64 MB , in that case cassandra partition is going to be equal to spark partition. Due to large partition I am getting memory issues in spark.
My question is if I do repartition at the beginning after reading data from cassandra, would number of spark partitions change ? and would it not lead to spark memory issues ?
My assumption is at very first place spark would read data from cassandra and hence at this stage cassandra large partition won't split due to repartition . Repartition will work on underlying data loaded from cassandra.
I am just wondering for answer if repartition could change data distribution when reading data from spark , rather than doing partitioning again ?
If you repartition your data using some arbitrary key then yes, it will be redistributed among the Spark partitions.
Technically, Cassandra partitions do not get split into Spark partitions when you retrieve the data but once you're done reading, you can repartition on a different key to break up the rows of a large Cassandra partition.
For the record, it doesn't avoid the memory issues of reading large Cassandra partitions in the first place because ​the default input split size of 64MB is just a notional target that Spark uses to calculate how many Spark partitions are required based on the estimated Cassandra table size and C* partition sizes. But since the calculation is based on estimates, the Spark partitions don't actually end up being 64MB in size.
If you are interested, I've explained in detail how Spark partitions are calculated in this post -- https://community.datastax.com/questions/11500/.
To illustrate with an example, let's say that based on the estimated table size and estimated number of C* partitions, each Spark partition is mapped to 200 token ranges in Cassandra.
For the first Spark partition, the token range might only contain 2 Cassandra partitions of size 3MB and 15MB so the actual size of the data in Sthe park partition is just 18MB.
But in the next Spark partition, the token range contains 28 Cassandra partitions that are mostly 1 to 4MB but there is one partition that is 56MB. The total size of this Spark partition ends up being a lot more than 64MB.
In these 2 cases, one Spark partition was just 18MB in size while the other is bigger than the 64MB target size. I've explained this issue in a bit more detail in this post -- https://community.datastax.com/questions/11565/. Cheers!

Alter table to add partition taking long time on Hive external table

I'm trying to execute a spark job through EMR cluster with 6 nodes(8 cores and 56GB memory on each node). Spark job does an incremental load on partitions on Hive table and at the end it does a refresh table in order to update the metadata.
Refresh command takes as long as 3-6 hours to complete which is too long.
Nature of data in Hive:
27Gb of data located on S3.
Stored in parquet.
Partitioned on 2 columns.(ex: s3a//bucket-name/table/partCol1=1/partCol2=2020-10-12).
Note- Its a date wise partition and cannot be changed.
Spark config used:
Num-executors= 15
Executor-memory =16Gb
Executor-cores = 2
Driver-memory= 49Gb
Spark-shuffle-partitions=48
Hive.exec.dynamic.partition.mode=nonstrict
Spark.sql.sources.partitionOverwriteMode=dynamic.
Things tried:
Tuning the spark cores/memory/executors but no luck.
Refresh table command.
Alter table add partition command.
Hive cli taking 3-4 hours to complete MSCK repair table tablename
All the above had no effect on reducing the time to refresh the partition on Hive.
Some assumptions:
Am I missing any parameter in tuning as the data is stored in Amazon-S3.?
Currently number of partitions on table are close to 10k is this an issue.?
Any help will be much appreciated.
incase possible, make the partitions to 1 column. It kills when we have multi level (multi column partitions)
use R type instance. It provides more memory compared to M type instances at same price
use coalesce to merge the files in source if there are many small files.
Check the number of mapper tasks. The more the task, lesser the performance
use EMRFS rather than S3 to keep the metadata info
use below
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
Follow some of the instructions from below Link

Size of cassandra partitions

What is the best tool to find the number of rows in each Cassandra partition? I have a big partition and I want to know how much records are there in that partition
nodetool tablehistograms <keyspace> <table> will give you the distribution of the cells and sizes of thee partition for the table. But that does not give you for sure that partition. To get the specific one you must use count(*) on a select query that specifies the partition key in where clause. A very large partition and that can fail though.
sstablemetadata after 4.0 is based off the describe command in sstable-tools. It will give you the partitions largest in size, and largest in number of rows, and the partitions with most tombstones if you provide the -s to scan the sstable. These can be used against 3.0 and 3.11 sstables. I think 2.1 sstables are not able to be processed though.
...
Partitions: 22515
Rows: 13579337
Tombstones: 0
Cells: 13579337
Widest Partitions:
[12345] 999999
[99049] 62664
[99007] 60437
[99017] 59728
[99010] 59555
Largest Partitions:
[12345] 189888705
[99049] 2965017
[99007] 2860391
[99017] 2826094
[99010] 2818038
...
above example has parititon key an int, it will print out key like:
Widest Partitions:
[frodo] 1
Largest Partitions:
[frodo] 104
You can find the total number of partitions available for a table with nodetool command. ./nodetool cfstats <keyspace>.<table>.
If you know the partition key, you can fire a select count(*) for the partition to get no. of the records in that partition. It's possible that query can timeout for count queries on big partitions set cqlsh request-timeout before executing the query.
To understand how to calculate the physical partition size, go through the Datastax DS220: Data Modeling partition size
Instaclustr has a tool to find the partition size. However, this does not show the number of records in each partition:
https://github.com/instaclustr/cassandra-sstable-tools
As mentioned above either use inbuilt node tool, which could be find within Cassandra Folder extracted from jar , and run nodetool inside terminal .
nodetool toppartitions
Additionally you can also use online tool such as : https://www.cqlguru.io/ , but this need some prior information as vaerage number of rows per partition, average number of text in varchar and all . But this tool is good for rough estimation.

Resources