Dataframe has over 10billion records with time stored as a bigint in milliseconds since 1/1/1970. I need that as a date - applying transformation as shown below. Is there a faster way of doing this?
spark.createDataFrame([[1365742800000],[1366866000000]], schema=["origdt"])\
.withColumn("newdt", F.to_date((F.col("origdt")/1000).cast(TimestampType()))).show()
Transformation which you have done is fine, to increase speed to you can increase cluster size or increase partitions using repartition to utilize all the cores at a time.
Related
I have a delta table which is partitioned by multiple keys, one of which includes date excluding minute details(only upto hour, example - Fri, 15 Jul 2022 07)
Now, with the data keep ingesting via batch and streaming ingestion workflow, what would be the best strategy to evaluate number of executors to read all the data from delta table?
One of the very naive way could be to just let spark autoscale but we may still need to play with shuffle partitions etc. Looking for hints or best practices around the same. Thanks!
If you want to "read all the data from delta table" it does not really matter whether this table is partitioned or not since the query reads all the data and hence loads the whole table.
This is the worst possible query - the dreaded full scan. If it's inevitable, just know that that is the kind of queries where Spark SQL shines so bright utilising the full power of a Spark cluster. You've been warned :)
Executors are simply machines with CPU cores and memory. You're probably more interested in the number of CPU cores for all the tasks to load the delta table.
I'd start this calculation with the number of files for a given version of the delta table. Files are of different size and (I might be wrong here) they are usually chunked (I don't want to use the overloaded term partitioned here, but that's what springs to my mind) to 512MB splits.
The number of splits (512MB blocks) for all the files of a given version of the delta table would be the number of tasks. That would give you the number of CPU cores and hence their "containers", i.e. Spark executors (to evenly saturate available physical resources for the best performance).
Im using pyspark and I have a large data source that I want to repartition specifying the files size per partition explicitly.
I know using the repartition(500) function will split my parquet into 500 files with almost equal sizes.
The problem is that new data gets added to this data source every day. On some days there might be a large input, and on some days there might be smaller inputs. So when looking at the partition file size distribution over a period of time, it varies between 200KB to 700KB per file.
I was thinking of specifying the max size per partition so that I get more or less the same file size per file per day irrespective of the number of files.
This will help me when running my job on this large dataset later on to avoid skewed executor times and shuffle times etc.
Is there a way to specify it using the repartition() function or while writing the dataframe to parquet?
You could consider writing your result with the parameter maxRecordsPerFile.
storage_location = //...
estimated_records_with_desired_size = 2000
result_df.write.option(
"maxRecordsPerFile",
estimated_records_with_desired_size) \
.parquet(storage_location, compression="snappy")
In Spark, I have a few jobs chained (i.e. output of one will be input to the next). The issue I am facing is, say, my input dataset to first job is 10GB today and I am repartitioning it statically (hardcoded number, coalesce or repartition) before writing the output such that its around 128mb per partition. As the data grows, the hardcoded number obviously starts to make partitions bigger and in a few months downstream jobs starts to become slower due to the higher partition size.
One way I tried to ensure partitions are always around 128mb is by dividing the total dataset size (row count) by a static number. That is, if a dataset with 1M rows is 500mb, I estimate that approx 250k rows would be around 128mb. So number of partitions I would need is df.count()/250,000
That sort of works but is there a better/more straightforward way to accomplish this without affecting the performance of the job much?
I am running a spark application where data comes in every 1 minute. No of repartitions i am doing is 48. It is running on 12 executor with 4G as executor memory and executor-cores=4.
Below are the streaming batches processing time
Here we can see that some of the batches are taking around 20 sec but some are taking around 45 sec
I further drilled down in one of the batch which is taking less time. Below is the image.
and the one which is taking more time.
Here we can see more time is taken in repartitioning task, but above one was not taking much time in repartitioning. Its happening with every 3-4 batch. The data is coming from kafka Stream and has only value, no key.
Is there any reason related to spark configuration?
Try reducing "spark.sql.shuffle.partitions" size, the default value is 200 which is an overkill. Reduce the values and analyse the performance.
I'm writing a parquet file from DataFrame to S3.
When I look at the Spark UI, I can see all tasks but 1 completed swiftly of the writing stage (e.g. 199/200). This last task appears to take forever to complete, and very often, it fails due to exceeding executor memory limit.
I'd like to know what is happening in this last task. How to optimize it?
Thanks.
I have tried Glemmie Helles Sindholt solution and works very well.
Here is the code:
path = 's3://...'
n = 2 # number of repartitions, try 2 to test
spark_df = spark_df.repartition(n)
spark_df.write.mode("overwrite").parquet(path)
It sounds like you have a data skew. You can fix this by calling repartition on your DataFrame before writing to S3.
As others have noted, data skew is likely at play.
Besides that, I noticed that your task count is 200.
The configuration parameter spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations.
200 is the default for this setting, but generally it is far from an optimal value.
For small data, 200 could be overkill and you would waste time in the overhead of multiple partitions.
For large data, 200 can result in large partitions, which should be broken down into more, smaller partitions.
The really rough rules of thumb are:
- have 2-3x number of partitions to cpu's.
- Or ~128MB.
2GB's is the max partition size. If you are hovering just below 2000 partitions, Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000[1]
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
...
You can try playing with this parameter at runtime:
spark.conf.set("spark.sql.shuffle.partitions", "300")
[1]What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?
This article - The Bleeding Edge: Spark, Parquet and S3 has a lot of useful information about Spark, S3 and Parquet. In particular, it talks about how the driver ends up writing out the _common_metadata_ files and can take quite a bit of time. There is a way to turn it off.
Unfortunately, they say that they go on to generate the common metadata themselves, but don't really talk about how they did so.