Spark saving parquet files process takes too long (indeed?) - apache-spark

I'm trying to figure out if the time it takes to my job to read and write files from and to S3 makes sense.
I run a job that reads data from Delta lake, in the size of 2TB (partitioned by dates and hours - about 90 days and 24 hours per day) and saves it to S3.
The code which saves it to S3 is -
rawData.write
.partitionBy("id", "dt")
.mode(SaveMode.Overwrite)
.parquet(outputPath)
It takes 90 minutes for the process to finish when I use 350 AWS spots (type m5.2xlarge) on the Hadoop cluster.
I'm trying to find some benchmark to understand if it makes sense or if something goes wrong.
Can some help? If any other details can help, please let me know, and I'll add them.

Are you doing any kind of spark executor tuning? If not, try it out. Here is a good example executor tuning. Also check the spark UI and see if the jobs are distributed properly across executors or if there is something thats getting stalled. Can help more if you could give more details.

Related

Spark limit + write is too slow

I have a dataset of 8Billion records stored in parquet files in Azure Data Lake Gen 2.
I wanted to separate out a sample dataset of 2Billion records in a different location for some benchmarking needs so I did the following
df = spark.read.option('inferSchema', 'true').format('parquet').option('badRecordsPath', f'/tmp/badRecords/').load(read_path)
df.limit(2000000000).write.option('badRecordsPath', f'/tmp/badRecords/').format('parquet').save(f'{write_path}/advertiser/2B_parquet')
This job is running on 8 nodes of 8core 28GB RAM machines [ 8 WorkerNodes + 1 Master Node ]. It's been running for over an hour with not a single file is written yet. The load did finish within 2s, so I know the limit + write action is what's causing the bottleneck [ although load just infers schema and creates a list of files but not actually reading the data ].
So I started inspecting the Spark UI for some clues and here are my observations
2 Jobs have been created by Spark
The first job took 35 mins. Here's the DAG
The second job has been running for about an hour now with no progress at all. The second job has two stages in it.
If you notice, stage 3 has one running task, but if I open the stages panel, I can't see any details of the task. I also don't understand why it's trying to do a shuffle when all I have is a limit on my DF. Does limit really need a shuffle? Even if it's shuffling, it seems like 1hr is awfully long to shuffle data around.
Also if this is what's really performing the limit, what did the first job really do? Just read the data? 35mins for that also seems too long, but for now I'd just settle on the job being completed.
Stage 4 is just stuck which is believed to be the actual writing stage and I believe is waiting for this shuffle to end.
I am new to spark and I'm kinda clueless about what's happening here. Any insights on what I'm doing wrong will be very useful.

Spark Databricks ultra slow read of parquet files

Configuration:
Spark 3.0.1
Cluster Databricks( Driver c5x.2xlarge, Worker (2) same as driver )
Source : S3
Format : Parquet
Size : 50 mb
File count : 2000 ( too many small files as they are gettin ng dumped from kinesis stream with 1 min batch as we cannot have more latency 99)
Problem Statement : I have 10 jobs with similar configuration and processing similar volume of data as above. When I run them individually, they take 5-6 mins each including cluster spin up time.
But when I run them together, they all seem kind of stuck at the same point in the code and takes 40-50 mins to complete.
When I check the spark UI, I see, all the jobs spent 90% of the time while taking the source count :
df = spark.read.parquet('s3a//....') df.cache() df.count() ----- problematic step ....more code logic
Now I know taking the count before doing cache should be faster for parquet files, but they were taking even more time if I don't cache the dataframe before taking the count, probably because of the huge number of small files.
But what I fail to understand is how the job is running way faster when ran one at a time?
Is S3 my bottleneck? They are all reading from the same bucket but different paths.
** I'm using privecera tokens for authentication.
They'll all be using the same s3a filesystem class instances in the worker nodes, there are some options there about the #of HTTP connections to have, fs.s3a.connection.maximum, default is 48. If all work is against the same bucket, set it to a number of 2x+ the number of worker threads. Do the same for "fs.s3a.max.total.tasks".
If using hadoop 2.8+ binaries switch the s3a client into the random IO mode which delivers best performance when seeking around parquet files, fs.s3a.experimental.fadvise = random.
change #2 should deliver speedup on single workloads, so do it anyway
Throttling would surface as 503 responses, which are handled in the AWS SDK and don't get collected/reported. I'd recommend that at least for debugging this you turn on S3 bucket logging, and scan the logs for 503 responses, which indicate throttling is taking place. It's what I do. Tip: set up a rule to delete old logs and so keep costs down; 1-2 weeks logs is generally enough for me.
Finally, lots of small files are bad on HDFS, awful with object stores as the time to list/open is so high. Try and make coalescing files step #1 in processing data

S3 based streaming solution using apache spark or flink

We have batch pipelines writing files (mostly csv) into an s3 bucket. Some of these pipelines write per minute and some of them every 5 mins. Currently, we have a batch application which runs every hour processing these files.
Business wants data to be available every 5 mins. Instead, of running batch jobs every 5 mins we decided to use apache spark structured streaming and process the data in real time. My question is how easy/difficult is productionise this solution?
My only worry is if checkpoint location gets corrupt, deleting the checkpoint directory will re-process data back from last 1 yr. Has anyone productionised any solution using s3 using spark structured streaming or you think flink is better for this use case?
If you think there is a better architecture/pattern for this problem, kindly point me in the right direction.
ps: We already thought of putting these files into kafka and ruled out due to the availability of bandwidth and large size of the files.
I found a way to do this, not the most effective way. Since we have already productionized Kafka based solution before, we could push a event into Kafka using s3 streams and lambda. The event will contain only metadata like file location and size.
This will make the spark program a bit more challenging as the file will be read and processed inside the executor, which is effectively not utilising the distributed processing. Or else, read in executor and bring the data back to the driver to utilise the distributed processing of spark. This will require the spark app to be planned a lot better in terms of memory, ‘cos input file sizes change a lot.
https://databricks.com/blog/2019/05/10/how-tilting-point-does-streaming-ingestion-into-delta-lake.html

How can you quickly save a dataframe/RDD from PySpark to disk as a CSV/Parquet file?

I have a Google Dataproc Cluster running and am submitting a PySpark job to it that reads in a file from Google Cloud Storage (945MB CSV file with 4 million rows --> takes 48 seconds in total to read in) to a PySpark Dataframe and applies a function to that dataframe (parsed_dataframe = raw_dataframe.rdd.map(parse_user_agents).toDF() --> takes about 4 or 5 seconds).
I then have to save those modified results back to Google Cloud Storage as a GZIP'd CSV or Parquet file. I can also save those modified results locally, and then copy them to a GCS bucket.
I repartition the dataframe via parsed_dataframe = parsed_dataframe.repartition(15) and then try saving that new dataframe via
parsed_dataframe.write.parquet("gs://somefolder/proto.parquet")
parsed_dataframe.write.format("com.databricks.spark.csv").save("gs://somefolder/", header="true")
parsed_dataframe.write.format("com.databricks.spark.csv").options(codec="org.apache.hadoop.io.compress.GzipCodec").save("gs://nyt_regi_usage/2017/max_0722/regi_usage/", header="true")
Each of those methods (and their different variants with lower/higher partitions and saving locally vs. on GCS) takes over 60 minutes for the 4 million rows (945 MB) which is a pretty long time.
How can I optimize this/make saving the data faster?
It should be worth noting that both the Dataproc Cluster and GCS bucket are in the same region/zone, and that the Cluster has a n1-highmem-8 (8CPU, 52GB memory) Master node with 15+ worker nodes (just variables I'm still testing out)
Some red flags here.
1) reading as a DF then converting to an RDD to process and back to a DF alone is very inefficient. You lose catalyst and tungsten optimizations by reverting to RDDs. Try changing your function to work within DF.
2) repartitioning forces a shuffle but more importantly means that the computation will now be limited to those executors controlling the 15 partitions. If your executors are large (7 cores, 40ish GB RAM), this likely isn't a problem.
What happens if you write the output before being repartitioned?
Please provide more code and ideally spark UI output to show how long each step in the job takes.
Try this, it should take few minutes:
your_dataframe.write.csv("export_location", mode="overwrite", header=True, sep="|")
Make sure you add mode="overwrite" if you want overwrite an old version.
Are you calling an action on parsed_dataframe?
As you wrote it above, Spark will not compute your function until you call write. If you're not calling an action, see how long parsed_dataframe.cache().count() takes. I suspect it will take an hour and that then running parsed_dataframe.write(...) will be much faster.

Spark write to CSV fails even after 8 hours

I have a dataframe with roughly 200-600 gb of data I am reading, manipulating, and then writing to csv using the spark shell (scala) on an elastic map reduce cluster.Spark write to CSV fails even after 8 hours
here's how I'm writing to csv:
result.persist.coalesce(20000).write.option("delimiter",",").csv("s3://bucket-name/results")
The result variable is created through a mix of columns from some other dataframes:
var result=sources.join(destinations, Seq("source_d","destination_d")).select("source_i","destination_i")
Now, I am able to read the csv data it is based on in roughly 22 minutes. In this same program, I'm also able to write another (smaller) dataframe to csv in 8 minutes. However, for this result dataframe it takes 8+ hours and still fails ... saying one of the connections was closed.
I'm also running this job on 13 x c4.8xlarge instances on ec2, with 36 cores each and 60 gb of ram, so I thought I'd have the capacity to write to csv, especially after 8 hours.
Many stages required retries or had failed tasks and I can't figure out what I'm doing wrong or why it's taking so long. I can see from the Spark UI that it never even got to the write CSV stage and was busy with persist stages, but without the persist function it was still failing after 8 hours. Any ideas? Help is greatly appreciated!
Update:
I've ran the following command to repartition the result variable into 66K partitions:
val r2 = result.repartition(66000) #confirmed with numpartitions
r2.write.option("delimiter",",").csv("s3://s3-bucket/results")
However, even after several hours, the jobs are still failing. What am I doing wrong still?
note, I'm running spark shell via spark-shell yarn --driver-memory 50G
Update 2:
I've tried running the write with a persist first:
r2.persist(StorageLevel.MEMORY_AND_DISK)
But I had many stages fail, returning a, Job aborted due to stage failure: ShuffleMapStage 10 (persist at <console>:36) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3' or saying Connection from ip-172-31-48-180.ec2.internal/172.31.48.180:7337 closed
Executors page
Spark web UI page for a node returning a shuffle error
Spark web UI page for a node returning an ec2 connection closed error
Overall Job Summary page
I can see from the Spark UI that it never even got to the write CSV
stage and was busy with persist stages, but without the persist
function it was still failing after 8 hours. Any ideas?
It is FetchFailedException i.e Failed to fetch a shuffle block
Since you are able to deal with small files, only huge data its failed...
I strongly feel that not enough partitions.
Fist thing is verify/Print source.rdd.getNumPartitions(). and destinations.rdd.getNumPartitions(). and result.rdd.getNumPartitions().
You need to repartition after the data is loaded in order to partition the data (via shuffle) to other nodes in the cluster. This will give you the parallelism that you need for faster processing with out fail
Further more, to verify the other configurations applied...
print all the config like this, adjust them to correct values as per demand.
sc.getConf.getAll
Also have a look at
SPARK-5928
Spark-TaskRunner-FetchFailedException Possible reasons : OOM or Container memory limits
repartition both source and destination before joining, with number of partitions such that each partition would be 10MB - 128MB(try to tune), there is no need to make it 20000(imho too many).
then join by those two columns and then write, without repartitioning(ie. output partitions should be same as reparitioning before join)
if you still have trouble, try to make same thing after converting to both dataframes to rdd(there are some differences between apis, and especially regarding repartitions, key-value rdds etc)

Resources