Improving performance for Spark with a large number of small files? - apache-spark

I have millions of Gzipped files to process and converting to Parquet. I'm running a simple Spark batch job on EMR to do the conversion, and giving it a couple million files at a time to convert.
However, I've noticed that there is a big delay from when the job starts to when the files are listed and split up into a batch for the executors to do the conversion. From what I have read and understood, the scheduler has to get the metadata for those files, and schedule those tasks. However, I've noticed that this step is taking 15-20 minutes for a million files to split up into tasks for a batch. Even though the actual task of listing the files and doing the conversion only takes 15 minutes with my cluster of instances, the overall job takes over 30 minutes. It appears that it takes a lot of time for the driver to index all the files to split up into tasks. Is there any way to increase parallelism for this initial stage of indexing files and splitting up tasks for a batch?
I've tried tinkering with and increasing spark.driver.cores thinking that it would increase parallelism, but it doesn't seem to have an effect.

you can try by setting below config
spark.conf.set("spark.default.parallelism",x)
where x = total_nodes_in_cluster * (total_core_in_node - 1 ) * 5

This is a common problem with spark (and other big data tools) as it uses only on driver to list all files from the source (S3) and their path.
Some more info here
I have found this article really helpful to solve this issue.
Instead of using spark to list and get metadata of files we can use PureTools to create a parallelized rdd of the files and pass that to spark for processing.
S3 Specific Solution
If you don not want to install and setup tools as in the guide above you can also use a S3 manifest file to list all the files present in a bucket and iterate over the files using rdds in parallel.
Steps for S3 Manifest Solution
# Create RDD from list of files
pathRdd = sc.parallelize([file1,file2,file3,.......,file100])
# Create a function which reads the data of file
def s3_path_to_data(path):
# Get data from s3
# return the data in whichever format you like i.e. String, array of String etc.
# Call flatMap on the pathRdd
dataRdd = pathRdd.flatMap(s3_path_to_data)
Details
Spark will create a pathRdd with default number of partitions. Then call the s3_path_to_data function on each partition's rows in parallel.
Partitions play an important role in spark parallelism. e.g.
If you have 4 executors and 2 partitions then only 2 executors will do the work.
You can play around num of partitions and num of executors to achieve the best performance according to your use case.
Following are some useful attributes you can use to get insights on your df or rdd specs to fine tune spark parameters.
rdd.getNumPartitions
rdd.partitions.length
rdd.partitions.size

Related

Large number of stages in my spark program

When my spark program is executing, it is creating 1000 stages. However, I have seen recommended is 200 only. I have two actions at the end to write data to S3 and after that i have unpersisted dataframes. Now, when my spark program writes the data into S3, it still runs for almost 30 mins more. Why it is so? Is it due to large number of dataframes i have persisted?
P.S -> I am running program for 5 input records only.
Probably cluster takes a longer time to append data to an existing dataset and in particular, all of Spark jobs have finished, but your command has not finished, it is because driver node is moving the output files of tasks from the job temporary directory to the final destination one-by-one, which is slow with cloud storage. Try setting the configuration mapreduce.fileoutputcommitter.algorithm.version to 2.

Spark partition by files

I have several thousand compressed CSV files on a S3 bucket, each of size approximately 30MB(around 120-160MB after decompression), which I want to process using spark.
In my spark job, I am doing simple filter select queries on each row.
While partitioning Spark is dividing the files into two or more parts and then creating tasks for each partition. Each task is taking around 1 min to complete just to process 125K records. I want to avoid this partitioning of a single file across many tasks.
Is there a way to fetch files and partition data such that each task works on one complete file, that is, Number of tasks = Number of input files.?
as well as playing with spark options, you can tell the s3a filesystem client to tell it to tell Spark that the "block size" of a file in S3 is 128 MB. The default is 32 MB, which is close enough to your "approximately 30MB" number that spark could be splitting the files in two
spark.hadoop.fs.s3a.block.size 134217728
using the wholeTextFiles() operation is safer though

How can you quickly save a dataframe/RDD from PySpark to disk as a CSV/Parquet file?

I have a Google Dataproc Cluster running and am submitting a PySpark job to it that reads in a file from Google Cloud Storage (945MB CSV file with 4 million rows --> takes 48 seconds in total to read in) to a PySpark Dataframe and applies a function to that dataframe (parsed_dataframe = raw_dataframe.rdd.map(parse_user_agents).toDF() --> takes about 4 or 5 seconds).
I then have to save those modified results back to Google Cloud Storage as a GZIP'd CSV or Parquet file. I can also save those modified results locally, and then copy them to a GCS bucket.
I repartition the dataframe via parsed_dataframe = parsed_dataframe.repartition(15) and then try saving that new dataframe via
parsed_dataframe.write.parquet("gs://somefolder/proto.parquet")
parsed_dataframe.write.format("com.databricks.spark.csv").save("gs://somefolder/", header="true")
parsed_dataframe.write.format("com.databricks.spark.csv").options(codec="org.apache.hadoop.io.compress.GzipCodec").save("gs://nyt_regi_usage/2017/max_0722/regi_usage/", header="true")
Each of those methods (and their different variants with lower/higher partitions and saving locally vs. on GCS) takes over 60 minutes for the 4 million rows (945 MB) which is a pretty long time.
How can I optimize this/make saving the data faster?
It should be worth noting that both the Dataproc Cluster and GCS bucket are in the same region/zone, and that the Cluster has a n1-highmem-8 (8CPU, 52GB memory) Master node with 15+ worker nodes (just variables I'm still testing out)
Some red flags here.
1) reading as a DF then converting to an RDD to process and back to a DF alone is very inefficient. You lose catalyst and tungsten optimizations by reverting to RDDs. Try changing your function to work within DF.
2) repartitioning forces a shuffle but more importantly means that the computation will now be limited to those executors controlling the 15 partitions. If your executors are large (7 cores, 40ish GB RAM), this likely isn't a problem.
What happens if you write the output before being repartitioned?
Please provide more code and ideally spark UI output to show how long each step in the job takes.
Try this, it should take few minutes:
your_dataframe.write.csv("export_location", mode="overwrite", header=True, sep="|")
Make sure you add mode="overwrite" if you want overwrite an old version.
Are you calling an action on parsed_dataframe?
As you wrote it above, Spark will not compute your function until you call write. If you're not calling an action, see how long parsed_dataframe.cache().count() takes. I suspect it will take an hour and that then running parsed_dataframe.write(...) will be much faster.

Spark Stand Alone - Last Stage saveAsTextFile takes many hours using very little resources to write CSV part files

We run Spark in Standalone mode with 3 nodes on a 240GB "large" EC2 box to merge three CSV files read into DataFrames to JavaRDDs into output CSV part files on S3 using s3a.
We can see from the Spark UI, the first stages reading and merging to produce the final JavaRDD run at 100% CPU as expected, but the final stage writing out as CSV files using saveAsTextFile at package.scala:179 gets "stuck" for many hours on 2 of the 3 nodes with 2 of the 32 tasks taking hours (box is at 6% CPU, memory 86%, Network IO 15kb/s, Disk IO 0 for the entire period).
We are reading and writing uncompressed CSV (we found uncompressed was much faster than gzipped CSV) with re partition 16 on each of the three input DataFrames and not coleaseing the write.
Would appreciate any hints what we can investigate as to why the final stage takes so many hours doing very little on 2 of the 3 nodes in our standalone local cluster.
Many thanks
--- UPDATE ---
I tried writing to local disk rather than s3a and the symptoms are the same - 2 of the 32 tasks in the final stage saveAsTextFile get "stuck" for hours:
If you are writing to S3, via s3n, s3a or otherwise, do not set spark.speculation = true unless you want to run the risk of corrupted output.
What I suspect is happening is that the final stage of the process is renaming the output file, which on an object store involves copying lots (many GB?) of data. The rename takes place on the server, with the client just keeping an HTTPS connection open until it finishes. I'd estimate S3A rename time as about 6-8 Megabytes/second...would that number tie in with your results?
Write to local HDFS then, afterwards, upload the output.
gzip compression can't be split, so Spark will not assign parts of processing a file to different executors. One file: one executor.
Try and avoid CSV, it's an ugly format. Embrace: Avro, Parquet or ORC. Avro is great for other apps to stream into, the others better for downstream processing in other queries. Significantly better.
And consider compressing the files with a format such as lzo or snappy, both of which can be split.
see also slides 21-22 on: http://www.slideshare.net/steve_l/apache-spark-and-object-stores
I have seen similar behavior. There is a bug fix in HEAD as of October 2016 that may be relevant. But for now you might enable
spark.speculation=true
in the SparkConf or in spark-defaults.conf .
Let us know if that mitigates the issue.

Is reading a CSV file from S3 into a Spark dataframe expected to be so slow?

I am building an application that needs to load data sets from S3. The functionality is working correctly, but the performance is surprisingly slow.
The datasets are in CSV format. There are approximately 7M records (lines) in each file, and each file is 600-700MB.
val spark = SparkSession
.builder()
.appName("MyApp")
.getOrCreate()
val df = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.csv(inFileName:_*)
// inFileName is a list that current contains 2 file names
// eg. s3://mybucket/myfile1.csv
val r = df.rdd.flatMap{ row =>
/*
* Discard poorly formated input records
*/
try {
totalRecords.add(1)
// this extracts several columns from the dataset
// each tuple of indexColProc specifies the index of the column to
// select from the input row, and a function to convert
// the value to an Int
val coords = indexColProc.map{ case (idx, func) => func( row.get(idx).toString ) }
List( (coords(0), coords) )
}
catch {
case e: Exception => {
badRecords.add(1)
List()
}
}
}
println("Done, row count " + r.count )
I ran this on an AWS cluster of 5 machines, each an m3.xlarge. The maximizeResourceAllocation parameter was set to true, and this was the only application running on the cluster.
I ran the application in twice. The first time with 'inFileName' pointing at the files on S3, and the second time pointing at a local copy of the files in hadoop file system.
When I look at the Spark history server and drill down to the job that corresponds to the final r.count action, I see that it takes 2.5 minutes when accessing the files on s3, and 18s when accessing the files locally on hdfs. I"ve gotten proportionally similar results when I run the same experiment on a smaller cluster or in master=local configuration.
When I copy the s3 files to the cluster using
aws s3 cp <file>
It only takes 6.5s to move one 600-700MB file. So it doesn't seem the raw I/O of the machine instance is contributing that much to the slow down.
Is this kind of slow performance when accessing s3 expected? If not, could someone please point out where I'm going wrong. If it is expected, are other ways to do this that would have better performance? Or do I need to develop something to simply copy the files over from s3 to hdfs before the application runs?
After some more digging I discovered that using S3 native makes a huge difference. I just changed the URI prefix to s3n:// and the performance for the job in question went from 2.5 minutes down to 21s. So only a 3s penalty for accessing s3 vs hdfs, which is pretty reasonable.
When searching for this topic there are many posts that mention s3n has a max file size limit of 5GB. However, I came across this which says that max files size limit was increased to 5TB in Hadoop 2.4.0.
"Using the S3 block file system is no longer recommended."
We faced the exact same issue about a couple of months ago, except that our data was 1TB so the issue was more pronounced.
We dug into it and finally came to the following conclusion:
Since we had 5 instances with 30 executors each, every time a stage was scheduled (and the first thing the task would do is fetch data from S3), so these tasks will be bottle-necked on network bandwidht, then they all move to compute part of the task and may contend for CPU simultaneously.
So basically because the tasks are all doing the same thing at the same time, they are always contending for the same resources.
We figured out that allowing only k number of tasks at any point would allow them to finish download quickly and move to the compute part and next set of k tasks can then come in and start downloading. This way, now k (as opposed to all) tasks are getting full bandwidth and some tasks are simultaneously doing something useful on CPU or I/O without waiting for each other on some common resource.
Hope this helps.
Did you try the spark-csv package? There is a lot of optimization for reading csv and you can use mode=MALFORMED to drop bad lines you are trying to filter. You can read from s3 directly like this:
csv_rdf<- read.df(sqlContext,"s3n://xxxxx:xxxxx#foldername/file1.csv",source="com.databricks.spark.csv")
More details can be found here https://github.com/databricks/spark-csv

Resources