Configuration:
Spark 3.0.1
Cluster Databricks( Driver c5x.2xlarge, Worker (2) same as driver )
Source : S3
Format : Parquet
Size : 50 mb
File count : 2000 ( too many small files as they are gettin ng dumped from kinesis stream with 1 min batch as we cannot have more latency 99)
Problem Statement : I have 10 jobs with similar configuration and processing similar volume of data as above. When I run them individually, they take 5-6 mins each including cluster spin up time.
But when I run them together, they all seem kind of stuck at the same point in the code and takes 40-50 mins to complete.
When I check the spark UI, I see, all the jobs spent 90% of the time while taking the source count :
df = spark.read.parquet('s3a//....') df.cache() df.count() ----- problematic step ....more code logic
Now I know taking the count before doing cache should be faster for parquet files, but they were taking even more time if I don't cache the dataframe before taking the count, probably because of the huge number of small files.
But what I fail to understand is how the job is running way faster when ran one at a time?
Is S3 my bottleneck? They are all reading from the same bucket but different paths.
** I'm using privecera tokens for authentication.
They'll all be using the same s3a filesystem class instances in the worker nodes, there are some options there about the #of HTTP connections to have, fs.s3a.connection.maximum, default is 48. If all work is against the same bucket, set it to a number of 2x+ the number of worker threads. Do the same for "fs.s3a.max.total.tasks".
If using hadoop 2.8+ binaries switch the s3a client into the random IO mode which delivers best performance when seeking around parquet files, fs.s3a.experimental.fadvise = random.
change #2 should deliver speedup on single workloads, so do it anyway
Throttling would surface as 503 responses, which are handled in the AWS SDK and don't get collected/reported. I'd recommend that at least for debugging this you turn on S3 bucket logging, and scan the logs for 503 responses, which indicate throttling is taking place. It's what I do. Tip: set up a rule to delete old logs and so keep costs down; 1-2 weeks logs is generally enough for me.
Finally, lots of small files are bad on HDFS, awful with object stores as the time to list/open is so high. Try and make coalescing files step #1 in processing data
Related
I have millions of Gzipped files to process and converting to Parquet. I'm running a simple Spark batch job on EMR to do the conversion, and giving it a couple million files at a time to convert.
However, I've noticed that there is a big delay from when the job starts to when the files are listed and split up into a batch for the executors to do the conversion. From what I have read and understood, the scheduler has to get the metadata for those files, and schedule those tasks. However, I've noticed that this step is taking 15-20 minutes for a million files to split up into tasks for a batch. Even though the actual task of listing the files and doing the conversion only takes 15 minutes with my cluster of instances, the overall job takes over 30 minutes. It appears that it takes a lot of time for the driver to index all the files to split up into tasks. Is there any way to increase parallelism for this initial stage of indexing files and splitting up tasks for a batch?
I've tried tinkering with and increasing spark.driver.cores thinking that it would increase parallelism, but it doesn't seem to have an effect.
you can try by setting below config
spark.conf.set("spark.default.parallelism",x)
where x = total_nodes_in_cluster * (total_core_in_node - 1 ) * 5
This is a common problem with spark (and other big data tools) as it uses only on driver to list all files from the source (S3) and their path.
Some more info here
I have found this article really helpful to solve this issue.
Instead of using spark to list and get metadata of files we can use PureTools to create a parallelized rdd of the files and pass that to spark for processing.
S3 Specific Solution
If you don not want to install and setup tools as in the guide above you can also use a S3 manifest file to list all the files present in a bucket and iterate over the files using rdds in parallel.
Steps for S3 Manifest Solution
# Create RDD from list of files
pathRdd = sc.parallelize([file1,file2,file3,.......,file100])
# Create a function which reads the data of file
def s3_path_to_data(path):
# Get data from s3
# return the data in whichever format you like i.e. String, array of String etc.
# Call flatMap on the pathRdd
dataRdd = pathRdd.flatMap(s3_path_to_data)
Details
Spark will create a pathRdd with default number of partitions. Then call the s3_path_to_data function on each partition's rows in parallel.
Partitions play an important role in spark parallelism. e.g.
If you have 4 executors and 2 partitions then only 2 executors will do the work.
You can play around num of partitions and num of executors to achieve the best performance according to your use case.
Following are some useful attributes you can use to get insights on your df or rdd specs to fine tune spark parameters.
rdd.getNumPartitions
rdd.partitions.length
rdd.partitions.size
Due to some unfortunate sequences of events, we've ended up with a very fragmented dataset stored on s3. The table metadata is stored on Glue, and data is written with "bucketBy", and stored in parquet format. Thus discovery of the files is not an issue, and the number of spark partitions is equal to the number of buckets, which provides a good level of parallelism.
When we load this dataset on Spark/EMR we end up having each spark partition loading around ~8k files from s3.
As we've stored the data in a columnar format; per our use-case where we need a couple of fields, we don't really read all the data but a very small portion of what is stored.
Based on CPU utilization on the worker nodes, I can see that each task (running per partition) is utilizing almost around 20% of their CPUs, which I suspect is due to a single thread per task reading files from s3 sequentially, so lots of IOwait...
Is there a way to encourage spark tasks on EMR to read data from s3 multi-threaded, so that we can read multiple files at the same time from s3 within a task? This way, we can utilize the 80% idle CPU to make things a bit faster?
There are two parts to reading S3 data with Spark dataframes:
Discovery (listing the objects on S3)
Reading the S3 objects, including decompressing, etc.
Discovery typically happens on the driver. Some managed Spark environments have optimizations that use cluster resources for faster discovery. This is not typically a problem unless you get beyond 100K objects. Discovery is slower if you have .option("mergeSchema", true) as each file will have to touched to discover its schema.
Reading S3 files is part of executing an action. The parallelism of reading is min(number of partitions, number of available cores). More partitions + more available cores means faster I/O... in theory. In practice, S3 can be quite slow if you haven't accesses these files regularly for S3 to scale their availability up. Therefore, in practice, additional Spark parallelism has diminishing returns. Watch the total network RW bandwidth per active core and tune your execution for the highest value.
You can discover the number of partitions with df.rdd.partitions.length.
There are additional things you can do if the S3 I/O throughput is low:
Make sure the data on S3 is dispersed when it comes to its prefix (see https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html).
Open an AWS support request and ask the prefixes with your data to be scaled up.
Experiment with different node types. We have found storage-optimized nodes to have better effective I/O.
Hope this helps.
When my spark program is executing, it is creating 1000 stages. However, I have seen recommended is 200 only. I have two actions at the end to write data to S3 and after that i have unpersisted dataframes. Now, when my spark program writes the data into S3, it still runs for almost 30 mins more. Why it is so? Is it due to large number of dataframes i have persisted?
P.S -> I am running program for 5 input records only.
Probably cluster takes a longer time to append data to an existing dataset and in particular, all of Spark jobs have finished, but your command has not finished, it is because driver node is moving the output files of tasks from the job temporary directory to the final destination one-by-one, which is slow with cloud storage. Try setting the configuration mapreduce.fileoutputcommitter.algorithm.version to 2.
We run Spark in Standalone mode with 3 nodes on a 240GB "large" EC2 box to merge three CSV files read into DataFrames to JavaRDDs into output CSV part files on S3 using s3a.
We can see from the Spark UI, the first stages reading and merging to produce the final JavaRDD run at 100% CPU as expected, but the final stage writing out as CSV files using saveAsTextFile at package.scala:179 gets "stuck" for many hours on 2 of the 3 nodes with 2 of the 32 tasks taking hours (box is at 6% CPU, memory 86%, Network IO 15kb/s, Disk IO 0 for the entire period).
We are reading and writing uncompressed CSV (we found uncompressed was much faster than gzipped CSV) with re partition 16 on each of the three input DataFrames and not coleaseing the write.
Would appreciate any hints what we can investigate as to why the final stage takes so many hours doing very little on 2 of the 3 nodes in our standalone local cluster.
Many thanks
--- UPDATE ---
I tried writing to local disk rather than s3a and the symptoms are the same - 2 of the 32 tasks in the final stage saveAsTextFile get "stuck" for hours:
If you are writing to S3, via s3n, s3a or otherwise, do not set spark.speculation = true unless you want to run the risk of corrupted output.
What I suspect is happening is that the final stage of the process is renaming the output file, which on an object store involves copying lots (many GB?) of data. The rename takes place on the server, with the client just keeping an HTTPS connection open until it finishes. I'd estimate S3A rename time as about 6-8 Megabytes/second...would that number tie in with your results?
Write to local HDFS then, afterwards, upload the output.
gzip compression can't be split, so Spark will not assign parts of processing a file to different executors. One file: one executor.
Try and avoid CSV, it's an ugly format. Embrace: Avro, Parquet or ORC. Avro is great for other apps to stream into, the others better for downstream processing in other queries. Significantly better.
And consider compressing the files with a format such as lzo or snappy, both of which can be split.
see also slides 21-22 on: http://www.slideshare.net/steve_l/apache-spark-and-object-stores
I have seen similar behavior. There is a bug fix in HEAD as of October 2016 that may be relevant. But for now you might enable
spark.speculation=true
in the SparkConf or in spark-defaults.conf .
Let us know if that mitigates the issue.
I am building an application that needs to load data sets from S3. The functionality is working correctly, but the performance is surprisingly slow.
The datasets are in CSV format. There are approximately 7M records (lines) in each file, and each file is 600-700MB.
val spark = SparkSession
.builder()
.appName("MyApp")
.getOrCreate()
val df = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.csv(inFileName:_*)
// inFileName is a list that current contains 2 file names
// eg. s3://mybucket/myfile1.csv
val r = df.rdd.flatMap{ row =>
/*
* Discard poorly formated input records
*/
try {
totalRecords.add(1)
// this extracts several columns from the dataset
// each tuple of indexColProc specifies the index of the column to
// select from the input row, and a function to convert
// the value to an Int
val coords = indexColProc.map{ case (idx, func) => func( row.get(idx).toString ) }
List( (coords(0), coords) )
}
catch {
case e: Exception => {
badRecords.add(1)
List()
}
}
}
println("Done, row count " + r.count )
I ran this on an AWS cluster of 5 machines, each an m3.xlarge. The maximizeResourceAllocation parameter was set to true, and this was the only application running on the cluster.
I ran the application in twice. The first time with 'inFileName' pointing at the files on S3, and the second time pointing at a local copy of the files in hadoop file system.
When I look at the Spark history server and drill down to the job that corresponds to the final r.count action, I see that it takes 2.5 minutes when accessing the files on s3, and 18s when accessing the files locally on hdfs. I"ve gotten proportionally similar results when I run the same experiment on a smaller cluster or in master=local configuration.
When I copy the s3 files to the cluster using
aws s3 cp <file>
It only takes 6.5s to move one 600-700MB file. So it doesn't seem the raw I/O of the machine instance is contributing that much to the slow down.
Is this kind of slow performance when accessing s3 expected? If not, could someone please point out where I'm going wrong. If it is expected, are other ways to do this that would have better performance? Or do I need to develop something to simply copy the files over from s3 to hdfs before the application runs?
After some more digging I discovered that using S3 native makes a huge difference. I just changed the URI prefix to s3n:// and the performance for the job in question went from 2.5 minutes down to 21s. So only a 3s penalty for accessing s3 vs hdfs, which is pretty reasonable.
When searching for this topic there are many posts that mention s3n has a max file size limit of 5GB. However, I came across this which says that max files size limit was increased to 5TB in Hadoop 2.4.0.
"Using the S3 block file system is no longer recommended."
We faced the exact same issue about a couple of months ago, except that our data was 1TB so the issue was more pronounced.
We dug into it and finally came to the following conclusion:
Since we had 5 instances with 30 executors each, every time a stage was scheduled (and the first thing the task would do is fetch data from S3), so these tasks will be bottle-necked on network bandwidht, then they all move to compute part of the task and may contend for CPU simultaneously.
So basically because the tasks are all doing the same thing at the same time, they are always contending for the same resources.
We figured out that allowing only k number of tasks at any point would allow them to finish download quickly and move to the compute part and next set of k tasks can then come in and start downloading. This way, now k (as opposed to all) tasks are getting full bandwidth and some tasks are simultaneously doing something useful on CPU or I/O without waiting for each other on some common resource.
Hope this helps.
Did you try the spark-csv package? There is a lot of optimization for reading csv and you can use mode=MALFORMED to drop bad lines you are trying to filter. You can read from s3 directly like this:
csv_rdf<- read.df(sqlContext,"s3n://xxxxx:xxxxx#foldername/file1.csv",source="com.databricks.spark.csv")
More details can be found here https://github.com/databricks/spark-csv