Spark: Cut down no. of output files - apache-spark

I wrote a Spark program that mimics functionality of an existing Map Reduce job. The MR job takes about 50 minutes every day, but the Spark job took only 9 minutes! That’s great!
When I looked at the output directory, I noticed that it created 1,020 part files. The MR job uses only 20 reducers so it creates only 20 files. We need to cut down on # of output files; otherwise our Namespace would be full in no time.
I am trying to figure out how I can reduce the number of output files under Spark. Seems like 1,020 tasks are getting triggered and each one creates a part file. Is this correct? Do I have to change the level of parallelism to cut down no. of tasks thereby reducing no. of output files? If so how do I set it? I am afraid cutting down no. of tasks will slow down this process – but I can test that!

Cutting down the number of reduce tasks will slow down the process for sure. However, it still should be considerably faster than Hadoop MapReduce for your use case.
In my opinion, the best method to limit the number of output files is using the coalesce(numPartitions) transformation. Below is an example:
JavaSparkContext ctx = new JavaSparkContext(/*your configuration*/);
JavaRDD<String> myData = ctx.textFile("path/to/my/file.txt");
//Consider we have 1020 partitions and thus 1020 map tasks
JavaRDD<String> mappedData = myData.map( your map function );
//Consider we need 20 output files
JavaRDD<String> newData = mappedData.coalesce(20)
newData.saveAsTextFile("output path");
In this example, the map function would be executed by 1020 tasks, which would not be altered in any way. However, after having coalesced the partitions, there should only be 20 partitions to work with. In that case, 20 output files would be saved at the end of the program.
As mentioned earlier, take into account that this method will be slower than having 1020 output files. The data needs to be stored into few partitions (from 1020 to 20).
Note: please take a look to the repartition command on the following link too.

Related

Why is Spark much faster at reading a directory compared to a list of filepaths?

I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs

Where are my extra Spark tasks coming from

I have a Spark program that is training several ML algorithms. The code that generates the final stage of my job looks like this (in Kotlin):
val runConfigs = buildOptionsCrossProduct(opts)
log.info("Will run {} different configurations.", runConfigs.size)
val runConfigsRdd: JavaRDD<RunConfiguration> = sc.parallelize(runConfigs)
// Create an RDD mapping window size to the score for that window size.
val accuracyRdd = runConfigsRdd.mapToPair { runConfig: RunConfiguration ->
runSingleOptionSet(runConfig, opts, trainingBroadcast, validBroadcast) }
accuracyRdd.saveAsTextFile(opts.output)
runConfigs is a list of 18 items. The log line right after the configs are generated shows:
17/02/06 19:23:20 INFO SparkJob: Will run 18 different configurations.
So I'd expect at most 18 tasks as there should be at most one task per stage per partition (at least that's my understanding). However, the History server reports 80 tasks most of which finish very quickly and, not surprisingly, produce no output:
There are in fact 80 output files generated with all but 18 of them being empty. My question is, what are the other 80 - 18 = 62 tasks in this stage doing and why did they get generated?
You use SparkContext.parallelize without providing numSlices argument so Spark is using defaultParallelism which is probably 80. In general parallelize tries to spread data uniformly between partitions but it doesn't remove empty ones so if you want to avoid executing empty task you should set numSlices to a number smaller or equal to runConfigs.size.

Spark shuffle writes growing out of control

I'm using Spark 1.6.1 to process some archives from CommonCrawl. They come as gzipped text files, and I've read that Spark has to load such compressed files into an RDD of a single partition. However, I'm running it on a cluster of ten nodes with 4 CPU's each, so I need to repartition it in order to process the data in parallel. These repartition steps are taking what seems like an unacceptably long time, and when I look on the web UI at the shuffle write times for any repartition step, it grows to over 40 GB, even though one .gz archive is only around 100 MB. Here is the relevant portion of the code I'm running:
final WhitelistFilter<String> filter =
new WhitelistFilter<String>(WHITELIST_THRESHOLD, termWeights, regex);
//Each URL points to a gzip-compressed text file in the commoncrawl s3 bucket
//With ~40,000 pages per archive
int counter = 0;
for(String s : ccURL) {
//Obtain WET file from the URL, and filter down to pages of English text
JavaPairRDD<LongWritable, Text> raw =
sc.newAPIHadoopFile("sample.wet.gz",
org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class,
LongWritable.class, Text.class, s3Conf)
.repartition(sc.defaultParallelism() * 3)
JavaRDD<String> pages = JavaPairRDD.fromJavaRDD(
.filter(new WETFilter())
.map(new WETTransformerWithURL())
.filter(filter)).keys();
pages.saveAsTextFile("pages-" + (counter++) + ".txt");
}
The various functions called in the filter and map steps are just basic text processing - the most complicated thing is assigning a score based on term frequencies and filtering out anything below a threshold. If I remove the call to repartition(), the entire thing will finish quickly, but without any parallelism. What about the repartitioning could be causing it to be so incredibly slow, and also make the block manager write tens of gigabytes to the disk?

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.
In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:
(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:
// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))
(ii) Run separate Spark Streaming apps - one for each stat
Questions
To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:
How would streamA and streamB be represented in the underlying
datastructure.
Would they share data - since they originate from the
KafkaDStream? Or would there be duplication of data?
Also, are there more efficient methods to handle such a use case.
Thanks in advance
Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.
Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).
One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.
If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

How does the Apache Spark scheduler split files into tasks?

In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow:
Here I wanna know three things about how does a stage be splited into tasks?
in this example above, it seems that tasks' number are created based on the file number, am I right?
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
thanks a lot, I feel confused in how does the stage been splited into tasks.
You can configure the # of partitions (splits) for the entire process as the second parameter to a job, e.g. for parallelize if we want 3 partitions:
a = sc.parallelize(myCollection, 3)
Spark will divide the work into relatively even sizes (*) . Large files will be broken down accordingly - you can see the actual size by:
rdd.partitions.size
So no you will not end up with single Worker chugging away for a long time on a single file.
(*) If you have very small files then that may change this processing. But in any case large files will follow this pattern.
The split occurs in two stages:
Firstly HDSF splits the logical file into 64MB or 128MB physical files when the file is loaded.
Secondly SPARK will schedule a MAP task to process each physical file.
There is a fairly complex internal scheduling process as there are three copies of each physical file stored on three different servers, and, for large logical files it may not be possible to run all the tasks at once. The way this is handled is one of the major differences between hadoop distributions.
When all the MAP tasks have run the collectors, shuffle and reduce tasks can then be run.
Stage: New stage will get created when a wide transformation occurs
Task: Will get created based on partitions in a worker
Attaching the link for more explanation: How DAG works under the covers in RDD?
Question 1: in this example above, it seems that tasks' number are created based on the file number, am I right?
Answer : its not based on the filenumber, its based on your hadoop block(0.gz,1.gz is a block of data saved or stored in hdfs. )
Question 2:
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
Answer : By default block size in hadoop is of 64MB and that block of data will be treated as partition in spark.
Note : no of partitions = no of task, because of these it has created 3tasks.
Question 3 :
what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
Answer : No, the very large file will be partitioned and as i answered for ur question 2 based on the no of partitions , no of task will be created

Resources