How does the Apache Spark scheduler split files into tasks? - apache-spark

In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow:
Here I wanna know three things about how does a stage be splited into tasks?
in this example above, it seems that tasks' number are created based on the file number, am I right?
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
thanks a lot, I feel confused in how does the stage been splited into tasks.

You can configure the # of partitions (splits) for the entire process as the second parameter to a job, e.g. for parallelize if we want 3 partitions:
a = sc.parallelize(myCollection, 3)
Spark will divide the work into relatively even sizes (*) . Large files will be broken down accordingly - you can see the actual size by:
rdd.partitions.size
So no you will not end up with single Worker chugging away for a long time on a single file.
(*) If you have very small files then that may change this processing. But in any case large files will follow this pattern.

The split occurs in two stages:
Firstly HDSF splits the logical file into 64MB or 128MB physical files when the file is loaded.
Secondly SPARK will schedule a MAP task to process each physical file.
There is a fairly complex internal scheduling process as there are three copies of each physical file stored on three different servers, and, for large logical files it may not be possible to run all the tasks at once. The way this is handled is one of the major differences between hadoop distributions.
When all the MAP tasks have run the collectors, shuffle and reduce tasks can then be run.

Stage: New stage will get created when a wide transformation occurs
Task: Will get created based on partitions in a worker
Attaching the link for more explanation: How DAG works under the covers in RDD?

Question 1: in this example above, it seems that tasks' number are created based on the file number, am I right?
Answer : its not based on the filenumber, its based on your hadoop block(0.gz,1.gz is a block of data saved or stored in hdfs. )
Question 2:
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
Answer : By default block size in hadoop is of 64MB and that block of data will be treated as partition in spark.
Note : no of partitions = no of task, because of these it has created 3tasks.
Question 3 :
what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
Answer : No, the very large file will be partitioned and as i answered for ur question 2 based on the no of partitions , no of task will be created

Related

Why is Spark much faster at reading a directory compared to a list of filepaths?

I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs

How execution will happen in this spark program?

I came across the example below:
lines = sc.textFile("some_file.txt") //line_1
lineswithFriday = lines.filter(lambda line: line.startwith("Friday")) //line_2
lineswithFriday.first(); //line_3
It also says
spark scans the file only until it finds the first line starting with
friday. It does not need to go through entire file.
My question is: does it mean spark will load each line one by one in memory, see if it starts with Friday and if yes stop there?
Say line_1 created three partitions based on cores and input blocks. line_2 will do the computation through separate worker thread on each cores.
On line_3, as soon as any worker finds a line starting with Friday it will stop there?
first() and take(n) can be optimized if used on their own.
There is no mechanism in terms of "inter process communication" within Spark allowing executors to terminate prematurely because the continued processing of results would be considered superfluous. That would lead to all sorts of issues architecturally speaking.

Where are my extra Spark tasks coming from

I have a Spark program that is training several ML algorithms. The code that generates the final stage of my job looks like this (in Kotlin):
val runConfigs = buildOptionsCrossProduct(opts)
log.info("Will run {} different configurations.", runConfigs.size)
val runConfigsRdd: JavaRDD<RunConfiguration> = sc.parallelize(runConfigs)
// Create an RDD mapping window size to the score for that window size.
val accuracyRdd = runConfigsRdd.mapToPair { runConfig: RunConfiguration ->
runSingleOptionSet(runConfig, opts, trainingBroadcast, validBroadcast) }
accuracyRdd.saveAsTextFile(opts.output)
runConfigs is a list of 18 items. The log line right after the configs are generated shows:
17/02/06 19:23:20 INFO SparkJob: Will run 18 different configurations.
So I'd expect at most 18 tasks as there should be at most one task per stage per partition (at least that's my understanding). However, the History server reports 80 tasks most of which finish very quickly and, not surprisingly, produce no output:
There are in fact 80 output files generated with all but 18 of them being empty. My question is, what are the other 80 - 18 = 62 tasks in this stage doing and why did they get generated?
You use SparkContext.parallelize without providing numSlices argument so Spark is using defaultParallelism which is probably 80. In general parallelize tries to spread data uniformly between partitions but it doesn't remove empty ones so if you want to avoid executing empty task you should set numSlices to a number smaller or equal to runConfigs.size.

Spark cores & tasks concurrency

I've a very basic question about spark. I usually run spark jobs using 50 cores. While viewing the job progress, most of the times it shows 50 processes running in parallel (as it is supposed to do), but sometimes it shows only 2 or 4 spark processes running in parallel. Like this:
[Stage 8:================================> (297 + 2) / 500]
The RDD's being processed are repartitioned on more than 100 partitions. So that shouldn't be an issue.
I have an observations though. I've seen the pattern that most of the time it happens, the data locality in SparkUI shows NODE_LOCAL, while other times when all 50 processes are running, some of the processes show RACK_LOCAL.
This makes me doubt that, maybe this happens because the data is cached before processing in the same node to avoid network overhead, and this slows down the further processing.
If this is the case, what's the way to avoid it. And if this isn't the case, what's going on here?
After a week or more of struggling with the issue, I think I've found what was causing the problem.
If you are struggling with the same issue, the good point to start would be to check if the Spark instance is configured fine. There is a great cloudera blog post about it.
However, if the problem isn't with configuration (as was the case with me), then the problem is somewhere within your code. The issue is that sometimes due to different reasons (skewed joins, uneven partitions in data sources etc) the RDD you are working on gets a lot of data on 2-3 partitions and the rest of the partitions have very few data.
In order to reduce the data shuffle across the network, Spark tries that each executor processes the data residing locally on that node. So, 2-3 executors are working for a long time, and the rest of the executors are just done with the data in few milliseconds. That's why I was experiencing the issue I described in the question above.
The way to debug this problem is to first of all check the partition sizes of your RDD. If one or few partitions are very big in comparison to others, then the next step would be to find the records in the large partitions, so that you could know, especially in the case of skewed joins, that what key is getting skewed. I've wrote a small function to debug this:
from itertools import islice
def check_skewness(df):
sampled_rdd = df.sample(False,0.01).rdd.cache() # Taking just 1% sample for fast processing
l = sampled_rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
max_part = max(l,key=lambda item:item[1])
min_part = min(l,key=lambda item:item[1])
if max_part[1]/min_part[1] > 5: #if difference is greater than 5 times
print 'Partitions Skewed: Largest Partition',max_part,'Smallest Partition',min_part,'\nSample Content of the largest Partition: \n'
print (sampled_rdd.mapPartitionsWithIndex(lambda i, it: islice(it, 0, 5) if i == max_part[0] else []).take(5))
else:
print 'No Skewness: Largest Partition',max_part,'Smallest Partition',min_part
It gives me the smallest and largest partition size, and if the difference between these two is more than 5 times, it prints 5 elements of the largest partition, to should give you a rough idea on what's going on.
Once you have figured out that the problem is skewed partition, you can find a way to get rid of that skewed key, or you can re-partition your dataframe, which will force it to get equally distributed, and you'll see now all the executors will be working for equal time and you'll see far less dreaded OOM errors and processing will be significantly fast too.
These are just my two cents as a Spark novice, I hope Spark experts can add some more to this issue, as I think a lot of newbies in Spark world face similar kind of problems far too often.

Spark: Cut down no. of output files

I wrote a Spark program that mimics functionality of an existing Map Reduce job. The MR job takes about 50 minutes every day, but the Spark job took only 9 minutes! That’s great!
When I looked at the output directory, I noticed that it created 1,020 part files. The MR job uses only 20 reducers so it creates only 20 files. We need to cut down on # of output files; otherwise our Namespace would be full in no time.
I am trying to figure out how I can reduce the number of output files under Spark. Seems like 1,020 tasks are getting triggered and each one creates a part file. Is this correct? Do I have to change the level of parallelism to cut down no. of tasks thereby reducing no. of output files? If so how do I set it? I am afraid cutting down no. of tasks will slow down this process – but I can test that!
Cutting down the number of reduce tasks will slow down the process for sure. However, it still should be considerably faster than Hadoop MapReduce for your use case.
In my opinion, the best method to limit the number of output files is using the coalesce(numPartitions) transformation. Below is an example:
JavaSparkContext ctx = new JavaSparkContext(/*your configuration*/);
JavaRDD<String> myData = ctx.textFile("path/to/my/file.txt");
//Consider we have 1020 partitions and thus 1020 map tasks
JavaRDD<String> mappedData = myData.map( your map function );
//Consider we need 20 output files
JavaRDD<String> newData = mappedData.coalesce(20)
newData.saveAsTextFile("output path");
In this example, the map function would be executed by 1020 tasks, which would not be altered in any way. However, after having coalesced the partitions, there should only be 20 partitions to work with. In that case, 20 output files would be saved at the end of the program.
As mentioned earlier, take into account that this method will be slower than having 1020 output files. The data needs to be stored into few partitions (from 1020 to 20).
Note: please take a look to the repartition command on the following link too.

Resources