I've been reading the source code of Spark, but I still not be able to understand how does Spark standalone implement the resource isolation and allocation. For example Mesos use LXC or Docker to implement the container for resource limitation. So how does Spark Standalone to implement this. for example I ran 10 threads in one executor, but Spark only gave the executor one core, so how does Spark guarantee these 10 threads only run on one cpu core.
After the following testing code, it turns out that Spark Standalone Resource Allocation is somehow Fake. I just had one Worker(executor) and only gave the executor one core(the machine has 6 cores totally), when the following code was running I found there were 5 cores 100% usage. (My code kicked off 4 threads)
object CoreTest {
class MyThread extends Thread {
override def run() {
while (true) {
val i = 1+1
}
}
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("core test")
val sc = new SparkContext(conf)
val memRDD = sc.parallelize(Seq(1), 1)
memRDD.foreachPartition { part =>
part.foreach {
x =>
var hello = new MyThread()
hello.start
hello = new MyThread()
hello.start
hello = new MyThread()
hello.start
hello = new MyThread()
hello.start
while (true) {
val j = 1+2
Thread.sleep(1000)
}
}
}
sc.stop()
}
}
Following Question: I'm curious that if I ran the above code on Spark+Mesos, what would happen, would Mesos limit the 4 threads only run on one core.
but I still not be able to understand how does Spark standalone implement the resource isolation and allocation.
With Spark, we have the notation of a Master node and Worker nodes. We can think about the latter as a resource pool. Each worker has CPU and RAM which it brings to the pool, and Spark jobs can utilize the resources in that pool to do their computation.
Spark Standalone has the notation of an Executor, which is the process that handles the computation, and to which we give resources from the resource pool. In any given executor, we run different stages of a computation which is composed of different tasks. Now, we can control the amount of computation power (cores) a given task uses (via spark.tasks.cpu configuration parameter), and we also control the general amount of computation power a given job may have (via spark.cores.max, which tells the cluster manager how many resources in total we want to give to the particular job we're running). Note that Standalone is greety by default and will schedule an executor on every Worker node in the cluster. We can get finer grained control over how many actual Executors we have by using Dynamic Allocation.
for example I ran 10 threads in one executor, but Spark only gave the executor one core, so how does Spark guarantee these 10 threads only run on one cpu core.
Spark doesn't verify that the execution only happens on a single core. Spark doesn't know which CPU cycle it'll get from the underlying operating system. What Spark Standalone does attempt to do is resource management, it tells you "Look, you have X amount of CPUs and Y amount of RAM, I will not let you schedule jobs if you don't partition your resources properly".
Spark standalone handles only resource allocation which is a simple task. All it is required is keeping tabs on:
available resources.
assigned resources.
It doesn't take care of resource isolation. YARN and Mesos, which have broader scope, don't implement resource isolation but depend on Linux Control Groups (cgroups).
Related
In the local mode, I am submitting 10 concurrent jobs with Threadpoolexecutor.
If I only set SparkConf sparkConf = new SparkConf().setAppName("Hello Spark - WordCount").setMaster("local[*]").set("spark.scheduler.mode","FAIR"); Then the 10 jobs are executing parellelly but they are not gettng the same number of cores.
But if I add them to a pool and add the scheduling of the pool to FAIR, they are getting almost the same number of cores. May I know what could be the reason for this?
I am running a TPC-DS benchmark for Spark 3.0.1 in local mode and using sparkMeasure to get workload statistics. I have 16 total cores and SparkContext is available as
Spark context available as 'sc' (master = local[*], app id = local-1623251009819)
Q1. For local[*], driver and executors are created in a single JVM with 16 threads. Considering Spark's configuration which of the following will be true?
1 worker instance, 1 executor having 16 cores/threads
1 worker instance, 16 executors each having 1 core
For a particular query, sparkMeasure reports shuffle data as follows
shuffleRecordsRead => 183364403
shuffleTotalBlocksFetched => 52582
shuffleTotalBlocksFetched => 52582
shuffleLocalBlocksFetched => 52582
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 1570948723 (1498.0 MB)
shuffleLocalBytesRead => 1570948723 (1498.0 MB)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 1570948723 (1498.0 MB)
shuffleRecordsWritten => 183364480
Q2. Regardless of the query specifics, why is there data shuffling when everything is inside a single JVM?
executor is a jvm process when you use local[*] you run Spark
locally with as many worker threads as logical cores on your machine so : 1 executor and as many worker threads as logical
cores. when you configure SPARK_WORKER_INSTANCES=5 in spark-env.sh and execute these commands start-master.sh and start-slave.sh spark://local:7077 to bring up a standalone spark cluster in your
local machine you have one master and 5 workers, if you want to send
your application to this cluster you must configure application like
SparkSession.builder().appName("app").master("spark://localhost:7077")
in this case you can't specify [*] or [2] for example. but when
you specify master to be local[*] a jvm process is created and
master and all workers will be in that jvm process and after your
application finished that jvm instance will be destroyed. local[*]
and spark://localhost:7077 are two separate things.
workers do their job using tasks and each task actually is a thread
i.e. task = thread. workers have memory and they assign a memory
partition to each task in order to they do their job such as reading
a part of a dataset into its own memory partition or do a
transformation on read data. when a task such as join needs other
partitions, shuffle occurs regardless weather the job is ran in
cluster or local. if you were in cluster there is a possibility that
two tasks were in different machines so Network transmission will be
added to other stuffs such as writing the result and then reading by
another task. in local if task B needs the data in the partition of
the task A, task A should write it down and then task B will read it
to do its job
Local mode is the same as non-distributed single-JVM deployment mode.
Q1: It is neither. In this mode Spark spawns all execution components
namely Driver, n threads for data processing and Master in a single JVM.
If I had to abstract it to one of your 2 options I would say, 1 worker
instance, 16 executors each having 1 core, but as said this is not the
right way to look at it. The other option could be N Workers with M Executors with 1 Core each where N x M = 16.
The default parallelism is the number of threads as specified in the
master URL = local[*].
Q2: The threads will service partitions, concurrently, one at a time,
as many as needed, sequentially within the current Stage, being
assigned by the Driver when free. A stage is a boundary that causes
shuffling, regardless of how you run, in YARN Cluster or local.
Shuffling - what is that then? Shuffle occurs when data is required to
be re-arranged over existing partitions. E.g. a groupBy or orderBy? We
may have M partitions and after the groupBy N partitions. This is a
wide-transformation concept at the core of Spark for parallel
processing, so (even) with local[*] this will apply.
I'm running a spark batch job on aws fargate in standalone mode. On the compute environment, I have 8 vcpu and job definition has 1 vcpu and 2048 mb memory. In the spark application I can specify how many core I want to use and doing this using below code
sparkSess = SparkSession.builder.master("local[8]")\
.appName("test app")\
.config("spark.debug.maxToStringFields", "1000")\
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
.getOrCreate()
local[8] is specifying 8 cores/threads (that’s what I'm assuming).
Initially I was running the spark app without specifying cores and I think job was running in single thread and was taking around 10 min to complete but with this number it is reducing the time to process. I started with 2 it almost reduced to 5 minutes and then I have changed to 4, 8 and now it is taking almost 4 minutes. But I don't understand the relation between vcpu and spark threads. Whatever the number I specify for cores, sparkContext.defaultParallelism shows me that value.
Is this the correct way? Is there any relation between this number and the vcpu that I specify on job definition or compute environment.
You are running in Spark Local Mode. Learning Spark has this to say about Local mode:
Spark driver runs on a single JVM, like a laptop or single node
Spark executor runs on the same JVM as the driver
Cluster manager Runs on the same host
Damji, Jules S.,Wenig, Brooke,Das, Tathagata,Lee, Denny. Learning Spark (p. 30). O'Reilly Media. Kindle Edition.
local[N] launches with N threads. Given the above definition of Local Mode, those N threads must be shared by the Local Mode Driver, Executor and Cluster Manager.
As such, from the available vCPUs, allotting one vCPU for the Driver thread, one for the Cluster Manager, one for the OS and the remaining for Executor seems reasonable.
The optimal number of threads/vCPUs for the Executor will depend on the number of partitions your data has.
I have a Spark application picks a subset and do some operation on the subset. There is no dependency & interaction between each subset and its operation, so I tried to use multi threads to let them run parallel to improve performance. The code looks like below:
Dataset<Row> fullData = sparkSession.read().json("some_path");
ExecutorService executor = Executors.newFixedThreadPool(10);
List<Runnable> tasks = Lists.newArrayList();
for (int i = 1; i <= 50; i++) {
final int x = i;
tasks.add(() -> {
Dataset<Row> subset_1 = fullData.filter(length(col("name")).equalTo(x));
Dataset<Row> subset_2 = fullData.filter(length(col("name")).equalTo(x));
Dataset<Row> result = subset_1.join(subset_2, ...);
log.info("Res size is " + result.count()); // force Spark do the join operation
});
}
CompletableFuture<?>[] futures = tasks.stream()
.map(task -> CompletableFuture.runAsync(task, executor))
.toArray(CompletableFuture[]::new);
CompletableFuture.allOf(futures).join();
executor.shutdown();
From Spark job management UI, I noticed those 50 tasks are submitted in parallel, but the the processing is still in a blocking way, one task starts running until another task is complete. How can I make the multiple tasks run in parallel instead of one after another?
This is not how you control parallelism in Spark. It's all controlled declaratively via configuration.
Spark is a distributed computing framework and it's meant to be used in a distributed environment where each worker is ran single threaded. Usually tasks are scheduled using Yarn which has metadata of nodes and may will start multiple tasks on a single node (depending on memory and cpu constraints) but in separate jvms.
In local mode you can have multiple workers realized as separate threads, so if you say master("local[8]") you will get 8 workers each running as a thread in a single jvm.
How are you running your application?
Imagine that we have 3 customers and we want do some same work for each of them in parallel.
def doSparkJob(customerId: String) = {
spark
.read.json(s"$customerId/file.json")
.map(...)
.reduceByKey(...)
.write
.partitionBy("id")
.parquet("output/")
}
We do it concurrently like this (from spark driver):
val jobs: Future[(Unit, Unit, Unit)] = for {
f1 <- Future { doSparkJob("customer1") }
f2 <- Future { doSparkJob("customer1") }
f3 <- Future { doSparkJob("customer1") }
} yield (f1, f2, f3)
Await.ready(jobs, 5.hours)
Do I understand correctly that this is bad approach? Many spark job will push out context of each other from executors and there will be many spilling data to disc appears. How spark will be manage execute task from parallel jobs? How shuffle appears when we have 3 concurrent job from one driver and only 3 executors with one core.
I guess, a good approach should looks like this:
We read all data together for all customers groupByKey by customer and do what we want to do.
Do I understand correctly that this is bad approach?
Not necessarily. A lot depends on the context and Spark implements it's own set of AsyncRDDActions to address scenarios like this one (though there is no Dataset equivalent).
In the simplest scenario, with static allocation, it is quite likely that Spark will just schedule all jobs sequentially, due to lack of resources. Unless configured otherwise, this is the most probable outcome with the described configuration. Please keep in mind that Spark can use in-application scheduling with FAIR scheduler to share limited resources between multiple concurrent jobs. See Scheduling Within an Application.
If amount of resources is sufficient to start multiple jobs at the same, there can be competition between individual jobs, especially for IO and memory intensive jobs. If all jobs use the same of resources (especially databases) it is possible that Spark will cause throttling and subsequent failures or timeouts. A less severe effect of running multiple jobs can be increased cache eviction.
Overall there multiple factors to consider when you choose between sequential and concurrent execution including, but not limited to, available resource (Spark cluster and external services), choice of the API (RDD tend to be more greedy than SQL, therefore requires some low level management) and choice of operators. Even if jobs are sequentially you may still decide to use asynchronous to improve driver utilization and reduce latency. This is particularly useful with Spark SQL and complex execution plans (common bottleneck in Spark SQL). This way Spark can crunch new execution plans, while other jobs are executed.