Spark run multiple operations in parallel - multithreading

I have a Spark application picks a subset and do some operation on the subset. There is no dependency & interaction between each subset and its operation, so I tried to use multi threads to let them run parallel to improve performance. The code looks like below:
Dataset<Row> fullData = sparkSession.read().json("some_path");
ExecutorService executor = Executors.newFixedThreadPool(10);
List<Runnable> tasks = Lists.newArrayList();
for (int i = 1; i <= 50; i++) {
final int x = i;
tasks.add(() -> {
Dataset<Row> subset_1 = fullData.filter(length(col("name")).equalTo(x));
Dataset<Row> subset_2 = fullData.filter(length(col("name")).equalTo(x));
Dataset<Row> result = subset_1.join(subset_2, ...);
log.info("Res size is " + result.count()); // force Spark do the join operation
});
}
CompletableFuture<?>[] futures = tasks.stream()
.map(task -> CompletableFuture.runAsync(task, executor))
.toArray(CompletableFuture[]::new);
CompletableFuture.allOf(futures).join();
executor.shutdown();
From Spark job management UI, I noticed those 50 tasks are submitted in parallel, but the the processing is still in a blocking way, one task starts running until another task is complete. How can I make the multiple tasks run in parallel instead of one after another?

This is not how you control parallelism in Spark. It's all controlled declaratively via configuration.
Spark is a distributed computing framework and it's meant to be used in a distributed environment where each worker is ran single threaded. Usually tasks are scheduled using Yarn which has metadata of nodes and may will start multiple tasks on a single node (depending on memory and cpu constraints) but in separate jvms.
In local mode you can have multiple workers realized as separate threads, so if you say master("local[8]") you will get 8 workers each running as a thread in a single jvm.
How are you running your application?

Related

SparkSQL Number of Tasks

I have a Spark Standalone Cluster (which consists of two Workers with 2 cores each). I run an SQLQuery which joins 2 dataframes and shows the result. I have some questions regarding the above simle example.
val df1 = sc.read.text(fn1).toDF()
val df2 = sc.read.text(fn2).toDF()
df1.createOrReplaceTempView("v1")
df2.createOrReplaceTempView("v2")
val df_join = sc.sql("SELECT * FROM v1,v2 WHERE v1.value=v2.value AND v2.value<1500").show()
DAG Scheduler - Number of Tasks
From what i've understood so far when i spark-submit the application, a SparkContext is spawn for the handling of the Job(where job is the printing of result rows). SparkContext creates a Task Scheduler instance which then creates a DAGScheduler. Through a simple event mechanism, the DAGScheduler handles the job for execution(handleJobSubmitted function from the code). SparkSQL query has been transformed into a physical execution plan(through Catalyst Optimizer), and then to an RDD-Graph(with toRdd function). DagScheduler receives the RDD-Graph and recursively creates all the stages.
I do not understand how it finds the Number of Tasks(before the execution of any stage) in the last stage,keeping in mind that the result stage is the one that performs the join(and prints the results). The number of data(and the rdds and the number of their partitions, which define the number of tasks) we have is unknown until the parent stages have ended their execution.
Parallel Execution of Stages
Each one of the two first stages is independent of the other, as it loads data from different files. I have read many posts that say that Stages that do not have dependencies between them MAY be executed in parallel from the cluster. What is the condition that implies that independent stages's tasks are executed in parallel?
Task Dependencies
Finally, i've read that Task Scheduler does not know about Stage Dependencies. If i keep in mind that each Stage in Spark is a TakSet( aka a set of non dependent tasks, each task with same functionality packed up with different partition of data), then TaskScheduler does not know as well the dependencies between tasks of different Stages. As a result, how and when a task knows the data on which it'll execute a function?
If for example, the task knows apriori where to look for its input data, then it could be launched as soon as they become available.

building the thread pool in spark streaming program

To avoid delaying and to speed up the process,i build the thread pool in the spark streaming. The main program is listed as follows:
stream.foreachRDD(rdd=> {
rdd.foreachPartition { rddPartition => {
val client: Client = ESClient.getInstance.getClient
var num = Random.nextInt()
val threadPool: ExecutorService = Executors.newFixedThreadPool(5)
val confs = new Configuration()
rddPartition.foreach(x => {
threadPool.execute(new esThread(x._2, num, client, confs))
} ) } } } )
The function of the esThread is that firstly,we inquire the elasticsearch,then we get the query result of ES,finally we write the result to HDFS. But we find data of the result file in HDFS lack a lot,which is a little left. I wonder that we can build the thread pool in the spark streaming. Does the thread pool in spark streaming make some data missing?
thanks for your help.
Partitions are processed by separate threads already, and stream won't proceed to the next batch until the previous one has finished. So it is not likely to buy you anything and makes resource usage tracking less transparent.
At the same time, as your code is implemented at this moment, you're likely to loose data. Since threadPool doesn't awaitTermination, parent thread might exit before all data has been processed.
Overall it is not useful approach. If you want to increase throughput you should tune number of partitions and amount of computing resources.

how does Spark standalone implement resource allocation

I've been reading the source code of Spark, but I still not be able to understand how does Spark standalone implement the resource isolation and allocation. For example Mesos use LXC or Docker to implement the container for resource limitation. So how does Spark Standalone to implement this. for example I ran 10 threads in one executor, but Spark only gave the executor one core, so how does Spark guarantee these 10 threads only run on one cpu core.
After the following testing code, it turns out that Spark Standalone Resource Allocation is somehow Fake. I just had one Worker(executor) and only gave the executor one core(the machine has 6 cores totally), when the following code was running I found there were 5 cores 100% usage. (My code kicked off 4 threads)
object CoreTest {
class MyThread extends Thread {
override def run() {
while (true) {
val i = 1+1
}
}
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("core test")
val sc = new SparkContext(conf)
val memRDD = sc.parallelize(Seq(1), 1)
memRDD.foreachPartition { part =>
part.foreach {
x =>
var hello = new MyThread()
hello.start
hello = new MyThread()
hello.start
hello = new MyThread()
hello.start
hello = new MyThread()
hello.start
while (true) {
val j = 1+2
Thread.sleep(1000)
}
}
}
sc.stop()
}
}
Following Question: I'm curious that if I ran the above code on Spark+Mesos, what would happen, would Mesos limit the 4 threads only run on one core.
but I still not be able to understand how does Spark standalone implement the resource isolation and allocation.
With Spark, we have the notation of a Master node and Worker nodes. We can think about the latter as a resource pool. Each worker has CPU and RAM which it brings to the pool, and Spark jobs can utilize the resources in that pool to do their computation.
Spark Standalone has the notation of an Executor, which is the process that handles the computation, and to which we give resources from the resource pool. In any given executor, we run different stages of a computation which is composed of different tasks. Now, we can control the amount of computation power (cores) a given task uses (via spark.tasks.cpu configuration parameter), and we also control the general amount of computation power a given job may have (via spark.cores.max, which tells the cluster manager how many resources in total we want to give to the particular job we're running). Note that Standalone is greety by default and will schedule an executor on every Worker node in the cluster. We can get finer grained control over how many actual Executors we have by using Dynamic Allocation.
for example I ran 10 threads in one executor, but Spark only gave the executor one core, so how does Spark guarantee these 10 threads only run on one cpu core.
Spark doesn't verify that the execution only happens on a single core. Spark doesn't know which CPU cycle it'll get from the underlying operating system. What Spark Standalone does attempt to do is resource management, it tells you "Look, you have X amount of CPUs and Y amount of RAM, I will not let you schedule jobs if you don't partition your resources properly".
Spark standalone handles only resource allocation which is a simple task. All it is required is keeping tabs on:
available resources.
assigned resources.
It doesn't take care of resource isolation. YARN and Mesos, which have broader scope, don't implement resource isolation but depend on Linux Control Groups (cgroups).

How to create RDD from within Task?

Normally when creating an RDD from a List you can just use the SparkContext.parallelize method, but you can not use the spark context from within a Task as it's not serializeable. I have a need to create an RDD from a list of Strings from within a task. Is there a way to do this?
I've tried creating a new SparkContext in the task, but it gives me an error about not supporting multiple spark contexts in the same JVM and that I need to set spark.driver.allowMultipleContexts = true. According to the Apache User Group, that setting however does not yet seem to be supported
As far as I am concerned it is not possible and it is hardly a matter of serialization or a support for multiple Spark contexts. A fundamental limitation is a core Spark architecture. Since Spark context is maintained by a driver and tasks are executed on the workers creating a RDD from inside a task would require pushing changes from workers to a driver. I am not saying it is technically impossible but a whole ideas seems to be rather cumbersome.
Creating Spark context from inside tasks looks even worse. First of all it would mean that context is created on the workers, which for all practical purposes don't communicate with each other. Each worker would get its own context which could operate only on a data that is accessible on given worker. Finally preserving worker state is definitely not a part of the contract so any context create inside a task should be simply garbage collected after the task is finished.
If handling the problem using multiple jobs is not an option you can try to use mapPartitions like this:
val rdd = sc.parallelize(1 to 100)
val tmp = rdd.mapPartitions(iter => {
val results = Map(
"odd" -> scala.collection.mutable.ArrayBuffer.empty[Int],
"even" -> scala.collection.mutable.ArrayBuffer.empty[Int]
)
for(i <- iter) {
if (i % 2 != 0) results("odd") += i
else results("even") += i
}
Iterator(results)
})
val odd = tmp.flatMap(_("odd"))
val even = tmp.flatMap(_("even"))

What is a task in Spark? How does the Spark worker execute the jar file?

After reading some document on http://spark.apache.org/docs/0.8.0/cluster-overview.html, I got some question that I want to clarify.
Take this example from Spark:
JavaSparkContext spark = new JavaSparkContext(
new SparkConf().setJars("...").setSparkHome....);
JavaRDD<String> file = spark.textFile("hdfs://...");
// step1
JavaRDD<String> words =
file.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
});
// step2
JavaPairRDD<String, Integer> pairs =
words.map(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
// step3
JavaPairRDD<String, Integer> counts =
pairs.reduceByKey(new Function2<Integer, Integer>() {
public Integer call(Integer a, Integer b) {
return a + b;
}
});
counts.saveAsTextFile("hdfs://...");
So let's say I have 3 nodes cluster, and node 1 running as master, and the above driver program has been properly jared (say application-test.jar). So now I'm running this code on the master node and I believe right after the SparkContext being created, the application-test.jar file will be copied to the worker nodes (and each worker will create a dir for that application).
So now my question:
Are step1, step2 and step3 in the example tasks that get sent over to the workers? If yes, then how does the worker execute that? Like java -cp "application-test.jar" step1 and so on?
When you create the SparkContext, each worker starts an executor. This is a separate process (JVM), and it loads your jar too. The executors connect back to your driver program. Now the driver can send them commands, like flatMap, map and reduceByKey in your example. When the driver quits, the executors shut down.
RDDs are sort of like big arrays that are split into partitions, and each executor can hold some of these partitions.
A task is a command sent from the driver to an executor by serializing your Function object. The executor deserializes the command (this is possible because it has loaded your jar), and executes it on a partition.
(This is a conceptual overview. I am glossing over some details, but I hope it is helpful.)
To answer your specific question: No, a new process is not started for each step. A new process is started on each worker when the SparkContext is constructed.
To get a clear insight on how tasks are created and scheduled, we must understand how execution model works in Spark. Shortly speaking, an application in spark is executed in three steps :
Create RDD graph
Create execution plan according to the RDD graph. Stages are created in this step
Generate tasks based on the plan and get them scheduled across workers
In your word-count example, the RDD graph is rather simple, it's something as follows :
file -> lines -> words -> per-word count -> global word count -> output
Based on this graph, two stages are created. The stage creation rule is based on the idea to pipeline as many narrow transformations as possible. In your example, the narrow transformation finishes at per-word count. Therefore, you get two stages
file -> lines -> words -> per-word count
global word count -> output
Once stages are figured out, spark will generate tasks from stages. The first stage will create ShuffleMapTasks and the last stage will create ResultTasks because in the last stage, one action operation is included to produce results.
The number of tasks to be generated depends on how your files are distributed. Suppose that you have 3 three different files in three different nodes, the first stage will generate 3 tasks : one task per partition.
Therefore, you should not map your steps to tasks directly. A task belongs to a stage, and is related to a partition.
Usually, the number of tasks ran for a stage is exactly the number of partitions of the final RDD, but since RDDs can be shared (and hence ShuffleMapStages) their number varies depending on the RDD/stage sharing. Please refer to How DAG works under the covers in RDD?

Resources