how to make two Spark RDD run parallel - apache-spark

For example I created two RDDs in my code as following:
val rdd1=sc.esRDD("userIndex1/type1")
val rdd2=sc.esRDD("userIndex2/type2")
val rdd3=rdd1.join(rdd2)
rdd3.foreachPartition{....}
I found they were executed serially, why not Spark run them parallel?
The reason of my question is that the network is very slow, for generating rdd1 need 1 hour and generating rdd2 needs 1 hour as well. So I asked why Spark didn't generate the two RDDs at the same time.

Spark provide the asynchronous action to run all jobs in asynchronously so it will may be help in you use case to run all computation in parallel + concurrent. AT a time only one RDD will be computed in spark cluster but u can make them asynchronous. you can check java docs for this api here https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/rdd/AsyncRDDActions.html
And there is also a blog about it check it out here https://blog.knoldus.com/2015/10/21/demystifying-asynchronous-actions-in-spark/

I have found similar behavior. Running RDDs either in Serial or parallel doesn't make any difference due to the number of executors, executor cores you set in your spark-submit.
Let's say we have 2 RDDs as you mentioned above. Let's say each RDD takes 1 hr with 1 executor and 1 core each. We cannot increase the performance with 1 executor and 1 core (Spark config), even if spark runs both RDDs in parallel unless you increase the executors and cores.
So, Running two RDDs in parallel is not going to increase the performance.

Related

How does Spark Streaming schedule map tasks between driver and executor?

I use Apache Spark 2.1 and Apache Kafka 0.9.
I have a Spark Streaming application that runs with 20 executors and reads from Kafka that has 20 partitions. This Spark application does map and flatMap operations only.
Here is what the Spark application does:
Create a direct stream from kafka with interval of 15 seconds
Perform data validations
Execute transformations using drool which are map only. No reduce transformations
Write to HBase using check-and-put
I wonder if executors and partitions are 1-1 mapped, will every executor independently perform above steps and write to HBase independently, or data will be shuffled within multiple executors and operations will happen between driver and executors?
Spark jobs submit tasks that can only be executed on executors. In other words, executors are the only place where tasks can be executed. The driver is to coordinate the tasks and schedule them accordingly.
With that said, I'd say the following is true:
will every executor independently perform above steps and write to HBase independently
By the way, the answer is irrelevant to what Spark version is in use. It's always been like this (and don't see any reason why it would or even should change).

why does a single core of a Spark worker complete each task faster than the rest of the cores in the other workers?

I have three nodes in a cluster, each with a single active core. I mean, I have 3 cores in the cluster.
Assuming that all partitions have almost the same number of records, why does a single core of a worker complete each task faster than the rest of the cores in the other workers?
Please observe this screenshot. The timeline shows that the latency of the worker core (x.x.x.230) is notably shorter than the other two worker core (x.x.x.210 and x.x.x.220) latencies.
This means that the workers x.x.x.210 and x.x.x.220 are doing the same job in a longer time compared to the worker x.x.x.230. This also happens when all the available cores in the cluster are used, but its delay is not so critial.
I submitted this application again. Look at this new screenshot. Now the fastest worker is the x.x.x.210. Observe that tasks 0, 1 and 2 process partitions with almost the same number of records. This execution time discrepancy is not good, is it?
I don't understand!!!
What I'm really doing is creating a DataFrame and doing a mapping operation to get a new DataFrame, saving the result in a Parquet file.
val input: DataFrame = spark.read.parquet(...)
val result: DataFrame = input.map(row => /* ...operations... */)
result.write.parquet(...)
Any idea why this happens? Is that how Spark operates normally?
Thanks in advance.

What is the relationship between tasks and partitions?

Can I say?
The number of the Spark tasks equal to the number of the Spark partitions?
The executor runs once (batch inside of executor) is equal to one task?
Every task produce only a partition?
(duplicate of 1.)
The degree of parallelism, or the number of tasks that can be ran concurrently, is set by:
The number of Executor Instances (configuration)
The Number of Cores per Executor (configuration)
The Number of Partitions being used (coded)
Actual parallelism is the smaller of
executors * cores - which gives the amount of slots available to run tasks
partitions - each partition will translate to a task whenever a slot opens up.
Tasks that run on the same executor will share the same JVM. This is used by the Broadcast feature as you only need one copy of the Broadcast data per Executor for all tasks to be able to access it through shared memory.
You can have multiple executors running, on the same machine, or on different machines. Executors are the true means of scalability.
Note that each Task takes up one Thread ¹, and is assumed to be assigned to one core ².
So -
Is the number of the Spark tasks equal to the number of the Spark partitions?
No (see previous).
The executor runs once (batch inside of executor) is equal to one task?
An Executor is started as an environment for the tasks to run. Multiple tasks will run concurrently within that Executor (multiple threads).
Every task produce only a partition?
For a task, it is one Partition in, one partition out. However, a repartitioning or shuffle/sort can happen in between tasks.
The number of the Spark tasks equal to the number of the Spark partitions?
Same as (1.)
(¹) Assumption is that within your tasks, you are not multithreading yourself (never do that, otherwise core estimate will be off).
(²) Note that due to hyper-threading, you might have more than one virtual core per physical core, and thus you can have several threads per core. You might even be able to handle multiple threads (2 to 3) on a single core without hyper-threading.
Partitions are a feature of RDD and are only available at design time (before an action is called).
Tasks are part of TaskSet per Stage per ActiveJob in a Spark application.
Is the number of the Spark tasks equal to the number of the Spark partitions?
Yes.
The executor runs once (batch inside of executor) is equal to one task?
That recursively uses "executor" and does not make much sense to me.
Every task produce only a partition?
Almost.
Every task produce an output of executing the code (it was created for) for the data in a partition.
The number of the Spark tasks equal to the number of the Spark partitions?
Almost.
The number of the Spark tasks in a single stage equals to the number of RDD partitions.
1.The number of the Spark tasks equal to the number of the Spark partitions?
Yes.
Spark breaks up the data into chunks called partitions. Is a collection of rows that sit on one physical machine in the cluster. Default partition size is 128MB. Allow every executor perform work in parallel. One partition will have a parallelism of only one, even if you have many executors.
With many partitions and only one executor will give you a parallelism of only one. You need to balance the number of executors and partitions to have the desired parallelism. This means that each partition will be processed by only one executor (1 executor for 1 partition for 1 task at a time).
A good rule is that the number of partitions should be larger than the number of executors on your cluster
See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple (p. 27). O'Reilly Media. Edição do Kindle.
2.The executor runs once (batch inside of executor) is equal to one task?
Cores are slot for tasks, and each executor can process more than one partition at a time if it has more than one core.
3.Every task produce only a partition?
It depend on the transformation.
Spark has Wide transformations and Narrow Transformation.
Wide Transformation: Will have input partitions contributing to many output partitions (shuffles -> Aggregation, sort, joins). Often referred to as a shuffle whereby Spark exchange partitions across the cluster. When we perform a shuffe, Spark write the results do disk
Narrow Transformation: Which input partition will contribute to only one output partition.
See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple. O'Reilly Media. Edição do Kindle.
Note: Read file is a narrow transformation because it does not require shuffle, but when you read one file that is splittable like parquet this file will be split into many partitions

what factors affect how many spark job concurrently

We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?
Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.
Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.

How does Spark paralellize slices to tasks/executors/workers?

I have a 2-node Spark cluster with 4 cores per node.
MASTER
(Worker-on-master) (Worker-on-node1)
Spark config:
slaves: master, node1
SPARK_WORKER_INSTANCES=1
I am trying to understand Spark's paralellize behaviour. The sparkPi example has this code:
val slices = 8 // my test value for slices
val n = 100000 * slices
val count = spark.parallelize(1 to n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
As per documentation:
Spark will run one task for each slice of the cluster. Typically you want 2-4 slices for each CPU in your cluster.
I set slices to be 8 which means the working set will be divided among 8 tasks on the cluster, in turn each worker node gets 4 tasks (1:1 per core)
Questions:
Where can I see task level details? Inside executors I don't see task breakdown so I can see the effect of slices on the UI.
How to programmatically find the working set size for the map function above? I assume it is n/slices (100000 above)
Are the multiple tasks run by an executor run sequentially or paralell in multiple threads?
Reasoning behind 2-4 slices per CPU.
I assume ideally we should tune SPARK_WORKER_INSTANCES to correspond to number of cores in each node (in a homogeneous cluster) so that each core gets its own executor and task (1:1:1)
I will try to answer your question as best I can:
1.- Where can I see task level details?
When submitting a job, Spark stores information about the task breakdown on each worker node, apart from the master. This data is stored, I believe (I have only tested with Spark for EC2), on the work folder under the spark directory.
2.- How to programmatically find the working set size for the map function?
Although I am not sure if it stores the size in memory of the slices, the logs mentioned on the first answer provide information about the amount of lines each RDD partition contains.
3.- Are the multiple tasks run by an executor run sequentially or paralelly in multiple threads?
I believe diferent tasks inside a node run sequentially. This is shown on the logs indicated above, which indicate the start and end time of every task.
4.- Reasoning behind 2-4 slices per CPU
Some nodes finish their tasks faster than others. Having more slices than available cores distributes the tasks in a balanced way avoiding long processing time due to slower nodes.
Taking a stab at #4:
For #4 it's worth noting that "slices" and "partitions" are the same thing, there is a bug filed and efforts to clean up the docs: https://issues.apache.org/jira/browse/SPARK-1701
Here's a link that expands the reasoning in #4: http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism
Specifically look at the line:
In general, we recommend 2-3 tasks per CPU core in your cluster.
An important consideration is to avoid shuffling, and setting number of slices is part of that. It's a more complicated subject than I fully understand to explain fully here -- the basic idea is to partition your data into enough partitions/slices up front to avoid Spark having to re-shuffle to get more partitions later.
1) Where can I see task level details? Inside executors I don't see task breakdown so I can see the effect of slices on the UI.
I do not understand your question as from the UI we can definitely see the effect of partitioning (or slices if you prefer).
2) How to programmatically find the working set size for the map function above? I assume it is n/slices (100000 above)
please give more details on what size are you interested. If you mean the amount of memory consumed by each worker ... each Spark partition has 64MB so ... from the official Spark documentation :
Spark prints the serialized size of each task on the master, so you can look at that to decide whether your tasks are too large; in general tasks larger than about 20 KB are probably worth optimizing.
3) Are the multiple tasks run by an executor run sequentially or paralell in multiple threads?
a good source for this is this question :
Spark executor & tasks concurrency
4) Reasoning behind 2-4 slices per CPU.
I assume ideally we should tune SPARK_WORKER_INSTANCES to correspond to number of cores in each node (in a homogeneous cluster) so that each core gets its own executor and task (1:1:1)
the major goal is not to have idle workers... once it finishes one task it will always have something to work with while waiting for other nodes to complete longer tasks. With a (1:1:1) the workers would be idle.

Resources