Where does Spark schedule .textFile() task - apache-spark

Say I want to read data from an external HDFS database, and I have 3 workers in my cluster (one maybe a bit closer to external_host - but not on the same host).
sc.textFile("hdfs://external_host/file.txt")
I understand that Spark schedules tasks based on the locality of the underlying RDD. But on which worker (ie. executor) is .textFile(..) scheduled (since we do not have an executor running on external_host)?
I imagine it loads the HDFS blocks as partitions to worker memory, but how does Spark decide what the best worker is (I would imagine it chooses the closest based on latency or something else, is this correct?)

Related

How is data accessed by worker nodes in a Spark Cluster?

I'm trying to understand the functioning of Spark, I know the Cluster manager allocates the resources (Workers) for the driver program.
I want to know, how (which transformations) the cluster manager sends the tasks to worker nodes and how worker nodes access the data (Assume my data is in S3)?
Does worker nodes read only a part of data and apply all transformations on it and return the actions to the driver program? or The worker nodes reads the entire file but only apply specific transformation and return back the result to the driver program?
Follow-up questions:
How and who decides how much amount of data needs to be sent to worker nodes? as we have established a point that partial data is present on each worker node. Eg: I have two worker nodes with 4 cores each and I have one 1TB csv file to read and perform few transformations and an action. assume the csv is on S3 and on the master node's local storage.
It's going to be a long answer, but I will try to simplify it at my best:
Typically a Spark cluster contains multiple nodes, each node would have multiple CPUs, a bunch of memory, and storage. Each node would hold some chunks of data, therefore sometimes they're also referred to data nodes as well.
When Spark application(s) are started, they tend to create multiple workers or executors. Those workers/executors took resources (CPU, RAM) from the cluster's nodes above. In other words, the nodes in a Spark cluster play both roles: data storage and computation.
But as you might have guessed, data in a node (sometimes) is incomplete, therefore, workers would have to "pull" data across the network to do a partial computation. Then the results are sent back to the driver. The driver would just do the "collection work", and combine them all to get the final results.
Edit #1:
How and who decides how much amount of data needs to be sent to worker nodes
Each task would decide which data is "local" and which is not. This article explains pretty well how data locality works
I have two worker nodes with 4 cores each and I have one 1TB csv file to read and perform a few transformations and an action
This situation is different with the above question, where you have only one file and most likely your worker would be exactly the same as your data node. The executor(s) those are sitting on that worker, however, would read the file piece by piece (by tasks), in parallel, in order to increase parallelism.

SparkDataframe.load(),when I execute a load command where actually my data get stored?

If I am loading one table from cassandra using spark dataframe.load().Where will my data gets loaded.Is it in spark memory.Or in datanode blocks ,if I am using yarn resource manager.
It will try to store in memory per number of partitions on the Worker Nodes / which in this context is a slightly better term than Data Nodes.
It will spill to disk if not enough memory on the Worker Nodes.
Per number of Cores / Executors, processing will occur. E.g. if you have, say, 20 Executors with 1 Core each, your concurrency of processing is 20 and spilling will occur via eviction. If you run out of disk, an error will result.
Worker Nodes is a better term here compared to Data Nodes, unless you have HDFS and processing locally, then Worker Node is equal to Data Node. Although you could argue what's in a name?
Of course, an Action will need to have been initiated.
And repartition and join or union latterly in the data pipeline affect things, but that goes without saying.

Distributing partitions across cluster

In apache spark one is allowed to load datasets from many different sources. According to my understanding computing nodes of spark cluster can be different than these used by hadoop to store data (am I right?). What is more, we can even load local file into spark job. Here goes main question: Even if we use the same computers for hdfs and spark purposes, is it always true that spark, during creation of RDD, will shuffle all data? Or spark will just try to load data in the way to take advantage of already existing data locality?
You can use HDFS as the common underlying storage for both MapReduce (Hadoop) and Spark engines, and use a cluster manager like YARN to perform resource management. Spark will try to take advantage of data locality, and execute tasks as close as possible to the data.
This is how it works: If data is available on a node to process, but the CPU is not free, Spark will wait for a certain amount of time (determined by the configuration parameter: spark.locality.wait seconds, default is 3 seconds) for the CPU to become available.
If CPU is still not free after the configured time has passed, Spark will switch the task to a lower locality level. It will then again wait for spark.locality.wait seconds and if a timeout occurs again, it will switch to a yet lower locality level.
The locality levels are defined as below, in order from closest to data, to farthest from data (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.TaskLocality$):
PROCESS_LOCAL (data is in the same JVM as the running code)
NODE_LOCAL (data is on the same node)
NO_PREF (data is accessed equally quickly from anywhere and has no locality preference)
RACK_LOCAL (data is on the same rack of servers)
ANY (data is elsewhere on the network and not in the same rack)
Waiting time for locality levels can also be individually configured. For longer jobs, the wait time can be increased to a larger value than default of 3 seconds, since the CPU might be tied up longer.

What is the relationship between tasks and partitions?

Can I say?
The number of the Spark tasks equal to the number of the Spark partitions?
The executor runs once (batch inside of executor) is equal to one task?
Every task produce only a partition?
(duplicate of 1.)
The degree of parallelism, or the number of tasks that can be ran concurrently, is set by:
The number of Executor Instances (configuration)
The Number of Cores per Executor (configuration)
The Number of Partitions being used (coded)
Actual parallelism is the smaller of
executors * cores - which gives the amount of slots available to run tasks
partitions - each partition will translate to a task whenever a slot opens up.
Tasks that run on the same executor will share the same JVM. This is used by the Broadcast feature as you only need one copy of the Broadcast data per Executor for all tasks to be able to access it through shared memory.
You can have multiple executors running, on the same machine, or on different machines. Executors are the true means of scalability.
Note that each Task takes up one Thread ¹, and is assumed to be assigned to one core ².
So -
Is the number of the Spark tasks equal to the number of the Spark partitions?
No (see previous).
The executor runs once (batch inside of executor) is equal to one task?
An Executor is started as an environment for the tasks to run. Multiple tasks will run concurrently within that Executor (multiple threads).
Every task produce only a partition?
For a task, it is one Partition in, one partition out. However, a repartitioning or shuffle/sort can happen in between tasks.
The number of the Spark tasks equal to the number of the Spark partitions?
Same as (1.)
(¹) Assumption is that within your tasks, you are not multithreading yourself (never do that, otherwise core estimate will be off).
(²) Note that due to hyper-threading, you might have more than one virtual core per physical core, and thus you can have several threads per core. You might even be able to handle multiple threads (2 to 3) on a single core without hyper-threading.
Partitions are a feature of RDD and are only available at design time (before an action is called).
Tasks are part of TaskSet per Stage per ActiveJob in a Spark application.
Is the number of the Spark tasks equal to the number of the Spark partitions?
Yes.
The executor runs once (batch inside of executor) is equal to one task?
That recursively uses "executor" and does not make much sense to me.
Every task produce only a partition?
Almost.
Every task produce an output of executing the code (it was created for) for the data in a partition.
The number of the Spark tasks equal to the number of the Spark partitions?
Almost.
The number of the Spark tasks in a single stage equals to the number of RDD partitions.
1.The number of the Spark tasks equal to the number of the Spark partitions?
Yes.
Spark breaks up the data into chunks called partitions. Is a collection of rows that sit on one physical machine in the cluster. Default partition size is 128MB. Allow every executor perform work in parallel. One partition will have a parallelism of only one, even if you have many executors.
With many partitions and only one executor will give you a parallelism of only one. You need to balance the number of executors and partitions to have the desired parallelism. This means that each partition will be processed by only one executor (1 executor for 1 partition for 1 task at a time).
A good rule is that the number of partitions should be larger than the number of executors on your cluster
See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple (p. 27). O'Reilly Media. Edição do Kindle.
2.The executor runs once (batch inside of executor) is equal to one task?
Cores are slot for tasks, and each executor can process more than one partition at a time if it has more than one core.
3.Every task produce only a partition?
It depend on the transformation.
Spark has Wide transformations and Narrow Transformation.
Wide Transformation: Will have input partitions contributing to many output partitions (shuffles -> Aggregation, sort, joins). Often referred to as a shuffle whereby Spark exchange partitions across the cluster. When we perform a shuffe, Spark write the results do disk
Narrow Transformation: Which input partition will contribute to only one output partition.
See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple. O'Reilly Media. Edição do Kindle.
Note: Read file is a narrow transformation because it does not require shuffle, but when you read one file that is splittable like parquet this file will be split into many partitions

what factors affect how many spark job concurrently

We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?
Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.
Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.

Resources