I'm investigating improvements in the new GridGain release and wanted to know how GridGain 6 handles tasks with many jobs.
Consider a situation where tasks spawn a large number of jobs (hundreds of thousands). In GridGain 4, we observed that the jobs got queued up in memory on the nodes potentially causing "out of memory" issues. We got around the issue by throttling job submission by creating a disk based queue and submitting the queued jobs as jobs finish.
Can (how?) GridGain 6 handle this situation and are there any specific recommendations? I see that there is a Streaming API available but can this handle our situation.
Thanks
I think you need to take advantage of GridComputeTaskContinuousMapper class which allows you to have a constant number of outstanding jobs within a task and then emit new jobs once other jobs complete.
Take a look at ComputeContinuousMapperExample shipped with GridGain (also available on GitHub).
Related
I am executing a Spark Streaming application and I am caching the rdds for history look-back, my batch is of duration one minute and average processing time is 14 seconds, So executors are not computing for whole batch duration. So are executors, are still hold up as I am caching the rdds in memory. And if executors are hold up should we consider this holding up of executors is wastage of resources.
It depends of what you want to achieve.
In Spark 2.0, dynamic allocation is enabled to Spark Streaming with no Bugs.
What was the problem, if you have a huge workload of data if you don't keep at least one executor for the data receive you may lose data. Now this is solved with Spark 2.0 and the release of the data is working.
What is the advantage of keeping your data in cache when a huge amount of data comes? You can process your data without a shuffle, it can increase your response time.
But, if you have a process of 1 by 1 minute. And it takes just 14 seconds to process your data in an average time. I suggest you to release your data and release your workers to open space for other tasks.
If you will not have enough resources for your tasks, the tasks will be queued and will be handled as soon you have the resources.
What is the risk? If you release the workers could be hard to get the resources back if you don't have preemption in your Yarn. This can be a waste of resource depends of your cluster.
What I would do is: create some queues that can handle your jobs. Set the High priority queue, set your streaming there, the other jobs in other queues and turn on the Dynamic Allocation and release the cache. If your application needs something with more resources the Yarn will handle it.
Need an understanding on when to use spark.dynamicAllocation.enabled - What are advantages and disadvantages of using it? I have queue where jobs get submitted.
9:30 AM --> Job A gets submitted with dynamicAllocation enabled.
10:30 AM --> Job B gets submitted with dynamicAllocation enabled.
Note: My Data is huge (processing will be done on 10GB data with transformations).
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Dynamic Allocation of Executors is about resizing your pool of executors.
Quoting Dynamic Allocation:
spark.dynamicAllocation.enabled Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload.
And later on in Dynamic Resource Allocation:
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
In other words, job A will usually finish before job B will be executed. Spark jobs are usually executed sequentially, i.e. a job has to finish before another can start.
Usually...
SparkContext is thread-safe and can handle jobs from a Spark application. That means that you may submit jobs at the same time or one after another and in some configuration expect that these two jobs will run in parallel.
Quoting Scheduling Within an Application:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
Wrapping up...
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Job A.
Unless you have enabled Fair Scheduler Pools:
The fair scheduler also supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important jobs, for example, or to group the jobs of each user together and give users equal shares regardless of how many concurrent jobs they have instead of giving jobs equal shares.
I have multiple spark jobs. Normally I submit my spark jobs to yarn and I have an option that is --yarn_queue which tells it which yarn queue to enter.
But, the jobs seem to run in parallel in the same queue. Sometimes, the results of one spark job, are the inputs for the next spark job. How do I run my spark jobs sequentially rather than in parallel in the same queue?
I have looked at this page for a capacity scheduler. But the closest thing I can see is the property yarn.scheduler.capacity.<queue>.maximum-applications. But this only sets the number of applications that can be in both PENDING and RUNNING. I'm interested in setting the number of applications that can be in the RUNNING state, but I don't care the total number of applications in PENDING (or ACCEPTED which is the same thing).
How do I limit the number of applications in state=RUNNING to 1 for a single queue?
You can manage appropriate queue run one task a time in capacity scheduler configuration. My suggestion to use ambari for that purpose. If you haven't such opportunity apply instruction from guide
From https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html:
The Fair Scheduler lets all apps run by default, but it is also possible to limit the number of running apps per user and per queue through the config file. This can be useful when a user must submit hundreds of apps at once, or in general to improve performance if running too many apps at once would cause too much intermediate data to be created or too much context-switching. Limiting the apps does not cause any subsequently submitted apps to fail, only to wait in the scheduler’s queue until some of the user’s earlier apps finish.
Specifically, you need to configure:
maxRunningApps: limit the number of apps from the queue to run at once
E.g.
<?xml version="1.0"?>
<allocations>
<queue name="sample_queue">
<maxRunningApps>1</maxRunningApps>
<other options>
</queue>
</allocations>
Is it possible to do a map from a mapper function (i.e from tasks) in pyspark?
In other words, is it possible to open "sub tasks" from a task?
If so - how do i pass the sparkContext to the tasks - just as a variable?
I would like to have a job that is composed from many tasks - each of these tasks should create many tasks as well, without going back to the driver.
My use case is like this:
I am doing a code porting of an application that was written using work queues - to pyspark.
In my old application tasks created other tasks - and we used this functionality. I don't want to redesign the whole code because of the move to spark (especially because i will have to make sure that both platform works in the transient phase between the systems)...
Is it possible to open "sub tasks" from a task?
No, at least not in a healthy manner*.
A task is a command sent from the driver and Spark has as one Driver (central coordinator) that communicates with many distributed workers (executors).
As a result, what you ask for here, implies that every task can play the role of a sub-Driver. Not even a worker, which would have the same faith in my answer as the task.
Remarkable resources:
What is a task in Spark? How does the Spark worker execute the jar file?
What are workers, executors, cores in Spark Standalone cluster?
*With that said, I mean that I am not aware of any hack or something, which if exists would be too specific.
I'm doing some instrumentation in Spark and I've realised that some of my tasks take really long times to complete because the Scheduler Delay Time that can be extracted from the TaskMetrics.
I know there are some questions already about this topic like this What is scheduler delay in spark UI's event timeline but the answers have not been accepted and it says that a task waiting for an open slot is considered scheduler delay, which I think is not true (as far as I know if a task doesn't have a slot into an executor it doesn't start generating metrics).
I'm a bit confused with from where does this Delay really starts. I was wondering if this Delay time takes also into account the period between an app being accepted by the YARN client and submitting the first job of the app. Or in other words, between this moment where the app is accepted:
and this one where is running:
I checked directly by launching one app with few resources available in the cluster. It stayed in the queue until enough executors could be launched for the stage. Then the yarn.Client launched the stage in the cluster. The metrics in spark don't consider this time in the queue as any delay. Also it doesn't matter if you have more tasks than cores like the stack overflow answer I posted above. The tasks will be allocated in the executors as they become available.
In short, scheduler delay time only considers sending the task to the executor. If there is a delay in here, YARN is not the bottleneck but the load in the nodes involved ( normally the driver and the worker nodes with the executors for the app)