Controlling the number of spark drivers running on Mesos - apache-spark

We have a cluster with 6 EC2 nodes in AWS (16 cpu, 64 gb per node - 1 node running mesos master and 5 nodes running mesos slave). The mesos dispatcher is running on the master node. We use this cluster exclusively to run spark jobs. Our spark configuration is 5 cpu and 10 gb per executor (same for driver). I
In one of our scheduled jobs, we have a scenario where we need to do a few hundred spark-submits at the same time. When this happens, the dispatchers starts drivers for all these spark-submits leaving no room for any executors in the cluster.
I'm looking at a few options. I would like to get some pointers from members in the spark/mesos community.
Some options which I don't want to get into are : increasing the cluster size, asking the analysts to change their job structure to combine all spark-submits into a single one, switching to YARN/EMR (actually I tried this and got into some messy queue problems there)
Option 1 : Using Mesos roles
I found some documentation on the use of quotas and roles to solve this. But I'm not sure about the following :
How to create a mesos role and update the resources to be made available to
this role ?
How to set up separate roles for spark drivers and spark executors ?
By default all resources are in the * role. There is a flag called spark.mesos.role that I can set while doing spark-submit but not sure how to create this role and ensure this role is used only for executors ?
Option 2 : Modifying the mesos cluster scheduler
When spark-submit happens to mesos dispatcher, it adds the driver request to a WaitingQueue. When drivers fail while executing and if supervise mode is available, they are sent to a PendingRetryQueue with custom retry schedule settings. When resources are available from mesos, these drivers from the PendingRetryQueue are scheduled first and WaitingQueue are scheduled next.I was thinking of keeping the WaitingQueue with size 5 (spark.mesos.maxDrivers) and when there are more spark-submits than the queue size, I would add these drivers to the PendingRetryQueue and schedule them to run later. Currently, as per my understanding, when there are more that 200 drivers in the WaitingQueue mesos rest server sends a failed message and doesn't add it to the PendingRetryQueue.
Any help on implementing either of the options would be very helpful to me. Thanks in advance.
Update : Just saw that by when I give spark-submit with a role, it runs only executors in that role and drivers run in the default * role. I think this should solve this issue for me. Once I test this, I'll post my update here and close this. Thanks

As mentioned in the update, by default the mesos runs spark drivers in default role (*) and executors in the role provided by 'spark.mesos.role' parameter. To control the resources available for each role, we can use quotas , guarantees or reservations. We went ahead with static reservations since it suited our requirements.Thanks.

Option 1 is the good one.
First set dispatcher quota by creating a file like dispatcher-quota.json
cat dispatcher-quota.json
{
"role": "dispatcher",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 5.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 5120.0 }
}
]
}
Then push it to you're mesos master (leader) with
curl -d #dispatcher-quota.json -X POST http://<master>:5050/quota
So now you will have a quota for driver
Ensure you dispatcher is running with the right service role set if needed ajust it. If in DC/OS use
$ cat options.json
{
"service": {
"role": "dispatcher"
}
}
$ dcos package install spark --options=options.json
Otherwise feel free to share how you've deployed you dispatcher. I will provide you a how to guide.
That's ok for drivers. Now let's work with executor folowing the same way
$ cat executor-quota.json
{
"role": "executor",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 100.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 409600.0 }
}
]
}
$ curl -d #executor-quota.json -X POST http://<master>:5050/quota
Adapt values to you requirements
Then ensure to launch executor on with the correct role by providing
--conf spark.mesos.role=executor \
Source of my explanation came from https://github.com/mesosphere/spark-build/blob/master/docs/job-scheduling.md don't hesistate if it's not enought.
This should do the work

Related

how to tell mesos frameworks which launched by different commands/parameters

I am building metrics collector to collect the running status about all the Spark Jobs running on it. The mesos API http://masterip/frameworks return a lot of details about all the frameworks and then I run http://slaveip/slave(1)/monitor/statistics to get each frameworks detail info from each slave, then correlate them.
This works fine for most of the jobs, but I have some jobs which behave different according to different parameters when submitting. They are shown as same framework name in Mesos GUI and I can not tell each other.
Is there a way to get the detail full commands which launches the job? Or any other idea about how to tell them?
You can find there are multiple instances with same framework name. As they are different spark job instances.
When I connect to Mesos slave, the monitor/statistics doesn't show the full command with all the parameters, so I can not tell which framework correlate to which Spark job instance.
{
"executor_id": "0",
"executor_name": "Command Executor (Task: 0) (Command: sh -c '
\"/usr/local...')",
"framework_id": "06ba8de8-7fc3-422d-9ee3-17dd9ddcb2ca-3157",
"source": "0",
"statistics": {
"cpus_limit": 2.1,
"cpus_system_time_secs": 848.689999999,
"cpus_user_time_secs": 5128.78,
"mem_limit_bytes": 4757389312,
"mem_rss_bytes": 2243149824,
"timestamp": 1522858776.20098
}
},
Thanks

Spark Standalone mode with master service discovery

We have a spark standalone that has 2 masters. We are using consul to discover all of our services. So that instead of writing in worker configuration such as:
spark://172.40.101.1:7077,172.40.102.2:7077
we just write
spark://spark-master.service:7077
The problem is that if for example 172.40.101.1 is standby and 172.40.102.2 is active, and in the first time the worker will get 101.1 then it will not try again. Seems like it is static.
Now I can work around using dig and linux parsing, But my questions are:
Is the worker config static ?
Is there a best practice for this issue ?
There are two parts to this problem. The first is how do you identify an active (or standby) spark? The second is how can you use that information to connect to the proper one?
If you can tell, either by a web url get or a process manipulation which one is active, and which one(s) are standby, you can create a service / health check based on that. Googling around a bit, I see the spark consul service and it's health check here:
{
"service": {
"name": "spark-master",
"port": 7077,
"checks": [
{
"script": "ps aux | grep -v grep | grep org.apache.spark.deploy.master.Master",
"interval": "10s"
}
]
}
}
This health check finds a java process via a script. If the process is found, then the health check succeeds. This particular health check doesn't care if it is active or standby, either matches. You would need a health check, under a service with a different name, that determines if the spark node is active. I don't know anything about spark, but looking on the net I found this spark-submit command. If this command works as I imagine, this might do the trick:
{
"service": {
"name":"spark-active"
,"port":7077
,"checks":[{"script": "curl --silent http://127.0.0.1:8080/ | grep '<li><strong>Status:</strong> ALIVE</li>'| wc -l | awk '{exit (\$0 - 1) }'"
}
}
Then you would connect using:
spark://spark-active.service:7077
Your health check can also connect via http. Consul service checks are documented here: https://www.consul.io/docs/agent/checks.html
-g

How to set spark.driver.memory for Spark/Zeppelin on EMR

When using EMR (with Spark, Zeppelin), changing spark.driver.memory in Zeppelin Spark interpreter settings won't work.
I wonder what is the best and quickest way to set Spark driver memory when using EMR web interface (not aws CLI) to create clusters?
Is Bootstrap action could be a solution?
If yes, can you please provide an example of how the bootstrap action file should look like?
You can always try to add the following configuration on job flow/cluster creation :
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.driver.memory": "12G"
}
}
]
You can do this most of the configurations whether for spark-default, hadoop core-site, etc.
I hope this helps !

Spark not using all nodes after resizing

I've got an EMR cluster that I'm trying to use to execute large text processing jobs, and I had it running on a smaller cluster, however after resizing the master keeps running the jobs locally and crashing due to memory issues.
This is the current configuration I have for my cluster:
[
{
"classification":"capacity-scheduler",
"properties":
{
"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
},
"configurations":[]
},
{
"classification":"spark",
"properties":
{
"maximizeResourceAllocation":"true"
},
"configurations":[]
},
{
"classification":"spark-defaults",
"properties":
{
"spark.executor.instances":"0",
"spark.dynamicAllocation.enabled":"true"
},
"configurations":[]
}
]
This was a potential solution I saw from this question, and it did work before I resized.
Now whenever I attempt to submit a spark job like this spark-submit mytask.py
I see tons of log entries where it doesn't seem to leave the master host, like so:
17/08/14 23:49:23 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0,localhost, executor driver, partition 0, PROCESS_LOCAL, 405141 bytes)
I've tried different parameters, like setting --deploy-mode cluster and --master yarn, since yarn is running on the master node, but still seeing all the work being done by the master host, while the core nodes sit idle.
Is there another configuration I'm missing, preferably one that doesn't require rebuilding the cluster?

Service Fabric Application PackageDeployment Operation Time Out exception

i have service fabric cluster and 3 nodes are created in 3 systems and it is inter-connected. i am able to connect each of nodes. These nodes are created in windows server. These Windows Server(VMs) are on-premises.
Manually i am trying to deploy my package into my cluster/one of nodes, i am getting Operation Timeout exception. i have used below commands to execute for deployment.
Service Fabric Power shell Commands:
Copy-ServiceFabricApplicationPackage -ApplicationPackagePath 'c:\sample\etc' -ApplicationPackagePathInImageStore 'abc.app.portaltype'
after execute above command it runs for 2 -3 mins and throws Operation Timeout exception. My package size is almost 250 MB and approximately 15000 file exist in my package. after that i have passed an extra parameter -TimeOutSec to 600(10mins) explicitly in above command, then it successfully executed and it copied to service fabric imagestore.
Register-ServiceFabricApplicationType -ApplicationPathInImageStore 'abc.app.portaltype'
after executed Copy-ServiceFabricApplicationPackage command , i have executed above Register-ServiceFabricApplicationType command to register my in cluster.but it also throws Operation timeout exception then i have passed an extra parameter -TimeOutSec to 600(10mins) explicitly in above command, but no luck it throws same operation timeout exception.
Just to make sure these operation Timeout issue because of no files in package or not. i have created simple empty service fabric asp.net core app and created package and try to deploy in same server with using above command, it deployed with in fraction of second and it works as smoothly.
Anybody has any idea how to over come service fabric operation timeout issue ?
How to handle the operation timeout issue if the package contains large set of files ?
Any help/suggestion would be very appreciated.
Thanks,
If this is taking longer than the 10 Minute default max it's probably one of the following issues:
Large application packages (>100s of MB)
Slow network connections
A large number of files within the application package (>1000s).
The following workarounds should help you.
Add the following settings to your cluster config:
"fabricSettings": [
{
"name": "NamingService",
"parameters": [
{
"name": "MaxOperationTimeout",
"value": "3600"
},
]
}
]
Also add:
"fabricSettings": [
{
"name": "EseStore",
"parameters": [
{
"name": "MaxCursors",
"value": "32768"
},
]
}
]
There’s a couple additional features which are currently rolling out. For these to be present and functional, you need to be sure that the client is at least 2.4.28 and the runtime of your cluster is at least 5.4.157. If you’re staying up to date these should already be present in your environment.
For register you can specify the -Async flag which will handle the upload asynchronously, reducing the need for the timeout to just the time necessary to send the command, not the application package. You can also query the status of the registration with Get-ServiceFabricApplicationType. 5.5 fixes some issues with these commands, so if they aren't working for you you'll have to wait for that release to hit your environment.

Resources