Spark Standalone mode with master service discovery - apache-spark

We have a spark standalone that has 2 masters. We are using consul to discover all of our services. So that instead of writing in worker configuration such as:
spark://172.40.101.1:7077,172.40.102.2:7077
we just write
spark://spark-master.service:7077
The problem is that if for example 172.40.101.1 is standby and 172.40.102.2 is active, and in the first time the worker will get 101.1 then it will not try again. Seems like it is static.
Now I can work around using dig and linux parsing, But my questions are:
Is the worker config static ?
Is there a best practice for this issue ?

There are two parts to this problem. The first is how do you identify an active (or standby) spark? The second is how can you use that information to connect to the proper one?
If you can tell, either by a web url get or a process manipulation which one is active, and which one(s) are standby, you can create a service / health check based on that. Googling around a bit, I see the spark consul service and it's health check here:
{
"service": {
"name": "spark-master",
"port": 7077,
"checks": [
{
"script": "ps aux | grep -v grep | grep org.apache.spark.deploy.master.Master",
"interval": "10s"
}
]
}
}
This health check finds a java process via a script. If the process is found, then the health check succeeds. This particular health check doesn't care if it is active or standby, either matches. You would need a health check, under a service with a different name, that determines if the spark node is active. I don't know anything about spark, but looking on the net I found this spark-submit command. If this command works as I imagine, this might do the trick:
{
"service": {
"name":"spark-active"
,"port":7077
,"checks":[{"script": "curl --silent http://127.0.0.1:8080/ | grep '<li><strong>Status:</strong> ALIVE</li>'| wc -l | awk '{exit (\$0 - 1) }'"
}
}
Then you would connect using:
spark://spark-active.service:7077
Your health check can also connect via http. Consul service checks are documented here: https://www.consul.io/docs/agent/checks.html
-g

Related

how to update openshift kube-apiserver component with a new container image?

Openshift provides update way which updates the whole platform in a live way. while i (perhaps others also)have needs to just update some specific components.
It's ok to update component such as console, openshift-apiserver with new container image by managing operator and setting image correspondingly.
For example, to update openshift-apiservercomponent, the following steps do work:
disable the management of openshift apiserver operator
#oc patch openshiftapiservers.operator.openshift.io cluster --patch '{ "spec": { "managementState": "Unmanaged" } }' --type=merge
set a new conainer image for openshift apiserver deployment
#oc set image deploy apiserver openshift-apiserver=registry.somecorp.com:5000/ocp4/openshift4:openshfit-apiserver-4.4.4-t1 -n openshift-apiserverb
check and wait for the rollout status
#oc rollout status -w deploy/apiserver -n openshift-apiserver
While for the base kube-apiserver component, things are different.
Firstly, The way to disable related operator does not work, it seems kubeapiserver operator does not support the "Unmanaged" feature.
#oc patch kubeapiserver.operator.openshift.io cluster --patch '{ "spec": b { "managementState": "Unmanaged" } }' --type=merge
The KubeAPIServer "cluster" is invalid: spec.managementState: Invalid
value: "": spec.managementState in body should match
'^(Managed|Force)$'
Secondly, instead of deployment, it seems just pods are used for kube-apiserver. while there is way to set image for a specific pod/container, i don't figure out how to apply the setting.
#oc set image pod kube-apiserver-master-0 kube-apiserver=registry.somecorp.com:5000/ocp4/openshift4:hyperkube-t1 -n openshift-kube-apiserver b
pod/kube-apiserver-master-0 image updated
Is there someone who could help me figure out an approach to manually update kube-apiserver in a openshift system? Thanks for any information.
Using option A described here(https://github.com/openshift/enhancements/blob/master/enhancements/operator-dev-doc.md), kube-apiserver component can be really updated for a running cluster.

Controlling the number of spark drivers running on Mesos

We have a cluster with 6 EC2 nodes in AWS (16 cpu, 64 gb per node - 1 node running mesos master and 5 nodes running mesos slave). The mesos dispatcher is running on the master node. We use this cluster exclusively to run spark jobs. Our spark configuration is 5 cpu and 10 gb per executor (same for driver). I
In one of our scheduled jobs, we have a scenario where we need to do a few hundred spark-submits at the same time. When this happens, the dispatchers starts drivers for all these spark-submits leaving no room for any executors in the cluster.
I'm looking at a few options. I would like to get some pointers from members in the spark/mesos community.
Some options which I don't want to get into are : increasing the cluster size, asking the analysts to change their job structure to combine all spark-submits into a single one, switching to YARN/EMR (actually I tried this and got into some messy queue problems there)
Option 1 : Using Mesos roles
I found some documentation on the use of quotas and roles to solve this. But I'm not sure about the following :
How to create a mesos role and update the resources to be made available to
this role ?
How to set up separate roles for spark drivers and spark executors ?
By default all resources are in the * role. There is a flag called spark.mesos.role that I can set while doing spark-submit but not sure how to create this role and ensure this role is used only for executors ?
Option 2 : Modifying the mesos cluster scheduler
When spark-submit happens to mesos dispatcher, it adds the driver request to a WaitingQueue. When drivers fail while executing and if supervise mode is available, they are sent to a PendingRetryQueue with custom retry schedule settings. When resources are available from mesos, these drivers from the PendingRetryQueue are scheduled first and WaitingQueue are scheduled next.I was thinking of keeping the WaitingQueue with size 5 (spark.mesos.maxDrivers) and when there are more spark-submits than the queue size, I would add these drivers to the PendingRetryQueue and schedule them to run later. Currently, as per my understanding, when there are more that 200 drivers in the WaitingQueue mesos rest server sends a failed message and doesn't add it to the PendingRetryQueue.
Any help on implementing either of the options would be very helpful to me. Thanks in advance.
Update : Just saw that by when I give spark-submit with a role, it runs only executors in that role and drivers run in the default * role. I think this should solve this issue for me. Once I test this, I'll post my update here and close this. Thanks
As mentioned in the update, by default the mesos runs spark drivers in default role (*) and executors in the role provided by 'spark.mesos.role' parameter. To control the resources available for each role, we can use quotas , guarantees or reservations. We went ahead with static reservations since it suited our requirements.Thanks.
Option 1 is the good one.
First set dispatcher quota by creating a file like dispatcher-quota.json
cat dispatcher-quota.json
{
"role": "dispatcher",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 5.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 5120.0 }
}
]
}
Then push it to you're mesos master (leader) with
curl -d #dispatcher-quota.json -X POST http://<master>:5050/quota
So now you will have a quota for driver
Ensure you dispatcher is running with the right service role set if needed ajust it. If in DC/OS use
$ cat options.json
{
"service": {
"role": "dispatcher"
}
}
$ dcos package install spark --options=options.json
Otherwise feel free to share how you've deployed you dispatcher. I will provide you a how to guide.
That's ok for drivers. Now let's work with executor folowing the same way
$ cat executor-quota.json
{
"role": "executor",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 100.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 409600.0 }
}
]
}
$ curl -d #executor-quota.json -X POST http://<master>:5050/quota
Adapt values to you requirements
Then ensure to launch executor on with the correct role by providing
--conf spark.mesos.role=executor \
Source of my explanation came from https://github.com/mesosphere/spark-build/blob/master/docs/job-scheduling.md don't hesistate if it's not enought.
This should do the work

how to tell mesos frameworks which launched by different commands/parameters

I am building metrics collector to collect the running status about all the Spark Jobs running on it. The mesos API http://masterip/frameworks return a lot of details about all the frameworks and then I run http://slaveip/slave(1)/monitor/statistics to get each frameworks detail info from each slave, then correlate them.
This works fine for most of the jobs, but I have some jobs which behave different according to different parameters when submitting. They are shown as same framework name in Mesos GUI and I can not tell each other.
Is there a way to get the detail full commands which launches the job? Or any other idea about how to tell them?
You can find there are multiple instances with same framework name. As they are different spark job instances.
When I connect to Mesos slave, the monitor/statistics doesn't show the full command with all the parameters, so I can not tell which framework correlate to which Spark job instance.
{
"executor_id": "0",
"executor_name": "Command Executor (Task: 0) (Command: sh -c '
\"/usr/local...')",
"framework_id": "06ba8de8-7fc3-422d-9ee3-17dd9ddcb2ca-3157",
"source": "0",
"statistics": {
"cpus_limit": 2.1,
"cpus_system_time_secs": 848.689999999,
"cpus_user_time_secs": 5128.78,
"mem_limit_bytes": 4757389312,
"mem_rss_bytes": 2243149824,
"timestamp": 1522858776.20098
}
},
Thanks

DC/OS 1.9 VIP load balancing not working for advertised ports

When I publish a service with a VIP, the advertised address does not route properly to the advertised port. For example, for a MariaDB Galera 3-node cluster service with a VIP specified as:
"labels": {
"VIP_0": "/mariadb-galera:3306"
}
On the configuration tab of the service page (and according to the docs), the load balanced address is:
mariadb-galera.marathon.l4lb.thisdcos.directory:3306
I can ping the DNS name just fine, but...
When I try to connect a front-end service (Drupal7, wordpress) to consume this load balanced address:port combination, there will be numerous connection failures and timeouts. It isn't that it never works but that it works quite sporadically, if at all. Drupal7 dies almost immediately and starts kicking up Bad Gateway errors.
What I have found through experimentation is that if I specify a hostPort for the service in question, the load balanced address will work as long as I use the hostPort value, and not the advertised load balanced service port as above. In this specific case I specified a hostPort of 3310.
"network":"USER",
"portMappings": [
{
"containerPort": 3306,
"hostPort": 3310,
"servicePort": 10000,
"name": "mariadb-galera",
"labels": {
"VIP_0": "/mariadb-galera:3306"
}
}
Then if I use the load balanced address (mariadb-galera.marathon.l4lb.thisdcos.directory) with the host port value (3310) in my Drupal7 settings.php, the front end connects and works fine.
I've noticed similar behaviour with custom applications connecting to mongodb backends also in a DC/OS environment... it seems the load balanced address/port combination specified never works reliably... but if you substitute the hostPort value, it does.
The docs clearly state that:
address and port is load balanced as a pair rather than individually.
(from https://docs.mesosphere.com/1.9/networking/dns-overview/)
Yet I am unable to effectively connect when I specify the VIP designated port. Yet IT DOES WORK when I use the hostPort (and will not work at all unless I designate a specific hostPort in the service definition json). Wether or not this approach is actually load balanced remains a question to me based on the wording in the documentation.
I must be doing something wrong, but I am at a loss... any help is appreciated.
My cluster nodes are VMWare virtual machines.
The VIP label shouldn't start with a slash:
"container": {
"portMappings": [
{
"containerPort": 3306,
"name": "mariadb-galera",
"labels": {
"VIP_0": "mariadb-galera:3306"
}
}
}
should be available as <VIP label>.marathon.l4lb.thisdcos.directory:<VIP port> in this case:
mariadb-galera.marathon.l4lb.thisdcos.directory:3306
you can test it using nc:
nc -z -w5 mariadb-galera.marathon.l4lb.thisdcos.directory 3306; echo $?
The command should return 0.
When you're not sure about exported DNS names you can list all of them from any DC/OS node:
curl -s http://198.51.100.1:63053/v1/records | grep mariadb-galera

Service Fabric Application PackageDeployment Operation Time Out exception

i have service fabric cluster and 3 nodes are created in 3 systems and it is inter-connected. i am able to connect each of nodes. These nodes are created in windows server. These Windows Server(VMs) are on-premises.
Manually i am trying to deploy my package into my cluster/one of nodes, i am getting Operation Timeout exception. i have used below commands to execute for deployment.
Service Fabric Power shell Commands:
Copy-ServiceFabricApplicationPackage -ApplicationPackagePath 'c:\sample\etc' -ApplicationPackagePathInImageStore 'abc.app.portaltype'
after execute above command it runs for 2 -3 mins and throws Operation Timeout exception. My package size is almost 250 MB and approximately 15000 file exist in my package. after that i have passed an extra parameter -TimeOutSec to 600(10mins) explicitly in above command, then it successfully executed and it copied to service fabric imagestore.
Register-ServiceFabricApplicationType -ApplicationPathInImageStore 'abc.app.portaltype'
after executed Copy-ServiceFabricApplicationPackage command , i have executed above Register-ServiceFabricApplicationType command to register my in cluster.but it also throws Operation timeout exception then i have passed an extra parameter -TimeOutSec to 600(10mins) explicitly in above command, but no luck it throws same operation timeout exception.
Just to make sure these operation Timeout issue because of no files in package or not. i have created simple empty service fabric asp.net core app and created package and try to deploy in same server with using above command, it deployed with in fraction of second and it works as smoothly.
Anybody has any idea how to over come service fabric operation timeout issue ?
How to handle the operation timeout issue if the package contains large set of files ?
Any help/suggestion would be very appreciated.
Thanks,
If this is taking longer than the 10 Minute default max it's probably one of the following issues:
Large application packages (>100s of MB)
Slow network connections
A large number of files within the application package (>1000s).
The following workarounds should help you.
Add the following settings to your cluster config:
"fabricSettings": [
{
"name": "NamingService",
"parameters": [
{
"name": "MaxOperationTimeout",
"value": "3600"
},
]
}
]
Also add:
"fabricSettings": [
{
"name": "EseStore",
"parameters": [
{
"name": "MaxCursors",
"value": "32768"
},
]
}
]
There’s a couple additional features which are currently rolling out. For these to be present and functional, you need to be sure that the client is at least 2.4.28 and the runtime of your cluster is at least 5.4.157. If you’re staying up to date these should already be present in your environment.
For register you can specify the -Async flag which will handle the upload asynchronously, reducing the need for the timeout to just the time necessary to send the command, not the application package. You can also query the status of the registration with Get-ServiceFabricApplicationType. 5.5 fixes some issues with these commands, so if they aren't working for you you'll have to wait for that release to hit your environment.

Resources