how to tell mesos frameworks which launched by different commands/parameters - apache-spark

I am building metrics collector to collect the running status about all the Spark Jobs running on it. The mesos API http://masterip/frameworks return a lot of details about all the frameworks and then I run http://slaveip/slave(1)/monitor/statistics to get each frameworks detail info from each slave, then correlate them.
This works fine for most of the jobs, but I have some jobs which behave different according to different parameters when submitting. They are shown as same framework name in Mesos GUI and I can not tell each other.
Is there a way to get the detail full commands which launches the job? Or any other idea about how to tell them?
You can find there are multiple instances with same framework name. As they are different spark job instances.
When I connect to Mesos slave, the monitor/statistics doesn't show the full command with all the parameters, so I can not tell which framework correlate to which Spark job instance.
{
"executor_id": "0",
"executor_name": "Command Executor (Task: 0) (Command: sh -c '
\"/usr/local...')",
"framework_id": "06ba8de8-7fc3-422d-9ee3-17dd9ddcb2ca-3157",
"source": "0",
"statistics": {
"cpus_limit": 2.1,
"cpus_system_time_secs": 848.689999999,
"cpus_user_time_secs": 5128.78,
"mem_limit_bytes": 4757389312,
"mem_rss_bytes": 2243149824,
"timestamp": 1522858776.20098
}
},
Thanks

Related

Controlling the number of spark drivers running on Mesos

We have a cluster with 6 EC2 nodes in AWS (16 cpu, 64 gb per node - 1 node running mesos master and 5 nodes running mesos slave). The mesos dispatcher is running on the master node. We use this cluster exclusively to run spark jobs. Our spark configuration is 5 cpu and 10 gb per executor (same for driver). I
In one of our scheduled jobs, we have a scenario where we need to do a few hundred spark-submits at the same time. When this happens, the dispatchers starts drivers for all these spark-submits leaving no room for any executors in the cluster.
I'm looking at a few options. I would like to get some pointers from members in the spark/mesos community.
Some options which I don't want to get into are : increasing the cluster size, asking the analysts to change their job structure to combine all spark-submits into a single one, switching to YARN/EMR (actually I tried this and got into some messy queue problems there)
Option 1 : Using Mesos roles
I found some documentation on the use of quotas and roles to solve this. But I'm not sure about the following :
How to create a mesos role and update the resources to be made available to
this role ?
How to set up separate roles for spark drivers and spark executors ?
By default all resources are in the * role. There is a flag called spark.mesos.role that I can set while doing spark-submit but not sure how to create this role and ensure this role is used only for executors ?
Option 2 : Modifying the mesos cluster scheduler
When spark-submit happens to mesos dispatcher, it adds the driver request to a WaitingQueue. When drivers fail while executing and if supervise mode is available, they are sent to a PendingRetryQueue with custom retry schedule settings. When resources are available from mesos, these drivers from the PendingRetryQueue are scheduled first and WaitingQueue are scheduled next.I was thinking of keeping the WaitingQueue with size 5 (spark.mesos.maxDrivers) and when there are more spark-submits than the queue size, I would add these drivers to the PendingRetryQueue and schedule them to run later. Currently, as per my understanding, when there are more that 200 drivers in the WaitingQueue mesos rest server sends a failed message and doesn't add it to the PendingRetryQueue.
Any help on implementing either of the options would be very helpful to me. Thanks in advance.
Update : Just saw that by when I give spark-submit with a role, it runs only executors in that role and drivers run in the default * role. I think this should solve this issue for me. Once I test this, I'll post my update here and close this. Thanks
As mentioned in the update, by default the mesos runs spark drivers in default role (*) and executors in the role provided by 'spark.mesos.role' parameter. To control the resources available for each role, we can use quotas , guarantees or reservations. We went ahead with static reservations since it suited our requirements.Thanks.
Option 1 is the good one.
First set dispatcher quota by creating a file like dispatcher-quota.json
cat dispatcher-quota.json
{
"role": "dispatcher",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 5.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 5120.0 }
}
]
}
Then push it to you're mesos master (leader) with
curl -d #dispatcher-quota.json -X POST http://<master>:5050/quota
So now you will have a quota for driver
Ensure you dispatcher is running with the right service role set if needed ajust it. If in DC/OS use
$ cat options.json
{
"service": {
"role": "dispatcher"
}
}
$ dcos package install spark --options=options.json
Otherwise feel free to share how you've deployed you dispatcher. I will provide you a how to guide.
That's ok for drivers. Now let's work with executor folowing the same way
$ cat executor-quota.json
{
"role": "executor",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 100.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 409600.0 }
}
]
}
$ curl -d #executor-quota.json -X POST http://<master>:5050/quota
Adapt values to you requirements
Then ensure to launch executor on with the correct role by providing
--conf spark.mesos.role=executor \
Source of my explanation came from https://github.com/mesosphere/spark-build/blob/master/docs/job-scheduling.md don't hesistate if it's not enought.
This should do the work

spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days.
My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. The spark jobs are submitted using cluster deploy mode. ( In normal client mode , its getting printed )
Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding.
My approach:-
1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory.
Part from my log4j.properties:-
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out
log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console
log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false
log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console
log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console
log4j.logger.org.apache.hadoop.ipc.Client=ALL
2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console.
Code logic :-
public void onApplicationEnd(SparkListenerApplicationEnd arg0)
{
for (Thread t : Thread.getAllStackTraces().keySet())
{
if (t.getName().equals("main"))
{
System.out.println("The current state : "+t.getState());
Configuration config = new Configuration();
ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);
// some logic to write to communicate with the main thread to print the app id to console.
}
}
}
3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command .
If anyone could help me in finding an approach to the solution, it would be really helpful. Or if I am doing something very wrong, any insights would help me.
Thanks.
After being stuck at the same place for some days, I was finally able to get a solution to my problem.
After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. It might help someone else looking to achieve the same result.
In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing.
So we need to use :-
SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration=<location>/log4j.properties" spark-submit <rest of the parameters>
To clarify:
client mode means the Spark driver is running on the same machine you ran spark submit from
cluster mode means the Spark driver is running out on the cluster somewhere
You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine.
Some ideas:
Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID.
Write the appIDs to some shared location like HDFS or a database. You might be able to use a Log4j appender if you want to keep log4j.

Apache Spark - How to get app-id from submissionId over Rest API?

I want to get the progress of an application running at Apache Spark standalone cluster. Is there any way to get the app-id for this pupose over REST API? eg;
http://spark-master:6066/v1/submissions/create
The response contains a submission id, eg;
"submissionId" : "driver-20171222112336-0208"
When the status of the driver is requested using that submission id, eg;
http://spark-master:6066/v1/submissions/status/driver-20171222112336-0208
The response (after the driver state changed into running) contains Worker Host & Port and a worker id, eg;
"workerHostPort": "192.168.1.20:43817",
"workerId": "worker-20170707120806-192.168.1.20-43817"
After this progress when the applications listed at that worker, eg;
http://192.168.1.20:4040/api/v1/applications
This seems to be giving a list of applications (a json array with one element) eg;
[
{
"id": "app-20171222112343-0125",
"name": "test",
"attempts": [
..
At this step a list of applications found instead of an app-id and without any information related to driver.
http://spark-master:8080/json
also gives a list of active applications and again without any information related to drivers, eg;
{
..
"activeapps": [..],
..
So, For a Standalone Apache Spark cluster, how to get application id [app-id] when a job is submitted to the Spark via hidden REST API?

How to leverage a spark cluster from a web app?

A lot of people have asked this question but there is no clear answer except links and references and also most of them are not recent. The question is this :
I have a web app that needs to leverage a spark cluster to run a spark-sql query. My understanding is that submit-job script is asynchronous hence this won't work here. How do I leverage spark in such a setup? Can I just write code in the web app like I do in a self-contained spark application i.e. create a context, set the master URL and do what I need to do ? Will this work in a web app ? If yes, then when would I need the job server that provides REST APIs to submit jobs?
Library for launching Spark applications.
This library allows applications to launch Spark programmatically. There's only one entry point to the library - the SparkLauncher class.
To launch a Spark application, just instantiate a SparkLauncher and configure the application to run. For example:
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
References:
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/launcher/package-summary.html
I think options will be
Through rest api like Livy (Livy is a new open source Spark REST
Server for submitting and interacting with your Spark jobs from
anywhere. ) or spark server (REST APIs) - See how they connect to
spark interactively from using kernel -
https://www.youtube.com/watch?v=TD1J7MzYcFo&feature=youtu.be&t=33m19s
https://developer.ibm.com/open/apache-toree/
Through jdbc (Running via the Thrift JDBC/ODBC server)
Through ssh and submit a job and wait for yarn status (this will
be SSH to the cluster and do a spark submit through YARN - YARN
give you an application ID and you can keep track of application
status with yarn application status command)

Stopping a Running Spark Application

I'm running a Spark cluster in standalone mode.
I've submitted a Spark application in cluster mode using options:
--deploy-mode cluster –supervise
So that the job is fault tolerant.
Now I need to keep the cluster running but stop the application from running.
Things I have tried:
Stopping the cluster and restarting it. But the application resumes
execution when I do that.
Used Kill -9 of a daemon named DriverWrapper but the job resumes again after that.
I've also removed temporary files and directories and restarted the cluster but the job resumes again.
So the running application is really fault tolerant.
Question:
Based on the above scenario can someone suggest how I can stop the job from running or what else I can try to stop the application from running but keep the cluster running.
Something just accrued to me, if I call sparkContext.stop() that should do it but that requires a bit of work in the code which is OK but can you suggest any other way without code change.
If you wish to kill an application that is failing repeatedly, you may do so through:
./bin/spark-class org.apache.spark.deploy.Client kill <master url> <driver ID>
You can find the driver ID through the standalone Master web UI at http://:8080.
From Spark Doc
Revisiting this because I wasn't able to use the existing answer without debugging a few things.
My goal was to programmatically kill a driver that runs persistently once a day, deploy any updates to the code, then restart it. So I won't know ahead of time what my driver ID is. It took me some time to figure out that you can only kill the drivers if you submitted your driver with the --deploy-mode cluster option. It also took me some time to realize that there was a difference between application ID and driver ID, and while you can easily correlate an application name with an application ID, I have yet to find a way to divine the driver ID through their api endpoints and correlate that to either an application name or the class you are running. So while run-class org.apache.spark.deploy.Client kill <master url> <driver ID> works, you need to make sure you are deploying your driver in cluster mode and are using the driver ID and not the application ID.
Additionally, there is a submission endpoint that spark provides by default at http://<spark master>:6066/v1/submissions and you can use http://<spark master>:6066/v1/submissions/kill/<driver ID> to kill your driver.
Since I wasn't able to find the driver ID that correlated to a specific job from any api endpoint, I wrote a python web scraper to get the info from the basic spark master web page at port 8080 then kill it using the endpoint at port 6066. I'd prefer to get this data in a supported way, but this is the best solution I could find.
#!/usr/bin/python
import sys, re, requests, json
from selenium import webdriver
classes_to_kill = sys.argv
spark_master = 'masterurl'
driver = webdriver.PhantomJS()
driver.get("http://" + spark_master + ":8080/")
for running_driver in driver.find_elements_by_xpath("//*/div/h4[contains(text(), 'Running Drivers')]"):
for driver_id in running_driver.find_elements_by_xpath("..//table/tbody/tr/td[contains(text(), 'driver-')]"):
for class_to_kill in classes_to_kill:
right_class = driver_id.find_elements_by_xpath("../td[text()='" + class_to_kill + "']")
if len(right_class) > 0:
driver_to_kill = re.search('^driver-\S+', driver_id.text).group(0)
print "Killing " + driver_to_kill
result = requests.post("http://" + spark_master + ":6066/v1/submissions/kill/" + driver_to_kill)
print json.dumps(json.loads(result.text), indent=4)
driver.quit()
https://community.cloudera.com/t5/Support-Questions/What-is-the-correct-way-to-start-stop-spark-streaming-jobs/td-p/30183
according this link use to stop if your master use yarn
yarn application -list
yarn application -kill application_id

Resources