Apache Spark - How to get app-id from submissionId over Rest API? - apache-spark

I want to get the progress of an application running at Apache Spark standalone cluster. Is there any way to get the app-id for this pupose over REST API? eg;
http://spark-master:6066/v1/submissions/create
The response contains a submission id, eg;
"submissionId" : "driver-20171222112336-0208"
When the status of the driver is requested using that submission id, eg;
http://spark-master:6066/v1/submissions/status/driver-20171222112336-0208
The response (after the driver state changed into running) contains Worker Host & Port and a worker id, eg;
"workerHostPort": "192.168.1.20:43817",
"workerId": "worker-20170707120806-192.168.1.20-43817"
After this progress when the applications listed at that worker, eg;
http://192.168.1.20:4040/api/v1/applications
This seems to be giving a list of applications (a json array with one element) eg;
[
{
"id": "app-20171222112343-0125",
"name": "test",
"attempts": [
..
At this step a list of applications found instead of an app-id and without any information related to driver.
http://spark-master:8080/json
also gives a list of active applications and again without any information related to drivers, eg;
{
..
"activeapps": [..],
..
So, For a Standalone Apache Spark cluster, how to get application id [app-id] when a job is submitted to the Spark via hidden REST API?

Related

Controlling the number of spark drivers running on Mesos

We have a cluster with 6 EC2 nodes in AWS (16 cpu, 64 gb per node - 1 node running mesos master and 5 nodes running mesos slave). The mesos dispatcher is running on the master node. We use this cluster exclusively to run spark jobs. Our spark configuration is 5 cpu and 10 gb per executor (same for driver). I
In one of our scheduled jobs, we have a scenario where we need to do a few hundred spark-submits at the same time. When this happens, the dispatchers starts drivers for all these spark-submits leaving no room for any executors in the cluster.
I'm looking at a few options. I would like to get some pointers from members in the spark/mesos community.
Some options which I don't want to get into are : increasing the cluster size, asking the analysts to change their job structure to combine all spark-submits into a single one, switching to YARN/EMR (actually I tried this and got into some messy queue problems there)
Option 1 : Using Mesos roles
I found some documentation on the use of quotas and roles to solve this. But I'm not sure about the following :
How to create a mesos role and update the resources to be made available to
this role ?
How to set up separate roles for spark drivers and spark executors ?
By default all resources are in the * role. There is a flag called spark.mesos.role that I can set while doing spark-submit but not sure how to create this role and ensure this role is used only for executors ?
Option 2 : Modifying the mesos cluster scheduler
When spark-submit happens to mesos dispatcher, it adds the driver request to a WaitingQueue. When drivers fail while executing and if supervise mode is available, they are sent to a PendingRetryQueue with custom retry schedule settings. When resources are available from mesos, these drivers from the PendingRetryQueue are scheduled first and WaitingQueue are scheduled next.I was thinking of keeping the WaitingQueue with size 5 (spark.mesos.maxDrivers) and when there are more spark-submits than the queue size, I would add these drivers to the PendingRetryQueue and schedule them to run later. Currently, as per my understanding, when there are more that 200 drivers in the WaitingQueue mesos rest server sends a failed message and doesn't add it to the PendingRetryQueue.
Any help on implementing either of the options would be very helpful to me. Thanks in advance.
Update : Just saw that by when I give spark-submit with a role, it runs only executors in that role and drivers run in the default * role. I think this should solve this issue for me. Once I test this, I'll post my update here and close this. Thanks
As mentioned in the update, by default the mesos runs spark drivers in default role (*) and executors in the role provided by 'spark.mesos.role' parameter. To control the resources available for each role, we can use quotas , guarantees or reservations. We went ahead with static reservations since it suited our requirements.Thanks.
Option 1 is the good one.
First set dispatcher quota by creating a file like dispatcher-quota.json
cat dispatcher-quota.json
{
"role": "dispatcher",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 5.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 5120.0 }
}
]
}
Then push it to you're mesos master (leader) with
curl -d #dispatcher-quota.json -X POST http://<master>:5050/quota
So now you will have a quota for driver
Ensure you dispatcher is running with the right service role set if needed ajust it. If in DC/OS use
$ cat options.json
{
"service": {
"role": "dispatcher"
}
}
$ dcos package install spark --options=options.json
Otherwise feel free to share how you've deployed you dispatcher. I will provide you a how to guide.
That's ok for drivers. Now let's work with executor folowing the same way
$ cat executor-quota.json
{
"role": "executor",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 100.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 409600.0 }
}
]
}
$ curl -d #executor-quota.json -X POST http://<master>:5050/quota
Adapt values to you requirements
Then ensure to launch executor on with the correct role by providing
--conf spark.mesos.role=executor \
Source of my explanation came from https://github.com/mesosphere/spark-build/blob/master/docs/job-scheduling.md don't hesistate if it's not enought.
This should do the work

How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow

I am working on submitting Spark job using Apache Livy batches POST method.
This HTTP request is send using AirFlow. After submitting job, I am tracking status using batch Id.
I want to show driver ( client logs) logs on Air Flow logs to avoid going to multiple places AirFLow and Apache Livy/Resource Manager.
Is this possible to do using Apache Livy REST API?
Livy has an endpoint to get logs /sessions/{sessionId}/log & /batches/{batchId}/log.
Documentation:
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-sessionssessionidlog
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-batchesbatchidlog
You can create python functions like the one shown below to get logs:
http = HttpHook("GET", http_conn_id=http_conn_id)
def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None):
if not extra_options:
extra_options = {}
self.http.method = method
response = http.run(endpoint, json.dumps(data), headers, extra_options=extra_options)
return response
def _get_batch_session_logs(self, batch_id):
method = "GET"
endpoint = "batches/" + str(batch_id) + "/log"
response = self._http_rest_call(method=method, endpoint=endpoint)
# return response.json()
return response
Livy exposes REST API in 2 ways: session and batch. In your case, since we assume you are not using session, you are submitting using batches. You can post your batch using the curl command:
curl http://livy-server-IP:8998/batches
Once you have submitted the job, you would get the batch ID in return. Then you can curl using the command:
curl http://livy-server-IP:8998/batches/{batchId}/log
You can find the documentation at:
https://livy.incubator.apache.org/docs/latest/rest-api.html
If you want to avoid the above steps, you can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow with a custom Livy operator. Livy operator submits and tracks the status of the job every 30 sec (configurable), and it also provides spark logs at the end of the spark job in Airflow UI logs.
Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.
Link for AWS Marketplace:
https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V
This will enable you to view consolidated logs at one place, instead of shuffling between Airflow and EMR/Spark logs (Ambari/Resource Manager).

spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days.
My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. The spark jobs are submitted using cluster deploy mode. ( In normal client mode , its getting printed )
Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding.
My approach:-
1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory.
Part from my log4j.properties:-
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out
log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console
log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false
log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console
log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console
log4j.logger.org.apache.hadoop.ipc.Client=ALL
2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console.
Code logic :-
public void onApplicationEnd(SparkListenerApplicationEnd arg0)
{
for (Thread t : Thread.getAllStackTraces().keySet())
{
if (t.getName().equals("main"))
{
System.out.println("The current state : "+t.getState());
Configuration config = new Configuration();
ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);
// some logic to write to communicate with the main thread to print the app id to console.
}
}
}
3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command .
If anyone could help me in finding an approach to the solution, it would be really helpful. Or if I am doing something very wrong, any insights would help me.
Thanks.
After being stuck at the same place for some days, I was finally able to get a solution to my problem.
After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. It might help someone else looking to achieve the same result.
In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing.
So we need to use :-
SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration=<location>/log4j.properties" spark-submit <rest of the parameters>
To clarify:
client mode means the Spark driver is running on the same machine you ran spark submit from
cluster mode means the Spark driver is running out on the cluster somewhere
You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine.
Some ideas:
Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID.
Write the appIDs to some shared location like HDFS or a database. You might be able to use a Log4j appender if you want to keep log4j.

how to tell mesos frameworks which launched by different commands/parameters

I am building metrics collector to collect the running status about all the Spark Jobs running on it. The mesos API http://masterip/frameworks return a lot of details about all the frameworks and then I run http://slaveip/slave(1)/monitor/statistics to get each frameworks detail info from each slave, then correlate them.
This works fine for most of the jobs, but I have some jobs which behave different according to different parameters when submitting. They are shown as same framework name in Mesos GUI and I can not tell each other.
Is there a way to get the detail full commands which launches the job? Or any other idea about how to tell them?
You can find there are multiple instances with same framework name. As they are different spark job instances.
When I connect to Mesos slave, the monitor/statistics doesn't show the full command with all the parameters, so I can not tell which framework correlate to which Spark job instance.
{
"executor_id": "0",
"executor_name": "Command Executor (Task: 0) (Command: sh -c '
\"/usr/local...')",
"framework_id": "06ba8de8-7fc3-422d-9ee3-17dd9ddcb2ca-3157",
"source": "0",
"statistics": {
"cpus_limit": 2.1,
"cpus_system_time_secs": 848.689999999,
"cpus_user_time_secs": 5128.78,
"mem_limit_bytes": 4757389312,
"mem_rss_bytes": 2243149824,
"timestamp": 1522858776.20098
}
},
Thanks

Get driver id in Spark Standalone cluster in cluster mode

How can i get the driver id of the already submitted/running application, programmatically or if there is any end point which i am missing ?
I have gone through the Spark hidden REST APIs which gives only three end points: /create /status /kill . Now, to kill the application, i need the driver id/submission id of the application.
EDIT :

Resources