I am trying to run several EMR steps in parallel.
I saw other questions regarding this issue on SO, as well as googled options.
so things i have tried:
Configure CapacityScheduler with set of queues
Configure FairScheduler
Try to use AWS data pipelines with PARALLEL_FAIR_SCHEDULING, PARALLEL_CAPACITY_SCHEDULING
this wasn't worked for me, yarn was created all queues properly, and submission was done on different queues. But EMR still ran just a single step at once (one step was RUNNING rest of them PENDING)
I also saw from one of the answers that step is meant to be sequential, but you can put several jobs inside single step. I wasn't managed to find a way to do this, and according to UI there is no option for this.
I wasn't tried to submit jobs to yarn cluster directly Submit Hadoop Jobs Interactively, i wanted to submit jobs from AWS API, and i havent found a way to do this from API
This is my configuration for CapacityScheduler CapacityScheduler
This is steps configuration StepsConfiguration
Might be late, but hope this would be helpful.
Spark provides an option that specifying whether the caller (step) will wait or not for spark application completion after submission. You can set this value as false then, AWS emr step will submit and will return immediately.
spark.yarn.submit.waitAppCompletion: "false"
Related
I am designing a dataproc workflow template with multiple spark jobs. These spark jobs would run in sequence one after the other. There could be scenarios where the workflow would run few jobs successfully and might fail for others. Is there a way to just rerun the failed jobs once I have done workaround to fix the issues which failed those jobs in the first place. Please note that I am not looking for job retry mechanism of jobs. I want to re-run the workflow again by avoiding running already successful jobs.
Dataproc Workflows do not support this use case.
Please take at Cloud Composer - Apache Airflow-based orchestration service which is more flexible and should be able to satisfy your use case.
Three questions of similarity:
what will happen if one my one executor is lost.
what will happen if my driver is lost.
What will happen in case of stage failure.
In all the above cases, are they recoverable? If yes, how to recover. Is there any option in "SparkConf", setting which these can be prevented from?
Thanks.
Spark use job scheduling. DAGScheduler is implemented by cluster managers (Standalone, YARN, Mesos), and your cluster manager can re-schedule the failed task.
For example, if you use YARN, try tweaking spark.yarn.maxAppAttempts and yarn.resourcemanager.am.max-attempts. Also, you can try to manually track jobs using the HTTP API: https://community.hortonworks.com/articles/28070/starting-spark-jobs-directly-via-yarn-rest-api.html
If you want to recover from logical errors, you can try checkpointing (saving records to HDFS for later use): https://mallikarjuna_g.gitbooks.io/spark/content/spark-streaming/spark-streaming-checkpointing.html. (For really long and important pipelines I recommend saving your data in normal files instead of checkpoints!).
Configuring high-available clusters is a more complex task than tweaking 1 setting in SparkConf. You can try to implement different scenarios and return with more detailed questions. As a first step, you can try to run everything on YARN.
We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master.
Answer to this will be highly appreciated.
Thanks in advance.
There are 3 ways you can submit Spark jobs using Apache Airflow remotely:
(1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status. The good thing is, it also streams the logs from the spark-submit command stdout and stderr.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work.
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath
(2) Using SSHOperator: Use this operator to run bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don't need to copy the hdfs-site.xml or maintain any file.
(3) Using SimpleHTTPOperator with Livy: Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just need to have REST calls.
I personally prefer SSHOperator :)
I just created a Google Cloud cluster (1 master and 6 workers) and by default Spark is configured.
I have a pure python code that uses NLTK to build the dependency tree for each line from a text file. When I run this code on the master spark-submit run.py I get the same execution time when I run it using my machine.
How to make sure that the master is using the workers in order to reduce the execution time ?
You can check the spark UI. If its running on top of yarn, please open the yarn UI and click on your application id which will open the spark UI. Check under the executors tab it will have the node ip address also.
could you please share your spark submit config.
Your command 'spark-submit run.py' doesn't seem to send your job to YARN. To do such thing, you need to add the --master parameter. For example, a valid command to execute a job in YARN is:
./bin/spark-submit --master yarn python/pi.py 1000
If you execute your job from the master, this execution will be straightforward. Anyway, check this link for another parameter that spark-submit accept.
For a Dataproc cluster (Hadoop Google cluster) you have two options to check the job history including the ones that are running:
By command line from the master: yarn application -list, this option sometimes needs additional configuration. If you have troubles, this link will be useful.
By UI. Dataproc enables you to access the Spark Web UI, it improves monitoring tasks. Check this link to learn how to access the Spark UI and other Dataproc UIs. In summary, you have to create a tunnel and configure your browser to use socks proxy.
Hope the information above help you.
I am running spark job on ec2 cluster, I have a trigger that submits job periodically. I do not want to submit job if one job is already running on cluster. Is there any api that can give me this information?
Spark, and by extension, Spark Streaming offer an operational REST API at http://<host>:4040/api/v1
Consulting the status of the current application will give you the information sought.
Check the documentation: https://spark.apache.org/docs/2.1.0/monitoring.html#rest-api
you can consult the UI to see the status
eg.
If you run locally, take a look at the localhost:4040