I want to submit spark job to azure hdInsights cluster from airflow, I don't want to use livy as it doesn't accumulate logs on airflow. Is it possible to do remotely submit job. SSH is 1 option but if job is long running it might break connection. Is there any other option?
Note - Airflow cluster is remote cluster, it's not colocated with spark cluster.
Related
I have deployed spark on Kubernetes cluster. ref link:https://testdriven.io/blog/deploying-spark-on-kubernetes/
But spark workers are not able to join spark master service
Not able to connect to master service
comment below if you have any solutions.
Spark worker should be able to connect spark master service.
we are planning to deploy all spark batch and streaming jobs on Kubernetes (as cluster services). I want to know if we can deploy all spark jobs using spark operator manifest file in PROD Kubernetes Server.
Kubernetes cluster
You can take a look at https://cloud.google.com/dataproc/docs/concepts/jobs/dataproc-gke wherein you can run a Dataproc cluster on GKE. It is a managed service by Google Cloud.
Is it possible to deploy spark code in Azure cloud without the yarn component? thanks in advance
Yes,you can deploy Apache Spark cluster in Azure HDInsight without Yarn.
Spark clusters in HDInsight include the following components that are available on the clusters by default.
1)Spark Core. Includes Spark Core, Spark SQL, Spark streaming APIs, GraphX, and MLlib.
2)Anaconda
3)Livy
4)Jupyter notebook
5)Zeppelin notebook
Spark clusters on HDInsight also provide an ODBC driver for connectivity to Spark clusters in HDInsight from BI tools such as Microsoft Power BI and Tableau.
Refer to the following sites for more information,
Create an Apache Spark cluster in Azure HDInsight
Introduction to Spark on HDInsight
I don't think it is possible to deploy HDInsight cluster without YARN.Refer to the HDInsight documentation
https://learn.microsoft.com/en-sg/azure/hdinsight/hdinsight-hadoop-introduction
https://learn.microsoft.com/en-sg/azure/hdinsight/hdinsight-component-versioning
YARN is the resource manager for Hadoop. Is there any particular reason you would not want to use YARN while working with HDInsight Spark cluster?
If you want to use the standalone mode, you can modify the location of the master url while submitting the job using Spark-submit command.
I have some examples in my repo with Spark-submit both in local mode and on HDInsight cluster
https://github.com/NileshGule/learning-spark
You can refer to
local mode : https://github.com/NileshGule/learning-spark/blob/master/src/main/java/com/nileshgule/movielens/MovieLens.md
HDInsight Spark cluster : https://github.com/NileshGule/learning-spark/blob/master/Azure.md
I am new to AWS EC2 and need to know how I can submit my Spark job to AWS EC2 spark cluster. Like in azure we can directly submit the job through IntelliJ idea with azure plugin.
You can submit a spark job easily through spark-submit command. Refer http://spark.apache.org/docs/latest/submitting-applications.html
Options:
1) login to master or other driver gateway node and use spark-submit to submit the job through YARN/media/etc
2) Use spark submit cluster deploy mode from any machine with sufficient ports and firewall access (may require configuration, such as client config files from Cloudera manager for CDH cluster)
3) Use a server setup like Livy (open source through Cloudera, and MS Azure HDinsights uses and contributes to it) or maybe thrift server. Livy (Livy.io) is a nice simple REST service that also has language APIs for Scala/Java to make it extra easy to submit jobs (and run interactive persisted sessions!)
I'd like to remotely run spark-submit from a local machine to submit a job to spark cluster (cluster mode). What method do I use to authenticate myself to the cluster?
You need to be on spark client machine to submit a job. Once you are in spark client and after you submit a job, you will be authenticated by user id by which you are submitting job. If your user id has permissions on cluster it will be taken care automatically. For kerberos enabled you need to generate keytab first for that session and user to authenticate with cluster.