Spark streaming application does not progress - apache-spark

Hi I am working on deploying spark streaming application on spark cluster.
I want to use spark itself as resourcer manager.
Our applications are built in a such way that it takes input from configmap and jar from mounted volume because we do changes in jar regularly.
I have deployed spark cluster using bitnami spark helm chart.
with one master and two worker nodes.
I have created one kubernetes deployment and their I have given the spark master address
as spark-master.
My application will read data from kafka.
here is my spark master
and spark app
The problem that I am facing is - it stuck at first streaming batch
Input command in deployment pod -

Related

Applying rack awareness on Pyspark Structured Streaming running on Kubernetes and reading from AWS MSK

I have a Pyspark Structured Streaming application in the following setup:
Pyspark - version 3.0.1, running on AWS EKS using the Spark operator.
Kafka - running on AWS MSK with a cluster running Apache Kafka version of 2.8.1 and replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector is configured at the cluster configurations (i.e rack awareness is enabled on cluster side).
The flow:
The application reads from Kafka, performs batch processing in 5 minutes intervals, and writes to Kafka again. Both my MSK cluster and the ASG running the instances of my Spark executors are spread on the same AZ's.
I wish to leverage the rack awareness mechanism to allow the Spark executors to read from the closest replica.
I wish to do something like the following:
When spawning new executors on new pods, extract the broker.rack corresponding to the same AZ.
Inject that broker.rack as an environmental variable and initialize the Spark kafka consumer with a client.rack matching that broker.rack parameter.
Is this possible? Or any other solution?

Airflow - How to run a KubernetesPodOperator with a non exiting command

I'm trying to set up a DAG that will create a Spark Cluster in the first task, submit spark applications to the cluster in interim tasks, and have finally teardown the Spark Cluster in the last task.
The approach I'm attempting right now is to use KubernetesPodOperators to create Spark Master and Worker pods. The issue is that they run a spark daemon which never exits. The fact that the command called on the pod never exits means that those tasks gets stuck in airflow in a running phase. So, I'm wondering if there's a way to run the spark daemon and then continue on to the next tasks in the DAG?
The approach I'm attempting right now is to use KubernetesPodOperators to create Spark Master and Worker pods.
Apache Spark provides working support for executing jobs in a Kubernetes cluster. It delivers a driver that is capable of starting executors in pods to run jobs.
You don't need to create Master and Worker pods directly in Airflow.
Rather build a Docker image containing Apache Spark with Kubernetes backend.
An example Dockerfile is provided in the project.
Then submit the given jobs to the cluster in a container based off this image by using KubernetesPodOperator. The following sample job is adapted from documentation provided in Apache Spark to submit spark jobs directly to a Kubernetes cluster.
from airflow.operators.kubernetes_pod_operator import KubernetesPodOperator
kubernetes_full_pod = KubernetesPodOperator(
task_id='spark-job-task-ex',
name='spark-job-task',
namespace='default',
image='<prebuilt-spark-image-name>',
cmds=['bin/spark-submit'],
arguments=[
'--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>',
'--deploy-mode cluster',
'--name spark-pi',
' --class org.apache.spark.examples.SparkPi',
'--conf spark.executor.instances=5',
'--conf spark.kubernetes.container.image=<prebuilt-spark-image-name>',
'local:///path/to/examples.jar'
],
#...
)

How to specify core instance node when submitting Spark step to AWS EMR cluster

I am running multiple instances for my EMR cluster on AWS.
I have 2 instances of CORE nodes and 1 MASTER node
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.add_job_flow_steps
I'm using PySpark to submit the job but don't see anything on specifying the CORE node to run this on.
I thought this is done automatically (like round-robin style?)
Is there way to acheive this?
You always submit your step to the master not the core nodes. The master will then distribute the task to the cluster's workers (Spark executors in the core or task nodes).

Role of master in Spark standalone cluster

In a Spark standalone cluster, what is exactly the role of the master (a node started with start_master.sh script)?
I understand that is the node that receives the jobs from the submit-job.sh script, but what is its role when processing a job?
I'm seeing in the web UI that always delivers the job to a slave (a node started with start_slave.sh) and is not participating from processing, Am I right? In that case, should I also run also the script start_slave.sh in the same machine than master to to take advantage of its resources (cpu and memory)?
Thanks in advance.
Spark runs in the following cluster modes:
Local
Standalone
Mesos
Yarn
The above are cluster modes which offer resources to Spark Applications
Spark standalone mode is master slave architecture, we have Spark Master and Spark Workers. Spark Master runs in one of the cluster nodes and Spark Workers run on the Slave nodes of the cluster.
Spark Master (often written standalone Master) is the resource manager
for the Spark Standalone cluster to allocate the resources (CPU, Memory, Disk etc...) among the
Spark applications. The resources are used to run the Spark Driver and Executors.
Spark Workers report to Spark Master about resources information on the Slave nodes.
[apache-spark]
Spark standalone comes with its own resource manager. Think about Spark Master/Worker as YARN ResourceManager/NodeManager.

Connecting to Remote Spark Cluster for TitanDB SparkGraphComputer

I am attempting to leverage a Hadoop Spark Cluster in order to batch load a graph into Titan using the SparkGraphComputer and BulkLoaderVertex program, as specified here. This requires setting the spark configuration in a properties file, telling Titan where Spark is located, where to read the graph input from, where to store its output, etc.
The problem is that all of the examples seem to specify a local spark cluster through the option:
spark.master=local[*]
I, however, want to run this job on a remote Spark cluster which is on the same VNet as the VM where the titan instance is hosted. From what I have read, it seems that this can be accomplished by setting
spark.master=<spark_master_IP>:7077
This is giving me the error that all Spark masters are unresponsive, which disallows me from sending the job to the spark cluster to distribute the batch loading computations.
For reference, I am using Titan 1.0.0 and a Spark 1.6.4 cluster, which are both hosted on the same VNet. Spark is being managed by yarn, which also may be contributing to this difficulty.
Any sort of help/reference would be appreciated. I am sure that I have the correct IP for the spark master, and that I am using the right gremlin commands to accomplish bulk loading through the SparkGraphComputer. What I am not sure about is how to properly configure the Hadoop properties file in order to get Titan to communicate with a remote Spark cluster over a VNet.

Resources