Dataflow SparkPipelineRunner - any available examples? - apache-spark

Does anybody have a working example(s) of using the Cloudera SparkPipielineRunner to execute (on a cluster) a pipeline written using the Dataflow SDK?
I can't see any in the Dataflow or Spark-Dataflow github repos.
We're trying to evaluate if running our pipelines on a Spark cluster will give us any performance gains over running them on the GCP Dataflow service.

There are examples for using the Beam Spark Runner at the Beam site: https://beam.apache.org/documentation/runners/spark/.
The dependency you want is:
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-spark</artifactId>
<version>0.3.0-incubating</version>
</dependency>
To run against a Standalone cluster simply run:
spark-submit --class com.beam.examples.BeamPipeline --master spark://HOST:PORT target/beam-examples-1.0.0-shaded.jar --runner=SparkRunner

Related

In Apache Beam's SparkRunner, how does the DOCKER environment_type affect an existing Spark cluster?

In Apache Beam's Spark documentation, it says that you can specify --environment_type="DOCKER" to customize the runtime environment:
The Beam SDK runtime environment can be containerized with Docker to
isolate it from other runtime systems. To learn more about the
container environment, read the Beam SDK Harness container contract.
...
You may want to customize container images for many
reasons, including:
Pre-installing additional dependencies
Launching third-party software in the worker environment
Further customizing the execution environment
...
python -m apache_beam.examples.wordcount \
--input=/path/to/inputfile \
--output=path/to/write/counts \
--runner=SparkRunner \
# When running batch jobs locally, we need to reuse the container.
--environment_cache_millis=10000 \
--environment_type="DOCKER" \
--environment_config="${IMAGE_URL}"
If you submit this job to an existing Spark cluster, what does the docker image do to the Spark cluster? Does it run all the Spark executors with that Docker image? What happens to the existing Spark executors if there are any? What about the Spark driver? What is the mechanism used (Spark Driver API?) to distribute the Docker image to the machines?
TL;DR The answer for this question is on this picture (it's taken form my talk for Beam Summit 2020 about running cross-language pipelines on Beam).
For example, if you run your Beam pipeline with Beam Portable Runner on Spark cluster, then Beam Portable Spark Runner will translate your job into a normal Spark job and then submit/run it on ordinary Spark cluster. So, it will use driver/executors of your Spark cluster (as usually).
As you can see from this picture, the Docker container is using just as part of SDK Harness to execute DoFn code independently from the "main" language of your pipeline (for example, run some Python code as a part of Java pipeline).
The only requirement, iirc, is that your Spark executors should be have installed Docker to run Docker container(s). Also, you can pre-fetch Beam SDK Docker images on Spark executors nodes to avoid it while running your job for the first time.
Alternative solution, that Beam Portability provides for portable pipelines, could be to execute SDK Harness as just a normal system process. In this case, you need to specify environment_type="PROCESS" and provide a path to executable file (that obviously has to be installed on all executor nodes).

Unable to run hop pipelines on Spark running on Kubernetes

I am looking for help in running hop pipelines on Spark cluster, running on kubernetes.
I have spark master deployed with 3 worker nodes on kubernetes
I am using hop-run.sh command to run pipeline on spark running on kubernetes.
Facing Below exception
-java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.services.s3.AmazonS3ClientBuilder
Looks like fat.jar is not getting associated with the spark when running hop-run.sh command.
I tried running same with spark-submit command too but not sure how to pass references of pipelines and workflows to Spark running on kubernetes, though I am able to add fat jar to the classpath (can be seen in logs)
Any kind of help is appreciated.
Thanks
like
Could it be that you are using version 1.0?
We had a missing jar for S3 VFS which has been resolved in 1.1
https://issues.apache.org/jira/browse/HOP-3327
For more information on how to use spark-submit you can take a look at the following documentation:
https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-spark-pipeline-engine.html#_running_with_spark_submit
The location to the fat-jar the pipeline and the required metadata-export can all be VFS locations so no need to place those on the cluster itself.

How to run Cloud Dataflow pipelines using Spark runner?

I have read that Google Cloud Dataflow pipelines, which are based on Apache Beam SDK, can be run with Spark or Flink.
I have some dataflow pipelines currently running on GCP using default Cloud Dataflow runner and I want to run it using Spark runner but I don't know how to.
Is there any documentation or guide about how to do this? Any pointers will help.
Thanks.
I'll assume you're using Java but the equivalent process applies with Python.
You need to migrate your pipeline to use the Apache Beam SDK, replacing your Google Dataflow SDK dependency with:
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-core</artifactId>
<version>2.4.0</version>
</dependency>
Then add the dependency for the runner you wish to use:
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-spark</artifactId>
<version>2.4.0</version>
</dependency>
And add the --runner=spark to specify that this runner should be used when submitting the pipeline.
See https://beam.apache.org/documentation/runners/capability-matrix/ for the full list of runners and comparison of their capabilities.
Thanks to multiple tutorials and documentation scattered all over the web, I was finally able to have a coherent idea about how to use spark runner with any Beam SDK based pipeline.
I have documented entire process here for future reference: http://opreview.blogspot.com/2018/07/running-apache-beam-pipeline-using.html.

How to setup YARN with Spark in cluster mode

I need to setup spark cluster (1 Master and 2 slaves nodes) on centos7 along with resource manager as YARN. I am new to all this and still exploring. Can somebody share me detailed steps of setting up Spark with Yarn in cluster mode.
Afterwards i have to integrate Livy too(an open source REST interface for using Spark from anywhere).
Inputs are welcome.Thanks
YARN is part of Hadoop. So, a Hadoop installation is necessary to run Spark on YARN.
Check out the page on the Hadoop Cluster Setup.
Then you can utilize the this documentation to learn about Spark on YARN.
Another method to quickly learn about Hadoop, YARN and Spark is to utilize Cloudera Distribution of Hadoop (CDH). Read the CDH 5 Quick Start Guide.
We are currently using the similar setup in aws. AWS EMR is costly hence
we setup our own cluster using ec2 machines with the help of Hadoop Cookbook. The cookbook supports multiple distributions, however we choose HDP.
The setup included following.
Master Setup
Spark (Along with History server)
Yarn Resource Manager
HDFS Name Node
Livy server
Slave Setup
Yarn Node Manager
HDFS Data Node
More information on manually installing can be found in HDP Documentation
You can see the part of that automation in here.

Remotely execute a Spark job on an HDInsight cluster

I am trying to automatically launch a Spark job on an HDInsight cluster from Microsoft Azure. I am aware that several methods exist to automate Hadoop job submission (provided by Azure itself), but so far I have not been able to found a way to remotely run a Spark job withouth setting a RDP with the master instance.
Is there any way to achieve this?
Spark-jobserver provides a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts.
https://github.com/spark-jobserver/spark-jobserver
My solution is using both Scheduler and Spark-jobserver to launch the Spark-job periodically.
At the moment of this writing, it seems there is no official way of achieving this. So far, however, I have been able to somehow remotely run Spark jobs using an Oozie shell workflow. It is nothing but a patch, but so far it has been useful for me. These are the steps I have followed:
Prerequisites
Microsoft Powershell
Azure Powershell
Process
Define an Oozie workflow *.xml* file:
<workflow-app name="myWorkflow" xmlns="uri:oozie:workflow:0.2">
<start to = "myAction"/>
<action name="myAction">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>myScript.cmd</exec>
<file>wasb://myContainer#myAccount.blob.core.windows.net/myScript.cmd#myScript.cmd</file>
<file>wasb://myContainer#myAccount.blob.core.windows.net/mySpark.jar#mySpark.jar</file>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Note that it is not possible to identify on which HDInsight node is going to be executed the script, so it is necessary to put it, along with the Spark application .jar, on the wasb repository. It is then redirectioned to the local directory on which the Oozie job is executing.
Define the custom script
C:\apps\dist\spark-1.2.0\bin\spark-submit --class spark.azure.MainClass
--master yarn-cluster
--deploy-mode cluster
--num-executors 3
--executor-memory 2g
--executor-cores 4
mySpark.jar
It is necessary to upload both the .cmd and the Spark .jar to the wasb repository (a process that it is not included in this answer), concretely to the direction pointed in the workflow:
wasb://myContainer#myAccount.blob.core.windows.net/
Define the powershell script
The powershell script is very much taken from the official Oozie on HDInsight tutorial. I am not including the script on this answer due to its almost absolute sameness with my approach.
I have made a new suggestion on the azure feedback portal indicating the need of official support for remote Spark job submission.
Updated on 8/17/2016:
Our spark cluster offering now includes a Livy server that provides a rest service to submit a spark job. You can automate spark job via Azure Data Factory as well.
Original post:
1) Remote job submission for spark is currently not supported.
2) If you want to automate setting a master every time ( i.e. adding --master yarn-client every time you execute), you can set the value in %SPARK_HOME\conf\spark-defaults.conf file with following config:
spark.master yarn-client
You can find more info on spark-defaults.conf on apache spark website.
3) Use cluster customization feature if you want to add this automatically to spark-defaults.conf file at deployment time.

Resources