How to connect Cloudera Manager to existing Spark cluster - puppet

I have following requirement: I need to provision both Cloudera Manager and Spark Cluster via Puppet but in a way that I need minimal (or none) configuration through Cloudera Manager UI afterwards. Ideal scenario that I'm looking for is following:
Topology: 3 nodes (where node1 is spark-master and node2 and node3 are spark-workers)
Provision spark cluster (this works as expected) and I have working CDH5.5 Spark cluster (verified by running Spark Pi example)
Install CM server on spark-master node
Install CM agent on all nodes
Start CM server and agents
I'm using razorsedge/cloudera puppet module to provision Cloudera Manager (https://forge.puppetlabs.com/razorsedge/cloudera) and I have custom made Spark puppet module which support CDH5.5 Spark installation
When I open Cloudera Manager UI, I can see all three nodes but I don't see any Spark related stats on CM UI dashboard.
When investigating cm agent and server logs, these are the findings:
cm agent log on spark-master (was not connected to CM server and cannot be seen on CM UI dashboard)
[12/Jan/2016 23:13:11 +0000] 4678 MainThread agent ERROR Heartbeating to EC2_PUBLIC_DNS:7182 failed
cm agent log on spark-workers (connected to CM server successfully and can be seen on CM UI dashboard)
cm server log on spark-master:
org.apache.avro.AvroRuntimeException: Unknown datum type: java.lang.IllegalArgumentException: Hostname invalid EC2_LOCAL_IPV4
Any idea what might be the issue here?
I'm also looking for following answers:
Is it even possible at all to provision some CDH service (in my case Spark) without using Cloudera Manager UI and then have it connected to CM?
If yes, which CM configuration/s need to changed to point to existing Spark Cluster?
Any help/guidance would be greatly appreciated

Related

pull out metrics from spark logs

how do I pull out these metrics from spark history logs? Is there some api I can pull these from?
I tried downloading the json event logs, but I can't grep for the numbers seen in the photo
The spark history server keeps all that information for you. You can access it via a rest API.
If you are on EMR:
You can view the Spark web UIs by following the procedures to create
an SSH tunnel or create a proxy in the section called Connect to the
cluster in the Amazon EMR Management Guide and then navigating to the
YARN ResourceManager for your cluster. Choose the link under Tracking
UI for your application. If your application is running, you see
ApplicationMaster. This takes you to the application master's web UI
at port 20888 wherever the driver is located. The driver may be
located on the cluster's primary node if you run in YARN client mode.
If you are running an application in YARN cluster mode, the driver is
located in the ApplicationMaster for the application on the cluster.
If your application has finished, you see History, which takes you to
the Spark HistoryServer UI port number at 18080 of the EMR cluster's
primary node. This is for applications that have already completed.
You can also navigate to the Spark HistoryServer UI directly at
http://master-public-dns-name:18080/.

Cannot get PySpark working in Kubernetes getting (Initial job has not accepted any resources)

I'm trying to use the following Helm Chart for Spark on Kubernetes
https://github.com/bitnami/charts/tree/main/bitnami/spark
The documentation is of course spotty but I've muddled along. So I have it installed with custom values that assign things like resource limits etc. I'm accessing the master through a NodePort and the WebUI through a port forward. I am NOT using spark-submit, I'm writing Python code to drive the Spark Cluster as follows:
import pyspark
sc = pyspark.SparkContext(appName="Testy", master="spark://<IP>:<PORT>")
This Python code is running locally on my Windows laptop, the Kubernetes cluster is on a separate set of servers. It connects and I can see the app appear in the WebUI but the second it tries to do something I get the following:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
The master seems to be in a cycle of removing and launching executors and the 3 workers each just fail to run a launch command. Interestingly the command has the hostname of my laptop in here:
"--driver-url" "spark://CoarseGrainedScheduler#<laptop hostname>:60557"
Got to imagine that's not right. So in this setup where should I be actually running the python code? On the kubernetes cluster? Can I run it locally on my laptop? These details are of course missing from the docs. I'm new to Spark so just looking for the absolute basics. My preferred workflow would be to develop code locally on my laptop then run it on the Kubernetes cluster I have access to.

spark-submit error : failed in initilizing sparkContext for non driver program vms

Cluster Specifications : Apache Spark on top of Mesos with 5 Vms and HDFS as storage.
spark-env.sh
export SPARK_LOCAL_IP=192.168.xx.xxx #to set the IP address Spark binds to on this node
enter code here`export MESOS_NATIVE_JAVA_LIBRARY="/home/xyz/tools/mesos-1.0.0/build/src/.libs/libmesos-1.0.0.so" #to point to your libmesos.so if you use Mesos
export SPARK_EXECUTOR_URI="hdfs://vm8:9000/spark-2.0.0-bin-hadoop2.7.tgz"
HADOOP_CONF_DIR="/usr/local/tools/hadoop" #To point Spark towards Hadoop configuration files
spark-defaults.conf
spark.executor.uri hdfs://vm8:9000/spark-2.0.0-bin-hadoop2.7.tgz
spark.driver.host 192.168.xx.xxx
spark.rpc netty
spark.rpc.numRetries 5
spark.ui.port 48888
spark.driver.port 48889
spark.port.maxRetries 32
I did some experiments with submitting word-count scala application in cluster mode, I observed that it executes successfully only when it finds driver program (containing main method) from the Vm it was submitted. As per my knowledge scheduling of resources (VMs) is handled by Mesos. for example if i submit my application from vm12 and coincidently if Mesos also schedules vm12 for executing application then it will execute successfully.In contrast it will fail if mesos scheduler decides to allocate let's say vm15.I checked logs in stderr of mesos UI and found error..
16/09/27 11:15:49 ERROR SparkContext: Error initializing SparkContext.
Besides I tried looking for configuration aspects of spark in following link.
[http://spark.apache.org/docs/latest/configuration.html][1] I tried setting rpc as it seemed necessary to keep driver program near to worker-node in LAN.
But couldn't get much insights.
I also tried uploading my code (application) in HDFS and submitting application jar file from HDFS.The same observations I received.
While connecting apache-spark with Mesos according to the documentation in
following link http://spark.apache.org/docs/latest/running-on-mesos.html
I also tried configuring spark-defaults.conf, spark-env.sh in other VM's in order to check if it successfully runs at least from 2 Vm's. That also didn't workout.
Am I missing any conceptual clarity here.?
So how can I make my application run successfully regardless of Vm's I'm submitting from ?

Connecting to Remote Spark Cluster for TitanDB SparkGraphComputer

I am attempting to leverage a Hadoop Spark Cluster in order to batch load a graph into Titan using the SparkGraphComputer and BulkLoaderVertex program, as specified here. This requires setting the spark configuration in a properties file, telling Titan where Spark is located, where to read the graph input from, where to store its output, etc.
The problem is that all of the examples seem to specify a local spark cluster through the option:
spark.master=local[*]
I, however, want to run this job on a remote Spark cluster which is on the same VNet as the VM where the titan instance is hosted. From what I have read, it seems that this can be accomplished by setting
spark.master=<spark_master_IP>:7077
This is giving me the error that all Spark masters are unresponsive, which disallows me from sending the job to the spark cluster to distribute the batch loading computations.
For reference, I am using Titan 1.0.0 and a Spark 1.6.4 cluster, which are both hosted on the same VNet. Spark is being managed by yarn, which also may be contributing to this difficulty.
Any sort of help/reference would be appreciated. I am sure that I have the correct IP for the spark master, and that I am using the right gremlin commands to accomplish bulk loading through the SparkGraphComputer. What I am not sure about is how to properly configure the Hadoop properties file in order to get Titan to communicate with a remote Spark cluster over a VNet.

Unable to add a new service with Cloudera Manager within Cloudera Quickstart VM 5.3.0

I'm using Cloudera Quickstart VM 5.3.0 (running in Virtual Box 4.3 on Windows 7) and I wanted to learn Spark (on YARN).
I started Cloudera Manager. In the sidebar I can see all the services, there is Spark but in standalone mode. So I click on "Add a new service", select "Spark". Then I have to select the set of dependencies for this service, I have no choices I must pick HDFS/YARN/zookeeper.
Next step I have to choose a History Server and a Gateway, I run the VM in local mode so I can only choose localhost.
I click on "Continue" and this error occures (+ 69 traces) :
A server error as occurred. Send the following information to
Cloudera.
Path : http://localhost:7180/cmf/clusters/1/add-service/reviewConfig
Version: Cloudera Express 5.3.0 (#155 built by jenkins on
20141216-1458 git: e9aae1d1d1ce2982d812b22bd1c29ff7af355226)
org.springframework.web.bind.MissingServletRequestParameterException:Required
long parameter 'serviceId' is not present at
AnnotationMethodHandlerAdapter.java line 738 in
org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter$ServletHandlerMethodInvoker
raiseMissingParameterException()
I don't know if an internet connection is needed but I precise that I can't connect to the internet with the VM. (EDIT : Even with an internet connection I get the same error)
I have no ideas how to add this service, I tried with or without gateway, many network options but it never worked. I checked the known issues; nothing...
Someone knows how I can solve this error or how I can work around ? Thanks for any help.
Julien,
Before I answer your question I'd like to make some general notes about Spark in Cloudera Distribution of Hadoop 5 (CDH5):
Spark runs in three different formats: (1) local, (2) Spark's own stand-alone manager, and (3) other cluster resource managers like Hadoop YARN, Apache Mesos, and Amazon EC2.
Spark works out-of-the-box with CHD 5 for (1) and (2). You can initiate a local
interactive spark session in Scala using the spark-shell command
or pyspark for Python without passing any arguments. I find the interactive Scala and Python
interpreters help learning to program with Resilient Distributed
Datasets (RDDs).
I was able to recreate your error on my CDH 5.3.x distribution. I didn't mean to take credit for the bug you discovered, but I posted to the Cloudera developer community for feedback.
In order to use Spark in the QuickStart pseudo-distributed environment, see if all of the Spark daemons are running using the following command (you can do this inside the Cloudera Manager (CM) UI):
[cloudera#quickstart simplesparkapp]$ sudo service --status-all | grep -i spark
Spark history-server is not running [FAILED]
Spark master is not running [FAILED]
Spark worker is not running [FAILED]
I've manually stopped all of the stand-alone Spark services so we can try to submit the Spark job within Yarn.
In order to run Spark inside a Yarn container on the quick start cluster, we have to do the following:
Set the HADOOP_CONF_DIR to the root of the directory containing the yarn-site.xml configuration file. This is typically /etc/hadoop/conf in CHD5. You can set this variable using the command export HADOOP_CONF_DIR="/etc/hadoop/conf".
Submit the job using spark-submit and specify you are using Hadoop YARN.
spark-submit --class CLASS_PATH --master yarn JAR_DIR ARGS
Check the job status in Hue and compare to the Spark History server. Hue should show the job placed in a generic Yarn container and Spark History should not have a record of the submitted job.
References used:
Learning Spark, Chapter 7
Sandy Ryza's Blog Post on Spark and CDH5
Spark Documentation for Running on Yarn

Resources