Spark metrics fot gmond / ganglia - apache-spark

OS: Cent OS 6.4
ISSUE:
Installed gmond, gmetad and gweb on a server. Installed spark worker in the same server.
configured metrics.properties in $SPARK_HOME/conf/metrics.properties as below...
CONFIGURATION (metrics.properties in spark):
org.apache.spark.metrics.sink.GangliaSink
host localhost
port 8649
period 10
unit seconds
ttl 1
mode multicast
We are not able to see any metrics in ganglia web.
Please do the needful.
-pradeep samudrala

In the first place, those are just indications of the default settings of Ganglia. You should not uncomment that. Taken from the metrics section from the Spark web page (spark page):
To install the GangliaSink you’ll need to perform a custom build of Spark. Note that by embedding this library you will include LGPL-licensed code in your Spark package. For sbt users, set the SPARK_GANGLIA_LGPL environment variable before building. For Maven users, enable the -Pspark-ganglia-lgpl profile. In addition to modifying the cluster’s Spark build user applications will need to link to the spark-ganglia-lgpl artifact.

Related

how to integrate dropwizard metrics for monitoring cassandra database

I want to monitor health of my cassandra cluster. And got to know about dropwizard metrics, but dont know how to integrate dropwizard metrics with my cassandra cluster to monitor it.
For this I want to use JMX as metrics reporter,graphite as metrics collector and Grafana as visualization GUI
can anyone help me here please.
Cassandra itself uses dropwizard Metrics and has a pluggable reporting interface since 2.0.2 (announcement post). 'Monitoring Apache Cassandra Metrics With Graphite and Grafana' gives a good overview on how to configure Cassandra to report metrics to graphite:
1). Download Graphite metrics reporter jar file
2). Put the downloaded jar file in Cassandra library folder, e.g. /usr/share/cassandra/lib/ (the default Cassandra library folder under packaged installation on Ubuntu 14.0.4)
3). Create a metrics reporter configuration file (e.g. metrics_reporter_graphite.yaml) and put it under the same folder as cassandra.yaml file, e.g. /etc/cassandra/ (the default Cassandra configuration folder under packaged installation on Ubuntu 14.0.4).
graphite:
-
period: 30
timeunit: 'SECONDS'
prefix: 'cassandra-clustername-node1'
hosts:
- host: 'localhost'
port: 2003
predicate:
color: 'white'
useQualifiedName: true
patterns:
- '^org.apache.cassandra.+'
- '^jvm.+'
4). Modify cassandra-env.sh file to include the following JVM option:
METRICS_REPORTER_CFG="metrics_reporter_graphite.yaml"
JVM_OPTS="$JVM_OPTS -Dcassandra.metricsReporterConfigFile=$METRICS_REPORTER_CFG"
5). Restart Cassandra service

Apache Cassandra monitoring

What is the best way to monitor if cassandra nodes are up? Due to security reasons JMX and nodetool is out of question. I have cluster metrics monitoring via Rest Api, but I understand that even if a node goes Rest Api will only report on a whole cluster.
Well, I have integrated a system where I can monitor all the metrics regarding to my cluster of all nodes. This seems like complicated but pretty simple to integrate. You will need the following components to build up a monitoring system for cassandra:
jolokia jar
telegraf
influxdb
grafana
I'm writing a short procedure, how it works.
Step 1: copy jolokia jvm jar to install_dir/apache-cassandra-version/lib/ , jolokia jvm agent can be downloaded from anywhere in google.
Step 2: add the following line to install_dir/apache-cassandra-version/conf/cassandra-env.sh
JVM_OPTS="$JVM_OPTS -javaagent:<here_goes_the_path_of_your_jolokia_jar>"
Step 3: install telegraf on each node and configure the metrics you want to monitor. and start telegraf service.
Step 4: install grafana and configure your ip, port, protocol. grafana will give you a dashboard to look after your nodes and start grafana service. Your metrics will be able get visibility here.
Step 5: install influxdb on another server from where you want to store your metrics data which will come through telegraf agent.
Step 6: browse the ip you have mentioned, where you have launched your grafana through browser and add data source ip (influxdb ip), then customize your dashboard.
image source: https://blog.pythian.com/monitoring-cassandra-grafana-influx-db/
This is not for monitoring but only for node state.
Cassandra CQL driver provides info if a particular node is UP or DOWN with Host.StateListener Interface. This info is used by driver to mark a node UP or Down. Thus it could be used if node is down or up if JMX is not accessible.
Java Doc API : https://docs.datastax.com/en/drivers/java/3.3/
I came up with a script which listens for DN nodes in the cluster and reports it to our monitoring setup which is integrated with pagerduty.
The script runs on one of our nodes and executes nodetool status every minute and reports for all down nodes.
Here is the script https://gist.github.com/johri21/87d4d549d05c3e2162af7929058a00d1
[1]:

What version of Apache spark is used in my IBM Analytics for Apache Spark for IBM Cloud service?

I saw an email indicating the sunset of support for 1.6 apache spark within IBM Cloud. I am pretty sure my version is 2.x, but I wanted to confirm. I couldn't find anywhere in the UI that indicated the version, and the bx cli command that I thought would show it didn't.
[chrisr#oc5287453221 ~]$ bx service show "Apache Spark-bc"
Invoking 'cf service Apache Spark-bc'...
Service instance: Apache Spark-bc
Service: spark
Bound apps:
Tags:
Plan: ibm.SparkService.PayGoPersonal
Description: IBM Analytics for Apache Spark for IBM Cloud.
Documentation url: https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html
Dashboard: https://spark-dashboard.ng.bluemix.net/dashboard
Last Operation
Status: create succeeded
Message:
Started: 2018-01-22T16:08:46Z
Updated: 2018-01-22T16:08:46Z
How do I determine the version of spark that I am using? Also, I tried going to the "Dashboard" URL from above, and I got an "Internal Server Error" message after logging in.
The information found on How to check the Spark version doesn't seem to help, because it seems to be related to locally installed spark instances. I need to find out the information from the IBM Cloud (ie. Bluemix) using either the UI or the bluemix CLI. Other possibilities would be running some command from a Jupyter Notebook in iPython running in Data Science Experience (part of IBM Cloud).
The answer was given by ptitzler above, just adding an answer as requested by the email I was sent.
The Spark service itself is not version specific. To find out whether
or not you need to migrate you need to inspect the apps/tools that
utilize the service. For example if you've created notebooks in DSX
you associated them with a kernel that was bound to a specific Spark
version and you'd need to open each notebook to find out which Spark
version they are utilizing. – ptitzler Jan 31 at 16:32

Unable to add a new service with Cloudera Manager within Cloudera Quickstart VM 5.3.0

I'm using Cloudera Quickstart VM 5.3.0 (running in Virtual Box 4.3 on Windows 7) and I wanted to learn Spark (on YARN).
I started Cloudera Manager. In the sidebar I can see all the services, there is Spark but in standalone mode. So I click on "Add a new service", select "Spark". Then I have to select the set of dependencies for this service, I have no choices I must pick HDFS/YARN/zookeeper.
Next step I have to choose a History Server and a Gateway, I run the VM in local mode so I can only choose localhost.
I click on "Continue" and this error occures (+ 69 traces) :
A server error as occurred. Send the following information to
Cloudera.
Path : http://localhost:7180/cmf/clusters/1/add-service/reviewConfig
Version: Cloudera Express 5.3.0 (#155 built by jenkins on
20141216-1458 git: e9aae1d1d1ce2982d812b22bd1c29ff7af355226)
org.springframework.web.bind.MissingServletRequestParameterException:Required
long parameter 'serviceId' is not present at
AnnotationMethodHandlerAdapter.java line 738 in
org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter$ServletHandlerMethodInvoker
raiseMissingParameterException()
I don't know if an internet connection is needed but I precise that I can't connect to the internet with the VM. (EDIT : Even with an internet connection I get the same error)
I have no ideas how to add this service, I tried with or without gateway, many network options but it never worked. I checked the known issues; nothing...
Someone knows how I can solve this error or how I can work around ? Thanks for any help.
Julien,
Before I answer your question I'd like to make some general notes about Spark in Cloudera Distribution of Hadoop 5 (CDH5):
Spark runs in three different formats: (1) local, (2) Spark's own stand-alone manager, and (3) other cluster resource managers like Hadoop YARN, Apache Mesos, and Amazon EC2.
Spark works out-of-the-box with CHD 5 for (1) and (2). You can initiate a local
interactive spark session in Scala using the spark-shell command
or pyspark for Python without passing any arguments. I find the interactive Scala and Python
interpreters help learning to program with Resilient Distributed
Datasets (RDDs).
I was able to recreate your error on my CDH 5.3.x distribution. I didn't mean to take credit for the bug you discovered, but I posted to the Cloudera developer community for feedback.
In order to use Spark in the QuickStart pseudo-distributed environment, see if all of the Spark daemons are running using the following command (you can do this inside the Cloudera Manager (CM) UI):
[cloudera#quickstart simplesparkapp]$ sudo service --status-all | grep -i spark
Spark history-server is not running [FAILED]
Spark master is not running [FAILED]
Spark worker is not running [FAILED]
I've manually stopped all of the stand-alone Spark services so we can try to submit the Spark job within Yarn.
In order to run Spark inside a Yarn container on the quick start cluster, we have to do the following:
Set the HADOOP_CONF_DIR to the root of the directory containing the yarn-site.xml configuration file. This is typically /etc/hadoop/conf in CHD5. You can set this variable using the command export HADOOP_CONF_DIR="/etc/hadoop/conf".
Submit the job using spark-submit and specify you are using Hadoop YARN.
spark-submit --class CLASS_PATH --master yarn JAR_DIR ARGS
Check the job status in Hue and compare to the Spark History server. Hue should show the job placed in a generic Yarn container and Spark History should not have a record of the submitted job.
References used:
Learning Spark, Chapter 7
Sandy Ryza's Blog Post on Spark and CDH5
Spark Documentation for Running on Yarn

Restart tasktracker and job tracker of hadoop CDH4 using Cloudera services

I have made few entries in mapred-site.xml, to pick these changes i need to restart TT and JT running at my cluster nodes.
Is there any i can restart them using Cloud Era manager web services from command line.
So I can automate those steps any time changed made configuration files for hadoop it will restart TT and JT..
Since version 4.0, Cloudera Manager exposes its functionality through an HTTP API which allows you to do the operations through "curl" from the shell. The API is available in both the Free Edition and the Enterprise Edition.
Their repository hosts a set of client-side utilities for communicating with the Cloudera Manager API. You can find more on the documentation page.

Resources