Bluemix Apache Spark Metrics - apache-spark

I have been looking for a way to monitor performance in Spark on Bluemix. I know in the Apache Spark project, they provide a metrics service based on the Coda Hale Metrics Library. This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV files. Details here: http://spark.apache.org/docs/latest/monitoring.html
Does anyone know of any way to do this in the Bluemix Spark service? Ideally, I would like to save the metrics to a csv file in Object Storage.
Appreciate the help.
Thanks
Saul

Currently, I do not see an option for usage of "Coda Hale Metrics Library" and reporting the job history or accessing the information via REST API.
However, on the main page of the Spark history server, you can see the Event log directory. It refers to your following user directory: file:/gpfs/fs01/user/USER_ID/events/
There I saw JSON (like) formatted files.

Related

Getting Splunk search result directly from Apache Spark

Small question regarding an integration between Splunk and Apache Spark.
Currently, I am doing a search query in Splunk. The result is quite big. And I am exporting this result as a CSV to share with several teams for downstream work.
Each of the downstream work ended up loading the CSV as part of a Apache Spark job, converting it to DataSet, and doing map reduce on it.
The Spark jobs from each and every teams are different. Therefore, simply plugin each and every teams computation in Splunk directly is not quite scalable.
This is leading us to ask a question, instead of each teams having to download a copy of the CSV, may I ask, if there is an API, or a way to connect to Splunk search result from Apache Spark directly?
Thank you
Splunk does not have an API specifically for Spark. There is a REST API, a few SDKs, and (perhaps best for you) support for ODBC. With an ODBC/JDBC driver installed on your Spark server and a few saved searches defined on Splunk, you should be able to export results from Splunk to Spark for analysis. See https://www.cdata.com/kb/tech/splunk-jdbc-apache-spark.rst for more information.

Retrieve graphical information using Spark Structured Streaming

Spark Streaming provided a "Streaming" tab within the deployed Web UI (http://localhost:4040 for running applications or http://localhost:18080 for completed applications, both by default) for each application executed, where graphs representative of application performance could be obtained, which is no more available using Spark Structured Streaming. In my case, I am developing a streaming application with Spark Structured Streaming that reads from a Kafka broker and I would like to obtain a graph of records processed per second, such as the one I could obtain when using Spark Streaming instead of Spark Structured Streaming, among other graphical information.
What is the best alternative to achieve this? I am using Spark 3.0.1 (via pyspark library), and deploying my application on a YARN cluster.
I've checked Monitoring Structured Streaming Applications Using Web UI by Jacek Laskowski, but it is still not very clear how to obtain this type of information in a graphic way.
Thank you in advance!
I managed to get what I wanted. For some reason I still don't know, the Spark History Server UI for completed apps (on http://localhost:18080 by default) did not show the new tab ("Structured Streaming" tab) that is available for Spark Structured Streaming applications that are executed on Spark 3.0.1. However, the web UI that I managed to access through the URL http://localhost:4040 does show me the information that I wanted to retrieve. You just need to click on the 'runId' link of the streaming query from which you want to get the statistics.
If you can't see this tab, based on my personal experience, I recommend the following:
Upgrade to Spark latest version (currently 3.0.1)
Consult this information on the UI deployed at port 4040 while the application is running, instead of port 18080 when the application has finished.
I found the Web UI official documentation from latest Apache Spark very useful to achieve this.
Most metrics informations you see in spark UI is exported by spark.
If spark UI don't fit your requirement, you could retrieve theses metrics and process it.
you can use a sink to export the data, for exemple to csv, prometheus, ... or via rest API.
you should take a look at spark monitoring : https://spark.apache.org/docs/latest/monitoring.html

Sending Spark streaming metrics to open tsdb

How can I send metrics from my spark streaming job to open tsdb database? I am trying to use open tsdb as data source in Grafana. Can you please help me with some references where I can start.
I do see open tsdb reporter here which does similar job. How can I integrate the metrics from Spark streaming job to use this? Is there any easy options to do it.
One way to send the metrics to opentsdb is to use it's REST API. To use it, simply convert the metrics to JSON strings and then utilize the Apache Http Client library to send the data (it's in java and can therefore be used in scala). Example code can be found on github.
A more elegant solution would be to use the Spark metrics library and add a sink to the database. There has been a discussion on adding an OpenTSDB sink for the Spark metrics library, however, finally it was not added into Spark itself. The code is avaiable on github and should be possible to use. Unfortunalty the code is compatible on Spark 1.4.1, however, in worst case it should still be possible to get some indications of what is necessary to add.

web URL information to apache spark in web app

I am currently are trying to retrieve information from the EPA into our web app, which needs to utilize ibm bluemix and apache spark. The information that we are gathering from the EPA is this:
https://aqs.epa.gov/api and ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/
But not only are we gathering historical data, we also want to update the data by inserting new data every hour into the web app. Hence concerning this I have a few questions:
1) Do we need to open a hdfs to store all the data? Or could we just retrieve the data by its URL and store it in a dataframe? IBM bluemix said it would provide 5 GB of storage, so how would one utilize that to store the historical data and store updated data per hour?
2) If we are going to update the data per hour by inserting new data into the data storage / data frame, should we still use spark streaming? If yes, how would we use spark streaming for URL data? A lot of resources I see online is only useful if one has an hdfs / formal database.
What we are doing currently is that we import the URLs through pandas:
url = "https://aqs.epa.gov/api/rawData?user=sogun3#gmail.com&pw=baycrane57&format=JSON&param=44201&bdate=20110501&edate=20110501&state=37&county=063"
import urllib2
content = urllib2.urlopen(url).read()
print content
However, if we use this method, it means that spark needs to be running 24-7 to ensure that the most updated data is utilized. How does one configure spark to run 24-7? Or is there a better method to process all the data and put them nicely in a dataframe so that the data could be accessed easily later?
Also, in a web app, can one still use iPython for data processing? Or is iPython just for interacting with the data and understanding the data experimentally?
Thanks a lot!
You have options ;-) If you need to read the source EPA data and then process it before you use it in your web app, then you can use the spark service to ETL (Extract Transform Load) the source data from EPA web site, manipulate or wrangle the data into the shape and size you want, and then save it into a storage service like Bluemix Object Storage. You web app would then read the data in the format you want directly from object storage. However, if the source EPA data is largely in a format you want to use in the web app, then you most certainly can create RDDs directly from web site and pull in the data as and when you need it. These datasets look small from my quick peek, so I don't think you need to worry about spark pulling it directly into memory for you to work on it; i.e. no need to try to store it locally with spark in the bluemix service cluster. Besides, there is no HDFS currently provided by the spark service; so as mentioned earlier, you would use an external storage service. re: "IBM bluemix said it would provide 5 GB of storage", that is intended for storing your personal and 3rd-party spark libraries and such.
re: "spark needs to be running 24-7". The spark service runs 24x7. Your spark code running on the service will run for as long as you program it to run ;-)
IPython (or Jupyter notebooks) is intended as a REPL for the web. So, yes, interactive. In your case, you can certainly write your spark code in an IPython notebook and have that run for as long as necessary, pulling and processing the EPA data for the web app, storing it in say object storage. The web app can then pull the data it needs from object storage. It is said that in the future APIs we will provided for the spark service, at which point your web app could talk directly to the spark service; in the meantime, you can certainly make something work with notebooks.

Bluemix Spark Service

Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup

Resources