Getting Splunk search result directly from Apache Spark - apache-spark

Small question regarding an integration between Splunk and Apache Spark.
Currently, I am doing a search query in Splunk. The result is quite big. And I am exporting this result as a CSV to share with several teams for downstream work.
Each of the downstream work ended up loading the CSV as part of a Apache Spark job, converting it to DataSet, and doing map reduce on it.
The Spark jobs from each and every teams are different. Therefore, simply plugin each and every teams computation in Splunk directly is not quite scalable.
This is leading us to ask a question, instead of each teams having to download a copy of the CSV, may I ask, if there is an API, or a way to connect to Splunk search result from Apache Spark directly?
Thank you

Splunk does not have an API specifically for Spark. There is a REST API, a few SDKs, and (perhaps best for you) support for ODBC. With an ODBC/JDBC driver installed on your Spark server and a few saved searches defined on Splunk, you should be able to export results from Splunk to Spark for analysis. See https://www.cdata.com/kb/tech/splunk-jdbc-apache-spark.rst for more information.

Related

What is exactly the need of spark when using talend?

I am new to both spark and talend.
But I read everywhere that both of these are ETL tools. I read another stackoverflow answer here. From the other answer what I understood is talend do use spark for large data processing. But can talend do all the ETL work efficiently that spark is doing without using spark under the hood? Or is it essentially a wrapper over spark where all the data is send to talend is actually put inside the spark inside talend for processing?
I am quite confused with this. Can someone clarify this?
Unlike Informatica BDM which has its own Blaze framework for processing on Hadoop (native), Talend relies on other frameworks such as Map Reduce (Hadoop using possibly tez underneath) or Spark engine. So you could avoid Spark, but there is less point in doing so. The key point is that we could expect I think some productivity using Talend as it is graphical based, which is handy when there are many fields and you do not need possibly the most skilled staff.
For NOSQL, like HBase, they provide specific connectors or could use the Phoenix route. Talend also has connectors for KAFKA.
Spark is just one of the frameworks supported by Talend. When you create a new job, you can pick Spark from the dropdown list. You can get more details in the docs.

Sending Spark streaming metrics to open tsdb

How can I send metrics from my spark streaming job to open tsdb database? I am trying to use open tsdb as data source in Grafana. Can you please help me with some references where I can start.
I do see open tsdb reporter here which does similar job. How can I integrate the metrics from Spark streaming job to use this? Is there any easy options to do it.
One way to send the metrics to opentsdb is to use it's REST API. To use it, simply convert the metrics to JSON strings and then utilize the Apache Http Client library to send the data (it's in java and can therefore be used in scala). Example code can be found on github.
A more elegant solution would be to use the Spark metrics library and add a sink to the database. There has been a discussion on adding an OpenTSDB sink for the Spark metrics library, however, finally it was not added into Spark itself. The code is avaiable on github and should be possible to use. Unfortunalty the code is compatible on Spark 1.4.1, however, in worst case it should still be possible to get some indications of what is necessary to add.

Searching SQL data using Spark

I am creating the spark job that searchs records (SQL rows) relevant to a keyword using tf-idf model. What I am currently doing for testing is to spark-submit the job to get results. However, ideally I want to make this job as a web service so that external users can search records using REST API. This may generate a number of concurrent requests to run the job for multiple users when they search own keywords through the API.
I wonder if I should support this scenario with spark job server so that users can submit jobs via API, or if you have any suggestion for this particular case based on your past experience. Thanks.
This would be an inappropriate use of Spark. Spark is for analytics jobs. Those take time (maybe less time than old-school MapReduce but time nonetheless), and REST clients demand immediate results.
You are on the right track though. As data come in, you can use, for example, Spark Streaming and MLLib to process records according to your TD-IDF and then store the indexed results in your SQL database. Then your REST clients will simply query your data like with all the conventional web-with-SQL-backend applications our ancestors once built.
I suppose you could also look into giving admins the ability to start analytics jobs via a REST client too.

Feed druid with data from graylog2 server

Dears
I need to get data from graylog2 server into druid (e.g. CPU, memory, disk utilisation of several machines).
I've searched for plugins at the graylog marketplace and tranquility documentation, and I did not found any solution to retrieve data from graylog2.
I believe the solution is to use the REST API from graylog2, but how can this be "automated" from the druid/tranquility side?
Looked quickly into the Graylog2, i couldn't find any doc or like about the REST API that you have mention, neither a documented way about how to extract data in general from Graylog2. But if you can come up with something that can read from Graylog2 and dump to kafka you can point the druid cluster to that kafka topics and ingest the data.

Bluemix Apache Spark Metrics

I have been looking for a way to monitor performance in Spark on Bluemix. I know in the Apache Spark project, they provide a metrics service based on the Coda Hale Metrics Library. This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV files. Details here: http://spark.apache.org/docs/latest/monitoring.html
Does anyone know of any way to do this in the Bluemix Spark service? Ideally, I would like to save the metrics to a csv file in Object Storage.
Appreciate the help.
Thanks
Saul
Currently, I do not see an option for usage of "Coda Hale Metrics Library" and reporting the job history or accessing the information via REST API.
However, on the main page of the Spark history server, you can see the Event log directory. It refers to your following user directory: file:/gpfs/fs01/user/USER_ID/events/
There I saw JSON (like) formatted files.

Resources