Feed druid with data from graylog2 server - graylog2

Dears
I need to get data from graylog2 server into druid (e.g. CPU, memory, disk utilisation of several machines).
I've searched for plugins at the graylog marketplace and tranquility documentation, and I did not found any solution to retrieve data from graylog2.
I believe the solution is to use the REST API from graylog2, but how can this be "automated" from the druid/tranquility side?

Looked quickly into the Graylog2, i couldn't find any doc or like about the REST API that you have mention, neither a documented way about how to extract data in general from Graylog2. But if you can come up with something that can read from Graylog2 and dump to kafka you can point the druid cluster to that kafka topics and ingest the data.

Related

Getting Splunk search result directly from Apache Spark

Small question regarding an integration between Splunk and Apache Spark.
Currently, I am doing a search query in Splunk. The result is quite big. And I am exporting this result as a CSV to share with several teams for downstream work.
Each of the downstream work ended up loading the CSV as part of a Apache Spark job, converting it to DataSet, and doing map reduce on it.
The Spark jobs from each and every teams are different. Therefore, simply plugin each and every teams computation in Splunk directly is not quite scalable.
This is leading us to ask a question, instead of each teams having to download a copy of the CSV, may I ask, if there is an API, or a way to connect to Splunk search result from Apache Spark directly?
Thank you
Splunk does not have an API specifically for Spark. There is a REST API, a few SDKs, and (perhaps best for you) support for ODBC. With an ODBC/JDBC driver installed on your Spark server and a few saved searches defined on Splunk, you should be able to export results from Splunk to Spark for analysis. See https://www.cdata.com/kb/tech/splunk-jdbc-apache-spark.rst for more information.

Log processing using ELK stack

My requirement is
I have log files that I need to process, also I would like to enrich the log information with some data which I have in postgres db.
Step 1. I plan to feed data from above two sources (log file and database) to kafka topics, using logstash
Step 2. I plan to use kafka stream to join data on different kafka topics and push them to elastic search via API calls.
My doubt is about step 2,
Is kafka stream is the way to go ? or can I use Apache spark which I believe can be used for same.
Any help on this is appreciated.
Step 1. I plan to feed data from above two sources (log file and database) to kafka topics, using logstash
If you're already using Apache Kafka, then note that you can use Kafka Connect for integrating systems, including databases, into Kafka. For information on integrating databases, see this article.
Step 2. I plan to use kafka stream to join data on different kafka topics and push them to elastic search via API calls.
My doubt is about step 2, Is kafka stream is the way to go ? or can I use Apache spark which I believe can be used for same. Any help on this is appreciated.
Yes, Kafka Streams is a good fit for this. It can enrich events as they flow through a topic, using data from other topics. These topics can be sourced from any system, including log files, databases, etc. Here is example code of such join, and the documentation for it.
BTW you might want to also check out KSQL. KSQL is built on Kafka Streams so you get the same scalability and elasticity functionality, but with a SQL abstraction that you can run directly (no coding needed). For an example of using KSQL to enrich streams of data see this talk or this article
(Disclosure: I work for Confluent, who lead the open-source KSQL project)

Twitter data harvesting

For my project, I need to harvest data from Twitter.
I am currently facing two design choices:
What is the best software architecture? I read that spark has Twitter support but I am not familiar with Scala. On the other hand, Apache Spark seems a good option, but then I'm not sure on how to save data to a common sink
I have some budget constraints. I surely need one server to do the sink and the processing. However, for the data harvesting, I don't know if several VM/containers offer a better performance / cost ratio than a bunch of Raspberry PI running Kafka producers.
Take a look at Confluent platform and especially Kafka Connect [1].
There is a Twitter connector out of the box. All the twitter data will be streamed to Kafka.
[1] https://www.confluent.io/blog/using-ksql-to-analyse-query-and-transform-data-in-kafka
Agree with #leshkin that Kafka Connect is the most natural fit. However, the Twitter connector (available on github here) does not require Confluent Platform, simply Kafka Connect which is a standard part of the Apache Kafka distribution. https://kafka.apache.org/documentation/#connect
If you choose, you can run Kafka connect workers in distributed mode to divide the load across several VMs/containers/boxes and these don't have to be the same boxes you run your kafka brokers (they only need some relevant libs from kafka and the libs for the connector and Java of course)

Sending Spark streaming metrics to open tsdb

How can I send metrics from my spark streaming job to open tsdb database? I am trying to use open tsdb as data source in Grafana. Can you please help me with some references where I can start.
I do see open tsdb reporter here which does similar job. How can I integrate the metrics from Spark streaming job to use this? Is there any easy options to do it.
One way to send the metrics to opentsdb is to use it's REST API. To use it, simply convert the metrics to JSON strings and then utilize the Apache Http Client library to send the data (it's in java and can therefore be used in scala). Example code can be found on github.
A more elegant solution would be to use the Spark metrics library and add a sink to the database. There has been a discussion on adding an OpenTSDB sink for the Spark metrics library, however, finally it was not added into Spark itself. The code is avaiable on github and should be possible to use. Unfortunalty the code is compatible on Spark 1.4.1, however, in worst case it should still be possible to get some indications of what is necessary to add.

Read the messages using pub/sub and load it into Datastax on Google Cloud

I have deployed a 9-node Datastax Cluster on Google cloud.
Now I've been given a requirement to read the data from QUEUE's and load into Datastax on google cloud( Continues Streaming).
I know pub/sub can read the data from QUEUE but not sure whether it can directly load into Datastax.
Is it possible for pub/sub to load the data into Datastax on Google cloud ?
I am new to pub/sub so not sure where to start or how to start.
Thanks,
You probably need custom code to do this. You'd have a Consumer to get messages from the topic you're interesting in, and then you'd have custom code to convert those messages and load them onto Datastax. I'm assuming that Datastax has an API for this.

Resources