Connecting Apache Spark 3.0 to Kafka using Python3 - python-3.x

I'm developing a little Big Data project and I was wondering if there is a way to read a stream from a Kafka Topic from Spark Streaming v3.0 using python3.
I've read on the https://spark.apache.org/docs/3.0.0-preview/streaming-programming-guide.html that is necessary to link the artifact spark-streaming-kafka-0-10_2.12 to hande this kind of stream but I've found that those dependencies are incompatible with Python (in the integration guide there are only examples with Java or Scala and from a different version of Spark Streaming, I've read that there isn't a support for the python language: https://spark.apache.org/docs/2.4.6/streaming-kafka-integration.html)
I've also found this link that could answer my question, but...
https://stackoverflow.com/questions/56960981/does-spark-streaming-kafka-0-10-2-10-work-with-python?rq=1
Some more details: actually, I've a stream of json response from https://openweathermap.org/api sent every seconds over a Kafka topic. I want to take this stream to calculate the trends on the actual temperature of a place over the last measurements.
I could switch my current stack choices, so other suggestions are welcome but I won't change Python as my scripting language.
Thanks in advance.

I've solved by creating a "connector" script in python, using the KafkaConsumer library. It takes data from the stream and publish them on a TCP socker on the localhost. Spark read those data using ssc.socketTextStream("127.0.0.1", PORTS).
I've used this guide to setting up the code : https://www.toptal.com/apache/apache-spark-streaming-twitter.

Related

Sending Spark streaming metrics to open tsdb

How can I send metrics from my spark streaming job to open tsdb database? I am trying to use open tsdb as data source in Grafana. Can you please help me with some references where I can start.
I do see open tsdb reporter here which does similar job. How can I integrate the metrics from Spark streaming job to use this? Is there any easy options to do it.
One way to send the metrics to opentsdb is to use it's REST API. To use it, simply convert the metrics to JSON strings and then utilize the Apache Http Client library to send the data (it's in java and can therefore be used in scala). Example code can be found on github.
A more elegant solution would be to use the Spark metrics library and add a sink to the database. There has been a discussion on adding an OpenTSDB sink for the Spark metrics library, however, finally it was not added into Spark itself. The code is avaiable on github and should be possible to use. Unfortunalty the code is compatible on Spark 1.4.1, however, in worst case it should still be possible to get some indications of what is necessary to add.

How to visualize log files from Hadoop?

I am just a newbie in Big Data world, so I do not know how to build a dashboard application for visualizing data from log files in Hadoop. After searching around, I can think of some solution:
1/ Using Kafka to ingesting streaming data
2/ Stream data processing: Streaming Spark or Apache Flink
3/ Front-end --> Visualize data: using d3js
Am I missing something? Spark and Flink which one should I use?
I have a cluster of machines, I've installed Ambari, HDP 2.4.2, HDFS 2.7, YARN 2.7, Spark 1.6, Kafka.
If possible, could you guys show me some tutorials to build such a application like that? Any book or course?
Thank a lot.
P/s:
I have read the git book of databrick, but it's only mentioned spark. I also find some tutorials how to analyze using Flink, Elasticsearch and Kibana, but it's not mentioned about how to combine with Ambari Server, that where I got stuck
You may take a look at Ambari Log Search feature: https://github.com/abajwa-hw/logsearch-service which visualizes the logs.

Port existing php application in spark streaming

We have a huge existing application in php which
Accepts a log file
Initialises all the database, in-memory store resources
Processes every line
Creates a set of output files
Above process happens per input file.
Input files are written by a kafka consumer. Is it possible to fit this application in spark streaming by somehow not porting all the code in java? For example in following manner
get a message from kafka topic
Pass this message to spark streaming
Spark streaming somehow interacts with legacy app and generates output
spark then writes output again in kafka
Whatever I have just mentioned is too high level. I just want to know whether there's a possibility of doing this by not recoding existing app in java? And can anyone please tell me roughly how this can be done?
I think there is no possibility to use PHP in Spark directly. According to documentation (http://spark.apache.org/) and my knowledge it supports only Java, Scala, R and Python.
However you can change an architecture of your app and create some external services (ws, rest etc) and use them from Spark (you can use whichever library you want) - not all modules from old app must be rewritten to Java. I would try to go in that way :)
I think Storm is an excellent choice in this case because it offers non-jvm language integration through Thrift. Also I am sure that there is a PHP Thrift client.
So basically what you have to do is finding a ShellSpout and ShellBolt written in PHP (this is the integration part needed to interact with Storm in your application) and then write your own spouts and bolts which are consuming Kafka and processing each line.
You can use this library for your need:
https://github.com/Lazyshot/storm-php
Then you will also have to find a PHP Thrift client to interact with the Storm cluster.
The Storm Thrift definition can be found here:
https://github.com/apache/storm/blob/master/storm-core/src/storm.thrift
And a PHP Thrift client example can be found here:
https://thrift.apache.org/tutorial/php
Now putting these things together you can write your own Apache Storm app in PHP.
Information sources:
http://storm.apache.org/about/multi-language.html
http://storm.apache.org/releases/current/Using-non-JVM-languages-with-Storm.html

Connecting Bluemix virtual sensors to an instance of Spark service

I am new to bluemix and also Apache Spark. I just wanted to do a small task using IBM analytics for Apache Spark where I want to create a virtual sensor using Bluemix's virtual sensors (https://virtualsensors.mybluemix.net/) and use that generated data as input to the spark streaming service and do some analytics based on the input data. But, I don't know exactly how to connect the instances of those two application and I am stuck. It would be great if someone could help me.
Thanks,
From the documentation the Virtual Sensors just emit their sensor data using MQTT, so I imagine this would be as easy as importing an MQTT library in your language of choice and simply connecting that to the Virtual Sensors.
You haven't really specified what language you're working with on the Spark side, but they'll probably all shake out to either:
Paho (Python, Java, Scala)
Scala-MQTT-client (specifically Scala)
For how to use it, the Paho project also includes some basic documentation about how MQTT works.
Some of the other basics are covered in the MQTT FAQ and this youtube video.
If you need to add the JAR to your notebook, you should be able to use the %AddJar command. You can read about that here -- scroll down to the section titled "Deploy your custom library jar to a Jupyter Notebook" for the instructions and example use.
I would like you to go through this recipe that shows how to configure the Apache Spark Streaming running in IBM Bluemix to get data from the actual sensor devices. I believe, you can just tweak the topic id to get the data from virtual sensor as well.
Also, look at the Github project that shows how to create the Spark-mqtt-connector Dstream such that the Spark service can consume the events in real-time.

cannot get kafka sample producer / consumer to run

I finally managed to setup zookeeper and kafka 0.7.2 on a CentOS box and tried to run the console-producer.sh sample.
[root#syslogtest bin]# bash kafka-console-producer.sh
Error: could not find or load main class kafka.producer.ConsoleProducer
apart from the http://kafka.apache.org/quickstart.html guide is there any other links someone could suggest to learn kafka. Just setting up I had to find other resources than this site which is turning out to be frustrating.
Any suggestions appreciated.
Since you are starting to learn Kafka, you might want to consider learning version 0.8 as there are several important updates like replication. (https://cwiki.apache.org/confluence/display/KAFKA/Changes+in+Kafka+0.8)
You can check https://cwiki.apache.org/confluence/display/KAFKA/Index for concrete examples on setting up consumer and producer.
Finally there's an active community for Kafka at http://grokbase.com/g/kafka/

Resources