I finally managed to setup zookeeper and kafka 0.7.2 on a CentOS box and tried to run the console-producer.sh sample.
[root#syslogtest bin]# bash kafka-console-producer.sh
Error: could not find or load main class kafka.producer.ConsoleProducer
apart from the http://kafka.apache.org/quickstart.html guide is there any other links someone could suggest to learn kafka. Just setting up I had to find other resources than this site which is turning out to be frustrating.
Any suggestions appreciated.
Since you are starting to learn Kafka, you might want to consider learning version 0.8 as there are several important updates like replication. (https://cwiki.apache.org/confluence/display/KAFKA/Changes+in+Kafka+0.8)
You can check https://cwiki.apache.org/confluence/display/KAFKA/Index for concrete examples on setting up consumer and producer.
Finally there's an active community for Kafka at http://grokbase.com/g/kafka/
Related
I'm fairly new to kafka and came across a repo on github (https://github.com/aber0016/Real_Time_Big_Data_Streaming_Spark_Kafka) which defined a kafka consumer in one notebook and a producer in another one. I want to know if it is possible to run both the producer and consumer in a single Google colab notebook? Because as far as I know, we need to run the consumer in a terminal and run the producer in another terminal and it doesn't seem to work on colab.
Many many thanks in advance for any help regarding this because I've been stuck for 2 weeks now.
Yes, it's possible. You need to use batch reading for consumption, though, since streaming jobs run indefinitely.
Example - https://github.com/OneCricketeer/docker-stacks/blob/master/hadoop-spark/spark-notebooks/kafka-sql.ipynb
You could also run both without using Spark
First of all my agenda is to be able to use spark codes inside jupyterhub. In other words I want to connect a remote spark cluster to jupyterhub. After searching about it I came up with two solutions:1)Livy and 2)spark magic. I have tried Livy but since it doesn't support spark version 3 I have put that one aside, also another way we considered was via spark magic but we couldn't install or even find a good documentation about it.
what came into our minds was somehow merge the two image of spark and jupyterhub together to gain what we need.
Does anyone have any idea how it can be done or a better suggestion?
Even a good documentation about spark magic that we can use would be wonderful.
thank you for your help.
I'm developing a little Big Data project and I was wondering if there is a way to read a stream from a Kafka Topic from Spark Streaming v3.0 using python3.
I've read on the https://spark.apache.org/docs/3.0.0-preview/streaming-programming-guide.html that is necessary to link the artifact spark-streaming-kafka-0-10_2.12 to hande this kind of stream but I've found that those dependencies are incompatible with Python (in the integration guide there are only examples with Java or Scala and from a different version of Spark Streaming, I've read that there isn't a support for the python language: https://spark.apache.org/docs/2.4.6/streaming-kafka-integration.html)
I've also found this link that could answer my question, but...
https://stackoverflow.com/questions/56960981/does-spark-streaming-kafka-0-10-2-10-work-with-python?rq=1
Some more details: actually, I've a stream of json response from https://openweathermap.org/api sent every seconds over a Kafka topic. I want to take this stream to calculate the trends on the actual temperature of a place over the last measurements.
I could switch my current stack choices, so other suggestions are welcome but I won't change Python as my scripting language.
Thanks in advance.
I've solved by creating a "connector" script in python, using the KafkaConsumer library. It takes data from the stream and publish them on a TCP socker on the localhost. Spark read those data using ssc.socketTextStream("127.0.0.1", PORTS).
I've used this guide to setting up the code : https://www.toptal.com/apache/apache-spark-streaming-twitter.
I just want to learn kafka and sparking streaming on my local machine (macOS Sierra).
Maybe Docker is a good idea?
Seems like what you need is described here
If you’ve always wanted to try Spark Streaming, but never found a time
to give it a shot, this post provides you with easy steps on how to
get development setup with Spark and Kafka using Docker
Example application here
I am just a newbie in Big Data world, so I do not know how to build a dashboard application for visualizing data from log files in Hadoop. After searching around, I can think of some solution:
1/ Using Kafka to ingesting streaming data
2/ Stream data processing: Streaming Spark or Apache Flink
3/ Front-end --> Visualize data: using d3js
Am I missing something? Spark and Flink which one should I use?
I have a cluster of machines, I've installed Ambari, HDP 2.4.2, HDFS 2.7, YARN 2.7, Spark 1.6, Kafka.
If possible, could you guys show me some tutorials to build such a application like that? Any book or course?
Thank a lot.
P/s:
I have read the git book of databrick, but it's only mentioned spark. I also find some tutorials how to analyze using Flink, Elasticsearch and Kibana, but it's not mentioned about how to combine with Ambari Server, that where I got stuck
You may take a look at Ambari Log Search feature: https://github.com/abajwa-hw/logsearch-service which visualizes the logs.