Download data from http using Python Spark streaming - apache-spark

I am new for PySpark and I installed Kafka single node and single broker on my Ubuntu 14.04.
After installation I tested the Kafka that sending and receiving data by using kafka-console-producer and kafka-console-consume.
Below are the steps I followed
Starting a consumer to consuming messages.
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic kafkatopic --from-beginning
Starting a producer to sending messages in a new terminal window.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic kafkatopic
[2016-09-25 7:26:58,179] WARN Property topic is not valid (kafka.utils.VerifiableProperties)
Good morning
Future big data
this is test message
In the consumer terminal
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic kafkatopic --from-beginning
Good morning
Future big data
this is test message
The below link from meetup.com produce streaming data
http://stream.meetup.com/2/rsvps
My requirement is how to collect the streaming data from http site to spark using Kafka. What is the transformation command to download streamin data?
After downloading the data I can find the count by city and other analysis for a particular time interval.

There are different ways to process realtime streaming. The one i am considering is like below one.

Related

Spark streamming task shutdown gracefully when kafka client send message asynchronously

i am building a spark streamming application, read input message from kafka topic, transformation message and output the result message into another kafka topic. Now i am confused how to prevent data loss when application restart, including kafka read and output. Setting the spark configuration "spark.streaming.stopGracefullyOnShutdow" true can help?
You can configure Spark to do checkpoint to HDFS and store the Kafka offsets in Zookeeper (or Hbase, or configure elsewhere for fast, fault tolerant lookups)
Though, if you process some records and write the results before you're able to commit offsets, then you'll end up reprocessing those records on restart. It's claimed that Spark can do exactly once with Kafka, but that is a only with proper offset management, as far as I know, for example, Set enable.auto.commit to false in the Kafka priorities, then only commit after the you've processed and written the data to its destination
If you're just moving data between Kafka topics, Kafka Streams is the included Kafka library to do that, which doesn't require YARN or a cluster scheduler

Kafka OffsetOutOfRangeException

I am streaming loads of data through kafka. And then I have spark streaming which is consuming these messages. Basically down the line, spark streaming throws this error:
kafka.common.OffsetOutOfRangeException
Now I am aware what this error means. So I changed the retention policy to 5 days. However I still encountered the same issue. Then I listed all the messages for a topic using --from-beginning in kafka. Surely enough, ton of messages from the beginning of the kafka streaming part were not present and since spark streaming is a little behind the kafka streaming part, spark streaming tries to consume messages that have been deleted by kafka. However I thought changing the retention policy would take care of this:
--add-config retention.ms=....
What I suspect is happening that kafka is deleting messages from the topic to free up space (because we are streaming tons of data) for the new messages. Is there a property which I can configure that specifies how much bytes of data kafka can store before deleting the prior messages?
You can set the maximum size of the topic when u create the topic using the topic configuration property retention.bytes via console like:
bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic my-topic --partitions 1 --replication-factor 1 --config retention.bytes=10485760 --config
or u can use global broker configuration property log.retention.bytes to set the maximum size for all topics.
what is important to know is that log.retention.bytes doesn't enforce a hard limit on a topic size, but it just signal to Kafka when to start deleting the oldest messages
Another way to solve this problem is to specify in the configuration the spark parameter :
spark.streaming.kafka.maxRatePerPartition

Spark Streaming from Kafka Consumer

I might need to work with Kafka and I am absolutely new to it. I understand that there are Kafka producers which will publish the logs(called events or messages or records in Kafka) to the Kafka topics.
I will need to work on reading from Kafka topics via consumer. Do I need to set up consumer API first then I can stream using SparkStreaming Context(PySpark) or I can directly use KafkaUtils module to read from kafka topics?
In case I need to setup the Kafka consumer application, how do I do that? Please can you share links to right docs.
Thanks in Advance!!
Spark provide internal kafka stream in which u dont need to create custom consumer there is 2 approach to connect with kafka 1 with receiver 2. direct approach.
For more detail go through this link http://spark.apache.org/docs/latest/streaming-kafka-integration.html
There's no need to set up kafka consumer application,Spark itself creates a consumer with 2 approaches. One is Reciever Based Approach which uses KafkaUtils class and other is Direct Approach which uses CreateDirectStream Method.
Somehow, in any case of failure ion Spark streaming,there's no loss of data, it starts from the offset of data where you left.
For more details,use this link: http://spark.apache.org/docs/latest/streaming-kafka-integration.html

What happens to data sent to NodeRed output node which is currently down?

I am currently implementing a flow on Node-RED where MQTT subscriber node sends data to a kafka producer node i.e. output node on Node-RED.
If the Kafka producer node is not able to send data in case of remote Kafka is down then what happens to the data which is pushed to the Kafka producer node from MQTT subscriber node.
I cannot afford to loose a single data set.
That will depend on how the Kafka producer node has been written, but having had a quick look at the src it seams to just log an error and throw the message away if there is a problem delivering it to Kafka
There is no retry/queuing built into Node-RED it would have to be added to a given output node. The problem comes with working out what should be kept and for how long, should it be stored on disk or in memory...
Solved the issue by putting a "Catch" node which catches the error thrown by the kafka producer node (It throws the data also with the error in case of remote cluster unavailable). The data then can be extracted and try to send it again to a new cluster.

After Linux restart, Kafka throwing "no brokers found when trying to rebalance"

I followed an excellent step-by-step tutorial for installing Kafka on Linux. Everything was working fine for me until I restarted Linux. After the restart, I get the following error when I try to consume a queue with kafka-console-consumer.sh.
$ ~/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic TutorialTopic --from-beginning
[2016-02-04 03:16:54,944] WARN [console-consumer-6966_bob-kafka-storm-1454577414492-8728ae43], no brokers found when trying to rebalance. (kafka.consumer.ZookeeperConsumerConnector)
Before I ran the kafka-console-consumer.sh script, I did push data to the Kafka queue using the kafka-console-producer.sh script. These steps worked without issue before restarting Linux.
I can fix the error by manually starting Kafka; but, I would rather that Kafka start automatically.
What might cause Kafka to not start correctly after restart?
So I had a similar issue in our hortonworks cluster earlier today. It seems like zookeeper was not starting correctly. I first tried kakfa-console-producer and got the exception below:
kafka-console-producer.sh --broker-list=localhost:9093 --topic=some_topic < /tmp/sometextfile.txt```
kafka.common.KafkaException: fetching topic metadata for topics```
The solution for me was to restart the server that had stopped responding. yours may be different but play around with console producer and see what errors your getting.
I has this same issue today when running a consumer:
WARN [console-consumer-74006_my-machine.com-1519117041191-ea485191], no brokers found when trying to rebalance. (kafka.consumer.ZookeeperConsumerConnector)
It turned out to be a disk full issue on the machine. Once space was freed up, the issue was resolved.
Deleting everything out relating to zookeeper and kafka out of the /tmp folder.. worked for me. But my system is not production.. so for other people proceed with caution.
By default kafka stores all the information in the /tmp folder. The tmp folder gets deleted every time you restart. You can change the placement of those files by changing in the properties file config/server.properties the property log.dirs.
For me,when start a consumer to fetch the message from broker cluster. But the Kafka server was shut down.
So, returned the message "no brokers found when trying to rebalance".
When i restart the kafka server, the error disappeared.
For producer consumer to work you have to start a broker. Broker is what intermediates message passing from producer to consumer.
sh kafka/bin/kafka-server-start.sh kafka/config/server.properties
The broker runs on 9092 port by default

Resources