I don't want to use one consumer for all topics, I want to use this method to improve consumption efficiency
val kafkaParams = Map(
ConsumerConfig.GROUP_ID_CONFIG -> group,
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> deserialization,
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> deserialization
)
//1.1 create first consumer
val kafkaDS: InputDStream1[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, Set(topic1))
//1.2 create second consumer
val kafkaDS: InputDStream2[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, Set(topic2))
//1.3 create third consumer
val kafkaDS: InputDStream3[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, Set(topic3))
//1.4 create fourth consumer
val kafkaDS: InputDStream4[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, Set(topic4))
//2.1 then union all Dstream
val allStream = InputDStream1
.union(InputDStream2)
.union(InputDStream3)
.union(InputDStream4)
The program can run 5~6 batches normally, but then the program gets stuck, spark webUI streaming cannot be opened, kafka consumer group is rebalancing, it seems that there is a problem with kafka offset submission,kafka consumer closed.
i refer to this Level of Parallelism in Data Receiving
Related
Consume multiple topics,Different kafka topic fields are processed differently and return the same result type,and union together,use updateStateByKey updateStateByKey Want to get the earliest and latest results corresponding to each key。
But, The program will run slower and slower ,sometimes spark Caused by: java.lang.OutOfMemoryError: Java heap space
If there is something that is unclear, I will add it in time, thank you for your enthusiastic answer
//1. set checkpoint
scc.sparkContext.setCheckpointDir("checkpoint")
//2. kafka Direct
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "192.168.44.10:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (true: java.lang.Boolean)
)
val topics = Array("test1", "test2", "test3", "test4")
val kafkaStream = KafkaUtils.createDirectStream[String, String](
scc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](
topics,
kafkaParams
)
) val topics = Array("test")
val kafkaStream = KafkaUtils.createDirectStream[String, String](
scc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](
topics,
kafkaParams
)
)
//3. Different kafka topics are handled differently and return the same result
dstream1 = kafkaStream.filter("test1").map()...
dstream1 = kafkaStream.filter("test2").map()...
dstream1 = kafkaStream.filter("test3").map()...
dstream1 = kafkaStream.filter("test4").map()...
//4. then combine all results together
dstream = dstream1.union(dstream2).union(dstream3).union(dstream4)
//5. updatestatebykey
dstream.updatestatebykey()
//6. save result to hdfs
I am new to spark streaming. I am trying to do some exercises on fetching data from kafka and joining with hive table.i am not sure how to do JOIN in spark streaming (not the structured streaming). Here is my code
val ssc = new StreamingContext("local[*]", "KafkaExample", Seconds(1))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "dofff2.dl.uk.feefr.com:8002",
"security.protocol" -> "SASL_PLAINTEXT",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("csvstream")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val strmk = stream.map(record => (record.value,record.timestamp))
Now i want to do join on one of the table in hive. In spark structured streaming i can directly call spark.table("table nanme") and do some join, but in spark streaming how can i do it since its everything based on RDD. can some one help me ?
You need transform.
Something like this is required:
val dataset: RDD[String, String] = ... // From Hive
val windowedStream = stream.window(Seconds(20))... // From dStream
val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
From the manuals:
The transform operation (along with its variations like transformWith)
allows arbitrary RDD-to-RDD functions to be applied on a DStream. It
can be used to apply any RDD operation that is not exposed in the
DStream API. For example, the functionality of joining every batch in
a data stream with another dataset is not directly exposed in the
DStream API. However, you can easily use transform to do this. This
enables very powerful possibilities.
An example of this can be found here:
How to join a DStream with a non-stream file?
The following guide helps: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
I am trying out some hands-on on spark and I tried to use spark streaming to read data from a kafka topic and store that data in a elasticsearch index.
I am trying to run my code from my ide.
I added some messages in Kafka and ran my Kafka Streaming context program and it read the data but after that the program stopped.
So, if I add new data in kafka , again I have to run my streaming context program.
I want the streaming context to keep 'listening' to the kafka broker , and I should not be running it each time I add some messages in Kafka broker.
Here is my code:
val conf = new SparkConf().setMaster("local[2]").setAppName("KafkaStreams2")
val ssc = new StreamingContext(conf, Seconds(1))
val kafkaParams = Map(
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[LongDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "spark-streaming-notes2",
"auto.offset.reset" -> "latest"
)
// List of topics you want to listen for from Kafka
val topics = List("inputstream-sink")
val lines = KafkaUtils.createDirectStream[String, String](ssc,
PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
val word = lines.map(_.value())
word.print()
insertIntoIndexes()
ssc.start()
ssc.stop(stopSparkContext = false)
I write a spark streaming application to receive data from Kafka by using KafkaUtils, and what I want to do is to print out data I received from Kafka. Here is my code(I use spark-submit to execute my spark streaming job):
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
messages.print()
When I run this, it works pretty fine. If the input is a,b,c in Kafka producer, I can get the result from Spark streaming as below:
Time: 1476481700000 ms
-------------------------------------------
(null,a)
(null,b)
(null,c)
But if I add one line to count the number of lines, messages.print() cannot work. Codes are shown below:
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
messages.print()
messages.count().print()
I am getting the following result:
-------------------------------------------
Time: 1476481800000 ms
-------------------------------------------
4
Only count number is getting printed out, and data cannot be printed out.
My question is why messages.print() would not be executed after I add messages.count.print().
Another question is what null stands for in the tuple (null, a)(null, b)(null, c).
There is no issue with print() and it will print both messages and count like below. Scroll and check your log.
-------------------------------------------
Time: 1476481700000 ms
-------------------------------------------
(null,a)
(null,b)
(null,c)
-------------------------------------------
Time: 1476481800000 ms
-------------------------------------------
4
KafkaUtils.createDirectStream method returns DStream of <Kafka topic, Kafka message>. Check this and this post related to topic is null.
Your code should be working but giving you an alternative.But this approach is only meant for testing or learning. Instead of performing two actions , you can achieve the end goal with just single action
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
//Cache your RDD before you perform any heavyweight operations.
messages.cache()
val result = messages.collect();
println(result.size + " size")
result.foreach { input => println(input) }
Can we share spark streaming state between two DStreams??
Basically I want to create/update state using first stream and enrich second stream using state.
Example: I have modified StatefulNetworkWordCount example. I am creating state using first stream and enriching second stream with count of first stream.
val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))
val mappingFuncForFirstStream = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
Some(output)
}
val mappingFuncForSecondStream = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
val sum = state.getOption.getOrElse(0)
val output = (word, sum)
Some(output)
}
// first stream
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
.flatMap(r=>r._2.split(" "))
.map(x => (x, 1))
.mapWithState(StateSpec.function(mappingFuncForFirstStream).initialState(initialRDD).timeout(Minutes(10)))
.print(1)
// second stream
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams2, mergeTopicSet)
.flatMap(r=>r._2.split(" "))
.map(x => (x, 1))
.mapWithState(StateSpec.function(mappingFuncForSecondStream).initialState(initialRDD).timeout(Minutes(10)))
.print(50)
In checkpointing directory, I can see two different state RDDs.
I am using spark-1.6.1 and kafka-0.8.2.1
It's possible to access the underlying StateDStream of the DStream resulting of applying the mapWithState operation by using stateMappedDStream.snapshotStream()
So, inspired on your example:
val firstDStream = ???
val secondDStream = ???
val firstDStreamSMapped = firstDStream..mapWithState(...)
val firstStreamState = firstDStreamSMapped.snapshotStream()
// we want to use the state of Stream 1 to enrich Stream 2. The keys of both streams are required to match.
val enrichedStream = secondDStream.join(firstStreamState)
... do stuff with enrichedStream ...
This method may be helpful for you:
ssc.untion(Seq[Dstream[T]])