I am trying out some hands-on on spark and I tried to use spark streaming to read data from a kafka topic and store that data in a elasticsearch index.
I am trying to run my code from my ide.
I added some messages in Kafka and ran my Kafka Streaming context program and it read the data but after that the program stopped.
So, if I add new data in kafka , again I have to run my streaming context program.
I want the streaming context to keep 'listening' to the kafka broker , and I should not be running it each time I add some messages in Kafka broker.
Here is my code:
val conf = new SparkConf().setMaster("local[2]").setAppName("KafkaStreams2")
val ssc = new StreamingContext(conf, Seconds(1))
val kafkaParams = Map(
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[LongDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "spark-streaming-notes2",
"auto.offset.reset" -> "latest"
)
// List of topics you want to listen for from Kafka
val topics = List("inputstream-sink")
val lines = KafkaUtils.createDirectStream[String, String](ssc,
PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
val word = lines.map(_.value())
word.print()
insertIntoIndexes()
ssc.start()
ssc.stop(stopSparkContext = false)
Related
Consume multiple topics,Different kafka topic fields are processed differently and return the same result type,and union together,use updateStateByKey updateStateByKey Want to get the earliest and latest results corresponding to each key。
But, The program will run slower and slower ,sometimes spark Caused by: java.lang.OutOfMemoryError: Java heap space
If there is something that is unclear, I will add it in time, thank you for your enthusiastic answer
//1. set checkpoint
scc.sparkContext.setCheckpointDir("checkpoint")
//2. kafka Direct
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "192.168.44.10:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (true: java.lang.Boolean)
)
val topics = Array("test1", "test2", "test3", "test4")
val kafkaStream = KafkaUtils.createDirectStream[String, String](
scc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](
topics,
kafkaParams
)
) val topics = Array("test")
val kafkaStream = KafkaUtils.createDirectStream[String, String](
scc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](
topics,
kafkaParams
)
)
//3. Different kafka topics are handled differently and return the same result
dstream1 = kafkaStream.filter("test1").map()...
dstream1 = kafkaStream.filter("test2").map()...
dstream1 = kafkaStream.filter("test3").map()...
dstream1 = kafkaStream.filter("test4").map()...
//4. then combine all results together
dstream = dstream1.union(dstream2).union(dstream3).union(dstream4)
//5. updatestatebykey
dstream.updatestatebykey()
//6. save result to hdfs
Let us say I have just launched a Kafka direct stream + spark streaming application. For the first batch, the Streaming Context in the driver program connects to the Kafka and fetches startOffset and endOffset. Then, it launches a spark job with these start and end offset ranges for the executors to fetch records from Kafka. My question starts here. When its time for the second batch, the Streaming context connects to the Kafka for the start and end offset ranges. How is Kafka able to give these ranges, when there is no consumer group (as Direct stream does not take into account group.id) that allows to store the last commit offset value?
There is always a Consumer Group when working with the Kafka Consumer API. It doesn't matter what kind of Stream you are dealing with (Spark Direct Streaming, Spark Structured Streaming, Java/Scala API of Kafka Consumer...).
as Direct stream does not take into account group.id
Have a look at the Spark + Kafka integration Guide for direct streaming (code example for spark-streaming-kafka010) on how to declare a consumer group:
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,anotherhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
Even if you are not declaring a consumer group in your configuration, there will be still a (random) consumer group created for you.
Check your logs to see which group.id has been used in your application.
Can i know if kafka consumer can read specific records when from and until offsets of partitions of a topic are known.
Use case is in my spark streaming application few batch are not processed(inserted to table) in this case i want to read only missed data. I am storing the topics details i.e partitions and offsets.
Can someone let me know if this can be achieved reading from topic when offsets are known.
If you want to process set of messages, that is defined by starting and ending offset in spark streaming you can use following code:
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "groupId"
)
val offsetRanges = Array(
OffsetRange("input", 0, 2, 4) // <-- topic name, partition number, fromOffset, untilOffset
)
val sparkContext: SparkContext = ???
val rdd = KafkaUtils.createRDD(sparkContext, kafkaParams.asJava, offsetRanges, PreferConsistent)
// other proccessing and saving
More details regarding integration spark streaming and Kafka can be found: https://spark.apache.org/docs/2.4.0/streaming-kafka-0-10-integration.html
I am new to spark streaming. I am trying to do some exercises on fetching data from kafka and joining with hive table.i am not sure how to do JOIN in spark streaming (not the structured streaming). Here is my code
val ssc = new StreamingContext("local[*]", "KafkaExample", Seconds(1))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "dofff2.dl.uk.feefr.com:8002",
"security.protocol" -> "SASL_PLAINTEXT",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("csvstream")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val strmk = stream.map(record => (record.value,record.timestamp))
Now i want to do join on one of the table in hive. In spark structured streaming i can directly call spark.table("table nanme") and do some join, but in spark streaming how can i do it since its everything based on RDD. can some one help me ?
You need transform.
Something like this is required:
val dataset: RDD[String, String] = ... // From Hive
val windowedStream = stream.window(Seconds(20))... // From dStream
val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
From the manuals:
The transform operation (along with its variations like transformWith)
allows arbitrary RDD-to-RDD functions to be applied on a DStream. It
can be used to apply any RDD operation that is not exposed in the
DStream API. For example, the functionality of joining every batch in
a data stream with another dataset is not directly exposed in the
DStream API. However, you can easily use transform to do this. This
enables very powerful possibilities.
An example of this can be found here:
How to join a DStream with a non-stream file?
The following guide helps: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
I have few questions on spark streaming with Kafka and HBase.
Below is my program for spark streaming,here i am using zookeeper configuartions to connect to Kafka and Hbase.
Do we really need this configuration in the streaming code? Or i am doing it wrong
If am using hadoop distribution such as Hortonworks or Cloudera, there should be provision to configure spark with kafka and Hbase, so that my spark stream code should only take parameter such as kafka topic and Hbase table no zoo keeper and other configurations. If this can be done can you please help me with the steps.
object KafkaSparkStream{
def main(args: Array[String]): Unit =
{
var arg = Array("10.74.163.163:9092,10.74.163.154:9092", "10.74.163.154:2181", "test_topic")
val Array(broker, zk, topic) = arg
val conf = new SparkConf()
.setAppName("KafkaSparkStreamToHbase")
.setMaster("local[2]");
//.setMaster("yarn-client")
val ssc = new StreamingContext(conf, Seconds(5))
val kafkaConf = Map("metadata.broker.list" -> broker,
"zookeeper.connect" -> zk,
"group.id" -> "kafka-spark-streaming-example",
"zookeeper.connection.timeout.ms" -> "1000")
/* Kafka integration with reciever */
val lines = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](
ssc, kafkaConf, Map(topic -> 1),
StorageLevel.MEMORY_ONLY_SER).map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.foreachRDD(rdd => {
val conf = HBaseConfiguration.create()
conf.set(TableOutputFormat.OUTPUT_TABLE, "stream_count")
conf.set("hbase.zookeeper.quorum", "10.74.163.154:2181")
conf.set("hbase.master", "HOSTNAME:16000");
conf.set("hbase.rootdir", "file:///tmp/hbase")
val jobConf = new Configuration(conf)
jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].getName)
jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName)
//rdd.saveAsNewAPIHadoopDataset(jobConf)
rdd.map(convert).saveAsNewAPIHadoopDataset(jobConf)
})
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
The way to go with HBase is to add your hbase-site.xml configuration file to Spark classpath.
For kafka you can use https://github.com/typesafehub/config to load properties from custom configuration files.
In order to work with this config files you have to:
set --driver-class-path <dir with the config file>
set --files <configuration file> to copy this file to each executor's working dir
set spark.executor.extraClassPath=./ to add each executor's working dir to its classpath