Spark Stream Kafka and Hbase Config - apache-spark

I have few questions on spark streaming with Kafka and HBase.
Below is my program for spark streaming,here i am using zookeeper configuartions to connect to Kafka and Hbase.
Do we really need this configuration in the streaming code? Or i am doing it wrong
If am using hadoop distribution such as Hortonworks or Cloudera, there should be provision to configure spark with kafka and Hbase, so that my spark stream code should only take parameter such as kafka topic and Hbase table no zoo keeper and other configurations. If this can be done can you please help me with the steps.
object KafkaSparkStream{
def main(args: Array[String]): Unit =
{
var arg = Array("10.74.163.163:9092,10.74.163.154:9092", "10.74.163.154:2181", "test_topic")
val Array(broker, zk, topic) = arg
val conf = new SparkConf()
.setAppName("KafkaSparkStreamToHbase")
.setMaster("local[2]");
//.setMaster("yarn-client")
val ssc = new StreamingContext(conf, Seconds(5))
val kafkaConf = Map("metadata.broker.list" -> broker,
"zookeeper.connect" -> zk,
"group.id" -> "kafka-spark-streaming-example",
"zookeeper.connection.timeout.ms" -> "1000")
/* Kafka integration with reciever */
val lines = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](
ssc, kafkaConf, Map(topic -> 1),
StorageLevel.MEMORY_ONLY_SER).map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.foreachRDD(rdd => {
val conf = HBaseConfiguration.create()
conf.set(TableOutputFormat.OUTPUT_TABLE, "stream_count")
conf.set("hbase.zookeeper.quorum", "10.74.163.154:2181")
conf.set("hbase.master", "HOSTNAME:16000");
conf.set("hbase.rootdir", "file:///tmp/hbase")
val jobConf = new Configuration(conf)
jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].getName)
jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName)
//rdd.saveAsNewAPIHadoopDataset(jobConf)
rdd.map(convert).saveAsNewAPIHadoopDataset(jobConf)
})
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}

The way to go with HBase is to add your hbase-site.xml configuration file to Spark classpath.
For kafka you can use https://github.com/typesafehub/config to load properties from custom configuration files.
In order to work with this config files you have to:
set --driver-class-path <dir with the config file>
set --files <configuration file> to copy this file to each executor's working dir
set spark.executor.extraClassPath=./ to add each executor's working dir to its classpath

Related

Structured Streaming: Reading from multiple Kafka topics at once

I have a Spark Structured Streaming Application which has to read from 12 Kafka topics (Different Schemas, Avro format) at once, deserialize the data and store in HDFS. When I read from a single topic using my code, it works fine and without errors but on running multiple queries together, I'm getting the following error
java.lang.IllegalStateException: Race while writing batch 0
My code is as follows:
def main(args: Array[String]): Unit = {
val kafkaProps = Util.loadProperties(kafkaConfigFile).asScala
val topic_list = ("topic1", "topic2", "topic3", "topic4")
topic_list.foreach(x => {
kafkaProps.update("subscribe", x)
val source= Source.fromInputStream(Util.getInputStream("/schema/topics/" + x)).getLines.mkString
val schemaParser = new Schema.Parser
val schema = schemaParser.parse(source)
val sqlTypeSchema = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
val kafkaStreamData = spark
.readStream
.format("kafka")
.options(kafkaProps)
.load()
val udfDeserialize = udf(deserialize(source), DataTypes.createStructType(sqlTypeSchema.fields))
val transformedDeserializedData = kafkaStreamData.select("value").as(Encoders.BINARY)
.withColumn("rows", udfDeserialize(col("value")))
.select("rows.*")
val query = transformedDeserializedData
.writeStream
.trigger(Trigger.ProcessingTime("5 seconds"))
.outputMode("append")
.format("parquet")
.option("path", "/output/topics/" + x)
.option("checkpointLocation", checkpointLocation + "//" + x)
.start()
})
spark.streams.awaitAnyTermination()
}
Alternative. You can use KAFKA Connect (from Confluent), NIFI, StreamSets etc. as your use case seems to fit "dump/persist to HDFS". That said, you need to have these tools (installed). The small files problem you state is not an issue, so be it.
From Apache Kafka 0.9 or later version you can Kafka Connect API for KAFKA --> HDFS Sink (various supported HDFS formats). You need a KAFKA Connect Cluster though, but that is based on your existing Cluster in any event, so not a big deal. But someone needs to maintain.
Some links to get you on your way:
https://data-flair.training/blogs/kafka-connect/
https://github.com/confluentinc/kafka-connect-hdfs

DSE Spark Streaming: Long active batches queue

I have the following code:
val conf = new SparkConf()
.setAppName("KafkaReceiver")
.set("spark.cassandra.connection.host", "192.168.0.78")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "2g")
.set("spark.driver.memory", "4g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "3")
.set("spark.executor.cores", "3")
.set("spark.shuffle.service.enabled", "false")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.io.compression.codec", "snappy")
.set("spark.rdd.compress", "true")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.backpressure.initialRate", "200")
.set("spark.streaming.receiver.maxRate", "500")
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val kafkaParams = Map[String, String](
"bootstrap.servers" -> "192.168.0.113:9092",
"group.id" -> "test-group-aditya",
"auto.offset.reset" -> "largest")
val topics = Set("random")
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
I'm running the code through spark-submit with the following command:
dse> bin/dse spark-submit --class test.kafkatesting /home/aditya/test.jar
I have a three-node Cassandra DSE cluster installed on different machines. Whenever I run the application, it takes so much data and starts creating a queue of active batches, which in turn creates a backlog and a long scheduling delay. How can I increase the performance and control the queue such that it receives a new batch only after it finishes executing the current batch?
I found the solution, did some optimisation in code. Instead of saving RDD try to create Dataframe, saving DF to Cassandra in much faster as compared to RDD. Also, increase the no of core and and executor memory in order to achieve good results.
Thanks,

Spark Streaming For Kafka:unable to read data from Kafka topic

I am trying out some hands-on on spark and I tried to use spark streaming to read data from a kafka topic and store that data in a elasticsearch index.
I am trying to run my code from my ide.
I added some messages in Kafka and ran my Kafka Streaming context program and it read the data but after that the program stopped.
So, if I add new data in kafka , again I have to run my streaming context program.
I want the streaming context to keep 'listening' to the kafka broker , and I should not be running it each time I add some messages in Kafka broker.
Here is my code:
val conf = new SparkConf().setMaster("local[2]").setAppName("KafkaStreams2")
val ssc = new StreamingContext(conf, Seconds(1))
val kafkaParams = Map(
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[LongDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "spark-streaming-notes2",
"auto.offset.reset" -> "latest"
)
// List of topics you want to listen for from Kafka
val topics = List("inputstream-sink")
val lines = KafkaUtils.createDirectStream[String, String](ssc,
PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
val word = lines.map(_.value())
word.print()
insertIntoIndexes()
ssc.start()
ssc.stop(stopSparkContext = false)

Reading data from Hbase using Get command in Spark

I want to read data from an Hbase table using get command while I have also the key of the row..I want to do that in my Spark streaming application, Is there any source code someone can share?
You can use Spark newAPIHadoopRDD to read Hbase table, which returns and RDD.
For example:
val sparkConf = new SparkConf().setAppName("Hbase").setMaster("local")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "table"
conf.set("hbase.master", "localhost:60000")
conf.set("hbase.zookeeper.quorum", "localhost:2181")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + rdd.count())
sc.stop()
Or you can use any Spark Hbase connector like HortonWorks Hbase connector.
https://github.com/hortonworks-spark/shc
You can also use Spark-Phoenix API.
https://phoenix.apache.org/phoenix_spark.html

apache spark streaming - kafka - reading older messages

I am trying to read older messages from Kafka with spark streaming. However, I am only able to retrieve messages as they are sent in real time (i.e., if I populate new messages, while my spark program is running - then I get those messages).
I am changing my groupID and consumerID to make sure zookeeper isn't just not giving messages it knows my program has seen before.
Assuming spark is seeing the offset in zookeeper as -1, shouldn't it read all the old messages in the queue? Am I just misunderstanding the way a kafka queue can be used? I'm very new to spark and kafka, so I can't rule out that I'm just misunderstanding something.
package com.kibblesandbits
import org.apache.spark.SparkContext
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils
import net.liftweb.json._
object KafkaStreamingTest {
val cfg = new ConfigLoader().load
val zookeeperHost = cfg.zookeeper.host
val zookeeperPort = cfg.zookeeper.port
val zookeeper_kafka_chroot = cfg.zookeeper.kafka_chroot
implicit val formats = DefaultFormats
def parser(json: String): String = {
return json
}
def main(args : Array[String]) {
val zkQuorum = "test-spark02:9092"
val group = "myGroup99"
val topic = Map("testtopic" -> 1)
val sparkContext = new SparkContext("local[3]", "KafkaConsumer1_New")
val ssc = new StreamingContext(sparkContext, Seconds(3))
val json_stream = KafkaUtils.createStream(ssc, zkQuorum, group, topic)
var gp = json_stream.map(_._2).map(parser)
gp.saveAsTextFiles("/tmp/sparkstreaming/mytest", "json")
ssc.start()
}
When running this, I will see the following message. So I am confident that it's not just not seeing the messages because the offset is set.
14/12/05 13:34:08 INFO ConsumerFetcherManager:
[ConsumerFetcherManager-1417808045047] Added fetcher for partitions
ArrayBuffer([[testtopic,0], initOffset -1 to broker
id:1,host:test-spark02.vpc,port:9092] , [[testtopic,1],
initOffset -1 to broker i d:1,host:test-spark02.vpc,port:9092] ,
[[testtopic,2], initOffset -1 to broker
id:1,host:test-spark02.vpc,port:9092] , [[testtopic,3],
initOffset -1 to broker id:1,host:test-spark02.vpc,port:9092] ,
[[testtopic,4], initOffset -1 to broker
id:1,host:test-spark02.vpc,port:9092] )
Then, if I populate 1000 new messages -- I can see those 1000 messages saved in my temp directory. But I don't know how to read the existing messages, which should number in the (at this point) tens of thousands.
Use the alternative factory method on KafkaUtils that lets you provide a configuration to the Kafka consumer:
def createStream[K: ClassTag, V: ClassTag, U <: Decoder[_]: ClassTag, T <: Decoder[_]: ClassTag](
ssc: StreamingContext,
kafkaParams: Map[String, String],
topics: Map[String, Int],
storageLevel: StorageLevel
): ReceiverInputDStream[(K, V)]
Then build a map with your kafka configuration and add the parameter 'kafka.auto.offset.reset' set to 'smallest':
val kafkaParams = Map[String, String](
"zookeeper.connect" -> zkQuorum, "group.id" -> groupId,
"zookeeper.connection.timeout.ms" -> "10000",
"kafka.auto.offset.reset" -> "smallest"
)
Provide that config to the factory method above. "kafka.auto.offset.reset" -> "smallest" tells the consumer to starts from the smallest offset in your topic.

Resources