Creating Two DStream from Kafka topic in Spark Streaming not working - apache-spark

In my spark streaming application . I am creating two DStream from two Kafka topic. I am doing so , because i need to process the two DStream differently. Below is the code example:
object KafkaConsumerTest3 {
var sc:SparkContext = null
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
val Array(zkQuorum, group, topics1, topics2, numThreads) = Array("localhost:2181", "group3", "test_topic4", "test_topic5","5")
val sparkConf = new SparkConf().setAppName("SparkConsumer").setMaster("local[2]")
sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
val topicMap1 = topics1.split(",").map((_, numThreads.toInt)).toMap
val topicMap2 = topics2.split(",").map((_, numThreads.toInt)).toMap
val lines2 = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap2).map(_._2)
val lines1 = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap1).map(_._2)
lines2.foreachRDD{rdd =>
rdd.foreach { println }}
lines1.foreachRDD{rdd =>
rdd.foreach { println }}
ssc.start()
ssc.awaitTermination()
}
}
Both the topics may or may not have data . In my case first topic is not getting data currently but the second topic is getting. But my spark application is not printing any data. And there is no exception as well.
Is there anything i am missing? or how do i resolve this issue.

Found out the issue with above code. The problem is that we have used master as local[2] and we are registering two receiver.Increasing the number of thread solve the problem.

Related

Issue in creating Multiple Avro Flume Sink in Spark Streaming

I need to connect multiple Flume sinks with Spark Streaming
this is my flume file :
agent1.sinks.sink1a.type = avro
agent1.sinks.sink1a.hostname = localhost
agent1.sinks.sink1a.port = 9091
agent1.sinks.sink1b.type = avro
agent1.sinks.sink1b.hostname = localhost
agent1.sinks.sink1b.port = 9092
But only 9091 port is connecting 9092 is not able to connect
Here is my spark code to create multiple Flume Streams :
val sparkConf = new SparkConf().setAppName("WordCount")
val ssc = new StreamingContext(sparkConf, Seconds(20))
val rawLines = FlumeUtils.createStream(ssc,"localhost", 9091)
val rawLines1 = FlumeUtils.createStream(ssc,"localhost", 9092)
val lines = rawLines.map{record => {
(new String(record.event.getBody().array()))}}
val lines1 = rawLines1.map{record1 => {
(new String(record1.event.getBody().array()))}}
val lines_combined = lines.union(lines1)
val words = lines_combined.flatMap(_.split(" "))
What am I doing wrong ?.Any help regarding the same would be great.

Serialization of transform function in checkpointing

I'm trying to understand Spark Streaming's RDD transformations and checkpointing in the context of serialization. Consider the following example Spark Streaming app:
private val helperObject = HelperObject()
private def createStreamingContext(): StreamingContext = {
val conf = new SparkConf()
.setAppName(Constants.SparkAppName)
.setIfMissing("spark.master", Constants.SparkMasterDefault)
implicit val streamingContext = new StreamingContext(
new SparkContext(conf),
Seconds(Constants.SparkStreamingBatchSizeDefault))
val myStream = StreamUtils.createStream()
myStream.transform(transformTest(_)).print()
streamingContext
}
def transformTest(rdd: RDD[String]): RDD[String] = {
rdd.map(str => helperObject.doSomething(str))
}
val ssc = StreamingContext.getOrCreate(Settings.progressDir,
createStreamingContext)
ssc.start()
while (true) {
helperObject.setData(...)
}
From what I've read in other SO posts, transformTest will be invoked on the driver program once for every batch after streaming starts. Assuming createStreamingContext is invoked (no checkpoint is available), I would expect that the instance of helperObject defined up top would be serialized out to workers once per batch, hence picking up the changes applied to it via helperObject.setData(...). Is this the case?
Now, if createStreamingContext is not invoked (a checkpoint is available), then I would expect that the instance of helperObject cannot possibly be picked up for each batch, since it can't have been captured if createStreamingContext is not executed. Spark Streaming must have serialized helperObject as part of the checkpoint, correct?
Is it possible to update helperObject throughout execution from the driver program when using checkpointing? If so, what's the best approach?
If helperObject is going to be serialized to each executors?
Ans: Yes.
val helperObject = Instantiate_SomeHow()
rdd.map{_.SomeFunctionUsing(helperObject)}
Spark Streaming must have serialized helperObject as part of the checkpoint, correct?
Ans Yes.
If you wish to refresh your helperObject behaviour for each RDD operation you can still do that by making your helperObject more intelligent and not sending the helperObject directly but via a function which has the following signature () => helperObject_Class.
Since it is a function it is serializable. It is a very common design pattern used for sending objects that are not serializable e.g. database connection object or for your fun use case.
An example is given from Kafka Exactly once semantics using database
package example
import kafka.serializer.StringDecoder
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import scalikejdbc._
import com.typesafe.config.ConfigFactory
import org.apache.spark.{SparkContext, SparkConf, TaskContext}
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka.{KafkaUtils, HasOffsetRanges, OffsetRange}
/** exactly-once semantics from kafka, by storing offsets in the same transaction as the results
Offsets and results will be stored per-batch, on the driver
*/
object TransactionalPerBatch {
def main(args: Array[String]): Unit = {
val conf = ConfigFactory.load
val kafkaParams = Map(
"metadata.broker.list" -> conf.getString("kafka.brokers")
)
val jdbcDriver = conf.getString("jdbc.driver")
val jdbcUrl = conf.getString("jdbc.url")
val jdbcUser = conf.getString("jdbc.user")
val jdbcPassword = conf.getString("jdbc.password")
val ssc = setupSsc(kafkaParams, jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)()
ssc.start()
ssc.awaitTermination()
}
def setupSsc(
kafkaParams: Map[String, String],
jdbcDriver: String,
jdbcUrl: String,
jdbcUser: String,
jdbcPassword: String
)(): StreamingContext = {
val ssc = new StreamingContext(new SparkConf, Seconds(60))
SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
// begin from the the offsets committed to the database
val fromOffsets = DB.readOnly { implicit session =>
sql"select topic, part, off from txn_offsets".
map { resultSet =>
TopicAndPartition(resultSet.string(1), resultSet.int(2)) -> resultSet.long(3)
}.list.apply().toMap
}
val stream: InputDStream[(String,Long)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, Long)](
ssc, kafkaParams, fromOffsets,
// we're just going to count messages per topic, don't care about the contents, so convert each message to (topic, 1)
(mmd: MessageAndMetadata[String, String]) => (mmd.topic, 1L))
stream.foreachRDD { rdd =>
// Note this block is running on the driver
// Cast the rdd to an interface that lets us get an array of OffsetRange
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// simplest possible "metric", namely a count of messages per topic
// Notice the aggregation is done using spark methods, and results collected back to driver
val results = rdd.reduceByKey {
// This is the only block of code running on the executors.
// reduceByKey did a shuffle, but that's fine, we're not relying on anything special about partitioning here
_+_
}.collect
// Back to running on the driver
// localTx is transactional, if metric update or offset update fails, neither will be committed
DB.localTx { implicit session =>
// store metric results
results.foreach { pair =>
val (topic, metric) = pair
val metricRows = sql"""
update txn_data set metric = metric + ${metric}
where topic = ${topic}
""".update.apply()
if (metricRows != 1) {
throw new Exception(s"""
Got $metricRows rows affected instead of 1 when attempting to update metrics for $topic
""")
}
}
// store offsets
offsetRanges.foreach { osr =>
val offsetRows = sql"""
update txn_offsets set off = ${osr.untilOffset}
where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset}
""".update.apply()
if (offsetRows != 1) {
throw new Exception(s"""
Got $offsetRows rows affected instead of 1 when attempting to update offsets for
${osr.topic} ${osr.partition} ${osr.fromOffset} -> ${osr.untilOffset}
Was a partition repeated after a worker failure?
""")
}
}
}
}
ssc
}
}

unable to consume kafka message through spark stream

I am trying to consume message from kafka producer through spark streaming program .
Here is my program
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
// lines.print()
lines.foreachRDD(rdd=>{
rdd.foreach(message=>
println(message))
})
The above program is running successfully. But I could not see any message get printed.
Set your master url using "local[*]"
val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[*]")
You can also try to call collect() and see if you get messages.
lines.foreachRDD { rdd =>
rdd.collect().foreach(println)
}

Spark always using an existing SparkContext every time i run job with spark-submit

I have deployed the jar of spark on cluster. I submit the spark job using spark-submit command follow by my project jar.
I have many Spak Conf in my project. Conf will be decided based on which class i am running but every time i run the spark job i got this warning
7/01/09 07:32:51 WARN SparkContext: Use an existing SparkContext, some
configuration may not take effect.
Query Does it mean SparkContext is already there and my job is picking this.
Query Why come configurations are not taking place
Code
private val conf = new SparkConf()
.setAppName("ELSSIE_Ingest_Cassandra")
.setMaster(sparkIp)
.set("spark.sql.shuffle.partitions", "8")
.set("spark.cassandra.connection.host", cassandraIp)
.set("spark.sql.crossJoin.enabled", "true")
object SparkJob extends Enumeration {
val Program1, Program2, Program3, Program4, Program5 = Value
}
object ElssieCoreContext {
def getSparkSession(sparkJob: SparkJob.Value = SparkJob.RnfIngest): SparkSession = {
val sparkSession = sparkJob match {
case SparkJob.Program1 => {
val updatedConf = conf.set("spark.cassandra.output.batch.size.bytes", "2048").set("spark.sql.broadcastTimeout", "2000")
SparkSession.builder().config(updatedConf).getOrCreate()
}
case SparkJob.Program2 => {
val updatedConf = conf.set("spark.sql.broadcastTimeout", "2000")
SparkSession.builder().config(updatedConf).getOrCreate()
}
}
}
And In Program1.scala i call by
val spark = ElssieCoreContext.getSparkSession()
val sc = spark.sparkContext

How to work with real time streaming data/logs using spark streaming?

I am newbie to Spark and Scala.
I want to implement a REAL TIME Spark Consumer which could read the network logs on per minute basis [fetching around 1GB of JSON log lines/minute] from Kafka Publisher and finally store the aggregated values in ElasticSearch.
Aggregations is based on few values [like bytes_in, bytes_out etc] using composite key [like : client MAC, client IP, server MAC, Server IP etc].
Spark Consumer which I have written is:
object LogsAnalyzerScalaCS{
def main(args : Array[String]) {
val sparkConf = new SparkConf().setAppName("LOGS-AGGREGATION")
sparkConf.set("es.nodes", "my ip address")
sparkConf.set("es.port", "9200")
sparkConf.set("es.index.auto.create", "true")
sparkConf.set("es.nodes.discovery", "false")
val elasticResource = "conrec_1min/1minute"
val ssc = new StreamingContext(sparkConf, Seconds(30))
val zkQuorum = "my zk quorum IPs:2181"
val consumerGroupId = "LogsConsumer"
val topics = "Logs"
val topicMap = topics.split(",").map((_,3)).toMap
val json = KafkaUtils.createStream(ssc, zkQuorum, consumerGroupId, topicMap)
val logJSON = json.map(_._2)
try{
logJSON.foreachRDD( rdd =>{
if(!rdd.isEmpty()){
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
val df = sqlContext.read.json(rdd)
val groupedData =
((df.groupBy("id","start_time_formated","l2_c","l3_c",
"l4_c","l2_s","l3_s","l4_s")).agg(count("f_id") as "total_f", sum("p_out") as "total_p_out",sum("p_in") as "total_p_in",sum("b_out") as "total_b_out",sum("b_in") as "total_b_in", sum("duration") as "total_duration"))
val dataForES = groupedData.withColumnRenamed("start_time_formated", "start_time")
dataForES.saveToEs(elasticResource)
dataForES.show();
}
})
}
catch{
case e: Exception => print("Exception has occurred : "+e.getMessage)
}
ssc.start()
ssc.awaitTermination()
}
object SQLContextSingleton {
#transient private var instance: org.apache.spark.sql.SQLContext = _
def getInstance(sparkContext: SparkContext): org.apache.spark.sql.SQLContext = {
if (instance == null) {
instance = new org.apache.spark.sql.SQLContext(sparkContext)
}
instance
}
}
}
First of all I would like to know if at all my approach is correct or not [considering I need 1 min logs aggregation]?
There seems to be an issue using this code:
This Consumer will pull data from the Kafka broker every 30 seconds
and saving the final aggregation to Elasticsearch for that 30
sec data, hence increasing the number of rows in Elasticsearch for
unique key [at least 2 entries per one minute]. UI tool [
let's say Kibana] needs to do further aggregation. If I increase the
polling time from 30 sec to 60 sec then it takes a lot of time to
aggregate and hence not at all remains real time.
I want to implement it in such a way that in ElasticSearch only one
row per key should get saved. Hence I want to perform aggregation
till the time I am not getting new key values in my dataset which is
getting pulled from Kafka broker [per minute basis]. After doing
some googling I have found that this could be achieved using
groupByKey() and updateStateByKey() functions but I am not able to
make out how I could use this in my case [should I convert the JSON
Log lines into a string of log line with flat values and then use
these functions there]? If I will use these functions then when
should I save the final aggregated values into ElasticSearch?
Is there any other way of achieving it?
Your quick help will be appreciated.
Regards,
Bhupesh
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object Main {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(15))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group1",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)//,localhost:9094,localhost:9095"
val topics = Array("test")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val out = stream.map(record =>
record.value
)
val words = out.flatMap(_.split(" "))
val count = words.map(word => (word, 1))
val wdc = count.reduceByKey(_+_)
val sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
wdc.foreachRDD{rdd=>
val es = sqlContext.createDataFrame(rdd).toDF("word","count")
import org.elasticsearch.spark.sql._
es.saveToEs("wordcount/testing")
es.show()
}
ssc.start()
ssc.awaitTermination()
}
}
To see full example and sbt
apache-sparkscalahadoopkafkaapache-spark-sql spark-streamingapache-spark-2.0elastic

Resources