How to save data to process later after stopping DirectStream in SparkStreaming? - apache-spark

I am creating below KafkaDirectStream.
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
Then saving the values as :
val lines = messages.map(_.value)
Then stoping the streaming context when I have no further offset to consume as follows:
lines.foreachRDD(rdd => {
if(rdd.isEmpty()) {
messages.stop()
ssc.stop(false)
} else {
}
})
Then I am printing the lines as follows:
lines.print()
Then I am starting stream as:
ssc.start()
It is working fine. It reads rdds and prints top 10 and stops messages stream and stop streaming context. But then when I execute the same line lines.print() it throws an exception saying cannot do new inputs, transform, or outputs after stoping streamingContext.
How do I achieve my goal? I am running it in a spark-shell not as a binary (mandatory requirement).
Here is what I actually want to achieve:
1) Consume all json records from the kafka topic.
2) Stop getting further records (It is guarenteed that after consuming, there won't be no new records added to Kafka topic, so don't want to keep proessing no records.)
3) Do some preprocessing by extracting some fields from the JSON fields.
4) Do further operation on the preprocessed data.
5) Done.

when you are calling "lines.print()" again, its trying to call the transformation "messages.map(_.value)" again. As you stopped the context its failing.
Save the lines variable by performing an action before stopping the context.

Related

Spark mapPartitions Issue

I am using spark mapPartition on my DF and the use case i should submit one Job (either calling lambda or sending a SQS Message) for each Partition.
I am partitioning on a custom formatted date column and logging the no.of partitions before and after and it is working as expected.
How ever when i see the total no.of jobs it is more than the no.of partitions. For Some of the partitions there are two or three jobs !!
Here is the Code i am using
val yearMonthQueryRDD = yearMonthQueryDF.rdd.mapPartitions(
partition => {
val partitionObjectList = new java.util.ArrayList[String]()
logger.info("partitionIndex = {}",TaskContext.getPartitionId());
val partitionCounter:AtomicLong = new AtomicLong(0)
val partitionSize:AtomicLong = new AtomicLong(0)
val paritionColumnName:AtomicReference[String] = new AtomicReference[String]();
// Iterate the Objects in a given parittion
val updatedPartition = partition.map( record => {
import yearMonthQueryDF.sparkSession.implicits._
partitionCounter.set(partitionCounter.get()+1)
val recordSizeInt = Integer.parseInt(record.getAs("object_size"))
val recordSize:Long = recordSizeInt.toLong
partitionObjectList.add(record.getAs("object_key"))
paritionColumnName.set(record.getAs("partition_column_name"))
record
}
).toList
logger_ref.info("No.of Elements in Partition ["+paritionColumnName.get()+"] are =["+partitionCounter.get()+"] Total Size=["+partitionSize.get()+"]")
// Submit a Job for the parition
// jobUtil.submitJob(paritionColumnName.get(),partitionObjectList,partitionSize.get())
updatedPartition.toIterator
}
)
Another thing that is making the debugging harder is the logging statements inside the mapPartitions() method are not found in the container error logs (since they are executed on each worker node not on master node i expected them to find them in container logs rather than in master node logs. Need to figure why i am only seeing stderr logs but not stdout logs on the containers though).
Thanks
Sateesh

Message getting lost in Kafka + Spark Streaming

I am facing an issue of data loss in spark streaming with Kafka, my use case is as follow:
Spark streaming(DirectStream) application reading messages from
Kafka topic and processing it.
On the basis of the processed message, an app will write the
processed message to different Kafka topics for e.g. if the message
is harmonized then write to the harmonized topic else unharmonized
topic.
Now, the problem is that during the streaming somehow I am losing some messaged i.e all the incoming messages are not written to harmonized or unharmonized topics.
for e.g., if app received 30 messages in one batch then sometimes it writes all the messages to output topics(this is the expected behaviour) but sometimes it writes only 27 (3 messages are lost, this number can change).
Following is the version I am using:
Spark 1.6.0
Kafka 0.9
Kafka topics configuration is as follow:
num of brokers: 3
num replication factor: 3
num of partitions: 3
Following are the properties I am using for kafka:
val props = new Properties()
props.put("metadata.broker.list", properties.getProperty("metadataBrokerList"))
props.put("auto.offset.reset", properties.getProperty("autoOffsetReset"))
props.put("group.id", properties.getProperty("group.id"))
props.put("serializer.class", "kafka.serializer.StringEncoder")
props.put("outTopicHarmonized", properties.getProperty("outletKafkaTopicHarmonized"))
props.put("outTopicUnharmonized", properties.getProperty("outletKafkaTopicUnharmonized"))
props.put("acks", "all");
props.put("retries", "5");
props.put("request.required.acks", "-1")
Following is the piece of code where I am writing processed messages to Kafka:
val schemaRdd2 = finalHarmonizedDF.toJSON
schemaRdd2.foreachPartition { partition =>
val producerConfig = new ProducerConfig(props)
val producer = new Producer[String, String](producerConfig)
partition.foreach { row =>
if (debug) println(row.mkString)
val keyedMessage = new KeyedMessage[String, String](props.getProperty("outTopicHarmonized"),
null, row.toString())
producer.send(keyedMessage)
}
//hack, should be done with the flush
Thread.sleep(1000)
producer.close()
}
I have explicitly added sleep(1000) for testing purpose.
But this is also not solving the problem :(
Any suggestion would be appreciated.
Try to tune the batchDuration parameter (when initializing StreamingContext ) to a number larger than the processing time of each rdd. This solved my problem.
Because you don't want to lose any messages, you might want to choose the 'exactly once' delivery semantics, which provides no data loss. In order to configure the exactly once delivery semantics you have to use acks='all', which you did.
According to this resource[1], acks='all' property must be used in conjunction with min.insync.replicas property.
[1] https://www.linkedin.com/pulse/kafka-producer-delivery-semantics-sylvester-daniel/

Why does my Spark Streaming application not print the number of records from Kafka (using count operator)?

I am working on a spark application which needs to read data from Kafka. I created a Kafka topic where producer was posting messages. I verified from console consumer that messages were successfully posted .
I wrote a short spark application to read data from Kafka, but it is not getting any data.
Following is the code i used:
def main(args: Array[String]): Unit = {
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("SparkConsumer").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
process(lines) // prints the number of records in Kafka topic
ssc.start()
ssc.awaitTermination()
}
private def process(lines: DStream[String]) {
val z = lines.count()
println("count of lines is "+z)
//edit
lines.foreachRDD(rdd => rdd.map(println)
// <-- Why does this **not** print?
)
Any suggestions on how to resolve this issue?
******EDIT****
I have used
lines.foreachRDD(rdd => rdd.map(println)
as well in actual code but that is also not working. I set the retention period as mentioned in post : Kafka spark directStream can not get data . But still the problem exist.
Your process is a continuation of a DStream pipeline with no output operator that gets the pipeline executed every batch interval.
You can "see" it by reading the signature of count operator:
count(): DStream[Long]
Quoting the count's scaladoc:
Returns a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
So, you have a dstream of Kafka records that you transform to a dstream of single values (being the result of count). Not much to have it outputed (to a console or any other sink).
You have to end the pipeline using an output operator as described in the official documentation Output Operations on DStreams:
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
(Low-Level) Output operators register input dstreams as output dstreams so the execution can start. Spark Streaming's DStream by design has no notion of being an output dstream. It is DStreamGraph to know and be able to differentiate between input and output dstreams.

File is overwritten while using saveAsNewAPIHadoopFile

We are using Spark 1.4 for Spark Streaming. Kafka is data source for the Spark Stream.
Records are published on Kafka every second. Our requirement is to store records published on Kafka in a single folder per minute. The stream will read records every five seconds. For instance records published during 1200 PM and 1201PM are stored in folder "1200"; between 1201PM and 1202PM in folder "1201" and so on.
The code I wrote is as follows
//First Group records in RDD by date
stream.foreachRDD (rddWithinStream -> {
JavaPairRDD<String, Iterable<String>> rddGroupedByDirectory = rddWithinStream.mapToPair(t -> {
return new Tuple2<String, String> (targetHadoopFolder, t._2());
}).groupByKey();
// All records grouped by folders they will be stored in
// Create RDD for each target folder.
for (String hadoopFolder : rddGroupedByDirectory.keys().collect()) {
JavaPairRDD <String, Iterable<String>> rddByKey = rddGroupedByDirectory.filter(groupedTuples -> {
return groupedTuples._1().equals(hadoopFolder);
});
// And store it in Hadoop
rddByKey.saveAsNewAPIHadoopFile(directory, String.class, String.class, TextOutputFormat.class);
}
Since the Stream processes data every five seconds, saveAsNewAPIHadoopFile gets invoked multiple times in a minute. This causes "Part-00000" file to be overwritten every time.
I was expecting that in the directory specified by "directory" parameter, saveAsNewAPIHadoopFile will keep creating part-0000N file even when I've a sinlge worker node.
Any help/alternatives are greatly appreciated.
Thanks.
In this case you have to build your output path and filename by yourself. Incremental file naming works only when the output operation is called directly on DStream (not per each RDD).
The argument function in stream.foreachRDD can get Time information for each micro-batch. Referring to Spark documentation:
def foreachRDD(foreachFunc: (RDD[T], Time) ⇒ Unit)
So you can save each RDD as follows:
stream.foreachRDD((rdd, time) -> {
val directory = timeToDirName(prefix, time)
rdd.saveAsNewAPIHadoopFile(directory, String.class, String.class, TextOutputFormat.class);
})
You can try this -
Split process into 2 steps :
Step-1 :- Write Avro file using saveAsNewAPIHadoopFile to <temp-path>
Step-2 :- Move file from <temp-path> to <actual-target-path>
Hope this is helpful.

Handle database connection inside spark streaming

I am not sure if I understand correctly how spark handle database connection and how to reliable using large number of database update operation insides spark without potential screw up the spark job. This is a code snippet I have been using (for easy illustration):
val driver = new MongoDriver
val hostList: List[String] = conf.getString("mongo.hosts").split(",").toList
val connection = driver.connection(hostList)
val mongodb = connection(conf.getString("mongo.db"))
val dailyInventoryCol = mongodb[BSONCollection](conf.getString("mongo.collections.dailyInventory"))
val stream: InputDStream[(String,String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](
ssc, kafkaParams, fromOffsets,
(mmd: MessageAndMetadata[String, String]) => (mmd.topic, mmd.message()));
def processRDD(rddElem: RDD[(String, String)]): Unit = {
val df = rdd.map(line => {
...
}).flatMap(x => x).toDF()
if (!isEmptyDF(df)) {
var mongoF: Seq[Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult]] = Seq();
val dfF2 = df.groupBy($"CountryCode", $"Width", $"Height", $"RequestType", $"Timestamp").agg(sum($"Frequency")).collect().map(row => {
val countryCode = row.getString(0); val width = row.getInt(1); val height = row.getInt(2);
val requestType = row.getInt(3); val timestamp = row.getLong(4); val frequency = row.getLong(5);
val endTimestamp = timestamp + 24*60*60; //next day
val updateOp = dailyInventoryCol.updateModifier(BSONDocument("$inc" -> BSONDocument("totalFrequency" -> frequency)), false, true)
val f: Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult] =
dailyInventoryCol.findAndModify(BSONDocument("width" -> width, "height" -> height, "country_code" -> countryCode, "request_type" -> requestType,
"startTs" -> timestamp, "endTs" -> endTimestamp), updateOp)
f
})
mongoF = mongoF ++ dfF2
//split into small chunk to avoid drying out the mongodb connection
val futureList: List[Seq[Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult]]] = mongoF.grouped(200).toList
//future list
futureList.foreach(seqF => {
Await.result(Future.sequence(seqF), 40.seconds)
});
}
stream.foreachRDD(processRDD(_))
Basically, I am using Reactive Mongo (Scala) and for each RDD, I convert it into dataframe, group/extract the necessary data and then fire a large number of database update query against mongo. I want to ask:
I am using mesos to deploy spark on 3 servers and have one more server for mongo database. Is this the correct way to handle database connection. My concern is if database connection / polling is opened at the beginning of spark job and maintained properly (despite timeout/network error failover) during the whole duration of spark(weeks, months....) and if it will be closed when each batch finished? Given the fact that job might be scheduled on different servers? Does it means that each batch, it will open different set of DB connections?
What happen if exception occurs when executing queries. The spark job for that batch will failed? But the next batch will keep continue?
If there is too many queries (2000->+) to run update on mongo-database, and the executing time is exceeding configured spark batch duration (2 minutes), will it cause the problem? I was noticed that with my current setup, after abt 2-3 days, all of the batch is queued up as "Process" on Spark WebUI (if i disable the mongo update part, then i can run one week without prob), none is able to exit properly. Which basically hang up all batch job until i restart/resubmit the job.
Thanks a lot. I appreciate if you can help me address the issue.
Please read "Design Patterns for using foreachRDD" section in http://spark.apache.org/docs/latest/streaming-programming-guide.html. This will clear your doubts about how connections should be used/ created.
Secondly i would suggest to keep the direct update operations separate from your Spark Job. Better way would be that your spark job, process the data and then post it into a Kafka Queue and then have another dedicated process/ job/ code which reads the data from Kafka Queue and perform insert/ update operation on Mongo DB.

Resources