Join two Spark DStreams with complex nested structure - apache-spark

I have implemented a Spark custom receiver to receive DStreams from http/REST as follows
val mem1Total:ReceiverInputDStream[String] = ssc.receiverStream(new CustomReceiver("httpURL1"))
val dstreamMem1:DStream[String] = mem1Total.window(Durations.seconds(30), Durations.seconds(10))
val mem2Total:ReceiverInputDStream[String] = ssc.receiverStream(new CustomReceiver("httpURL2"))
val dstreamMem2:DStream[String] = mem2Total.window(Durations.seconds(30), Durations.seconds(10))
Each stream has the following schema
val schema = StructType(Seq(
StructField("status", StringType),
StructField("data", StructType(Seq(
StructField("resultType", StringType),
StructField("result", ArrayType(StructType(Array(
StructField("metric", StructType(Seq(StructField("application", StringType),
StructField("component", StringType),
StructField("instance", StringType)))),
StructField("value", ArrayType(StringType))
))))
)
))))
Here is what I how far i could go to extract the features that I need from dstreamMem1.
dstreamMem1.foreachRDD { rdd =>
import sparkSession.implicits._
val df = rdd.toDS()
.selectExpr("cast (value as string) as myData")
.select(from_json($"myData", schema).as("myDataEvent"))
.select($"myDataEvent.data.*")
.select(explode($"result").as("flat"))
.select($"flat.metric.*", $"flat.value".getItem(0).as("value1"), $"flat.value".getItem(1).as("value2"))
}
However I am cant figure out how to join dstreamMem1 with dstreamMem2 while also dealing with the complex structure. I can do a union operation on dstreamMem1 and dstreamMem2. But that wont work in my case because the "value" fields represent different things on each stream. Any ideas please?
Edit#1
Based on the following resources
How to create a custom streaming data source?
https://github.com/apache/spark/pull/21145
https://github.com/hienluu/structured-streaming-sources/tree/master/streaming-sources/src/main/scala/org/structured_streaming_sources/twitter
I have been able to create the following class to implement the following
class SSPStreamMicroBatchReader(options: DataSourceOptions) extends MicroBatchReader with Logging {
private val httpURL = options.get(SSPStreamingSource.HTTP_URL).orElse("") //.toString()
private val numPartitions = options.get(SSPStreamingSource.NUM_PARTITIONS).orElse("5").toInt
private val queueSize = options.get(SSPStreamingSource.QUEUE_SIZE).orElse("512").toInt
private val debugLevel = options.get(SSPStreamingSource.DEBUG_LEVEL).orElse("debug").toLowerCase
private val NO_DATA_OFFSET = SSPOffset(-1)
private var startOffset: SSPOffset = new SSPOffset(-1)
private var endOffset: SSPOffset = new SSPOffset(-1)
private var currentOffset: SSPOffset = new SSPOffset(-1)
private var lastReturnedOffset: SSPOffset = new SSPOffset(-2)
private var lastOffsetCommitted : SSPOffset = new SSPOffset(-1)
private var incomingEventCounter = 0;
private var stopped:Boolean = false
private var acsURLConn:HttpURLConnection = null
private var worker:Thread = null
private val sspList:ListBuffer[StreamingQueryStatus] = new ListBuffer[StreamingQueryStatus]()
private var sspQueue:BlockingQueue[StreamingQueryStatus] = null
initialize()
private def initialize(): Unit = synchronized {
log.warn(s"Inside initialize ....")
sspQueue = new ArrayBlockingQueue(queueSize)
new Thread("Socket Receiver") { log.warn(s"Inside thread ....")
override def run() {
log.warn(s"Inside run ....")
receive() }
}.start()
}
private def receive(): Unit = {
log.warn(s"Inside recieve() ....")
var userInput: String = null
acsURLConn = new AccessACS(httpURL).getACSConnection();
// Until stopped or connection broken continue reading
val reader = new BufferedReader(
new InputStreamReader(acsURLConn.getInputStream(), java.nio.charset.StandardCharsets.UTF_8))
userInput = reader.readLine()
while(!stopped) {
// poll tweets from queue
val tweet:StreamingQueryStatus = sspQueue.poll(100, TimeUnit.MILLISECONDS)
if (tweet != null) {
sspList.append(tweet);
currentOffset = currentOffset + 1
incomingEventCounter = incomingEventCounter + 1;
}
}
reader.close()
}
override def planInputPartitions(): java.util.List[InputPartition[org.apache.spark.sql.catalyst.InternalRow]] = {
synchronized {
log.warn(s"Inside planInputPartitions ....")
//initialize()
val startOrdinal = startOffset.offset.toInt + 1
val endOrdinal = endOffset.offset.toInt + 1
internalLog(s"createDataReaderFactories: sOrd: $startOrdinal, eOrd: $endOrdinal, " +
s"lastOffsetCommitted: $lastOffsetCommitted")
val newBlocks = synchronized {
val sliceStart = startOrdinal - lastOffsetCommitted.offset.toInt - 1
val sliceEnd = endOrdinal - lastOffsetCommitted.offset.toInt - 1
assert(sliceStart <= sliceEnd, s"sliceStart: $sliceStart sliceEnd: $sliceEnd")
sspList.slice(sliceStart, sliceEnd)
}
newBlocks.grouped(numPartitions).map { block =>
new SSPStreamBatchTask(block).asInstanceOf[InputPartition[org.apache.spark.sql.catalyst.InternalRow]]
}.toList.asJava
}
}
override def setOffsetRange(start: Optional[Offset],
end: Optional[Offset]): Unit = {
if (start.isPresent && start.get().asInstanceOf[SSPOffset].offset != currentOffset.offset) {
internalLog(s"setOffsetRange: start: $start, end: $end currentOffset: $currentOffset")
}
this.startOffset = start.orElse(NO_DATA_OFFSET).asInstanceOf[SSPOffset]
this.endOffset = end.orElse(currentOffset).asInstanceOf[SSPOffset]
}
override def getStartOffset(): Offset = {
internalLog("getStartOffset was called")
if (startOffset.offset == -1) {
throw new IllegalStateException("startOffset is -1")
}
startOffset
}
override def getEndOffset(): Offset = {
if (endOffset.offset == -1) {
currentOffset
} else {
if (lastReturnedOffset.offset < endOffset.offset) {
internalLog(s"** getEndOffset => $endOffset)")
lastReturnedOffset = endOffset
}
endOffset
}
}
override def commit(end: Offset): Unit = {
internalLog(s"** commit($end) lastOffsetCommitted: $lastOffsetCommitted")
val newOffset = SSPOffset.convert(end).getOrElse(
sys.error(s"SSPStreamMicroBatchReader.commit() received an offset ($end) that did not " +
s"originate with an instance of this class")
)
val offsetDiff = (newOffset.offset - lastOffsetCommitted.offset).toInt
if (offsetDiff < 0) {
sys.error(s"Offsets committed out of order: $lastOffsetCommitted followed by $end")
}
sspList.trimStart(offsetDiff)
lastOffsetCommitted = newOffset
}
override def stop(): Unit = {
log.warn(s"There is a total of $incomingEventCounter events that came in")
stopped = true
if (acsURLConn != null) {
try {
//acsURLConn.disconnect()
} catch {
case e: IOException =>
}
}
}
override def deserializeOffset(json: String): Offset = {
SSPOffset(json.toLong)
}
override def readSchema(): StructType = {
SSPStreamingSource.SCHEMA
}
private def internalLog(msg:String): Unit = {
debugLevel match {
case "warn" => log.warn(msg)
case "info" => log.info(msg)
case "debug" => log.debug(msg)
case _ =>
}
}
}
object SSPStreamingSource {
val HTTP_URL = "httpURL"
val DEBUG_LEVEL = "debugLevel"
val NUM_PARTITIONS = "numPartitions"
val QUEUE_SIZE = "queueSize"
val SCHEMA = StructType(Seq(
StructField("status", StringType),
StructField("data", StructType(Seq(
StructField("resultType", StringType),
StructField("result", ArrayType(StructType(Array(
StructField("application", StringType),
StructField("component", StringType),
StructField("instance", StringType)))),
StructField("value", ArrayType(StringType))
))))
)
))))
}
class SSPStreamBatchTask(sspList:ListBuffer[StreamingQueryStatus]) extends InputPartition[Row] {
override def createPartitionReader(): InputPartitionReader[Row] = new SSPStreamBatchReader(sspList)
}
class SSPStreamBatchReader(sspList:ListBuffer[StreamingQueryStatus]) extends InputPartitionReader[Row] {
private var currentIdx = -1
override def next(): Boolean = {
// Return true as long as the new index is in the seq.
currentIdx += 1
currentIdx < sspList.size
}
override def get(): Row = {
val tweet = sspList(currentIdx)
Row(tweet.json)
}
override def close(): Unit = {}
}
Further this class is used as follows
val a = sparkSession.readStream
.format(providerClassName)
.option(SSPStreamingSource.HTTP_URL, httpMemTotal)
.load()
a.printSchema()
a.writeStream
.outputMode(OutputMode.Append())
.option("checkpointLocation", "/home/localCheckpoint1") //local
.start("/home/sparkoutput/aa00a01")
Here is the error. Im yet to crack this :(
18/11/29 13:33:28 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
18/11/29 13:33:28 WARN SSPStreamingSource: Inside createMicroBatchReader() ....
18/11/29 13:33:28 WARN SSPStreamMicroBatchReader: Inside initialize ....
18/11/29 13:33:28 WARN SSPStreamMicroBatchReader: Inside thread ....
18/11/29 13:33:28 WARN SSPStreamMicroBatchReader: There is a total of 0 events that came in
18/11/29 13:33:28 WARN SSPStreamMicroBatchReader: Inside run ....
18/11/29 13:33:28 WARN SSPStreamMicroBatchReader: Inside recieve() ....
root
|-- status: string (nullable = true)
|-- data: struct (nullable = true)
| |-- resultType: string (nullable = true)
| |-- result: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- metric: struct (nullable = true)
| | | | |-- application: string (nullable = true)
| | | | |-- component: string (nullable = true)
| | | | |-- instance: string (nullable = true)
| | | |-- value: array (nullable = true)
| | | | |-- element: string (containsNull = true)
18/11/29 13:33:30 INFO MicroBatchExecution: Starting [id = f15252df-96d8-45b4-a6db-83fd4c7aed71, runId = 65a6dc28-5eb4-468a-80c3-f547504689d7]. Use file:///home/localCheckpoint1 to store the query checkpoint.
18/11/29 13:33:30 WARN SSPStreamingSource: Inside createMicroBatchReader() ....
18/11/29 13:33:30 WARN SSPStreamMicroBatchReader: Inside initialize ....
18/11/29 13:33:30 ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:168)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:513)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:573)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:572)
at myproject.spark.predictive_monitoring.predictmyproject$.run(predictmyproject.scala:99)
at myproject.spark.predictive_monitoring.predictmyproject$.main(predictmyproject.scala:31)
at myproject.spark.predictive_monitoring.predictmyproject.main(predictmyproject.scala)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:168)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:513)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:573)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:572)
at myproject.spark.predictive_monitoring.predictmyproject$.run(predictmyproject.scala:99)
at myproject.spark.predictive_monitoring.predictmyproject$.main(predictmyproject.scala:31)
at myproject.spark.predictive_monitoring.predictmyproject.main(predictmyproject.scala)
18/11/29 13:33:30 INFO SparkContext: Invoking stop() from shutdown hook
18/11/29 13:33:30 WARN SSPStreamMicroBatchReader: Inside thread ....
18/11/29 13:33:30 WARN SSPStreamMicroBatchReader: Inside run ....
18/11/29 13:33:30 WARN SSPStreamMicroBatchReader: Inside recieve() ....
18/11/29 13:33:30 INFO MicroBatchExecution: Using MicroBatchReader [myproject.spark.predictive_monitoring.SSPStreamMicroBatchReader#74cc1ddc] from DataSourceV2 named 'myproject.spark.predictive_monitoring.SSPStreamingSource' [myproject.spark.predictive_monitoring.SSPStreamingSource#7e503c3]
18/11/29 13:33:30 INFO SparkUI: Stopped Spark web UI at http://172.16.221.232:4040
18/11/29 13:33:30 ERROR MicroBatchExecution: Query [id = f15252df-96d8-45b4-a6db-83fd4c7aed71, runId = 65a6dc28-5eb4-468a-80c3-f547504689d7] terminated with error
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:76)
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:838)
org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:85)
myproject.spark.predictive_monitoring.predictmyproject$.run(predictmyproject.scala:37)
myproject.spark.predictive_monitoring.predictmyproject$.main(predictmyproject.scala:31)
myproject.spark.predictive_monitoring.predictmyproject.main(predictmyproject.scala)
The currently active SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:76)
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:838)
org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:85)
myproject.spark.predictive_monitoring.predictmyproject$.run(predictmyproject.scala:37)
myproject.spark.predictive_monitoring.predictmyproject$.main(predictmyproject.scala:31)
myproject.spark.predictive_monitoring.predictmyproject.main(predictmyproject.scala)
at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:100)
at org.apache.spark.sql.SparkSession.<init>(SparkSession.scala:91)
at org.apache.spark.sql.SparkSession.cloneSession(SparkSession.scala:256)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:268)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
18/11/29 13:33:30 WARN SSPStreamMicroBatchReader: There is a total of 0 events that came in
18/11/29 13:33:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/11/29 13:33:30 INFO MemoryStore: MemoryStore cleared
18/11/29 13:33:30 INFO BlockManager: BlockManager stopped
18/11/29 13:33:30 INFO BlockManagerMaster: BlockManagerMaster stopped
18/11/29 13:33:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/11/29 13:33:31 INFO SparkContext: Successfully stopped SparkContext

Related

Cogrouping not supported in streaming DataSet/DataFrames

Executing action:Metrics:P{"input":"tripMetrics","isEnabled":"true","class":"com.mobileum.wcmodel.execution.actions.SaveStreamAction","properties":{"path":"output/tripMetrics","triggerWindow":"5 minutes","checkpointLocation":"output/checkpoints/tripMetrics","format":"console","queryName":"GtpDetailModel"}}}
Exception in thread "main" org.apache.spark.sql.AnalysisException: CoGrouping with a streaming DataFrame/Dataset is not supported;;
I have a use case where I have to cogroup two datasets in streaming . However when doing so I am getting an exception that Cogrouping of Dataset/DataFrames in streaming is not supported
#Override
public List<Dataset<Row>> transform(SparkSession sparkSession, Map<String, Dataset<Row>> inputDatasets, Properties properties) {
Encoder<Row> encoder = RowEncoder.apply((StructType)new CatalystSqlParser(sparkSession.sqlContext().conf()).parseDataType("struct<hostnetworkid:string,partnercountryid:string>"));
try {
Iterator<Map.Entry<String,Dataset<Row>>> itr= inputDatasets.entrySet().iterator();
Dataset<Row> trip = null;
Dataset<Row> registration= null;
while(itr.hasNext()){
trip=itr.next().getValue();
registration=itr.next().getValue();
}
KeyValueGroupedDataset<Long, TripModel> tripKeyValueGroupedDataset =
trip.map((MapFunction<Row, TripModel>) TripModel :: new , Encoders.bean(TripModel.class))
.groupByKey((MapFunction<TripModel, Long>) TripModel::getKey, Encoders.LONG());
KeyValueGroupedDataset<Long, RegistrationModel> regKeyValueGroupedDataset =
registration.map((MapFunction<Row, RegistrationModel>) RegistrationModel :: new , Encoders.bean(RegistrationModel.class))
.groupByKey((MapFunction<RegistrationModel, Long>) RegistrationModel::getKey, Encoders.LONG());
Dataset<Row> cogrouped = tripKeyValueGroupedDataset.cogroup(regKeyValueGroupedDataset, (CoGroupFunction<Long,TripModel, RegistrationModel, Row>) ( key, it1, it2) ->
{
Iterable<TripModel> iterable = () -> it1;
List<TripModel> tripModelList = StreamSupport
.stream(iterable.spliterator(), false)
.collect(Collectors.toList());
List<Row> a1 = new ArrayList<Row>();
a1.add(RowFactory.create(tripModelList.get(0).getCosid(),"asdf"));
return a1.iterator();

Spark GC Overhead limit exceeded error message

I am running the below code in spark to compare the data stored in a csv file and a hive table. My data file is about 1.5GB and about 0.2 billion rows. When I run the code below, I am getting GC overhead limit exceeded error. I am not sure why I am getting this error. I have search various articles.
The error comes at Test 3 step sourceDataFrame.except(targetRawData).count > 0
I am not sure if there is any memory leak or not. How can I debug and resolve the same?
import org.apache.spark.sql.hive._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions.{to_date, to_timestamp}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.text._
import java.util.Date
import scala.util._
import org.apache.spark.sql.hive.HiveContext
//val conf = new SparkConf().setAppName("Simple Application")
//val sc = new SparkContext(conf)
val hc = new HiveContext(sc)
val spark: SparkSession = SparkSession.builder().appName("Simple Application").config("spark.master", "local").getOrCreate()
// set source and target location
//val sourceDataLocation = "hdfs://localhost:9000/sourcec.txt"
val sourceDataLocation = "s3a://rbspoc-sas/sas_valid_large.txt"
val targetTableName = "temp_TableA"
// Extract source data
println("Extracting SAS source data from csv file location " + sourceDataLocation);
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sourceRawCsvData = sc.textFile(sourceDataLocation)
println("Extracting target data from hive table " + targetTableName)
val targetRawData = hc.sql("Select datetime,load_datetime,trim(source_bank) as source_bank,trim(emp_name) as emp_name,header_row_count, emp_hours from " + targetTableName)
// Add the test cases here
// Test 1 - Validate the Structure
println("Validating the table structure...")
var startTime = getTimestamp()
val headerColumns = sourceRawCsvData.first().split(",").to[List]
val schema = TableASchema(headerColumns)
val sourceData = sourceRawCsvData.mapPartitionsWithIndex((index, element) => if (index == 0) element.drop(1) else element)
.map(_.split(",").toList)
.map(row)
val sourceDataFrame = spark.createDataFrame(sourceData,schema)
//val sourceDataFrame = sourceDataFrame.toDF(sourceDataFrame.columns map(_.toLowerCase): _*)
val sourceSchemaList = flatten(sourceDataFrame.schema).map(r => r.dataType.toString).toList
val targetSchemaList = flatten(targetRawData.schema).map(r => r.dataType.toString).toList
var endTime = getTimestamp()
if (sourceSchemaList.diff(targetSchemaList).length > 0) {
println("Updating StructureValidation result in table...")
UpdateResult(targetTableName, startTime, endTime, 1, s"FAILED: $targetTableName failed StructureValidation. ")
// Force exit here if needed
// sys.exit(1)
} else {
println("Updating StructureValidation result in table...")
UpdateResult(targetTableName, startTime, endTime, 0, s"SUCCESS: $targetTableName passed StructureValidation. ")
}
// Test 2 - Validate the Row count
println("Validating the Row count...")
startTime = getTimestamp()
// check the row count.
val sourceCount = sourceData.count()
val targetCount = targetRawData.count()
endTime = getTimestamp()
if (sourceCount != targetCount){
println("Updating RowCountValidation result in table...")
// Update the result in the table
UpdateResult(targetTableName, startTime, endTime, 1, s"FAILED: $targetTableName failed RowCountValidation. Source count:$sourceCount and Target count:$targetCount")
// Force exit here if needed
//sys.exit(1)
}
else{
println("Updating RowCountValidation result in table...")
// Update the result in the table
UpdateResult(targetTableName, startTime, endTime, 0, s"SUCCESS: $targetTableName passed RowCountValidation. Source count:$sourceCount and Target count:$targetCount")
}
// Test 3 - Validate the data
println("Comparing source and target data...")
startTime = getTimestamp()
if (sourceDataFrame.except(targetRawData).count > 0 ){
endTime = getTimestamp()
// Update the result in the table
println("Updating DataValidation result in table...")
UpdateResult(targetTableName, startTime, endTime, 1, s"FAILED: $targetTableName failed DataMatch validation")
// Force exit here if needed
// sys.exit(1)
}
else{
endTime = getTimestamp()
println("Updating DataValidation result in table...")
// Update the result in the table
UpdateResult(targetTableName, startTime, endTime, 0, s"SUCCESS: $targetTableName passed DataMatch validation")
}
// Test 4 - Calculate the average and variance of Int or Dec columns
// Test 5 - String length validation
def UpdateResult(tableName: String, startTime: String, endTime: String, returnCode: Int, description: String){
val insertString = s"INSERT INTO TABLE TestResult VALUES( FROM_UNIXTIME(UNIX_TIMESTAMP()),'$startTime','$endTime','$tableName',$returnCode,'$description')"
val a = hc.sql(insertString)
}
def TableASchema(columnName: List[String]): StructType = {
StructType(
Seq(
StructField(name = "datetime", dataType = TimestampType, nullable = true),
StructField(name = "load_datetime", dataType = TimestampType, nullable = true),
StructField(name = "source_bank", dataType = StringType, nullable = true),
StructField(name = "emp_name", dataType = StringType, nullable = true),
StructField(name = "header_row_count", dataType = IntegerType, nullable = true),
StructField(name = "emp_hours", dataType = DoubleType, nullable = true)
)
)
}
def row(line: List[String]): Row = {
Row(convertToTimestamp(line(0).trim), convertToDate(line(1).trim), line(2).trim, line(3).trim, line(4).toInt, line(5).toDouble)
}
def convertToTimestamp(s: String) : Timestamp = s match {
case "" => null
case _ => {
val format = new SimpleDateFormat("ddMMMyyyy:HH:mm:ss")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => t
case Failure(_) => null
}
}
}
def convertToDate(s: String) : Timestamp = s match {
case "" => null
case _ => {
val format = new SimpleDateFormat("ddMMMyyyy")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => t
case Failure(_) => null
}
}
}
def flatten(scheme: StructType): Array[StructField] = scheme.fields.flatMap { f =>
f.dataType match {
case struct:StructType => flatten(struct)
case _ => Array(f)
}
}
def getTimestamp(): String = {
val now = java.util.Calendar.getInstance()
val timestampFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
timestampFormat.format(now.getTime())
}
Exception is below:
17/12/21 05:18:40 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(8,0,ShuffleMapTask,TaskKilled(stage cancelled),org.apache.spark.scheduler.TaskInfo#78db3052,null)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in stage 8.0 failed 1 times, most recent failure: Lost task 17.0 in stage 8.0 (TID 323, localhost, executor driver): java.lang.OutOfMemoryError: GC overhead limit exceeded
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2429)
... 53 elided
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
scala> 17/12/21 05:18:40 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: /tmp/spark-6f345216-41df-4fd6-8e3d-e34d49e28f0c
java.io.IOException: Failed to delete: /tmp/spark-6f345216-41df-4fd6-8e3d-e34d49e28f0c
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1031)
at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65)
at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
Your spark process is wasting too much time in Garbage collection.Most of the cpu core is getting consumed and processing doesnt completes.You are running out of executor memory.You can try below options
Tune the property spark.storage.memoryFraction and
spark.memory.storageFraction.You can also issue the command to tune this-spark-submit ... --executor-memory 4096m --num-executors 20..
Or by changing the GC policy.Check the current GC value.Set the value to -XX:G1GC

MapWithState gives java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast while recovering from checkpoint

I am facing an issue with spark streaming job where i am trying to use broadcast, mapWithState and checkpointing together in spark.
Following is the usage:
Since I have to pass some connection object (which is not Serializable) to the executors, I am using org.apache.spark.broadcast.Broadcast
Since we have to maintain some cached information i am using stateful streaming with mapWithState
Also I am using checkpointing of my streaming context
I also need to pass the broadcasted connection object into the mapWithState for fetching some data from an external source.
The flow is working just fine when the context is created newly. However when i crash the application and try to recover from checkpoint I get a ClassCastException.
I have put a small code snippet based on an example from asyncified.io to reproduce the issue in github:
My broadcast logic is yuvalitzchakov.utils.KafkaWriter.scala
The dummy logic of the application is yuvalitzchakov.stateful.SparkStatefulRunnerWithBroadcast.scala
Dummy snippet of the code:
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark-stateful-example")
...
val prop = new Properties()
...
val config: Config = ConfigFactory.parseString(prop.toString)
val sc = new SparkContext(sparkConf)
val ssc = StreamingContext.getOrCreate(checkpointDir, () => {
println("creating context newly")
clearCheckpoint(checkpointDir)
val streamingContext = new StreamingContext(sc, Milliseconds(batchDuration))
streamingContext.checkpoint(checkpointDir)
...
val kafkaWriter = SparkContext.getOrCreate().broadcast(kafkaErrorWriter)
...
val stateSpec = StateSpec.function((key: Int, value: Option[UserEvent], state: State[UserSession]) =>
updateUserEvents(key, value, state, kafkaWriter)).timeout(Minutes(jobConfig.getLong("timeoutInMinutes")))
kafkaTextStream
.transform(rdd => {
offsetsQueue.enqueue(rdd.asInstanceOf[HasOffsetRanges].offsetRanges)
rdd
})
.map(deserializeUserEvent)
.filter(_ != UserEvent.empty)
.mapWithState(stateSpec)
.foreachRDD { rdd =>
...
some logic
...
streamingContext
})
}
ssc.start()
ssc.awaitTermination()
def updateUserEvents(key: Int,
value: Option[UserEvent],
state: State[UserSession],
kafkaWriter: Broadcast[KafkaWriter]): Option[UserSession] = {
...
kafkaWriter.value.someMethodCall()
...
}
I get the following error when
kafkaWriter.value.someMethodCall()
is executed:
17/08/01 21:20:38 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 4)
java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to yuvalitzchakov.utils.KafkaWriter
at yuvalitzchakov.stateful.SparkStatefulRunnerWithBroadcast$.updateUserSessions$1(SparkStatefulRunnerWithBroadcast.scala:144)
at yuvalitzchakov.stateful.SparkStatefulRunnerWithBroadcast$.updateUserEvents(SparkStatefulRunnerWithBroadcast.scala:150)
at yuvalitzchakov.stateful.SparkStatefulRunnerWithBroadcast$$anonfun$2.apply(SparkStatefulRunnerWithBroadcast.scala:78)
at yuvalitzchakov.stateful.SparkStatefulRunnerWithBroadcast$$anonfun$2.apply(SparkStatefulRunnerWithBroadcast.scala:77)
at org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:181)
at org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:180)
at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:57)
at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
at org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:159)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1005)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Basically kafkaWriter is the broadcast variable and kafkaWriter.value should return us the broadcasted variable but it is returning SerializableCongiguration which is not getting casted to the desired object
Thanks in advance for help!
Broadcast variable cannot be used with MapwithState(transformation operations in general) if we need to recover from checkpoint directory in Spark streaming. It can only be used inside output operations in that case as it requires Spark context to lazily initialize the broadcast
class JavaWordBlacklist {
private static volatile Broadcast<List<String>> instance = null;
public static Broadcast<List<String>> getInstance(JavaSparkContext jsc) {
if (instance == null) {
synchronized (JavaWordBlacklist.class) {
if (instance == null)
{ List<String> wordBlacklist = Arrays.asList("a", "b", "c"); instance = jsc.broadcast(wordBlacklist); }
}
}
return instance;
}
}
class JavaDroppedWordsCounter {
private static volatile LongAccumulator instance = null;
public static LongAccumulator getInstance(JavaSparkContext jsc) {
if (instance == null) {
synchronized (JavaDroppedWordsCounter.class) {
if (instance == null)
{ instance = jsc.sc().longAccumulator("WordsInBlacklistCounter"); }
}
}
return instance;
}
}
wordCounts.foreachRDD((rdd, time) -> {
// Get or register the blacklist Broadcast
Broadcast<List<String>> blacklist = JavaWordBlacklist.getInstance(new JavaSparkContext(rdd.context()));
// Get or register the droppedWordsCounter Accumulator
LongAccumulator droppedWordsCounter = JavaDroppedWordsCounter.getInstance(new JavaSparkContext(rdd.context()));
// Use blacklist to drop words and use droppedWordsCounter to count them
String counts = rdd.filter(wordCount -> {
if (blacklist.value().contains(wordCount._1()))
{ droppedWordsCounter.add(wordCount._2()); return false; }
else
{ return true; }
}).collect().toString();
String output = "Counts at time " + time + " " + counts;
}

Spark Struct structfield names getting changed in UDF

I am trying to pass a struct in spark to udf. It is changing the field names and renaming to the column position. How do I fix it?
object TestCSV {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("localTest").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val inputData = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter","|")
.option("header", "true")
.load("test.csv")
inputData.printSchema()
inputData.show()
val groupedData = inputData.withColumn("name",struct(inputData("firstname"),inputData("lastname")))
val udfApply = groupedData.withColumn("newName",processName(groupedData("name")))
udfApply.show()
}
def processName = udf((input:Row) =>{
println(input)
println(input.schema)
Map("firstName" -> input.getAs[String]("firstname"), "lastName" -> input.getAs[String]("lastname"))
})
}
Output:
root
|-- id: string (nullable = true)
|-- firstname: string (nullable = true)
|-- lastname: string (nullable = true)
+---+---------+--------+
| id|firstname|lastname|
+---+---------+--------+
| 1| jack| reacher|
| 2| john| Doe|
+---+---------+--------+
Error:
[jack,reacher]
StructType(StructField(i[1],StringType,true), > StructField(i[2],StringType,true))
17/03/08 09:45:35 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.IllegalArgumentException: Field "firstname" does not exist.
What you are encountering is really strange. After playing around a bit I finally figured out that it may be related to a problem with the optimizer engine. It seems that the problem is not the UDF but the struct function.
I get it to work (Spark 1.6.3) when I cache the groupedData, without caching I get your reported exception:
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
object Demo {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[1]"))
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions._
def processName = udf((input: Row) => {
Map("firstName" -> input.getAs[String]("firstname"), "lastName" -> input.getAs[String]("lastname"))
})
val inputData =
sc.parallelize(
Seq(("1", "Kevin", "Costner"))
).toDF("id", "firstname", "lastname")
val groupedData = inputData.withColumn("name", struct(inputData("firstname"), inputData("lastname")))
.cache() // does not work without cache
val udfApply = groupedData.withColumn("newName", processName(groupedData("name")))
udfApply.show()
}
}
Alternatively you can use the RDD API to make your struct, but this is not really nice:
case class Name(firstname:String,lastname:String) // define outside main
val groupedData = inputData.rdd
.map{r =>
(r.getAs[String]("id"),
Name(
r.getAs[String]("firstname"),
r.getAs[String]("lastname")
)
)
}
.toDF("id","name")

Spark How to RDD[JSONObject] to Dataset

I am reading data from RDD of Element of type com.google.gson.JsonObject. Trying to convert that into DataSet but no clue how to do this.
import com.google.gson.{JsonParser}
import org.apache.hadoop.io.LongWritable
import org.apache.spark.sql.{SparkSession}
object tmp {
class people(name: String, age: Long, phone: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val sc = spark.sparkContext
val parser = new JsonParser();
val jsonObject1 = parser.parse("""{"name":"abc","age":23,"phone":"0208"}""").getAsJsonObject()
val jsonObject2 = parser.parse("""{"name":"xyz","age":33}""").getAsJsonObject()
val PairRDD = sc.parallelize(List(
(new LongWritable(1l), jsonObject1),
(new LongWritable(2l), jsonObject2)
))
val rdd1 =PairRDD.map(element => element._2)
import spark.implicits._
//How to create Dataset as schema People from rdd1?
}
}
Even trying to print rdd1 elements throws
object not serializable (class: org.apache.hadoop.io.LongWritable, value: 1)
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (1,{"name":"abc","age":23,"phone":"0208"}))
Basically I get this RDD[LongWritable,JsonParser] from BigQuery table which I want to convert to Dataset so I can apply SQL for transformation.
I've left phone in the second record null intentionally, BigQuery return nothing for that element with null value.
Thanks for the clarification. You need to register the class as Serializable in kryo. The following show work. I am running in spark-shell so had to destroy the old context and create a new spark context with a config that included the registered Kryo Classes
import com.google.gson.{JsonParser}
import org.apache.hadoop.io.LongWritable
import org.apache.spark.SparkContext
sc.stop()
val conf = sc.getConf
conf.registerKryoClasses( Array(classOf[LongWritable], classOf[JsonParser] ))
conf.get("spark.kryo.classesToRegister")
val sc = new SparkContext(conf)
val parser = new JsonParser();
val jsonObject1 = parser.parse("""{"name":"abc","age":23,"phone":"0208"}""").getAsJsonObject()
val jsonObject2 = parser.parse("""{"name":"xyz","age":33}""").getAsJsonObject()
val pairRDD = sc.parallelize(List(
(new LongWritable(1l), jsonObject1),
(new LongWritable(2l), jsonObject2)
))
val rdd = pairRDD.map(element => element._2)
rdd.collect()
// res9: Array[com.google.gson.JsonObject] = Array({"name":"abc","age":23,"phone":"0208"}, {"name":"xyz","age":33})
val jsonstrs = rdd.map(e=>e.toString).collect()
val df = spark.read.json( sc.parallelize(jsonstrs) )
df.printSchema
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// |-- phone: string (nullable = true)

Resources