Creating an RDD from ConsumerRecord Value in Spark Streaming - apache-spark

I am trying to create a XmlRelation based on ConsumerRecord Value.
val value = record.value();
logger.info(".processRecord() : Value ={}" , value)
if(value !=null) {
val rdd = spark.sparkContext.parallelize(List(new String(value)))
How ever when i try to create an RDD based on the value i am getting NullPointerException.
org.apache.spark.SparkException: Job aborted due to stage failure:
Is this because i cannot create an RDD as i cannot get sparkContext on on worker nodes. Obviously i cannot send this information to back to the Driver as this is an infinite Stream.
What alternatives do i have.
The other alternative is write this record data along with Header info to another topic and write it back to another topic and have another streaming job process that info.
The ConsumerRecord Value i am getting is String (XML) and i want to parse it using an existing schema into an RDD and process it further.
Thanks
Sateesh

I am able to use the following code and make it work
val xmlStringDF:DataFrame = batchDF.selectExpr("value").filter($"value".isNotNull)
logger.info(".convert() : xmlStringDF Schema ={}",xmlStringDF.schema.treeString)
val rdd: RDD[String] = xmlStringDF.as[String].rdd
logger.info(".convert() : Before converting String DataFrame into XML DataFrame")
val relation = XmlRelation(
() => rdd,
None,
parameters.toMap,
xmlSchema)(spark.sqlContext)
val xmlDF = spark.baseRelationToDataFrame(relation)

Related

How to collect a streaming dataset (to a Scala value)?

How can I store a dataframe value to a scala variable ?
I need to store values from the below dataframe (assuming column "timestamp" producing same values) to a variable and later I need to use this variable somewhere
i have tried following
val spark =SparkSession.builder().appName("micro").
enableHiveSupport().config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("spark.sql.streaming.checkpointLocation", "hdfs://dff/apps/hive/warehouse/area.db").
getOrCreate()
val xmlSchema = new StructType().add("id", "string").add("time_xml", "string")
val xmlData = spark.readStream.option("sep", ",").schema(xmlSchema).csv("file:///home/shp/sourcexml")
val xmlDf_temp = xmlData.select($"id",unix_timestamp($"time_xml", "dd/mm/yyyy HH:mm:ss").cast(TimestampType).as("timestamp"))
val collect_time = xmlDf_temp.select($"timestamp").as[String].collect()(0)
its thorwing error saying following:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
Is there any way i can store some dataframe values to a variable and use later?
is there any way i can store some dataframe values to a variable and use later ?
That's not possible in Spark Structured Streaming since a streaming query never ends and so it is not possible to express collect.
and later I need to use this variable somewhere
This "later" has to be another streaming query that you could join together and produce a result.

Spark Streaming: Using external data during stream transformation

I have a situation where I have to filter data-points in a stream based on some condition involving a reference to external data. I have loaded up the external data in a Dataframe (so that I get to query on it using SQL interface). But when I tried to query on Dataframe I see that we cannot access it inside the transform (filter) function. (sample code below)
// DStream is created and temp table called 'locations' is registered
dStream.filter(dp => {
val responseDf = sqlContext.sql("select location from locations where id='001'")
responseDf.show() //nothing is displayed
// some condition evaluation using responseDf
true
})
Am I doing something wrong? If yes, then what would be a better approach to load external data in-memory and query it during stream transformation stage.
Using SparkSession instead of SQLContext solved the issue. Code below,
val sparkSession = SparkSession.builder().appName("APP").getOrCreate()
val df = sparkSession.createDataFrame(locationRepo.getLocationInfo, classOf[LocationVO])
df.createOrReplaceTempView("locations")
val dStream: DStream[StreamDataPoint] = getdStream()
dStream.filter(dp => {
val sparkAppSession = SparkSession.builder().appName("APP").getOrCreate()
val responseDf = sparkAppSession.sql("select location from locations where id='001'")
responseDf.show() // this prints the results
// some condition evaluation using responseDf
true
})

Why does my Spark Streaming application not print the number of records from Kafka (using count operator)?

I am working on a spark application which needs to read data from Kafka. I created a Kafka topic where producer was posting messages. I verified from console consumer that messages were successfully posted .
I wrote a short spark application to read data from Kafka, but it is not getting any data.
Following is the code i used:
def main(args: Array[String]): Unit = {
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("SparkConsumer").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
process(lines) // prints the number of records in Kafka topic
ssc.start()
ssc.awaitTermination()
}
private def process(lines: DStream[String]) {
val z = lines.count()
println("count of lines is "+z)
//edit
lines.foreachRDD(rdd => rdd.map(println)
// <-- Why does this **not** print?
)
Any suggestions on how to resolve this issue?
******EDIT****
I have used
lines.foreachRDD(rdd => rdd.map(println)
as well in actual code but that is also not working. I set the retention period as mentioned in post : Kafka spark directStream can not get data . But still the problem exist.
Your process is a continuation of a DStream pipeline with no output operator that gets the pipeline executed every batch interval.
You can "see" it by reading the signature of count operator:
count(): DStream[Long]
Quoting the count's scaladoc:
Returns a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
So, you have a dstream of Kafka records that you transform to a dstream of single values (being the result of count). Not much to have it outputed (to a console or any other sink).
You have to end the pipeline using an output operator as described in the official documentation Output Operations on DStreams:
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
(Low-Level) Output operators register input dstreams as output dstreams so the execution can start. Spark Streaming's DStream by design has no notion of being an output dstream. It is DStreamGraph to know and be able to differentiate between input and output dstreams.

RDD toDF() : Erroneous Behavior

I built a SparkStreaming App that fetches content from A Kafka Queue and intends to put the data into a MySQL table after some pre-processing and structuring.
I call the 'foreachRDD' method on the SparkStreamingContext. The issue that I'm facing is that there's dataloss between when I call saveAsTextFile on the RDD and DataFrame's write method with format("csv"). I can't seem to pin point why this is happening.
val ssc = new StreamingContext(spark.sparkContext, Seconds(60))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val stream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
stream.foreachRDD {
rdd => {
rdd.saveAsTextFile("/Users/jarvis/rdds/"+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_rdd")
import spark.implicits._
val messagesDF = rdd.map(_.split("\t")).map( w => { Record ( w(0), autoTag( w(1),w(4) ) , w(2), w(3), w(4), w(5).substring(w(5).lastIndexOf("http://")), w(6).split("\n")(0) )}).toDF("recordTS","tag","channel_url","title","description","link","pub_TS")
messagesDF.write.format("csv").save(dumpPath+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_DF")
}
}
ssc.start()
ssc.awaitTermination()
There's data loss ie Many rows don't make it to the DataFrame from the RDD.
There's also replication: Many rows that do reach the Dataframe are replicated many times.
Found the error. Actually there was a wrong understanding about the ingested data format.
The intended data was "\t\t\t..." and hence the Row was supposed be split at "\n".
However the actual data was :
"\t\t\t...\n\t\t\t...\n"
So the rdd.map(...) operation needed another map for splitting at every "\n"

How to join a DStream with a non-stream file?

I'd like to join every RDD in a DStream with a non-streaming, unchanging reference file. Here is my code:
val sparkConf = new SparkConf().setAppName("LogCounter")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val sc = new SparkContext()
val geoData = sc.textFile("data/geoRegion.csv")
.map(_.split(','))
.map(line => (line(0), (line(1),line(2),line(3),line(4))))
val topicMap = topics.split(",").map((_,numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val goodIPsFltrBI = lines.filter(...).map(...).filter(...) // details removed for brevity
val vdpJoinedGeo = goodIPsFltrBI.transform(rdd =>rdd.join(geoData))
I'm getting many, many errors, the most common being:
14/11/19 19:58:23 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException
java.io.FileNotFoundException: http://10.102.71.92:40764/broadcast_1
I think I should be broadcasting geoData instead of reading it in with each task (it's a 100MB file), but I'm not sure where to put the code that initializes geoData the first time.
Also I'm not sure if geoData is even defined correctly (maybe it should use ssc instead of sc?). The documentation I've seen just lists the transform and join but doesn't show how the static file was created.
Any ideas on how to broadcast geoData and then join it to each streaming RDD?
FileNotFound Exception:
The geoData textFile is loaded on all workers from the provided location ("data/geroRegion.csv"). It's most probably that this file in only available in the driver and therefore the workers cannot load it, throwing a file not found exception.
Broadcast variable:
Broadcast variables are defined on the driver and used on the workers by unwrapping the broadcast container to get the content.
This means that the data contained by the broadcast variable should be loaded by the driver before at the time the job is defined.
This might solve two problems in this case: Assuming that the geoData.csv file is located in the driver node, it will allow proper loading of this data on the driver and an efficient spread over the cluster.
In the code above, replace the geoData loading with a local file reading version:
val geoData = Source.fromFile("data/geoRegion.csv").getLines
.map(_.split(','))
.map(line => (line(0), (line(1),line(2),line(3),line(4)))).toMap
val geoDataBC = sc.broadcast(geoData)
To use it, you access the broadcast contents within a closure. Note that you will get access to the map previously wrapped in the broadcast variable: it's a simple object, not an RDD, so in this case you cannot use join to merge the two datasets. You could use flatMap instead:
val vdpJoinedGeo = goodIPsFltrBI.flatMap{ip => geoDataBC.value.get(ip).map(data=> (ip,data)}

Resources