structured streaming writing to multiple streams - apache-spark

my Scenario
Gets data from a stream and call a UDF which return a json string. one of the attribute in JSON string is UniqueId, which UDF is generating as guid.newGuid() (C#).
DataFrame output of UDF is written to multiple streams/sinks based on some fiter.
issue:
each sink is getting a new value for the UniqueId which was generated by UDF. How can i maintain the same UniqueId for all sinks.
If each sink is getting different values for UniqueId, does that mean my UDF is getting called multiple times for each sink?
If UDF is getting invoked twice, what is the option to get it called once and then just write same data to different sinks
inData = spark.readstream().format("eventhub")
udfdata = indata.select(from_json(myudf("column"), schema)).as("result").select(result.*)
filter1 = udfdata.filter("column =='filter1'")
filter 2 = udfdata.filter("column =='filter2'")
# write filter1 to two differnt sinks
filter1.writestream().format(delta).start(table1)
filter1.writestream().format(eventhub).start()
# write filter2 to two differnt sinks
filter2.writestream().format(delta).start(table2)
filter2.writestream().format(eventhub).start()

Each time you call .writestream()....start() you are creating a new independent streaming query.
This means that for each output sink you define Spark will read again from the input source and process the dataframe.
If you want to read and process only one time and then output to multiple sink you can use foreachBatch sink as a workaround:
inData = spark.readstream().format("eventhub")
udfdata = indata.select(from_json(myudf("column"), schema)).as("result").select(result.*)
udfdata.writeStream().foreachBatch(filter_and_output).start()
def filter_and_output(udfdata, batchId):
# At this point udfdata is a batch dataframe, no more a streaming dataframe
udfdata.cache()
filter1 = udfdata.filter("column =='filter1'")
filter2 = udfdata.filter("column =='filter2'")
# write filter1
filter1.write().format(delta).save(table1)
filter1.write().format(eventhub).save()
# write filter2
filter2.write().format(delta).save(table2)
filter2.write().format(eventhub).save()
udfdata.unpersist()
You can learn more about foreachBatch in the Spark Structured Streaming documentation.
To answer your questions
If you use foreachBatch your data will be processed only once and you will have the same UniqueId for all sinks
Yes
Using foreachBatch will solve the issue

Related

How to print/log outputs within foreachBatch function?

Using table streaming, I am trying to write stream using foreachBatch
df.writestream
.format("delta")
.foreachBatch(WriteStreamToDelta)
...
WriteStreamToDelta looks like
def WriteStreamToDelta(microDF, batch_id):
microDFWrangled = microDF."some_transformations"
print(microDFWrangled.count()) <-- How do I achieve the equivalence of this?
microDFWrangled.writeStream...
I would like to view the number of rows in
Notebook, below the writeStream cell
Driver Log
Create a list to append number of rows for each micro batch.

Apache spark custom log unfiltered data (LazyLogging)

I'm filtering a column to comply with some validations and I can filter using Spark built-in functions,
but I need to log the invalid data with a proper message (I am using LazyLogging), is there any way I can do it without using a custom UDF, so I can keep Spark optimization?
for example filtering names that are shorter then 20 characters:
df.filter(length($"name") <= lit(20))
in this scenario how can I log the names that are more than 20 characters without custom UDF?
In case the result of the filter operation is not too large that it does not fit into your driver, you can collect the result and print it out to your default Logger.
val logCollection = df.filter(length($"name") > lit(20)).collectAsList
logCollection.foreach(logger.info(_))
As an alternative you can create a separate stream by applying another writeStream format to write the names into a database, console etc. Just keep in mind that when you do this, you will actually create multiple streaming queries within your SparkSession which are consuming the data independently:
val originalDf = df.[...]
val logDf = df.filter(length($"name") > lit(20))
val originalQuery = originalDf.writeStream.[...].start() // keep logic as is
val logQuery = logDf.writeStream.format("console").[...].start()
spark.streams.awaitAnyTermination()

RDD String to Spark csv Reader

I want to read the RDD[String] using the spark CSV reader. The reason I am doing this is, I need to filter some records before using the CSV reader.
val fileRDD: RDD[String] = spark.sparkContext.textFile("file")
I need to read the fileRDD using the spark CSV reader. I wish not to commit the file as it increases the IO of the HDFS. I have looked into the options we have in the spark CSV, but didn't found any.
spark.read.csv(file)
Sample Data
PHM|MERC|PHARMA|BLUEDRUG|50
CLM|BSH|CLAIM|VISIT|HSA|EMPLOYER|PAID|250
PHM|GSK|PHARMA|PARAC|70
CLM|UHC|CLAIM|VISIT|HSA|PERSONAL|PAID|72
As you can see all the records starts with PHM has different number of columns and clm has different number of columns. That is the reason i am filtering and then applying schema. PHM and CLM records has different schema.
val fileRDD: RDD[String] = spark.sparkContext.textFile("file").filter(_.startWith("PHM"))
spark.read.option(schema,"phcschema").csv(fileRDD.toDS())
Since Spark 2.2, method ".csv" can read dataset of strings. Can be implemented in this way:
val rdd: RDD[String] = spark.sparkContext.textFile("csv.txt")
// ... do filtering
spark.read.csv(rdd.toDS())

How to collect a streaming dataset (to a Scala value)?

How can I store a dataframe value to a scala variable ?
I need to store values from the below dataframe (assuming column "timestamp" producing same values) to a variable and later I need to use this variable somewhere
i have tried following
val spark =SparkSession.builder().appName("micro").
enableHiveSupport().config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("spark.sql.streaming.checkpointLocation", "hdfs://dff/apps/hive/warehouse/area.db").
getOrCreate()
val xmlSchema = new StructType().add("id", "string").add("time_xml", "string")
val xmlData = spark.readStream.option("sep", ",").schema(xmlSchema).csv("file:///home/shp/sourcexml")
val xmlDf_temp = xmlData.select($"id",unix_timestamp($"time_xml", "dd/mm/yyyy HH:mm:ss").cast(TimestampType).as("timestamp"))
val collect_time = xmlDf_temp.select($"timestamp").as[String].collect()(0)
its thorwing error saying following:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
Is there any way i can store some dataframe values to a variable and use later?
is there any way i can store some dataframe values to a variable and use later ?
That's not possible in Spark Structured Streaming since a streaming query never ends and so it is not possible to express collect.
and later I need to use this variable somewhere
This "later" has to be another streaming query that you could join together and produce a result.

Why does my Spark Streaming application not print the number of records from Kafka (using count operator)?

I am working on a spark application which needs to read data from Kafka. I created a Kafka topic where producer was posting messages. I verified from console consumer that messages were successfully posted .
I wrote a short spark application to read data from Kafka, but it is not getting any data.
Following is the code i used:
def main(args: Array[String]): Unit = {
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("SparkConsumer").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
process(lines) // prints the number of records in Kafka topic
ssc.start()
ssc.awaitTermination()
}
private def process(lines: DStream[String]) {
val z = lines.count()
println("count of lines is "+z)
//edit
lines.foreachRDD(rdd => rdd.map(println)
// <-- Why does this **not** print?
)
Any suggestions on how to resolve this issue?
******EDIT****
I have used
lines.foreachRDD(rdd => rdd.map(println)
as well in actual code but that is also not working. I set the retention period as mentioned in post : Kafka spark directStream can not get data . But still the problem exist.
Your process is a continuation of a DStream pipeline with no output operator that gets the pipeline executed every batch interval.
You can "see" it by reading the signature of count operator:
count(): DStream[Long]
Quoting the count's scaladoc:
Returns a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
So, you have a dstream of Kafka records that you transform to a dstream of single values (being the result of count). Not much to have it outputed (to a console or any other sink).
You have to end the pipeline using an output operator as described in the official documentation Output Operations on DStreams:
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
(Low-Level) Output operators register input dstreams as output dstreams so the execution can start. Spark Streaming's DStream by design has no notion of being an output dstream. It is DStreamGraph to know and be able to differentiate between input and output dstreams.

Resources