Aggregation with distinct count in Spark structured streaming throwing error - apache-spark

i am trying to get unique id for group of Parentgroup,childgroup and MountingType in Spark structured streaming.
Code: the below code is throwing error
.withWatermark("timestamp", "1 minutes")
val aggDF = JSONDF.groupBy("Parentgroup","childgroup","MountingType")
.agg(countDistinct("id"))
Error:
Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark
may someone please help me how to do aggregate and write to the csv in structured streaming.
Thanks a lot
Data:
{"id":"7CE3A7CA","Faulttime":1544362500,"name":"Sony","Parentgroup":"TV","childgroup":"Other","MountingType":"SurfaceMount"}
{"id":"7CE3A7CA","Faulttime":1544362509,"name":"Sony","Parentgroup":"TV","childgroup":"Other","MountingType":"SurfaceMount"}
{"id":"010004FF,"Faulttime":1551339188,"name":"Philips","Parentgroup":"Light","childgroup":"Other","MountingType":"Solder"}
{"id":"010004FF","Faulttime":1551339188,"name":"Sony","Parentgroup":"TV","childgroup":"Other","MountingType":"Solder"}
{"id":"010004FF,"Faulttime":1551339191,"name":"Sansui","Parentgroup":"AC","childgroup":"Other","MountingType":"SurfaceMount"}
{"id":"CE361405","Faulttime":1552159061,"name":"Hyndai","Parentgroup":"SBAR","childgroup":"Other","MountingType":"SurfaceMount"}
{"id":"CE361405","Faulttime":1552159061,"name":"sony","Parentgroup":"TV","childgroup":"Other","MountingType":"SurfaceMount"}
{"id":"7BE446C0","Faulttime":1553022095,"name":"Sony","Parentgroup":"TV","childgroup":"Other","MountingType":"Solder"}
{"id":"7BE446C0","Faulttime":1553022095,"name":"Philips","Parentgroup":"LIGHT","childgroup":"Other","MountingType":"Solder"}

Group By operations need to specify window or time period in spark streaming.
Try this
psuedo code
val JSONDF = df.withWatermark("timestamp", "1 minutes")
val aggDF = JSONDF.groupBy(window("timestamp", "5 minutes", "1 minutes")).agg(countDistinct("id"),$"Parentgroup",$"childgroup",$"MountingType")
Reference :
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html

Related

How to store data from a dataframe in a variable to use as a parameter in a select in cassandra?

I have a Spark Structured Streaming application. The application receives data from kafka, and should use these values ​​as a parameter to process data from a cassandra database. My question is how do I use the data that is in the input dataframe (kafka), as "where" parameters in cassandra "select" without taking the error below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
This is my df input:
val df = spark
.readStream
.format("kafka")
.options(
Map("kafka.bootstrap.servers"-> kafka_bootstrap,
"subscribe" -> kafka_topic,
"startingOffsets"-> "latest",
"fetchOffset.numRetries"-> "5",
"kafka.group.id"-> groupId
))
.load()
I get this error whenever I try to store the dataframe values ​​in a variable to use as a parameter.
This is the method I created to try to convert the data into variables. With that the spark give the error that I mentioned earlier:
def processData(messageToProcess: DataFrame): DataFrame = {
val messageDS: Dataset[Message] = messageToProcess.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
val mensagem = messageToProcess
mensagem
}
When you need to use data in Kafka to query data in Cassandra, then such operation is a typical join between two datasets - you don't need to call .collect to find entries, you just do the join. And it's quite typical thing - to enrich data in Kafka with data from the external dataset, and Cassandra provides low-latency operations.
Your code could look as following (you'll need to configure so-called DirectJoin, see link below):
import spark.implicits._
import org.apache.spark.sql.cassandra._
val df = spark.readStream.format("kafka")
.options(Map(...)).load()
... decode data in Kafka into columns
val cassdata = spark.read.cassandraFormat("table", "keyspace").load
val joined = df.join(cassdata, cassdata("pk") === df("some_column"))
val processed = ... process joined data
val query = processed.writeStream.....output data somewhere...start()
query.awaitTermination()
I have detailed blog post on how to perform efficient joins with data in Cassandra.
As the error message suggest, you have to use writeStream.start() in order to execute a Structured Streaming query.
You can't use the same actions you use for batch dataframes (like .collect(), .show() or .count()) on streaming dataframes, see the Unsupported Operations section of the Spark Structured Streaming documentation.
In your case, you are trying to use messageDS.collect() on a streaming dataset, which is not allowed. To achieve this goal you can use a foreachBatch output sink to collect the rows you need at each microbatch:
streamingDF.writeStream.foreachBatch { (microBatchDf: DataFrame, batchId: Long) =>
// Now microBatchDf is no longer a streaming dataframe
// you can check with microBatchDf.isStreaming
val messageDS: Dataset[Message] = microBatchDf.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
// ...
}

org.apache.spark.sql.AnalysisException: Multiple streaming aggregations are not supported with streaming DataFrames/Datasets;

Below is my Streaming Data Frame created from a weblog file:
val finalDf = joinedDf
.groupBy(window($"dateTime", "10 seconds"))
.agg(
max(col("datetime")).as("visitdate"),
count(col("ipaddress")).as("number_of_records"),
collect_list("ipaddress").as("ipaddress")
)
.select(col("window"),col("visitdate"),col("number_of_records"),explode(col("ipaddress")).as("ipaddress"))
.join(joinedDf,Seq("ipaddress"))
.select(
col("window"),
col("category").as("category_page_category"),
col("category"),
col("calculation1"),
hour(col("dateTime")).as("hour_label").cast("String"),
col("dateTime").as("date_label").cast("String"),
minute(col("dateTime")).as("minute_label").cast("String"),
col("demography"),
col("fullname").as("full_name"),
col("ipaddress"),
col("number_of_records"),
col("endpoint").as("pageurl"),
col("pageurl").as("page_url"),
col("username"),
col("visitdate"),
col("productname").as("product_name")
).dropDuplicates().toDF()
There are no aggregations performed on this Data Frame earlier at this stage.
I have applied aggregation only once but still I am getting below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Multiple streaming aggregations are not supported with streaming
DataFrames/Datasets;
There are indeed two aggregations here. The first one is explicit:
.groupBy(...).agg(...)
the second one is required for
.dropDuplicates()
which is implemented
.groupBy(...).agg(first(...), ...)
You'll have to redesign your pipeline.

Queries with streaming sources must be executed with writeStream.start()

I have a structured stream dataframe tempDataFrame2 consisting of Field1. I am trying to calculate the approxQuantile of Field1. However, whenever I type
val Array(Q1, Q3) = tempDataFrame2.stat.approxQuantile("Field1", Array(0.25, 0.75), 0.0) I get the following error message:
Queries with streaming sources must be executed with writeStream.start()
Below is the code snippet:
val tempDataFrame2 = A structured streaming dataframe
// Calculate IQR
val Array(Q1, Q3) = tempDataFrame2.stat.approxQuantile("Field1", Array(0.25, 0.75), 0.0)
// Filter messages
val tempDataFrame3 = tempDataFrame2.filter("Some working filter")
val query = tempDataFrame2.writeStream.outputMode("append").queryName("table").format("console").start()
query.awaitTermination()
I have already went through this two links from SO: Link1 Link2. Unfortunately, I am not able to relate those responses with my problem.
Edit
After reading the comments, following is the way I am planning to go ahead with:
1) Read all the uncommitted offset from the Kafka topic.
2) Save them to a dataframe variable.
3) Stop the structured streaming so that I don't read from the Kafka topic anymore.
4) Start processing the saved dataframe from step 2).
But, now I am not sure how to go ahead -
1) like how to know that I don't have any other records to consume in the Kafka topic and stop the streaming query?

Spark Streaming aggregation and filter in the same window

I've a fairly easy task - events are coming in and I want to filter those with higher value than the average per group by key in the same window.
I think this this is the relevant part of the code:
val avgfuel = events
.groupBy(window($"enqueuedTime", "30 seconds"), $"weatherCondition")
.agg(avg($"fuelEfficiencyPercentage") as "avg_fuel")
val joined = events.join(avgfuel, Seq("weatherCondition"))
.filter($"fuelEfficiencyPercentage" > $"avg_fuel")
val streamingQuery1 = joined.writeStream
.outputMode("append").
.trigger(Trigger.ProcessingTime("10 seconds")).
.option("checkpointLocation", checkpointLocation).
.format("json").option("path", containerOutputLocation).start()
events is a DStream.
The problem is that I'm getting empty files in the output location.
I'm using Databricks 3.5 - Spark 2.2.1 with Scala 2.11
What have I done wrong?
Thanks!
EDIT: a more complete code -
val inputStream = spark.readStream
.format("eventhubs") // working with azure event hubs
.options(eventhubParameters)
.load()
val schema = (new StructType)
.add("id", StringType)
.add("latitude", StringType)
.add("longitude", StringType)
.add("tirePressure", FloatType)
.add("fuelEfficiencyPercentage", FloatType)
.add("weatherCondition", StringType)
val df1 = inputStream.select($"body".cast("string").as("value")
, from_unixtime($"enqueuedTime").cast(TimestampType).as("enqueuedTime")
).withWatermark("enqueuedTime", "1 minutes")
val df2 = df1.select(from_json(($"value"), schema).as("body")
, $"enqueuedTime")
val df3 = df2.select(
$"enqueuedTime"
, $"body.id".cast("integer")
, $"body.latitude".cast("float")
, $"body.longitude".cast("float")
, $"body.tirePressure"
, $"body.fuelEfficiencyPercentage"
, $"body.weatherCondition"
)
val avgfuel = df3
.groupBy(window($"enqueuedTime", "10 seconds"), $"weatherCondition" )
.agg(avg($"fuelEfficiencyPercentage") as "fuel_avg", stddev($"fuelEfficiencyPercentage") as "fuel_stddev")
.select($"weatherCondition", $"fuel_avg")
val broadcasted = sc.broadcast(avgfuel)
val joined = df3.join(broadcasted.value, Seq("weatherCondition"))
.filter($"fuelEfficiencyPercentage" > $"fuel_avg")
val streamingQuery1 = joined.writeStream.
outputMode("append").
trigger(Trigger.ProcessingTime("10 seconds")).
option("checkpointLocation", checkpointLocation).
format("json").option("path", outputLocation).start()
This executes without errors and after a while results are starting to be written. I might be due to the broadcast of the aggregation result but I'm not sure.
Small investigation ;)
Events can't be a DStream, because you have option to use Dataset operations on it - it must be a Dataset
Stream-Stream joins are not allowed in Spark 2.2. I've tried to run your code with events as rate source and I get:
org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;;
Join Inner, (value#1L = eventValue#41L)
Result is quite unexpected - probably you used read instead of readStream and you didn't create a Streaming Dataset, but static. Change it to readStream and it will work - of course after upgrade to 2.3
Code - without comments above - is correct and should run correctly on Spark 2.3. Note that you must also change mode to complete instead of append, because you are doing aggregation

Union of Streaming Dataframe and Batch Dataframe in Spark Structured Streaming

How can I combine streaming dataframe and batch dataframe together in Spark Structured Streaming ?
You can do normal join:
val stream = spark.readStream(....)
val batch = spark.read....
val joinSandB = stream.join(batch, "someColumn")
See more in documentation

Resources