Spark Streaming aggregation and filter in the same window - apache-spark

I've a fairly easy task - events are coming in and I want to filter those with higher value than the average per group by key in the same window.
I think this this is the relevant part of the code:
val avgfuel = events
.groupBy(window($"enqueuedTime", "30 seconds"), $"weatherCondition")
.agg(avg($"fuelEfficiencyPercentage") as "avg_fuel")
val joined = events.join(avgfuel, Seq("weatherCondition"))
.filter($"fuelEfficiencyPercentage" > $"avg_fuel")
val streamingQuery1 = joined.writeStream
.outputMode("append").
.trigger(Trigger.ProcessingTime("10 seconds")).
.option("checkpointLocation", checkpointLocation).
.format("json").option("path", containerOutputLocation).start()
events is a DStream.
The problem is that I'm getting empty files in the output location.
I'm using Databricks 3.5 - Spark 2.2.1 with Scala 2.11
What have I done wrong?
Thanks!
EDIT: a more complete code -
val inputStream = spark.readStream
.format("eventhubs") // working with azure event hubs
.options(eventhubParameters)
.load()
val schema = (new StructType)
.add("id", StringType)
.add("latitude", StringType)
.add("longitude", StringType)
.add("tirePressure", FloatType)
.add("fuelEfficiencyPercentage", FloatType)
.add("weatherCondition", StringType)
val df1 = inputStream.select($"body".cast("string").as("value")
, from_unixtime($"enqueuedTime").cast(TimestampType).as("enqueuedTime")
).withWatermark("enqueuedTime", "1 minutes")
val df2 = df1.select(from_json(($"value"), schema).as("body")
, $"enqueuedTime")
val df3 = df2.select(
$"enqueuedTime"
, $"body.id".cast("integer")
, $"body.latitude".cast("float")
, $"body.longitude".cast("float")
, $"body.tirePressure"
, $"body.fuelEfficiencyPercentage"
, $"body.weatherCondition"
)
val avgfuel = df3
.groupBy(window($"enqueuedTime", "10 seconds"), $"weatherCondition" )
.agg(avg($"fuelEfficiencyPercentage") as "fuel_avg", stddev($"fuelEfficiencyPercentage") as "fuel_stddev")
.select($"weatherCondition", $"fuel_avg")
val broadcasted = sc.broadcast(avgfuel)
val joined = df3.join(broadcasted.value, Seq("weatherCondition"))
.filter($"fuelEfficiencyPercentage" > $"fuel_avg")
val streamingQuery1 = joined.writeStream.
outputMode("append").
trigger(Trigger.ProcessingTime("10 seconds")).
option("checkpointLocation", checkpointLocation).
format("json").option("path", outputLocation).start()
This executes without errors and after a while results are starting to be written. I might be due to the broadcast of the aggregation result but I'm not sure.

Small investigation ;)
Events can't be a DStream, because you have option to use Dataset operations on it - it must be a Dataset
Stream-Stream joins are not allowed in Spark 2.2. I've tried to run your code with events as rate source and I get:
org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;;
Join Inner, (value#1L = eventValue#41L)
Result is quite unexpected - probably you used read instead of readStream and you didn't create a Streaming Dataset, but static. Change it to readStream and it will work - of course after upgrade to 2.3
Code - without comments above - is correct and should run correctly on Spark 2.3. Note that you must also change mode to complete instead of append, because you are doing aggregation

Related

How to store data from a dataframe in a variable to use as a parameter in a select in cassandra?

I have a Spark Structured Streaming application. The application receives data from kafka, and should use these values ​​as a parameter to process data from a cassandra database. My question is how do I use the data that is in the input dataframe (kafka), as "where" parameters in cassandra "select" without taking the error below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
This is my df input:
val df = spark
.readStream
.format("kafka")
.options(
Map("kafka.bootstrap.servers"-> kafka_bootstrap,
"subscribe" -> kafka_topic,
"startingOffsets"-> "latest",
"fetchOffset.numRetries"-> "5",
"kafka.group.id"-> groupId
))
.load()
I get this error whenever I try to store the dataframe values ​​in a variable to use as a parameter.
This is the method I created to try to convert the data into variables. With that the spark give the error that I mentioned earlier:
def processData(messageToProcess: DataFrame): DataFrame = {
val messageDS: Dataset[Message] = messageToProcess.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
val mensagem = messageToProcess
mensagem
}
When you need to use data in Kafka to query data in Cassandra, then such operation is a typical join between two datasets - you don't need to call .collect to find entries, you just do the join. And it's quite typical thing - to enrich data in Kafka with data from the external dataset, and Cassandra provides low-latency operations.
Your code could look as following (you'll need to configure so-called DirectJoin, see link below):
import spark.implicits._
import org.apache.spark.sql.cassandra._
val df = spark.readStream.format("kafka")
.options(Map(...)).load()
... decode data in Kafka into columns
val cassdata = spark.read.cassandraFormat("table", "keyspace").load
val joined = df.join(cassdata, cassdata("pk") === df("some_column"))
val processed = ... process joined data
val query = processed.writeStream.....output data somewhere...start()
query.awaitTermination()
I have detailed blog post on how to perform efficient joins with data in Cassandra.
As the error message suggest, you have to use writeStream.start() in order to execute a Structured Streaming query.
You can't use the same actions you use for batch dataframes (like .collect(), .show() or .count()) on streaming dataframes, see the Unsupported Operations section of the Spark Structured Streaming documentation.
In your case, you are trying to use messageDS.collect() on a streaming dataset, which is not allowed. To achieve this goal you can use a foreachBatch output sink to collect the rows you need at each microbatch:
streamingDF.writeStream.foreachBatch { (microBatchDf: DataFrame, batchId: Long) =>
// Now microBatchDf is no longer a streaming dataframe
// you can check with microBatchDf.isStreaming
val messageDS: Dataset[Message] = microBatchDf.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
// ...
}

Reusing an Event Hub stream for multiple queries in Azure Data Bricks

In Azure Databricks Structured Streaming (scala notebook, connected to Azure IoT Hub) I am opening a stream on the Event Hub compatible endpoint of Azure IoT Hub. Then I parse the incoming stream, based on the structured schema and I create 3 queries (groupBy) on the same stream.
Most of the times (not always, it seems) I get an exception on one of the display queries around an epoch value on the partition. (see below)
I am using a dedicated consumer group on which no other application is reading. So, I would guess that opening 1 stream and having multiple streaming queries against it would be supported?
Any suggestions, any explanations or ideas to solve this? (I would like to avoid having to create 3 consumer groups and defining the stream 3 times again)
Exception example:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 3 in stage 1064.0 failed 4 times, most recent failure: Lost task
3.3 in stage 1064.0 (TID 24790, 10.139.64.10, executor 7): java.util.concurrent.CompletionException:
com.microsoft.azure.eventhubs.ReceiverDisconnectedException: New
receiver with higher epoch of '0' is created hence current receiver
with epoch '0' is getting disconnected. If you are recreating the
receiver, make sure a higher epoch is used. TrackingId:xxxx,
SystemTracker:iothub-name|databricks-db,
Timestamp:2019-02-18T15:25:19, errorContext[NS: yyy, PATH:
savanh-traffic-camera2/ConsumerGroups/databricks-db/Partitions/3,
REFERENCE_ID: a0e445_7319_G2_1550503505013, PREFETCH_COUNT: 500,
LINK_CREDIT: 500, PREFETCH_Q_LEN: 0]
This is my code: (cleaned up)
// Define schema and create incoming camera eventstream
val cameraEventSchema = new StructType()
.add("TrajectId", StringType)
.add("EventTime", StringType)
.add("Country", StringType)
.add("Make", StringType)
val iotHubParameters =
EventHubsConf(cameraHubConnectionString)
.setConsumerGroup("databricks-db")
.setStartingPosition(EventPosition.fromEndOfStream)
val incomingStream = spark.readStream.format("eventhubs").options(iotHubParameters.toMap).load()
// Define parsing query selecting the required properties from the incoming telemetry data
val cameraMessages =
incomingStream
.withColumn("Offset", $"offset".cast(LongType))
.withColumn("Time (readable)", $"enqueuedTime".cast(TimestampType))
.withColumn("Timestamp", $"enqueuedTime".cast(LongType))
.withColumn("Body", $"body".cast(StringType))
// Select the event hub fields so we can work with them
.select("Offset", "Time (readable)", "Timestamp", "Body")
// Parse the "Body" column as a JSON Schema which we defined above
.select(from_json($"Body", cameraEventSchema) as "cameraevents")
// Now select the values from our JSON Structure and cast them manually to avoid problems
.select(
$"cameraevents.TrajectId".cast("string").alias("TrajectId"),
$"cameraevents.EventTime".cast("timestamp").alias("EventTime"),
$"cameraevents.Country".cast("string").alias("Country"),
$"cameraevents.Make".cast("string").alias("Make")
)
.withWatermark("EventTime", "10 seconds")
val groupedDataFrame =
cameraMessages
.groupBy(window($"EventTime", "5 seconds") as 'window)
.agg(count("*") as 'count)
.select($"window".getField("start") as 'window, $"count")
display(groupedDataFrame)
val makeDataFrame =
cameraMessages
.groupBy("Make")
.agg(count("*") as 'count)
.sort($"count".desc)
display(makeDataFrame)
val countryDataFrame =
cameraMessages
.groupBy("Country")
.agg(count("*") as 'count)
.sort($"count".desc)
display(countryDataFrame)
You can store the stream data into a table or a file location, then you can run the multiple queries on that table or file, all are running in real time.
For a file, you need to specify the schema while extracting the data into a data frame, so it's a good practice to write the stream data into a table.
cameraMessages.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation","/data/events/_checkpoints/data_file")
.table("events")
Now you can execute your queries on the table 'events'.
And for the data frame -
cameraMessages = spark.readStream.table("events")
I have faced the same issue while using EventHub, and the above trick is work for me.
for using a file instead of table
//Write/Append streaming data to file
cameraMessages.writeStream
.format("parquet")
.outputMode("append")
.option("checkpointLocation", "/FileStore/StreamCheckPoint.parquet")
.option("path","/FileStore/StreamData")
.start()
//Read data from the file, we need to specify the schema for it
val Schema = (
new StructType()
.add(StructField("TrajectId", StringType))
.add(StructField("EventTime", TimestampType))
.add(StructField("Country", StringType))
.add(StructField("Make", StringType))
)
val cameraMessages = (
sqlContext.readStream
.option("maxEventsPerTrigger", 1)
.schema(Schema)
.parquet("/FileStore/StreamData")
)

Getting error saying "Queries with streaming sources must be executed with writeStream.start()" on spark structured streaming [duplicate]

This question already has answers here:
How to display a streaming DataFrame (as show fails with AnalysisException)?
(2 answers)
Closed 4 years ago.
I am getting some issues while executing spark SQL on top of spark structures streaming.
PFA for error.
here is my code
object sparkSqlIntegration {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
.config("spark.sql.streaming.checkpointLocation", "file:///C:/checkpoint")
.getOrCreate()
setupLogging()
val userSchema = new StructType().add("name", "string").add("age", "integer")
// Create a stream of text files dumped into the logs directory
val rawData = spark.readStream.option("sep", ",").schema(userSchema).csv("file:///C:/Users/R/Documents/spark-poc-centri/csvFolder")
// Must import spark.implicits for conversion to DataSet to work!
import spark.implicits._
rawData.createOrReplaceTempView("updates")
val sqlResult= spark.sql("select * from updates")
println("sql results here")
sqlResult.show()
println("Otheres")
val query = rawData.writeStream.outputMode("append").format("console").start()
// Keep going until we're stopped.
query.awaitTermination()
spark.stop()
}
}
During execution, I am getting the following error. As I am new to streaming can anyone tell how can I execute spark SQL queries on spark structured streaming
2018-12-27 16:02:40 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, LAPTOP-5IHPFLOD, 6829, None)
2018-12-27 16:02:41 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6731787b{/metrics/json,null,AVAILABLE,#Spark}
sql results here
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[file:///C:/Users/R/Documents/spark-poc-centri/csvFolder]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
You don't need any of these lines
import spark.implicits._
rawData.createOrReplaceTempView("updates")
val sqlResult= spark.sql("select * from updates")
println("sql results here")
sqlResult.show()
println("Otheres")
Most importantly, select * isn't needed. When you print the dataframe, you would already see all the columns. Therefore, you also don't need to register the temp view to give it a name.
And when you format("console"), that eliminates the need for .show()
Refer to the Spark examples for reading from a network socket and output to console.
val words = // omitted ... some Streaming DataFrame
// Generating a running word count
val wordCounts = words.groupBy("value").count()
// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
Take away - use DataFrame operations like .select() and .groupBy() rather than raw SQL
Or you can use Spark Streaming, as shown in those examples, you need to foreachRDD over each stream batch, then convert these to a DataFrame, which you can query
/** Case class for converting RDD to DataFrame */
case class Record(word: String)
val words = // omitted ... some DStream
// Convert RDDs of the words DStream to DataFrame and run SQL query
words.foreachRDD { (rdd: RDD[String], time: Time) =>
// Get the singleton instance of SparkSession
val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
import spark.implicits._
// Convert RDD[String] to RDD[case class] to DataFrame
val wordsDataFrame = rdd.map(w => Record(w)).toDF()
// Creates a temporary view using the DataFrame
wordsDataFrame.createOrReplaceTempView("words")
// Do word count on table using SQL and print it
val wordCountsDataFrame =
spark.sql("select word, count(*) as total from words group by word")
println(s"========= $time =========")
wordCountsDataFrame.show()
}
ssc.start()
ssc.awaitTermination()

Issue with HBase in spark streaming

I have issue with the performance when reading data from HBase in spark streaming. It is taking more than 5 mins just to read data from HBase for 3 records. Below is the logic that I used in mapPartitions.
val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, topicSet)
messages.mapPartitions(iter => {
val context = TaskContext.get
logger.info((s"Process for partition: ${context.partitionId} "))
val hbaseConf = HBaseConfiguration.create()
//hbaseConf.addResource(new File("/etc/hbase/conf/hbase-site.xml").toURI.toURL)
//val connection: Connection = hbaseConnection.getOrCreateConnection(hbaseConf)
val connection = ConnectionFactory.createConnection(hbaseConf)
val hbaseTable = connection.getTable(TableName.valueOf("prod:CustomerData"))
.......
})
I have used BulkGet. It is taking around 5 seconds to process 90K messages(may be because the API is using HBaseContext and we dont have create any HBaseConnection). But I cannot use this as the output of BulkGet is RDD, and I have to do leftouterjoin to join the RDD of BulkGet with the actual RDD from Kafka. I assume this is not correct approach as it involves the below. Moreover I have to process all the 90K messages in 1 second.
Fetch distinct Cusotmer Id from the RDD read from Kafka before passing it to BulkGet
Also, it involves shuffling as I have to do leftOuterJoin the main RDD (from Kafka) with the BulkGet RDD (I only see the option of join as the BulkGet output is an RDD)
Can anyone please help me what is the issue with performance when I try to create HBaseConnection in mapPartitions. I have also tried setting driver-class-path.
Thanks

Joining Kafka and Cassandra DataFrames in Spark Streaming ignores C* predicate pushdown

Intent
I'm receiving data from Kafka via direct stream and would like to enrich the messages with data from Cassandra. The Kafka messages (Protobufs) are decoded into DataFrames and then joined with a (supposedly pre-filtered) DF from Cassandra. The relation of (Kafka) streaming batch size to raw C* data is [several streaming messages to millions of C* rows], BUT the join always yields exactly ONE result [1:1] per message. After the join the resulting DF is eventually stored to another C* table.
Problem
Even though I'm joining the two DFs on the full Cassandra primary key and pushing the corresponding filter to C*, it seems that Spark is loading the whole C* data-set into memory before actually joining (which I'd like to prevent by using the filter/predicate pushdown). This leads to a lot of shuffling and tasks being spawned, hence the "simple" join takes forever...
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("test")
.set("spark.cassandra.connection.host", "xxx")
.set("spark.cassandra.connection.keep_alive_ms", "30000")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(10))
ssc.sparkContext.setLogLevel("INFO")
// Initialise Kafka
val kafkaTopics = Set[String]("xxx")
val kafkaParams = Map[String, String](
"metadata.broker.list" -> "xxx:32000,xxx:32000,xxx:32000,xxx:32000",
"auto.offset.reset" -> "smallest")
// Kafka stream
val messages = KafkaUtils.createDirectStream[String, MyMsg, StringDecoder, MyMsgDecoder](ssc, kafkaParams, kafkaTopics)
// Executed on the driver
messages.foreachRDD { rdd =>
// Create an instance of SQLContext
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
// Map MyMsg RDD
val MyMsgRdd = rdd.map{case (key, MyMsg) => (MyMsg)}
// Convert RDD[MyMsg] to DataFrame
val MyMsgDf = MyMsgRdd.toDF()
.select(
$"prim1Id" as 'prim1_id,
$"prim2Id" as 'prim2_id,
$...
)
// Load DataFrame from C* data-source
val base_data = base_data_df.getInstance(sqlContext)
// Left join on prim1Id and prim2Id
val joinedDf = MyMsgDf.join(base_data,
MyMsgDf("prim1_id") === base_data("prim1_id") &&
MyMsgDf("prim2_id") === base_data("prim2_id"), "left")
.filter(base_data("prim1_id").isin(MyMsgDf("prim1_id"))
&& base_data("prim2_id").isin(MyMsgDf("prim2_id")))
joinedDf.show()
joinedDf.printSchema()
// Select relevant fields
// Persist
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
Environment
Spark 1.6
Cassandra 2.1.12
Cassandra-Spark-Connector 1.5-RC1
Kafka 0.8.2.2
SOLUTION
From discussions on the DataStax Spark Connector for Apache Cassandra ML
Joining Kafka and Cassandra DataFrames in Spark Streaming ignores C* predicate pushdown
How to create a DF from CassandraJoinRDD
I've learned the following:
Quoting Russell Spitzer
This wouldn't be a case of predicate pushdown. This is a join on a partition key column. Currently only joinWithCassandraTable supports this direct kind of join although we are working on some methods to try to have this automatically done within Spark.
Dataframes can be created from any RDD which can have a schema applied to it. The easiest thing to do is probably to map your joinedRDD[x,y] to Rdd[JoinedCaseClass] and then call toDF (which will require importing your sqlContext implicits.) See the DataFrames documentation here for more info.
So the actual implementation now resembles something like
// Join myMsg RDD with myCassandraTable
val joinedMsgRdd = myMsgRdd.joinWithCassandraTable(
"keyspace",
"myCassandraTable",
AllColumns,
SomeColumns(
"prim1_id",
"prim2_id"
)
).map{case (myMsg, cassandraRow) =>
JoinedMsg(
foo = myMsg.foo
bar = cassandraRow.bar
)
}
// Convert RDD[JoinedMsg] to DataFrame
val myJoinedDf = joinedMsgRdd.toDF()
Have you tried joinWithCassandraTable ? It should pushdown to C* all keys you are looking for...

Resources