Reusing an Event Hub stream for multiple queries in Azure Data Bricks - databricks

In Azure Databricks Structured Streaming (scala notebook, connected to Azure IoT Hub) I am opening a stream on the Event Hub compatible endpoint of Azure IoT Hub. Then I parse the incoming stream, based on the structured schema and I create 3 queries (groupBy) on the same stream.
Most of the times (not always, it seems) I get an exception on one of the display queries around an epoch value on the partition. (see below)
I am using a dedicated consumer group on which no other application is reading. So, I would guess that opening 1 stream and having multiple streaming queries against it would be supported?
Any suggestions, any explanations or ideas to solve this? (I would like to avoid having to create 3 consumer groups and defining the stream 3 times again)
Exception example:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 3 in stage 1064.0 failed 4 times, most recent failure: Lost task
3.3 in stage 1064.0 (TID 24790, 10.139.64.10, executor 7): java.util.concurrent.CompletionException:
com.microsoft.azure.eventhubs.ReceiverDisconnectedException: New
receiver with higher epoch of '0' is created hence current receiver
with epoch '0' is getting disconnected. If you are recreating the
receiver, make sure a higher epoch is used. TrackingId:xxxx,
SystemTracker:iothub-name|databricks-db,
Timestamp:2019-02-18T15:25:19, errorContext[NS: yyy, PATH:
savanh-traffic-camera2/ConsumerGroups/databricks-db/Partitions/3,
REFERENCE_ID: a0e445_7319_G2_1550503505013, PREFETCH_COUNT: 500,
LINK_CREDIT: 500, PREFETCH_Q_LEN: 0]
This is my code: (cleaned up)
// Define schema and create incoming camera eventstream
val cameraEventSchema = new StructType()
.add("TrajectId", StringType)
.add("EventTime", StringType)
.add("Country", StringType)
.add("Make", StringType)
val iotHubParameters =
EventHubsConf(cameraHubConnectionString)
.setConsumerGroup("databricks-db")
.setStartingPosition(EventPosition.fromEndOfStream)
val incomingStream = spark.readStream.format("eventhubs").options(iotHubParameters.toMap).load()
// Define parsing query selecting the required properties from the incoming telemetry data
val cameraMessages =
incomingStream
.withColumn("Offset", $"offset".cast(LongType))
.withColumn("Time (readable)", $"enqueuedTime".cast(TimestampType))
.withColumn("Timestamp", $"enqueuedTime".cast(LongType))
.withColumn("Body", $"body".cast(StringType))
// Select the event hub fields so we can work with them
.select("Offset", "Time (readable)", "Timestamp", "Body")
// Parse the "Body" column as a JSON Schema which we defined above
.select(from_json($"Body", cameraEventSchema) as "cameraevents")
// Now select the values from our JSON Structure and cast them manually to avoid problems
.select(
$"cameraevents.TrajectId".cast("string").alias("TrajectId"),
$"cameraevents.EventTime".cast("timestamp").alias("EventTime"),
$"cameraevents.Country".cast("string").alias("Country"),
$"cameraevents.Make".cast("string").alias("Make")
)
.withWatermark("EventTime", "10 seconds")
val groupedDataFrame =
cameraMessages
.groupBy(window($"EventTime", "5 seconds") as 'window)
.agg(count("*") as 'count)
.select($"window".getField("start") as 'window, $"count")
display(groupedDataFrame)
val makeDataFrame =
cameraMessages
.groupBy("Make")
.agg(count("*") as 'count)
.sort($"count".desc)
display(makeDataFrame)
val countryDataFrame =
cameraMessages
.groupBy("Country")
.agg(count("*") as 'count)
.sort($"count".desc)
display(countryDataFrame)

You can store the stream data into a table or a file location, then you can run the multiple queries on that table or file, all are running in real time.
For a file, you need to specify the schema while extracting the data into a data frame, so it's a good practice to write the stream data into a table.
cameraMessages.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation","/data/events/_checkpoints/data_file")
.table("events")
Now you can execute your queries on the table 'events'.
And for the data frame -
cameraMessages = spark.readStream.table("events")
I have faced the same issue while using EventHub, and the above trick is work for me.
for using a file instead of table
//Write/Append streaming data to file
cameraMessages.writeStream
.format("parquet")
.outputMode("append")
.option("checkpointLocation", "/FileStore/StreamCheckPoint.parquet")
.option("path","/FileStore/StreamData")
.start()
//Read data from the file, we need to specify the schema for it
val Schema = (
new StructType()
.add(StructField("TrajectId", StringType))
.add(StructField("EventTime", TimestampType))
.add(StructField("Country", StringType))
.add(StructField("Make", StringType))
)
val cameraMessages = (
sqlContext.readStream
.option("maxEventsPerTrigger", 1)
.schema(Schema)
.parquet("/FileStore/StreamData")
)

Related

Custom aggregator for Spark Structured Streaming from Event Hub

I've created a simple spark structured streaming job which reads events from Azure event hub and aggregates (where the output goes is irrelevant for this post) them per session window. The job is structured like the following snippet (some details have been omitted for readability):
val connStr = new com.microsoft.azure.eventhubs.ConnectionStringBuilder()
val customEventhubParameters =
EventHubsConf(connStr.toString())
.setMaxEventsPerTrigger(1000)
val incomingStream = spark.readStream.format("eventhubs").options(customEventhubParameters.toMap).load()
val eventhubs = incomingStream.select($"body" cast "string")
val testSchema = new StructType()
val messages = incomingStream
.select($"body".cast(StringType))
.alias("body")
.select(from_json($"body",testSchema))
.select("from_json(body).*")
val windowedCounts = messages
.withWatermark("timestamp", "10 seconds")
.groupBy($"body_column", session_window($"timestamp", "1 minute"))
.count()
Now, I'd like to replace count aggregate function with some custom implementation.
I've tried to extend Aggregator (following docs https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html) and it works for input from a data frame created in the notebook but fails to execute with event hub with exception:
Task not serializable: java.io.NotSerializableException: com.microsoft.azure.eventhubs.ConnectionStringBuilder
Is there really no way to create custom aggregators for streaming data?

How to store data from a dataframe in a variable to use as a parameter in a select in cassandra?

I have a Spark Structured Streaming application. The application receives data from kafka, and should use these values ​​as a parameter to process data from a cassandra database. My question is how do I use the data that is in the input dataframe (kafka), as "where" parameters in cassandra "select" without taking the error below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
This is my df input:
val df = spark
.readStream
.format("kafka")
.options(
Map("kafka.bootstrap.servers"-> kafka_bootstrap,
"subscribe" -> kafka_topic,
"startingOffsets"-> "latest",
"fetchOffset.numRetries"-> "5",
"kafka.group.id"-> groupId
))
.load()
I get this error whenever I try to store the dataframe values ​​in a variable to use as a parameter.
This is the method I created to try to convert the data into variables. With that the spark give the error that I mentioned earlier:
def processData(messageToProcess: DataFrame): DataFrame = {
val messageDS: Dataset[Message] = messageToProcess.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
val mensagem = messageToProcess
mensagem
}
When you need to use data in Kafka to query data in Cassandra, then such operation is a typical join between two datasets - you don't need to call .collect to find entries, you just do the join. And it's quite typical thing - to enrich data in Kafka with data from the external dataset, and Cassandra provides low-latency operations.
Your code could look as following (you'll need to configure so-called DirectJoin, see link below):
import spark.implicits._
import org.apache.spark.sql.cassandra._
val df = spark.readStream.format("kafka")
.options(Map(...)).load()
... decode data in Kafka into columns
val cassdata = spark.read.cassandraFormat("table", "keyspace").load
val joined = df.join(cassdata, cassdata("pk") === df("some_column"))
val processed = ... process joined data
val query = processed.writeStream.....output data somewhere...start()
query.awaitTermination()
I have detailed blog post on how to perform efficient joins with data in Cassandra.
As the error message suggest, you have to use writeStream.start() in order to execute a Structured Streaming query.
You can't use the same actions you use for batch dataframes (like .collect(), .show() or .count()) on streaming dataframes, see the Unsupported Operations section of the Spark Structured Streaming documentation.
In your case, you are trying to use messageDS.collect() on a streaming dataset, which is not allowed. To achieve this goal you can use a foreachBatch output sink to collect the rows you need at each microbatch:
streamingDF.writeStream.foreachBatch { (microBatchDf: DataFrame, batchId: Long) =>
// Now microBatchDf is no longer a streaming dataframe
// you can check with microBatchDf.isStreaming
val messageDS: Dataset[Message] = microBatchDf.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
// ...
}

Queries with streaming sources must be executed with writeStream.start()

I have a structured stream dataframe tempDataFrame2 consisting of Field1. I am trying to calculate the approxQuantile of Field1. However, whenever I type
val Array(Q1, Q3) = tempDataFrame2.stat.approxQuantile("Field1", Array(0.25, 0.75), 0.0) I get the following error message:
Queries with streaming sources must be executed with writeStream.start()
Below is the code snippet:
val tempDataFrame2 = A structured streaming dataframe
// Calculate IQR
val Array(Q1, Q3) = tempDataFrame2.stat.approxQuantile("Field1", Array(0.25, 0.75), 0.0)
// Filter messages
val tempDataFrame3 = tempDataFrame2.filter("Some working filter")
val query = tempDataFrame2.writeStream.outputMode("append").queryName("table").format("console").start()
query.awaitTermination()
I have already went through this two links from SO: Link1 Link2. Unfortunately, I am not able to relate those responses with my problem.
Edit
After reading the comments, following is the way I am planning to go ahead with:
1) Read all the uncommitted offset from the Kafka topic.
2) Save them to a dataframe variable.
3) Stop the structured streaming so that I don't read from the Kafka topic anymore.
4) Start processing the saved dataframe from step 2).
But, now I am not sure how to go ahead -
1) like how to know that I don't have any other records to consume in the Kafka topic and stop the streaming query?

How to do a fast insertion of the data in a Kafka topic inside a Hive Table?

I have a Kafka topic in which I have received around 500k events.
Currently, I need to insert those events into a Hive table.
Since events are time-driven, I decided to use the following strategy:
1) Define a route inside HDFS, which I call users. Inside of this route, there will be several Parquet files, each one corresponding to a certain date. E.g.: 20180412, 20180413, 20180414, etc. (Format YYYYMMDD).
2) Create a Hive table and use the date in the format YYYYMMDD as a partition. The idea is to use each of the files inside the users HDFS directory as a partition of the table, by simply adding the corresponding parquet file through the command:
ALTER TABLE users DROP IF EXISTS PARTITION
(fecha='20180412') ;
ALTER TABLE users ADD PARTITION
(fecha='20180412') LOCATION '/users/20180412';
3) Read the data from the Kafka topic by iterating from the earliest event, get the date value in the event (inside the parameter dateClient), and given that date value, insert the value into the corresponding Parque File.
4) In order to accomplish the point 3, I read each event and saved it inside a temporary HDFS file, from which I used Spark to read the file. After that, I used Spark to convert the temporary file contents into a Data Frame.
5) Using Spark, I managed to insert the DataFrame values into the Parquet File.
The code follows this approach:
val conf = ConfigFactory.parseResources("properties.conf")
val brokersip = conf.getString("enrichment.brokers.value")
val topics_in = conf.getString("enrichment.topics_in.value")
val spark = SparkSession
.builder()
.master("yarn")
.appName("ParaTiUserXY")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val properties = new Properties
properties.put("key.deserializer", classOf[StringDeserializer])
properties.put("value.deserializer", classOf[StringDeserializer])
properties.put("bootstrap.servers", brokersip)
properties.put("auto.offset.reset", "earliest")
properties.put("group.id", "UserXYZ2")
//Schema para transformar los valores del topico de Kafka a JSON
val my_schema = new StructType()
.add("longitudCliente", StringType)
.add("latitudCliente", StringType)
.add("dni", StringType)
.add("alias", StringType)
.add("segmentoCliente", StringType)
.add("timestampCliente", StringType)
.add("dateCliente", StringType)
.add("timeCliente", StringType)
.add("tokenCliente", StringType)
.add("telefonoCliente", StringType)
val consumer = new KafkaConsumer[String, String](properties)
consumer.subscribe( util.Collections.singletonList("geoevents") )
val fs = {
val conf = new Configuration()
FileSystem.get(conf)
}
val temp_path:Path = new Path("hdfs:///tmp/tmpstgtopics")
if( fs.exists(temp_path)){
fs.delete(temp_path, true)
}
while(true)
{
val records=consumer.poll(100)
for (record<-records.asScala){
val data = record.value.toString
val dataos: FSDataOutputStream = fs.create(temp_path)
val bw: BufferedWriter = new BufferedWriter( new OutputStreamWriter(dataos, "UTF-8"))
bw.append(data)
bw.close
val data_schema = spark.read.schema(my_schema).json("hdfs:///tmp/tmpstgtopics")
val fechaCliente = data_schema.select("dateCliente").first.getString(0)
if( fechaCliente < date){
data_schema.select("longitudCliente", "latitudCliente","dni", "alias",
"segmentoCliente", "timestampCliente", "dateCliente", "timeCliente",
"tokenCliente", "telefonoCliente").coalesce(1).write.mode(SaveMode.Append)
.parquet("/desa/landing/parati/xyusers/" + fechaCliente)
}
else{
break
}
}
}
consumer.close()
However, this method is taking around 1 second to process each record in my cluster. So far, it would mean I will take around 6 days to process all the events I have.
Is this the optimal way to insert the whole amount of events inside a Kafka topic into a Hive table?
What other alternatives exist or which upgrades could I do to my code in order to speed it up?
Other than the fact that you're not using Spark Streaming correctly to poll from Kafka (you wrote a vanilla Scala Kafka consumer with a while loop) and coalesce(1) will always be a bottleneck as it forces one executor to collect the records, I'll just say you're really reinventing the wheel here.
What other alternatives exist
That I known of and are all open source
Gobblin (replaces Camus) by LinkedIn
Kafka Connect w/ HDFS Sink Connector (built into Confluent Platform, but also builds from source on Github)
Streamsets
Apache NiFi
Secor by Pinterest
From those listed, it would be beneficial for you to have JSON or Avro encoded Kafka messages, and not a flat string. That way, you can drop the files as is into a Hive serde, and not parse them while consuming them. If you cannot edit the producer code, make a separate Kafka Streams job taking the raw string data, parse it, then write to a new topic of Avro or JSON.
If you choose Avro (which you really should for Hive support), you can use the Confluent Schema Registry. Or if you're running Hortonworks, they offer a similar Registry.
HIve on Avro operates far better than text or JSON. Avro can easily be transformed into Parquet, and I believe each of the above options offers at least Parquet support while the others also can do ORC (Kafka Connect doesn't do ORC at this time).
Each of the above also support some level of automatic Hive partition generation based on the Kafka record time.
You can improve the parallelism by increasing the partitions of the kafka topic and having one or more consumer groups with multiple consumers consuming one-to-one with each partition.
As, cricket_007 mentioned you can use one of the opensource frameworks or you can have more consumer groups consuming the same topic to off-load the data.

Spark Streaming aggregation and filter in the same window

I've a fairly easy task - events are coming in and I want to filter those with higher value than the average per group by key in the same window.
I think this this is the relevant part of the code:
val avgfuel = events
.groupBy(window($"enqueuedTime", "30 seconds"), $"weatherCondition")
.agg(avg($"fuelEfficiencyPercentage") as "avg_fuel")
val joined = events.join(avgfuel, Seq("weatherCondition"))
.filter($"fuelEfficiencyPercentage" > $"avg_fuel")
val streamingQuery1 = joined.writeStream
.outputMode("append").
.trigger(Trigger.ProcessingTime("10 seconds")).
.option("checkpointLocation", checkpointLocation).
.format("json").option("path", containerOutputLocation).start()
events is a DStream.
The problem is that I'm getting empty files in the output location.
I'm using Databricks 3.5 - Spark 2.2.1 with Scala 2.11
What have I done wrong?
Thanks!
EDIT: a more complete code -
val inputStream = spark.readStream
.format("eventhubs") // working with azure event hubs
.options(eventhubParameters)
.load()
val schema = (new StructType)
.add("id", StringType)
.add("latitude", StringType)
.add("longitude", StringType)
.add("tirePressure", FloatType)
.add("fuelEfficiencyPercentage", FloatType)
.add("weatherCondition", StringType)
val df1 = inputStream.select($"body".cast("string").as("value")
, from_unixtime($"enqueuedTime").cast(TimestampType).as("enqueuedTime")
).withWatermark("enqueuedTime", "1 minutes")
val df2 = df1.select(from_json(($"value"), schema).as("body")
, $"enqueuedTime")
val df3 = df2.select(
$"enqueuedTime"
, $"body.id".cast("integer")
, $"body.latitude".cast("float")
, $"body.longitude".cast("float")
, $"body.tirePressure"
, $"body.fuelEfficiencyPercentage"
, $"body.weatherCondition"
)
val avgfuel = df3
.groupBy(window($"enqueuedTime", "10 seconds"), $"weatherCondition" )
.agg(avg($"fuelEfficiencyPercentage") as "fuel_avg", stddev($"fuelEfficiencyPercentage") as "fuel_stddev")
.select($"weatherCondition", $"fuel_avg")
val broadcasted = sc.broadcast(avgfuel)
val joined = df3.join(broadcasted.value, Seq("weatherCondition"))
.filter($"fuelEfficiencyPercentage" > $"fuel_avg")
val streamingQuery1 = joined.writeStream.
outputMode("append").
trigger(Trigger.ProcessingTime("10 seconds")).
option("checkpointLocation", checkpointLocation).
format("json").option("path", outputLocation).start()
This executes without errors and after a while results are starting to be written. I might be due to the broadcast of the aggregation result but I'm not sure.
Small investigation ;)
Events can't be a DStream, because you have option to use Dataset operations on it - it must be a Dataset
Stream-Stream joins are not allowed in Spark 2.2. I've tried to run your code with events as rate source and I get:
org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;;
Join Inner, (value#1L = eventValue#41L)
Result is quite unexpected - probably you used read instead of readStream and you didn't create a Streaming Dataset, but static. Change it to readStream and it will work - of course after upgrade to 2.3
Code - without comments above - is correct and should run correctly on Spark 2.3. Note that you must also change mode to complete instead of append, because you are doing aggregation

Resources