I have 2 liner Spark Structured Streaming job that copies data from one kafka topic to another.
Is it possible to publish/view the number of events consumed/produced in the Spark UI ?
The "Streaming Tab" in the Spark Web UI is not available for Structured Streaming, only for the Direct API. Starting with version 3.x it is available.
However, there is another easy way of displaying the number of events processed by a Spark Structured Streaming job.
You could use a StreamingQueryListener
import org.apache.spark.sql.streaming.StreamingQueryListener
import org.apache.spark.sql.streaming.StreamingQueryListener.QueryProgressEvent
class CountNumRecordsListener extends StreamingQueryListener {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = { }
override def onQueryProgress(event: QueryProgressEvent): Unit = {
println(s"""numInputRows: ${event.progress.numInputRows}""")
}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = { }
}
With that class you can then add a listener to your stream application (where spark is your SparkSession).
val countNumRecordsListener = new CountNumRecordsListener
spark.streams.addListener(countNumRecordsListener)
The StreamingQueryProgress class has even further information to help you understand the data processing of your streaming job.
Related
I have a Flink job that reads data from Kafka into a table which is emitted into a DataStream on which I apply a filter function and then convert the data stream back to a table which writes data back to Kafka.
I want to test the functionality of the filter function. I am writing unit tests in Groovy using Spock (framework). In my unit test I am calling the Flink job with the Table SQL string with details about the Kafka topic, however, I am confused on how to load the right StreamExecution and TableEnvironment because when I create a new object of my Flink class, those values are null and I don't have getters/setters to set everything up because that would make the code really messy.
The following is my logic. My question is can I write Apache Flink APIs as seamlessly in Groovy or there are many layers/pitfalls and how can I better approach these tests:
class DataStreamTests extends Specification {
#Autowired
ApplicationConfiguration configuration;
FlinkStreaming streaming = new FlinkStreaming();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings settings = EnvironmentSettings
.newInstance()
.inStreamingMode()
.build();
final StreamTableEnvironment tableEnvironment = TableEnvironment.create(settings);
def resourcePath = "cars/porche.txt"
StreamSpec streamSpec = ConfigurationParser.createStreamSpec(getClass().getResource(resourcePath).text)
def "create a new input stream from input table sql"() {
given:
DataStream<String> streamRecords = env.readTextFile("/streaming_signals/pedals.txt")
streaming.setConfiguration(configuration)
streaming.setTableEnvironment(tableEnvironment);
when:
String tableSpec = streaming.createTableSpec(streamSpec);
DataStream<Row> rawStream = streaming.getFilteredStream(streamSpec, tableSpec)
DataStream<String> comapareStreams = rawStream.map(new MapFunction<Row, String>() {
#Override
public String map(Row record) {
// Logic to compare stream received with test stream
}
});
then:
// Comparison Logic
}
}
org.apache.flink.table.api.ValidationException: Could not find any factories that implement 'org.apache.flink.table.delegation.ExecutorFactory' in the classpath.
at org.apache.flink.table.factories.FactoryUtil.discoverFactory(FactoryUtil.java:385)
at org.apache.flink.table.api.internal.TableEnvironmentImpl.create(TableEnvironmentImpl.java:295)
at org.apache.flink.table.api.internal.TableEnvironmentImpl.create(TableEnvironmentImpl.java:266)
at org.apache.flink.table.api.TableEnvironment.create(TableEnvironment.java:95)
at com.streaming.DataStreamTests.$spock_initializeFields(DataStreamTests.groovy:38
We are running into a problem where -- for one of our applications --
we don't see any evidences of batches being processed in the Structured
Streaming tab of the Spark UI.
I have written a small program (below) to reproduce the issue.
A self-contained project that allows you to build the app, along with scripts that facilitate upload to AWS, and details on how to run and reproduce the issue can be found here: https://github.com/buildlackey/spark-struct-streaming-metrics-missing-on-aws (The github version of the app is a slightly evolved version of what is presented below, but it illustrates the problem of Spark streaming metrics not showing up.)
The program can be run 'locally' -- on someones' laptop in local[*] mode (say with a dockerized Kafka instance),
or on an EMR cluster. For local mode operation you invoke the main method with 'localTest' as the first
argument.
In our case, when we run on the EMR cluster, pointing to a topic
where we know there are many data records (we read from 'earliest'), we
see that THERE ARE INDEED NO BATCHES PROCESSED -- on the cluster for some reason...
In the local[*] case we CAN see batches processed.
To capture evidence of this i wrote a forEachBatch handler that simply does a
toLocalIterator.asScala.toList.mkString("\n") on the Dataset of each batch, and then dumps the
resultant string to a file. Running locally.. i see evidence of the
captured records in the temporary file. HOWEVER, when I run on
the cluster and i ssh into one of the executors i see NO SUCH
files. I also checked the master node.... no files matching the pattern 'Missing'
So... batches are not triggering on the cluster. Our kakfa has plenty of data and
when running on the cluster the logs show we are churning through messages at increasing offsets:
21/12/16 05:15:21 DEBUG KafkaDataConsumer: Get spark-kafka-source-blah topic.foo.event-18 nextOffset 4596542913 requested 4596542913
21/12/16 05:15:21 DEBUG KafkaDataConsumer: Get spark-kafka-source-blah topic.foo.event-18 nextOffset 4596542914 requested 4596542914
Note to get the logs we are using:
yarn yarn logs --applicationId <appId>
which should get both driver and executor logs for the entire run (when app terminates)
Now, in the local[*] case we CAN see batches processed. The evidence is that we see a file whose name
is matching the pattern 'Missing' in our tmp folder.
I am including my simple demo program below. If you can spot the issue and clue us in, I'd be very grateful !
// Please forgive the busy code.. i stripped this down from a much larger system....
import com.typesafe.scalalogging.StrictLogging
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.{Dataset, SparkSession}
import java.io.File
import java.util
import scala.collection.JavaConverters.asScalaIteratorConverter
import scala.concurrent.duration.Duration
object AwsSupportCaseFailsToYieldLogs extends StrictLogging {
case class KafkaEvent(fooMsgKey: Array[Byte],
fooMsg: Array[Byte],
topic: String,
partition: String,
offset: String) extends Serializable
case class SparkSessionConfig(appName: String, master: String) {
def sessionBuilder(): SparkSession.Builder = {
val builder = SparkSession.builder
builder.master(master)
builder
}
}
case class KafkaConfig(kafkaBootstrapServers: String, kafkaTopic: String, kafkaStartingOffsets: String)
def sessionFactory: (SparkSessionConfig) => SparkSession = {
(sparkSessionConfig) => {
sparkSessionConfig.sessionBuilder().getOrCreate()
}
}
def main(args: Array[String]): Unit = {
val (sparkSessionConfig, kafkaConfig) =
if (args.length >= 1 && args(0) == "localTest") {
getLocalTestConfiguration
} else {
getRunOnClusterConfiguration
}
val spark: SparkSession = sessionFactory(sparkSessionConfig)
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val dataSetOfKafkaEvent: Dataset[KafkaEvent] = spark.readStream.
format("kafka").
option("subscribe", kafkaConfig.kafkaTopic).
option("kafka.bootstrap.servers", kafkaConfig.kafkaBootstrapServers).
option("startingOffsets", kafkaConfig.kafkaStartingOffsets).
load.
select(
$"key" cast "binary",
$"value" cast "binary",
$"topic",
$"partition" cast "string",
$"offset" cast "string").map { row =>
KafkaEvent(
row.getAs[Array[Byte]](0),
row.getAs[Array[Byte]](1),
row.getAs[String](2),
row.getAs[String](3),
row.getAs[String](4))
}
val initDF = dataSetOfKafkaEvent.map { item: KafkaEvent => item.toString }
val function: (Dataset[String], Long) => Unit =
(dataSetOfString, batchId) => {
val iter: util.Iterator[String] = dataSetOfString.toLocalIterator()
val lines = iter.asScala.toList.mkString("\n")
val outfile = writeStringToTmpFile(lines)
println(s"writing to file: ${outfile.getAbsolutePath}")
logger.error(s"writing to file: ${outfile.getAbsolutePath} / $lines")
}
val trigger = Trigger.ProcessingTime(Duration("1 second"))
initDF.writeStream
.foreachBatch(function)
.trigger(trigger)
.outputMode("append")
.start
.awaitTermination()
}
private def getLocalTestConfiguration: (SparkSessionConfig, KafkaConfig) = {
val sparkSessionConfig: SparkSessionConfig =
SparkSessionConfig(master = "local[*]", appName = "dummy2")
val kafkaConfig: KafkaConfig =
KafkaConfig(
kafkaBootstrapServers = "localhost:9092",
kafkaTopic = "test-topic",
kafkaStartingOffsets = "earliest")
(sparkSessionConfig, kafkaConfig)
}
private def getRunOnClusterConfiguration = {
val sparkSessionConfig: SparkSessionConfig = SparkSessionConfig(master = "yarn", appName = "AwsSupportCase")
val kafkaConfig: KafkaConfig =
KafkaConfig(
kafkaBootstrapServers= "kafka.foo.bar.broker:9092", // TODO - change this for kafka on your EMR cluster.
kafkaTopic= "mongo.bongo.event", // TODO - change this for kafka on your EMR cluster.
kafkaStartingOffsets = "earliest")
(sparkSessionConfig, kafkaConfig)
}
def writeStringFile(string: String, file: File): File = {
java.nio.file.Files.write(java.nio.file.Paths.get(file.getAbsolutePath), string.getBytes).toFile
}
def writeStringToTmpFile(string: String, deleteOnExit: Boolean = false): File = {
val file: File = File.createTempFile("streamingConsoleMissing", "sad")
if (deleteOnExit) {
file.delete()
}
writeStringFile(string, file)
}
}
I have encountered similar issue, maxOffsetsPerTrigger would fix the issue. Actually, it's not issue.
All logs and metrics per batch are only printed or showing after
finish of this batch. That's the reason why you can't see the job make
progress.
If maxOffsetsPerTrigger can't solve the issue, you could try to consume from latest offset to confirm the procssing logic is correct.
This is a provisional answer. One of our team members has a theory that looks pretty likely. Here it is: Batches ARE getting processed (this is demonstrated better by the version of the program I linked to on github), but we are thinking that since there is so much backed up in the topic on our cluster that the processing (from earliest) of the first batch takes a very long time, hence when looking at the cluster we see zero batches processed... even though there is clearly work being done. It might be that the solution is to use maxOffsetsPerTrigger to gate the amount of incoming traffic (when starting from earliest and working w/ a topic that has huge volumes of data). We are working on confirming this.
I am consuming from a Kafka topic by using Kafka Streaming. (kafka direct stream)
The data in this topic arrives after every 5 minutes from another source.
Now i need to process the data that arrives after every 5 minutes and convert that into a Spark DataFrame.
Now, stream is continuous flow of data.
My issue is , how do i determine that i am done reading first set of data that was loaded in Kafka topic? (So that i can convert that into DataFrame and start my work)
I know i can mention the batch interval( in JavaStreamingContext) to a certain number, but even then i can never be sure on how much time the source will take to push the data to the topic.
Any suggestions are welcome.
If I understand your question correctly you would like to not create a batch till all the data for the 5 minute worth of input is read.
Spark out of the box does not provide any API like that.
You can how ever use a sliding window on your received stream to achieve part of what you want. (See last example code)
The other way(harder way) is to implement your own org.apache.spark.streaming.util.ManualClock to achieve what you need.
ManualClock is a private class so the override happens within the name space.
package org.apache.spark.streaming
import org.apache.spark.util.ManualClock
object ClockWrapper {
def advance(ssc: StreamingContext, timeToAdd: Duration): Unit = {
val manualClock = ssc.scheduler.clock.asInstanceOf[ManualClock]
manualClock.advance(timeToAdd.milliseconds)
}
}
Then in your own class
import org.apache.spark.streaming.{ClockWrapper, Duration, Seconds, StreamingContext}
//elided.
override def sparkConfig: Map[String, String] = {
super.sparkConfig + ("spark.streaming.clock" -> "org.apache.spark.streaming.util.ManualClock")
}
def ssc: StreamingContext = _ssc
def advanceClock(timeToAdd: Duration): Unit = {
//Only if some other conditions are met..
ClockWrapper.advance(_ssc, timeToAdd)
}
def advanceClockOneBatch(): Unit = {
advanceClock(Duration(batchDuration.milliseconds))
}
State based stream management can be done by using mapWithState API.
object StatefulStreamOperation {
val sparkConf = new SparkConf().setAppName("")
// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint(".")
val incoming: DStream[(batchTimeMultipleOf5Minute, UserClass)]
val mappingFunc = (key: batchTime, incoming: Option[Int], state: State[UserClass]) => {
//Do what ever you need to do to the data.
//say (result, newState) = Some_Cool_operation(incoming, state)
state.update(newState)
result
}
val stateDstream = dataDstream.mapWithState(
StateSpec.function(mappingFunc).initialState(initialRDD))
//Do something with the result.
ssc.start()
ssc.awaitTermination()
}
}
I am trying to use ForeachWriter interface in Spark 2.1 it's interface, but I cannot use it.
It will be supported in Spark 2.2.0. To learn how to use it, I suggest you read this blog post: https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html
You can try Spark 2.2.0 RC2 [1] or just wait for the final release.
Another option is taking a look at this blog if you cannot use Spark 2.2.0+:
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html
It has a very simple Kafka sink and maybe that's enough for you.
[1] http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC2-td21497.html
First thing to know is that, if you working with spark structured Stream and processing streaming data, you'll be having a streaming Dataset.
Being said, the way to write this streaming Dataset is by calling the ForeachWriter, you got it right..
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[Commons.UserEvent] {
override def open(partitionId: Long, version: Long) = true
override def process(value: Commons.UserEvent) = {
processRow(value)
}
override def close(errorOrNull: Throwable) = {}
}
val query =
ds.writeStream.queryName("aggregateStructuredStream").outputMode("complete").foreach(writer).start
And the function that writes into topic will be like:
private def processRow(value: Commons.UserEvent) = {
/*
* Producer.send(topic, data)
*/
}
Lately, as a part of a scientific research, I've been developing an application that streams (or at least should) data from Travis CI and GitHub, using their REST API's. The purpose of this is to get insight into the commit-build relationship, in order to further perform numerous analysis.
For this, I've implemented the following Travis custom receiver:
object TravisUtils {
def createStream(ctx : StreamingContext, storageLevel: StorageLevel) : ReceiverInputDStream[Build] = new TravisInputDStream(ctx, storageLevel)
}
private[streaming]
class TravisInputDStream(ctx : StreamingContext, storageLevel : StorageLevel) extends ReceiverInputDStream[Build](ctx) {
def getReceiver() : Receiver[Build] = new TravisReceiver(storageLevel)
}
private[streaming]
class TravisReceiver(storageLevel: StorageLevel) extends Receiver[Build](storageLevel) with Logging {
def onStart() : Unit = {
new BuildStream().addListener(new BuildListener {
override def onBuildsReceived(numberOfBuilds: Int): Unit = {
}
override def onBuildRepositoryReceived(build: Build): Unit = {
store(build)
}
override def onException(e: Exception): Unit = {
reportError("Exception while streaming travis", e)
}
})
}
def onStop() : Unit = {
}
}
Whereas the receiver uses my custom made TRAVIS API library (developed in Java using Apache Async Client). However, the problem is the following: the data that I should be receiving is continuous and changes i.e. is being pushed to Travis and GitHub constantly. As an example, consider the fact that GitHub records per second approx. 350 events - including push events, commit comment and similar.
But, when streaming either GitHub or Travis, I do get the data from the first two batches, but then afterwards, the RDD's apart of the DStream are empty - although there is data to be streamed!
I've checked so far couple of things, including the HttpClient used for omitting requests to the API, but none of them did actually solve this problem.
Therefore, my question is - what could be going on? Why isn't Spark streaming the data after period x passes. Below, you may find the set context and configuration:
val configuration = new SparkConf().setAppName("StreamingSoftwareAnalytics").setMaster("local[2]")
val ctx = new StreamingContext(configuration, Seconds(3))
val stream = GitHubUtils.createStream(ctx, StorageLevel.MEMORY_AND_DISK_SER)
// RDD IS EMPTY - that is what is happenning!
stream.window(Seconds(9)).foreachRDD(rdd => {
if (rdd.isEmpty()) {println("RDD IS EMPTY")} else {rdd.collect().foreach(event => println(event.getRepo.getName + " " + event.getId))}
})
ctx.start()
ctx.awaitTermination()
Thanks in advance!