I'm struggling to get the console sink working with PySpark Structured Streaming when run from Zeppelin. Basically, I'm not seeing any results printed to the screen, or to any logfiles I've found.
My question: Does anyone have a working example of using PySpark Structured Streaming with a sink that produces output visible in Apache Zeppelin? Ideally it would also use the socket source, as that's easy to test with.
I'm using:
Ubuntu 16.04
spark-2.2.0-bin-hadoop2.7
zeppelin-0.7.3-bin-all
Python3
I've based my code on the structured_network_wordcount.py example. It works when run from the PySpark shell (./bin/pyspark --master local[2]); I see tables per batch.
%pyspark
# structured streaming
from pyspark.sql.functions import *
lines = spark\
.readStream\
.format('socket')\
.option('host', 'localhost')\
.option('port', 9999)\
.option('includeTimestamp', 'true')\
.load()
# Split the lines into words, retaining timestamps
# split() splits each line into an array, and explode() turns the array into multiple rows
words = lines.select(
explode(split(lines.value, ' ')).alias('word'),
lines.timestamp
)
# Group the data by window and word and compute the count of each group
windowedCounts = words.groupBy(
window(words.timestamp, '10 seconds', '1 seconds'),
words.word
).count().orderBy('window')
# Start running the query that prints the windowed word counts to the console
query = windowedCounts\
.writeStream\
.outputMode('complete')\
.format('console')\
.option('truncate', 'false')\
.start()
print("Starting...")
query.awaitTermination(20)
I'd expect to see printouts of results for each batch, but instead I just see Starting..., and then False, the return value of query.awaitTermination(20).
In a separate terminal I'm entering some data into a nc -lk 9999 netcat session while the above is running.
Console sink is not a good choice for interactive notebook-based workflow. Even in Scala, where the output can be captured, it requires awaitTermination call (or equivalent) in the same paragraph, effectively blocking the note.
%spark
spark
.readStream
.format("socket")
.option("host", "localhost")
.option("port", "9999")
.option("includeTimestamp", "true")
.load()
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()
.awaitTermination() // Block execution, to force Zeppelin to capture the output
Chained awaitTermination could be replaced with standalone call in the same paragraph would work as well:
%spark
val query = df
.writeStream
...
.start()
query.awaitTermination()
Without it, Zeppelin has no reason to wait for any output. PySpark just adds another problem on top of that - indirect execution. Because of that, even blocking the query won't help you here.
Moreover continuous output from the stream can cause rendering issues and memory problems when browsing the note (it might be possible to use Zeppelin display system via InterpreterContext or REST API, to achieve a bit more sensible behavior, where the output is overwritten or periodically cleared).
A much better choice for testing with Zeppelin is memory sink. This way you can start a query without blocking:
%pyspark
query = (windowedCounts
.writeStream
.outputMode("complete")
.format("memory")
.queryName("some_name")
.start())
and query the result on demand in another paragraph:
%pyspark
spark.table("some_name").show()
It can be coupled with reactive streams or similar solution to provide interval based updates.
It is also possible to use StreamingQueryListener with Py4j callbacks to couple rx with onQueryProgress events, although query listeners are not supported in PySpark, and require a bit of code, to glue things together. Scala interface:
package com.example.spark.observer
import org.apache.spark.sql.streaming.StreamingQueryListener
import org.apache.spark.sql.streaming.StreamingQueryListener._
trait PythonObserver {
def on_next(o: Object): Unit
}
class PythonStreamingQueryListener(observer: PythonObserver)
extends StreamingQueryListener {
override def onQueryProgress(event: QueryProgressEvent): Unit = {
observer.on_next(event)
}
override def onQueryStarted(event: QueryStartedEvent): Unit = {}
override def onQueryTerminated(event: QueryTerminatedEvent): Unit = {}
}
build a jar, adjusting build definition to reflect desired Scala and Spark version:
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion
)
put it on the Spark classpath, patch StreamingQueryManager:
%pyspark
from pyspark.sql.streaming import StreamingQueryManager
from pyspark import SparkContext
def addListener(self, listener):
jvm = SparkContext._active_spark_context._jvm
jlistener = jvm.com.example.spark.observer.PythonStreamingQueryListener(
listener
)
self._jsqm.addListener(jlistener)
return jlistener
StreamingQueryManager.addListener = addListener
start callback server:
%pyspark
sc._gateway.start_callback_server()
and add listener:
%pyspark
from rx.subjects import Subject
class StreamingObserver(Subject):
class Java:
implements = ["com.example.spark.observer.PythonObserver"]
observer = StreamingObserver()
spark.streams.addListener(observer)
Finally you can use subscribe and block execution:
%pyspark
(observer
.map(lambda p: p.progress().name())
# .filter() can be used to print only for a specific query
.subscribe(lambda n: spark.table(n).show() if n else None))
input() # Block execution to capture the output
The last step should be executed after you started streaming query.
It is also possible to skip rx and use minimal observer like this:
class StreamingObserver(object):
class Java:
implements = ["com.example.spark.observer.PythonObserver"]
def on_next(self, value):
try:
name = value.progress().name()
if name:
spark.table(name).show()
except: pass
It gives a bit less control than the Subject (one caveat is that this can interfere with other code printing to stdout and can be stopped only by removing listener. With Subject you can easily dispose subscribed observer, once you're done), but otherwise should work more or less the same.
Note that any blocking action will be sufficient to capture the output from the listener and it doesn't have to be executed in the same cell. For example
%pyspark
observer = StreamingObserver()
spark.streams.addListener(observer)
and
%pyspark
import time
time.sleep(42)
would work in a similar way, printing table for a defined time interval.
For completeness you can implement StreamingQueryManager.removeListener.
zeppelin-0.7.3-bin-all uses Spark 2.1.0 (so no rate format to test Structured Streaming with unfortunately).
Make sure that when you start a streaming query with socket source nc -lk 9999 has already been started (as the query simply stops otherwise).
Also make sure that the query is indeed up and running.
val lines = spark
.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load
val q = lines.writeStream.format("console").start
It's indeed true that you won't be able to see the output in a Zeppelin notebook possibly because:
Streaming queries start on their own threads (that seems to be outside Zeppelin's reach)
console sink writes to standard output (uses Dataset.show operator on that separate thread).
All this makes "intercepting" the output not available in Zeppelin.
So we come to answer the real question:
Where is the standard output written to in Zeppelin?
Well, with a very limited understanding of the internals of Zeppelin, I thought it could be logs/zeppelin-interpreter-spark-[hostname].log, but unfortunately could not find the output from the console sink. That's where you can find the logs from Spark (and Structured Streaming in particular) that use log4j but console sink does not use.
It looks as if your only long-term solution were to write your own console-like custom sink and use a log4j logger. Honestly, that is not that hard as it sounds. Follow the sources of console sink.
Related
I have following stream code in a databricks notebook (python).
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("MyTest") \
.getOrCreate()
# Create a streaming DataFrame
lines = spark.readStream \
.format("delta") \
.table("myschema.streamTest")
In notebook 2, I have
def foreach_batch_function(df, epoch_id):
test = df
print(test['simplecolumn'])
display(test['simplecolumn'])
test['simplecolumn'].display
lines.writeStream.outputMode("append").foreachBatch(foreach_batch_function).format('console').start()
When I execute the above where can I see the output from the .display function? I looked inside the cluster driver logs and I don't see anything. I also don't see anything in the notebook itself when executed except a successfully initialized and executing stream. I do see that the dataframe parameter data is displayed in console but I am trying to see that assigning test was successful.
I am trying to carry out this manipulation as a precursor to time series operations over mini batches for real-time model scoring and in python - but I am struggling to get the basics right in the structured streaming world. A working model functions but executes every 10-15 minutes. I would like to make it realtime via streams and hence this question.
You're mixing different things together - I recommend to read initial parts of the structured streaming documentation or chapter 8 of Learning Spark, 2ed book (freely available from here).
You can use display function directly on the stream, like (better with checkpointLocation and maybe trigger parameters as described in documentation):
display(lines)
Regarding the scoring - usually it's done by defining the user defined function and applying it to stream either as select or withColumn functions of the dataframe. Easiest way is to register a model in the MLflow registry, and then load model with built-in functions, like:
import mlflow.pyfunc
pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)
preds = lines.withColumn("predictions", pyfunc_udf(params...))
Look into that notebook for examples.
I'm trying to develop a small Spark app (using Scala) to read messages from Kafka (Confluent) and write them (insert) into Hive table. Everything works as expected, except for one important feature - managing offsets when the application is restarted (submited). It confuses me.
Cut from my code:
def main(args: Array[String]): Unit = {
val sparkSess = SparkSession
.builder
.appName("Kafka_to_Hive")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse/")
.config("hive.metastore.uris", "thrift://localhost:9083")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.enableHiveSupport()
.getOrCreate()
sparkSess.sparkContext.setLogLevel("ERROR")
// don't consider this code block please, it's just a part of Confluent avro message deserializing adventures
sparkSess.udf.register("deserialize", (bytes: Array[Byte]) =>
DeserializerWrapper.deserializer.deserialize(bytes)
)
val kafkaDataFrame = sparkSess
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", 'localhost:9092')
.option("group.id", 'kafka-to-hive-1')
// ------> which Kafka options do I need to set here for starting from last right offset to ensure completenes of data and "exactly once" writing? <--------
.option("failOnDataLoss", (false: java.lang.Boolean))
.option("subscribe", 'some_topic')
.load()
import org.apache.spark.sql.functions._
// don't consider this code block please, it's just a part of Confluent avro message deserializing adventures
val valueDataFrame = kafkaDataFrame.selectExpr("""deserialize(value) AS message""")
val df = valueDataFrame.select(
from_json(col("message"), sparkSchema.dataType).alias("parsed_value"))
.select("parsed_value.*")
df.writeStream
.foreachBatch((batchDataFrame, batchId) => {
batchDataFrame.createOrReplaceTempView("`some_view_name`")
val sqlText = "SELECT * FROM `some_view_name` a where some_field='some value'"
val batchDataFrame_view = batchDataFrame.sparkSession.sql(sqlText);
batchDataFrame_view.write.insertInto("default.some_hive_table")
})
.option("checkpointLocation", "/user/some_user/tmp/checkpointLocation")
.start()
.awaitTermination()
}
Questions (the questions are related to each other):
Which Kafka options do I need to apply on readStream.format("kafka") for starting from last right offset on every submit of spark app?
Do I need to manually read 3rd line of checkpointLocation/offsets/latest_batch file to find last offsets to read from Kafka? I mean something like that: readStream.format("kafka").option("startingOffsets", """{"some_topic":{"2":35079,"5":34854,"4":35537,"1":35357,"3":35436,"0":35213}}""")
What is the right/convenient way to read stream from Kafka (Confluent) topic? (I'm not considering offsets storing engine of Kafka)
"Which Kafka options do I need to apply on readStream.format("kafka") for starting from last right offset on every submit of spark app?"
You would need to set startingOffsets=latest and clean up the checkpoint files.
"Do I need to manually read 3rd line of checkpointLocation/offsets/latest_batch file to find last offsets to read from Kafka? I mean something like that: readStream.format("kafka").option("startingOffsets", """{"some_topic":{"2":35079,"5":34854,"4":35537,"1":35357,"3":35436,"0":35213}}""")"
Similar to first question, if you set the startingOffsets as the json string, you need to delete the checkpointing files. Otherwise, the spark application will always fetch the information stored in the checkpoint files and override the settings given in the startingOffsets option.
"What is the right/convenient way to read stream from Kafka (Confluent) topic? (I'm not considering offsets storing engine of Kafka)"
Asking about "the right way" might lead to opinion based answers and is therefore off-topic on Stackoverflow. Anyway, using Spark Structured Streaming is already a mature and production-ready approach in my experience. However, it is always worth also looking into KafkaConnect.
I have a self-referencing protobuf schema:
message A {
uint64 timestamp = 1;
repeated A fields = 2;
}
I am generating the corresponding Scala classes using scalaPB and then trying to decode the messages which are consumed from Kafka stream, following these steps:
def main(args : Array[String]) {
val spark = SparkSession.builder.
master("local")
.appName("spark session example")
.getOrCreate()
import spark.implicits._
val ds1 = spark.readStream.format("kafka").
option("kafka.bootstrap.servers","localhost:9092").
option("subscribe","student").load()
val ds2 = ds1.map(row=> row.getAs[Array[Byte]]("value")).map(Student.parseFrom(_))
val query = ds2.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
}
This is a related question here on StackOverflow.
However, Spark Structured Streaming throws a cyclic reference error at this line.
val ds2 = ds1.map(row=> row.getAs[Array[Byte]]("value")).map(Student.parseFrom(_))
I understand it is because of the recursive reference which can be handled in the Spark only at the driver (basically RDD or Dataset level). Has anyone figured a workaround for this, to enable recursive calling through UDF for instance?
It turns out this is due to the limitation in a way spark architecture is made. To process the huge amount of data code is distributed over all the slave nodes along with a portion of the data and the results are coordinated through a master node. Now since there is nothing on the worker node to keep track of the stack hence recursion is not allowed at a worker, but only at the driver level.
In short with the current build of spark it is not possible to do this kind of recursive parsing. The best option is to move to java which has similar libraries and easily parses a recursive protobuf file.
Framing the question
This question stems from the following problem:
I want to test a Spark structured streaming [2.2.X or 2.3.x] application that reads its input from Kafka (without a from beginning flag).
The app essentially reads like this:
val sparkSession = SparkSession.builder.getOrCreate()
val lines = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
.option("subscribe", "test")
.load()
Once the app is started and running, it may take an arbitrary amount of time for it to start listening to the Kafka topic.
How can I post the input data to Kafka, after waiting the least time possible?
Naive solution
A simple solution to the problem would be to wait a large arbitrary amount of time after starting the app:
startApplication()
Thread.sleep(10*1000)
postInputDataToKafka()
This is problematic on 2 accounts:
- Not all environments are equal, and some may take longer than you expected
- It's wasteful
Complex solution
Another option would be to use a global supervisor, meaning, to have some process that coordinates the test.
Meaning, the same process that starts the application waits to receive a signal from it that it's ready to listen. After this signal is received, it then starts posting the input data.
This approach requires the application to send such a signal, my question is how to do so.
You can wait until StreamingQuery.lastProgress returns a non-null value, such as
import org.apache.spark.sql.streaming.StreamingQuery
val q: StreamingQuery = ... // start a streaming query
while (q.lastProcess == null) {
Thread.sleep(100)
}
postInputDataToKafka
Im trying to restart streaming query in spark using below code inplace of query.awaitTermination(),below code will be inside an infinite loop and looks for trigger to restart query and then executes below code.Basically im trying to refresh cached df.
query.processAllavaialble()
query.stop()
//oldDF is a cached Dataframe created from GlobalTempView which is of size 150GB.
oldDF.unpersist()
val inputDf: DataFrame = readFile(spec, sparkSession) //read file from S3
or anyother source
val recreateddf = inputDf.persist()
//Start the query// here should i start query again by invoking readStream ?
But when i looked into spark documentation it says
void processAllAvailable() ///documentation says This method is intended for testing/// Blocks until all available data in the source has been processed and committed to the sink. This method is intended for testing. Note that in the case of continually arriving data, this method may block forever. Additionally, this method is only guaranteed to block until data that has been synchronously appended data to a Source prior to invocation. (i.e. getOffset must immediately reflect the addition).
stop() Stops the execution of this query if it is running. This method blocks until the threads performing execution has stopped.
So whats the better way to restart query without stopping my spark streaming application
This has worked for me.
Below is the scenario which I followed in spark 2.4.5 for left outer join and left join.Below process is pushing spark to read latest dimension data changes.
Process is for Stream Join with batch dimension (always update)
Step 1:-
Before starting Spark streaming job:- Make sure dimension batch data folder has only one file and the file should have at-least one record (for some reason placing empty file is not working)/
Step 2:- Start your streaming job and add a stream record in kafka stream
Step 3:- Overwrite dim data with values (the file should be same name don't change and the dimension folder should have only one file) Note:- don't use spark to write to this folder use Java or Scala filesystem.io to overwrite the file or bash delete the file and replace with new data file with same name.
Step 4:- In next batch spark is able to read updated dimension data while joining with kafka stream...
Sample Code:-
package com.databroccoli.streaming.streamjoinupdate
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType}
import org.apache.spark.sql.{DataFrame, SparkSession}
object BroadCastStreamJoin3 {
def main(args: Array[String]): Unit = {
#transient lazy val logger: Logger = Logger.getLogger(getClass.getName)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("com.amazonaws").setLevel(Level.ERROR)
Logger.getLogger("com.amazon.ws").setLevel(Level.ERROR)
Logger.getLogger("io.netty").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
val schemaUntyped1 = StructType(
Array(
StructField("id", StringType),
StructField("customrid", StringType),
StructField("customername", StringType),
StructField("countrycode", StringType),
StructField("timestamp_column_fin_1", TimestampType)
))
val schemaUntyped2 = StructType(
Array(
StructField("id", StringType),
StructField("countrycode", StringType),
StructField("countryname", StringType),
StructField("timestamp_column_fin_2", TimestampType)
))
val factDf1 = spark.readStream
.schema(schemaUntyped1)
.option("header", "true")
.csv("src/main/resources/broadcasttest/fact")
val dimDf3 = spark.read
.schema(schemaUntyped2)
.option("header", "true")
.csv("src/main/resources/broadcasttest/dimension")
.withColumnRenamed("id", "id_2")
.withColumnRenamed("countrycode", "countrycode_2")
import spark.implicits._
factDf1
.join(
dimDf3,
$"countrycode_2" <=> $"countrycode",
"inner"
)
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination
}
}
Your question is a little unclear (the second piece of code doesn't use the df you want to persist so I am not sure how you intend to integrate them... I assume a join?
We had a similar issue (using Spark 2.1), and solved it by creating a custom implementation of Sink (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala) where the data is loaded in addBatch. Since your setting indicate that you are only processing 1 file at a time and there is no watermarking, you can probably cram your logic into the addBatch method... though this is kind of hacky (our use case was slightly different I believe).
If spark 2.2 is an option, then you are in luck. Spark 2.2 adds the "run once" trigger that allows you to use the Spark Streaming API for batch jobs (which is essentially what you are trying to do). If you modify your write stream to use this new trigger, than the infinite loop might work (though I have never tried). You might be better off using an external scheduler to run the streaming job in batch mode. You can read more about the Run Once trigger here: https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
If you are using EMR, then Spark 2.2 isn't available yet... but I have heard it will be out in the next couple weeks (fingers crossed).
You can find some complete Sink implementation examples here: https://github.com/holdenk/spark-structured-streaming-ml/blob/master/src/main/scala/com/high-performance-spark-examples/structuredstreaming/CustomSink.scala