Databricks : structure stream data assignment and display - apache-spark

I have following stream code in a databricks notebook (python).
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("MyTest") \
.getOrCreate()
# Create a streaming DataFrame
lines = spark.readStream \
.format("delta") \
.table("myschema.streamTest")
In notebook 2, I have
def foreach_batch_function(df, epoch_id):
test = df
print(test['simplecolumn'])
display(test['simplecolumn'])
test['simplecolumn'].display
lines.writeStream.outputMode("append").foreachBatch(foreach_batch_function).format('console').start()
When I execute the above where can I see the output from the .display function? I looked inside the cluster driver logs and I don't see anything. I also don't see anything in the notebook itself when executed except a successfully initialized and executing stream. I do see that the dataframe parameter data is displayed in console but I am trying to see that assigning test was successful.
I am trying to carry out this manipulation as a precursor to time series operations over mini batches for real-time model scoring and in python - but I am struggling to get the basics right in the structured streaming world. A working model functions but executes every 10-15 minutes. I would like to make it realtime via streams and hence this question.

You're mixing different things together - I recommend to read initial parts of the structured streaming documentation or chapter 8 of Learning Spark, 2ed book (freely available from here).
You can use display function directly on the stream, like (better with checkpointLocation and maybe trigger parameters as described in documentation):
display(lines)
Regarding the scoring - usually it's done by defining the user defined function and applying it to stream either as select or withColumn functions of the dataframe. Easiest way is to register a model in the MLflow registry, and then load model with built-in functions, like:
import mlflow.pyfunc
pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)
preds = lines.withColumn("predictions", pyfunc_udf(params...))
Look into that notebook for examples.

Related

Can we use row_number() in PySpark Structured Streaming?

The PySpark SQL functions reference on the row_number() function says
returns a sequential number starting at 1 within a window partition
implying that the function works only on windows. Trying
df.select('*', row_number())
predictably gives a
Window function row_number() requires an OVER clause
exception.
Now, .over() seems to work only with WindowSpec because
from pyspark.sql.functions import window, row_number
...
df.select('*', row_number().over(window('time', '5 minutes')))
gives a
TypeError: window should be WindowSpec
exception.
However, according to this comment on the ASF Jira:
By time-window we described what time windows are supported in SS natively.
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#types-of-time-windows
Window spec is not supported. This defines the boundary of window as non-timed manner, the offset(s) of the row, which is hard to track in streaming context.
WindowSpec is generally not supported in Structured Streaming. Leading to the conclusion that the row_number() function is not supported in Structured Streaming. Is that correct? Just want to make sure I'm not missing anything here.
first point, your imports are wrong:
from pyspark.sql import Window
from pyspark.sql.functions import row_number
second, try doing like this:
partition_columns = Window.partitionBy(
df.column1,
df.column2,
...
).orderBy(df.col...)
df = df.withColumn('your_new_column_rank', row_number().over(partition_columns))
Usually we use Windowing functions to deduplicate records in Structured Streaming, the documentation says that is not possible to use it because this function will not access the already saved data as we can do with Batch, but you can set watermark, like this:
df = df.withWatermark("timestamp", "10 minutes").withColumn('your_new_column_rank', row_number().over(partition_columns))
or even you can try using watermark to run drop_duplicate function.
Another way to do it, is through the foreachBatch
def func(batch_df, batch_id):
partition_columns = Window.partitionBy(
df.column1,
df.column2,
...
).orderBy(batch_df.col...)
batch_df= batch_df.withColumn('your_new_column_rank',
row_number().over(partition_columns))
...
writer = sdf.writeStream.foreachBatch(func)
Like above you will have a micro df that is not a Structured Streaming DF, so there is possible to access functions that you can't with a a streaming one.

Adding / removing columns from schema with Kafka source without restarting the session using PySpark Structured Streaming

I'm using Pyspark 3.2.0, and I'm pretty new to Structured Streaming and couldn't find an answer to this question.
I want to read json data from a Kafka topic using a predefined schema like the following (code related to initialization / connections is omitted):
# The skeleton schema is defined in 'schema.py'
skeleton_schema = get_skeleton_schema()
df = df.selectExpr("CAST(value AS STRING)") \
.select(from_json("value", skeleton_schema).alias("data")) \
.select(col("data.*"))
...
df.writeStream \
.format("console") \
.outputMode("append") \
.trigger(processingTime='5 minutes') \
.start()
df.awaitTermination()
I want to be able to modify the skeleton_schema (e.g add/remove columns) in the 'schema.py' file and to have those changes reflected to future triggers. Is there a way to achieve this? If not, is there a different mechanism to update the schema without restarting the session?
Unless get_skeleton_schema() function itself is ran per batch and not cached (for example, calls an external REST API, database, or parses some file), which it does not in the shown code, then no, it's not possible to change it at runtime.
Keep in mind, there's no guarantee that all records in the same batch will have the same schema....
You'd need to consume the columns as bytes, then use a ForEachWriter implementation to implement this, but I'm not familiar enough with pyspark to give an example
Depending on where you're actually going to be writing the data into (not the console, e.g. Using Mongo or Snowflake instead), you could look at using Kafka Connect and then using Avro or Protobuf serialization rather than JSON. Then your producer's would decide when to introduce/remove columns in a backwards-compatible manner, enforced by a Schema Registry, and your consumers wouldn't have to change or define any schema themselves

PySpark structured streaming apply udf to window

I am trying to apply a pandas udf to a window of a pyspark structured stream. The problem is that as soon as the stream has caught up with the current state all new windows only contain a single value somehow.
As you can see in the screenshot all windows after 2019-10-22T15:34:08.730+0000 only contain a single value. The code used to generate this is this:
#pandas_udf("Count long, Resampled long, Start timestamp, End timestamp", PandasUDFType.GROUPED_MAP)
def myudf(df):
df = df.dropna()
df = df.set_index("Timestamp")
df.sort_index(inplace=True)
# resample the dataframe
resampled = pd.DataFrame()
oidx = df.index
nidx = pd.date_range(oidx.min(), oidx.max(), freq="30S")
resampled["Value"] = df.Value.reindex(oidx.union(nidx)).interpolate('index').reindex(nidx)
return pd.DataFrame([[len(df.index), len(resampled.index), df.index.min(), df.index.max()]], columns=["Count", "Resampled", "Start", "End"])
predictionStream = sensorStream.withWatermark("Timestamp", "90 minutes").groupBy(col("Name"), window(col("Timestamp"), "70 minutes", "5 minutes"))
predictionStream.apply(myudf).writeStream \
.queryName("aggregates") \
.format("memory") \
.start()
The stream does get new values every 5 minutes. Its just that the window somehow only takes values from the last batch even though the watermark should not have expired.
Is there anything I am doing wrong ? I already tried playing with the watermark; that did have no effect on the result. I need all values of the window inside the udf.
I am running this in databricks on a cluster set to 5.5 LTS ML (includes Apache Spark 2.4.3, Scala 2.11)
It looks like you could specify the Output Mode you want for you writeStream
See documentation at Output Modes
By default it's using Append Mode:
This is the default mode, where only the new rows added to the Result Table since the last trigger will be outputted to the sink.
Try using:
predictionStream.apply(myudf).writeStream \
.queryName("aggregates") \
.format("memory") \
.outputMode(OutputMode.Complete) \
.start()
I found a Spark JIRA issue concerning this problem but it was closed without resolution. The bug appears to be, and I confirmed this independently on Spark version 3.1.1, that the Pandas UDF is executed on every trigger only with the data since the last trigger. So you are likely only processing a subset of the data you want to take into account on each trigger. Grouped Map Pandas UDFs do not appear to be functional for structured streaming with a delta table source. Please do follow up if you previously found a solution, otherwise I’ll just leave this here for folks that also find this thread.
Edit: There's some discussion in the Databricks forums about first doing a streaming aggregation and following that up with a Pandas UDF (that will likely expect a single record with columns containing arrays) as shown below. I tried it. It works. However, my batch duration is high and I'm uncertain how much this additional work is contributing to it.
agg_exprs = [f.collect_list('col_of_interest_1'),
f.collect_list('col_of_interest_2'),
f.collect_list('col_of_interest_3')]
intermediate_sdf = source_sdf.groupBy('time_window', ...).agg(agg_exprs)
final_sdf = intermediate_sdf.groupBy('time_window', ...).applyInPandas(func, schema)

How to get the output from console streaming sink in Zeppelin?

I'm struggling to get the console sink working with PySpark Structured Streaming when run from Zeppelin. Basically, I'm not seeing any results printed to the screen, or to any logfiles I've found.
My question: Does anyone have a working example of using PySpark Structured Streaming with a sink that produces output visible in Apache Zeppelin? Ideally it would also use the socket source, as that's easy to test with.
I'm using:
Ubuntu 16.04
spark-2.2.0-bin-hadoop2.7
zeppelin-0.7.3-bin-all
Python3
I've based my code on the structured_network_wordcount.py example. It works when run from the PySpark shell (./bin/pyspark --master local[2]); I see tables per batch.
%pyspark
# structured streaming
from pyspark.sql.functions import *
lines = spark\
.readStream\
.format('socket')\
.option('host', 'localhost')\
.option('port', 9999)\
.option('includeTimestamp', 'true')\
.load()
# Split the lines into words, retaining timestamps
# split() splits each line into an array, and explode() turns the array into multiple rows
words = lines.select(
explode(split(lines.value, ' ')).alias('word'),
lines.timestamp
)
# Group the data by window and word and compute the count of each group
windowedCounts = words.groupBy(
window(words.timestamp, '10 seconds', '1 seconds'),
words.word
).count().orderBy('window')
# Start running the query that prints the windowed word counts to the console
query = windowedCounts\
.writeStream\
.outputMode('complete')\
.format('console')\
.option('truncate', 'false')\
.start()
print("Starting...")
query.awaitTermination(20)
I'd expect to see printouts of results for each batch, but instead I just see Starting..., and then False, the return value of query.awaitTermination(20).
In a separate terminal I'm entering some data into a nc -lk 9999 netcat session while the above is running.
Console sink is not a good choice for interactive notebook-based workflow. Even in Scala, where the output can be captured, it requires awaitTermination call (or equivalent) in the same paragraph, effectively blocking the note.
%spark
spark
.readStream
.format("socket")
.option("host", "localhost")
.option("port", "9999")
.option("includeTimestamp", "true")
.load()
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()
.awaitTermination() // Block execution, to force Zeppelin to capture the output
Chained awaitTermination could be replaced with standalone call in the same paragraph would work as well:
%spark
val query = df
.writeStream
...
.start()
query.awaitTermination()
Without it, Zeppelin has no reason to wait for any output. PySpark just adds another problem on top of that - indirect execution. Because of that, even blocking the query won't help you here.
Moreover continuous output from the stream can cause rendering issues and memory problems when browsing the note (it might be possible to use Zeppelin display system via InterpreterContext or REST API, to achieve a bit more sensible behavior, where the output is overwritten or periodically cleared).
A much better choice for testing with Zeppelin is memory sink. This way you can start a query without blocking:
%pyspark
query = (windowedCounts
.writeStream
.outputMode("complete")
.format("memory")
.queryName("some_name")
.start())
and query the result on demand in another paragraph:
%pyspark
spark.table("some_name").show()
It can be coupled with reactive streams or similar solution to provide interval based updates.
It is also possible to use StreamingQueryListener with Py4j callbacks to couple rx with onQueryProgress events, although query listeners are not supported in PySpark, and require a bit of code, to glue things together. Scala interface:
package com.example.spark.observer
import org.apache.spark.sql.streaming.StreamingQueryListener
import org.apache.spark.sql.streaming.StreamingQueryListener._
trait PythonObserver {
def on_next(o: Object): Unit
}
class PythonStreamingQueryListener(observer: PythonObserver)
extends StreamingQueryListener {
override def onQueryProgress(event: QueryProgressEvent): Unit = {
observer.on_next(event)
}
override def onQueryStarted(event: QueryStartedEvent): Unit = {}
override def onQueryTerminated(event: QueryTerminatedEvent): Unit = {}
}
build a jar, adjusting build definition to reflect desired Scala and Spark version:
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion
)
put it on the Spark classpath, patch StreamingQueryManager:
%pyspark
from pyspark.sql.streaming import StreamingQueryManager
from pyspark import SparkContext
def addListener(self, listener):
jvm = SparkContext._active_spark_context._jvm
jlistener = jvm.com.example.spark.observer.PythonStreamingQueryListener(
listener
)
self._jsqm.addListener(jlistener)
return jlistener
StreamingQueryManager.addListener = addListener
start callback server:
%pyspark
sc._gateway.start_callback_server()
and add listener:
%pyspark
from rx.subjects import Subject
class StreamingObserver(Subject):
class Java:
implements = ["com.example.spark.observer.PythonObserver"]
observer = StreamingObserver()
spark.streams.addListener(observer)
Finally you can use subscribe and block execution:
%pyspark
(observer
.map(lambda p: p.progress().name())
# .filter() can be used to print only for a specific query
.subscribe(lambda n: spark.table(n).show() if n else None))
input() # Block execution to capture the output
The last step should be executed after you started streaming query.
It is also possible to skip rx and use minimal observer like this:
class StreamingObserver(object):
class Java:
implements = ["com.example.spark.observer.PythonObserver"]
def on_next(self, value):
try:
name = value.progress().name()
if name:
spark.table(name).show()
except: pass
It gives a bit less control than the Subject (one caveat is that this can interfere with other code printing to stdout and can be stopped only by removing listener. With Subject you can easily dispose subscribed observer, once you're done), but otherwise should work more or less the same.
Note that any blocking action will be sufficient to capture the output from the listener and it doesn't have to be executed in the same cell. For example
%pyspark
observer = StreamingObserver()
spark.streams.addListener(observer)
and
%pyspark
import time
time.sleep(42)
would work in a similar way, printing table for a defined time interval.
For completeness you can implement StreamingQueryManager.removeListener.
zeppelin-0.7.3-bin-all uses Spark 2.1.0 (so no rate format to test Structured Streaming with unfortunately).
Make sure that when you start a streaming query with socket source nc -lk 9999 has already been started (as the query simply stops otherwise).
Also make sure that the query is indeed up and running.
val lines = spark
.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load
val q = lines.writeStream.format("console").start
It's indeed true that you won't be able to see the output in a Zeppelin notebook possibly because:
Streaming queries start on their own threads (that seems to be outside Zeppelin's reach)
console sink writes to standard output (uses Dataset.show operator on that separate thread).
All this makes "intercepting" the output not available in Zeppelin.
So we come to answer the real question:
Where is the standard output written to in Zeppelin?
Well, with a very limited understanding of the internals of Zeppelin, I thought it could be logs/zeppelin-interpreter-spark-[hostname].log, but unfortunately could not find the output from the console sink. That's where you can find the logs from Spark (and Structured Streaming in particular) that use log4j but console sink does not use.
It looks as if your only long-term solution were to write your own console-like custom sink and use a log4j logger. Honestly, that is not that hard as it sounds. Follow the sources of console sink.

how to connect spark streaming with cassandra?

I'm using
Cassandra v2.1.12
Spark v1.4.1
Scala 2.10
and cassandra is listening on
rpc_address:127.0.1.1
rpc_port:9160
For example, to connect kafka and spark-streaming, while listening to kafka every 4 seconds, I have the following spark job
sc = SparkContext(conf=conf)
stream=StreamingContext(sc,4)
map1={'topic_name':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1)
And spark-streaming keeps listening to kafka broker every 4 seconds and outputs the contents.
Same way, I want spark streaming to listen to cassandra and output the contents of the specified table every say 4 seconds.
How to convert the above streaming code to make it work with cassandra instead of kafka?
The non-streaming solution
I can obviously keep running the query in an infinite loop but that's not true streaming right?
spark job:
from __future__ import print_function
import time
import sys
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext
from pyspark.streaming import *
sc = SparkContext(appName="sparkcassandra")
while(True):
time.sleep(5)
sqlContext = SQLContext(sc)
stream=StreamingContext(sc,4)
lines = stream.socketTextStream("127.0.1.1", 9160)
sqlContext.read.format("org.apache.spark.sql.cassandra")\
.options(table="users", keyspace="keyspace2")\
.load()\
.show()
run like this
sudo ./bin/spark-submit --packages \
datastax:spark-cassandra-connector:1.4.1-s_2.10 \
examples/src/main/python/sparkstreaming-cassandra2.py
and I get the table values which rougly looks like
lastname|age|city|email|firstname
So what's the correct way of "streaming" the data from cassandra?
Currently the "Right Way" to stream data from C* is not to Stream Data from C* :) Instead it usually makes much more sense to have your message queue (like Kafka) in front of C* and Stream off of that. C* doesn't easily support incremental table reads although this can be done if the clustering key is based on insert time.
If you are interested in using C* as a streaming source be sure to check out and comment on
https://issues.apache.org/jira/browse/CASSANDRA-8844
Change Data Capture
Which is most likely what you are looking for.
If you are actually just trying to read the full table periodically and do something you may be best off with just a cron job launching a batch operation as you really have no way of recovering state anyway.
Currently Cassandra is not natively supported as a streaming source in Spark 1.6, you must implement a custom receiver for your own case(listen to cassandra and output the contents of the specified table every say 4 seconds.).
Please refer to the implementation guide:
Spark Streaming Custom Receivers

Resources