Databricks: Queries with streaming sources must be executed with writeStream.start() - apache-spark

I know there are several other questions about this error message, but none seems to relate to the problem I am currently facing. I am streaming from a JSON file (this part works):
gamingEventDF = (spark
.readStream
.schema(eventSchema)
.option('streamName','mobilestreaming_demo')
.option("maxFilesPerTrigger", 1)
.json(inputPath)
)
Next I want to use writeStream to append it to a table:
def writeToBronze(sourceDataframe, bronzePath, streamName):
(sourceDataframe.rdd
.spark
.writeStream.format("delta")
.option("checkpointLocation", bronzePath + "/_checkpoint")
.queryName(streamName)
.outputMode("append")
.start(bronzePath)
)
When I now run:
writeToBronze(gamingEventDF, outputPathBronze, "bronze_stream")
I am getting the error: AnalysisException: Queries with streaming sources must be executed with writeStream.start()
Btw: when I delete the .rdd, I am getting another error ('DataFrame' object has no attribute 'spark')
Any idea what I got wrong?
Thanks a lot

writeStream method is available on dataframe class not on SparkSession.
Below code should work for you.
def writeToBronze(sourceDataframe, bronzePath, streamName):
(sourceDataframe
.writeStream.format("delta")
.option("checkpointLocation", bronzePath + "/_checkpoint")
.queryName(streamName)
.outputMode("append")
.start(bronzePath)
.awaitTermination())

Related

Sink from Delta Live Table to Kafka, initial sink works, but any subsequent updates fail

I have a DLT pipeline that ingests a topic from my kafka stream, transforms it into a DLT, then I wish to write that table back into Kafka under a new topic.
So far, I have this working, however it only works on first load of the table, then after any subsequent updates will crash my read and write stream.
My DLT tables updates correctly, so I see updates from my pipeline into the Gold table,
CREATE OR REFRESH LIVE TABLE deal_gold1
TBLPROPERTIES ("quality" = "gold")
COMMENT "Gold Deals"
AS SELECT
documentId,
eventTimestamp,
substring(fullDocument.owner_id, 11, 24) as owner_id,
fullDocument.owner_type as owner_type,
substring(fullDocument.account_id, 11, 24) as account_id,
substring(fullDocument.manager_account_id, 11, 24) as manager_account_id,
fullDocument.hubspot_deal_id as hubspot_deal_id,
fullDocument.stage as stage,
fullDocument.status as status,
fullDocument.title as title
FROM LIVE.deal_bronze_cleansed
but then when I try to read from it via a separate notebook, these updates cause it to crash
import pyspark.sql.functions as fn
from pyspark.sql.types import StringType
# this one is the problem not the write stream
df = spark.readStream.format("delta").table("deal_stream_test.deal_gold1")
display(df)
writeStream= (
df
.selectExpr("CAST(documentId AS STRING) AS key", "to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.outputMode("append")
.option("ignoreChanges", "true")
.option("checkpointLocation", "/tmp/benperram21/checkpoint")
.option("kafka.bootstrap.servers", confluentBootstrapServers)
.option("ignoreChanges", "true")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("topic", confluentTopicName)
.start()
)
I was looking and can see this might be as a result of it not being read as "Append". But yeah any thoughts on this? Everything works upset updates.
Right now DLT doesn't support output to the arbitrary sinks. Also, all Spark operations should be done inside the nodes of the execution graph (functions labeled with dlt.table or dlt.view).
Right now the workaround would be to run that notebook outside of the DLT pipeline, as a separate task in the multitask job (workflow).

Continous data generator from Azure Databricks to Azure Event Hubs using Spark with Kafka API but no data is streamed

I'm trying to implement a continuous data generator from Databricks to an Event Hub.
My idea was to generate some data in a .csv file and then create a data frame with the data. In a loop I call a function that executes a query to stream that data to the Event Hub. Not sure if the idea was good or if spark can handle writing from the same data frame or if I understood correctly how queries work.
The code looks like this:
def write_to_event_hub(
df: DataFrame,
topic: str,
bootstrap_servers: str,
config: str,
checkpoint_path: str,
):
return (
df.writeStream.format("kafka")
.option("topic", topic)
.option("kafka.bootstrap.servers", bootstrap_servers)
.option("kafka.sasl.mechanism", "PLAIN")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", config)
.option("checkpointLocation", checkpoint_path)
.trigger(once=True)
.start()
)
while True:
query = write_to_event_hub(
streaming_df,
topic,
bootstrap_servers,
sasl_jaas_config,
"/checkpoint",
)
query.awaitTermination()
print("Wrote once")
time.sleep(5)
I want to mention that this is how I read data from the CSV file (I have it in DBFS) and I also have the schema for it:
streaming_df = (
spark.readStream.format("csv")
.option("header", "true")
.schema(location_schema)
.load(f"{path}")
)
It looks like no data is written event though I have the message "Wrote once" printed. Any ideas how to handle this? Thank you!
The problem is that you're using readStream to get the CSV data, so it will wait until new data will be pushed to the directory with CSV files. But really, you don't need to use readStream/writeStream - Kafka connector works just fine in batch mode, so your code should be:
df = read_csv_file()
while True:
write_to_kafka(df)
sleep(5)

Running a Spark Streaming job in Zeppelin throws connection refused 8998 error

I'm working in a virtual machine. I run a Spark Streaming job which I basically copied from a Databricks tutorial.
%pyspark
query = (
streamingCountsDF
.writeStream
.format("memory") # memory = store in-memory table
.queryName("counts") # counts = name of the in-memory table
.outputMode("complete") # complete = all the counts should be in the table
.start()
)
Py4JJavaError: An error occurred while calling o101.start.
: java.net.ConnectException: Call From VirtualBox/127.0.1.1 to localhost:8998 failed on connection exception: java.net.ConnectException:
I checked and there is no service listening on port 8998. I learned that this port is associated with the Apache Livy-server which I am not using. Can someone point me into the right direction?
Ok, so I fixed this issue. First, I added 'file://' when specifying the input folder. Second, I added a checkpoint location. See code below:
inputFolder = 'file:///home/sallos/tmp/'
streamingInputDF = (
spark
.readStream
.schema(schema)
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as a stream by picking one file at a time
.csv(inputFolder)
)
streamingCountsDF = (
streamingInputDF
.groupBy(
streamingInputDF.SrcIPAddr,
window(streamingInputDF.Datefirstseen, "30 seconds"))
.sum('Bytes').withColumnRenamed("sum(Bytes)", "sum_bytes")
)
query = (
streamingCountsDF
.writeStream.format("memory")\
.queryName("sumbytes")\
.outputMode("complete")\
.option("checkpointLocation","file:///home/sallos/tmp_checkpoint/")\
.start()
)

writing corrupt data from kafka / json datasource in spark structured streaming

In spark batch jobs I usually have a JSON datasource written to a file and can use corrupt column features of the DataFrame reader to write the corrupt data out in a seperate location, and another reader to write the valid data both from the same job. ( The data is written as parquet )
But in Spark Structred Streaming I'm first reading the stream in via kafka as a string and then using from_json to get my DataFrame. Then from_json uses JsonToStructs which uses a FailFast mode in the parser and does not return the unparsed string to a column in the DataFrame. (see Note in Ref) Then how can I write corrupt data that doesn't match my schema and possibly invalid JSON to another location using SSS?
Finally in the batch job the same job can write both dataframes. But Spark Structured Streaming requires special handling for multiple sinks. Then in Spark 2.3.1 (my current version) we should include details about how to write both corrupt and invalid streams properly...
Ref: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Expression-JsonToStructs.html
val rawKafkaDataFrame=spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", config.broker)
.option("kafka.ssl.truststore.location", path.toString)
.option("kafka.ssl.truststore.password", config.pass)
.option("kafka.ssl.truststore.type", "JKS")
.option("kafka.security.protocol", "SSL")
.option("subscribe", config.topic)
.option("startingOffsets", "earliest")
.load()
val jsonDataFrame = rawKafkaDataFrame.select(col("value").cast("string"))
// does not provide a corrupt column or way to work with corrupt
jsonDataFrame.select(from_json(col("value"), schema)).select("jsontostructs(value).*")
When you convert to json from string, and if it is not be able to parse with the schema provided, it will return null. You can filter the null values and select the string. Something like this.
val jsonDF = jsonDataFrame.withColumn("json", from_json(col("value"), schema))
val invalidJsonDF = jsonDF.filter(col("json").isNull).select("value")
I was just trying to figure out the _corrupt_record equivalent for structured streaming as well. Here's what I came up with; hopefully it gets you closer to what you're looking for:
// add a status column to partition our output by
// optional: only keep the unparsed json if it was corrupt
// writes up to 2 subdirs: 'out.par/status=OK' and 'out.par/status=CORRUPT'
// additional status codes for validation of nested fields could be added in similar fashion
df.withColumn("struct", from_json($"value", schema))
.withColumn("status", when($"struct".isNull, lit("CORRUPT")).otherwise(lit("OK")))
.withColumn("value", when($"status" <=> lit("CORRUPT"), $"value"))
.write
.partitionBy("status")
.parquet("out.par")

multiple writeStream with spark streaming

I am working with spark streaming and I am facing some issues trying to implement multiple writestreams.
Below is my code
DataWriter.writeStreamer(firstTableData,"parquet",CheckPointConf.firstCheckPoint,OutputConf.firstDataOutput)
DataWriter.writeStreamer(secondTableData,"parquet",CheckPointConf.secondCheckPoint,OutputConf.secondDataOutput)
DataWriter.writeStreamer(thirdTableData,"parquet", CheckPointConf.thirdCheckPoint,OutputConf.thirdDataOutput)
where writeStreamer is defined as follows :
def writeStreamer(input: DataFrame, checkPointFolder: String, output: String) = {
val query = input
.writeStream
.format("orc")
.option("checkpointLocation", checkPointFolder)
.option("path", output)
.outputMode(OutputMode.Append)
.start()
query.awaitTermination()
}
the problem I am facing is that only the first table is written with spark writeStream , nothing happens for all other tables .
Do you have any idea about this please ?
query.awaitTermination() should be done after the last stream is created.
writeStreamer function can be modified to return a StreamingQuery and not awaitTermination at that point (as it is blocking):
def writeStreamer(input: DataFrame, checkPointFolder: String, output: String): StreamingQuery = {
input
.writeStream
.format("orc")
.option("checkpointLocation", checkPointFolder)
.option("path", output)
.outputMode(OutputMode.Append)
.start()
}
then you will have:
val query1 = DataWriter.writeStreamer(...)
val query2 = DataWriter.writeStreamer(...)
val query3 = DataWriter.writeStreamer(...)
query3.awaitTermination()
If you want to execute writers to run in parallel you can use
sparkSession.streams.awaitAnyTermination()
and remove query.awaitTermination() from writeStreamer method
By default the number of concurrent jobs is 1 which means at a time
only 1 job will be active
did you try increase number of possible concurent job in spark conf ?
sparkConf.set("spark.streaming.concurrentJobs","3")
not a offcial source : http://why-not-learn-something.blogspot.com/2016/06/spark-streaming-performance-tuning-on.html

Resources