How to write a Streaming Structured Stream into Hive directly? - apache-spark

I want to achieve something like this :
df.writeStream
.saveAsTable("dbname.tablename")
.format("parquet")
.option("path", "/user/hive/warehouse/abc/")
.option("checkpointLocation", "/checkpoint_path")
.outputMode("append")
.start()
I am open to suggestions. I know Kafka Connect could be one of the options but how to achieve this using Spark. A possible workaround may be what I am looking for.
Thanks in Advance !!

Spark Structured Streaming does not support writing the result of a streaming query to a Hive table directly. You must write to paths.
For 2.4 they say try foreachBatch, but I have not tried it.

Related

Streaming from a Delta Live Tables in databrick to kafka instance

I have the following live table
And i'm looking to write that into a stream to be written back into my kafka source.
I've seen in the apache spark docs that I can use writeStream ( I've used readStream to get it out of my kafka stream already ). But how do I transform the table into the medium it needs so it can use this?
I'm fairly new to both kafka and the data world so any further explanation's are welcome here.
writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "updates")
.start()
Thanks in Advance,
Ben
I've seen in the apache spark docs that I can use writeStream ( I've used readStream to get it out of my kafka stream already ). But how do I transform the table into the medium it needs so it can use this?I'm fairly new to both kafka and the data world so any further explanation's are welcome here.
writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "updates")
.start()
As of right now, Delta Live Tables can only write data as a Delta table - it's not possible to write in other formats. You can implement a workaround by creating a Databricks workflow that consist of two tasks (with dependencies or not depending if the pipeline is triggered or not):
DLT Pipeline that will do the actual data processing
A task (easiest way to do with notebook) that will read a table generated by DLT as a stream and write its content into Kafka, with something like that:
df = spark.readStream.format("delta").table("database.table_name")
(df.write.format("kafka").option("kafka....", "")
.trigger(availableNow=True) # if it's not continuous
.start()
)
P.S. If you have solution architect or customer success engineer attached to your Databricks account, you can communicate this requirement to them for product prioritization.
The transformation is done after the read stream process is started
read_df = spark.readStream.format('kafka') ... .... # other options
processed_df = read_df.withColumn('some column', some_calculation )
processed_df.writeStream.format('parquet') ... .... # other options
.start()
The spark documentation is helpful and detailed but some articles are not for beginners. You can look on youtube or read articles to help you get started like this one

Can't read via Apache Spark Structured Streaming from Hive Table

When I try and read from a hive table with the following code. I get the error buildReader is not supported for HiveFileFormat from spark driver pod.
spark.readStream \
.table("table_name") \
.repartition("filename") \
.writeStream \
.outputMode("append") \
.trigger(processingTime="10 minutes") \
.foreachBatch(perBatch)
Have tried every possible combination including most simple queries possible. Reading via parquet method direct from specified folder works, as does Spark SQL without streaming, but reading with Structured Streaming via readStream does not.
The documentation says the following...
Since Spark 3.1, you can also use DataStreamReader.table() to read tables as streaming DataFrames and use DataStreamWriter.toTable() to write streaming DataFrames as tables:
I'm using the latest Spark 3.2.1. Although reading from a table is not shown in the examples the paragraph above clearly suggests it should be possible.
Any assistance to help get this working would be really great and simplify my project a lot.

Databricks structured streaming with Snowflake as source?

Is it possible to use a Snowflake table as a source for spark structured streaming in Databricks? When I run the following pyspark code:
options = dict(sfUrl=our_snowflake_url,
sfUser=user,
sfPassword=password,
sfDatabase=database,
sfSchema=schema,
sfWarehouse=warehouse)
df = spark.readStream.format("snowflake") \
.schema(final_struct) \
.options(**options) \
.option("dbtable", "BASIC_DATA_TEST") \
.load()
I get this warning:
java.lang.UnsupportedOperationException: Data source snowflake does not support streamed reading
I haven't been able to find anything in the Spark Structured Streaming Docs that explicitly says Snowflake is supported as a source, but I'd like to make sure I'm not missing anything obvious.
Thanks!
The Spark Snowflake connector currently does not support using the .writeStream/.readStream calls from Spark Structured Streaming

Using a dynamic index on Spark Structured Streaming with ES Hadoop

I have Elasticsearch 6.4.2 and Spark 2.2.0
Currently I have a working example where I can send data from a Dataset into Elasticsearch via the writeStream (Structured Streaming) API:
ds.writeStream
.outputMode("append")
.format("org.elasticsearch.spark.sql")
.option("checkpointLocation","hdfs://X.X.X.X:9000/tmp")
.option("es.resource.write","index/doc")
.option("es.nodes","X.X.X.X")
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
However, I am interested in using dynamic index names to create a new index based on the date of the event. Per the documentation, it its supposedly possible to do that using the es.resource.write configuration with a special format:
.option("es.resource.write","index-{myDateField}/doc")
Despite all my efforts, when I try to run the code with the curly braces on, it immediately crashes stating an illegal character '{' was detected.
¿Does the streamWrite API currently supports this configuration?
spark
.readStream
.option("sep", ",")
.schema(userSchema)
.csv("/data/csv/") // csv file home
.writeStream
.format("es")
.outputMode(OutputMode.Append())
.option("checkpointLocation", "file:/data/job/spark/checkpointLocation/example/StructuredEs")
.start("structured.es.example.{name}.{date|yyyy-MM}") // ES 7+ no type
.awaitTermination()

how to check if stop streaming from kafka topic by a limited time duration or record count?

My ultimate goal is to see if a kafka topic is running and if the data in it is good, otherwise fail / throw an error
if I could pull just 100 messages, or pull for just 60 seconds I think I could accomplish what i wanted. But all the streaming examples / questions I have found online have no intention of shutting down the streaming connection.
Here is the best working code I have so far, that pulls data and displays it, but it keeps trying to pull for more data, and if I try to access it in the next line, it hasnt had a chance to pull the data yet. I assume I need some sort of call back. has anyone done something similar? is this the best way of going about this?
I am using databricks notebooks to run my code
import org.apache.spark.sql.functions.{explode, split}
val kafka = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<kafka server>:9092")
.option("subscribe", "<topic>")
.option("startingOffsets", "earliest")
.load()
val df = kafka.select(explode(split($"value".cast("string"), "\\s+")).as("word"))
display(df.select($"word"))
The trick is you don't need streaming at all. Kafka source supports batch queries, if you replace readStream with read and adjust startingOffsets and endingOffsets.
val df = spark
.read
.format("kafka")
... // Remaining options
.load()
You can find examples in the Kafka streaming documentation.
For streaming queries you can use once trigger, although it might not be the best choice in this case:
df.writeStream
.trigger(Trigger.Once)
... // Handle the output, for example with foreach sink (?)
You could also use standard Kafka client to fetch some data without starting SparkSession.

Resources