Structured Streaming (Spark 2.3.0) cannot write to Parquet file sink when submitted as a job - apache-spark

I'm consuming from Kafka and writing to parquet in EMRFS. Below code works in spark-shell:
val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start
SBT is able to package the code without errors. When the .jar is sent to spark-submit, the job is accepted and stays in running state forever without writing data to HDFS.
There is no ERROR in the .inprogress log
Some posts suggest that a large watermark duration can cause it, but I have not set a custom watermark duration.

I can write to parquet using Pyspark, I put you my code in case that will be useful:
stream = self.spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", self.kafka_bootstrap_servers) \
.option("subscribe", self.topic) \
.option("startingOffsets", self.startingOffsets) \
.option("max.poll.records", self.max_poll_records) \
.option("auto.commit.interval.ms", self.auto_commit_interval_ms) \
.option("session.timeout.ms", self.session_timeout_ms) \
.option("key.deserializer", self.key_deserializer) \
.option("value.deserializer", self.value_deserializer) \
.load()
self.query = stream \
.select(col("value")) \
.select((self.proto_function("value")).alias("value_udf")) \
.select(*columns,
date_format(column_time, "yyyy").alias("date").alias("year"),
date_format(column_time, "MM").alias("date").alias("month"),
date_format(column_time, "dd").alias("date").alias("day"),
date_format(column_time, "HH").alias("date").alias("hour"))
query = self.query \
.writeStream \
.format("parquet") \
.option("checkpointLocation", self.path) \
.partitionBy("year", "month", "day", "hour") \
.option("path", self.path) \
.start()
Also, you need to run the code in that way: spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 <code>

Related

Spark Structured Streaming inconsistent output to multiple sinks

I am using spark structured streaming to read data from Kafka and apply some udf to the dataset. The code as below :
calludf = F.udf(lambda x: function_name(x))
dfraw = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', KAFKA_CONSUMER_IP) \
.option('subscribe', topic_name) \
.load()
df = dfraw.withColumn("value", F.col('value').cast('string')).withColumn('value', calludf(F.col('value')))
ds = df.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format('console') \
.option('truncate', False) \
.start()
dsf = df.selectExpr("CAST (value AS STRING)") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_CONSUMER_IP) \
.option("topic", topic_name_two) \
.option("checkpointLocation", checkpoint_location) \
.start()
ds.awaitTermination()
dsf.awaitTermination()
Now the problem is that I am getting 10 dataframes as input. 2 of them failed due to some issue with the data which is understandable. The console displays rest of the 8 processed dataframes BUT only 6 of those 8 processed dataframes are written to the Kafka topic using dsf steaming query. Even though I have added checkpoint location to it but it is still not working.
PS: Do let me know if you have any suggestion regarding the code as well. I am new to spark structured streaming so maybe there is something wrong with the way I am doing it.

How to make Spark streams execute sequentially

Issue
I have a job that executes two streams in total but I want the last one to start after the first stream has finished since the first stream saves events from the readstream in a DeltaTable that serve as input for the second stream. The problem is that what is being added in the first stream is not available in the second stream, in the current notebook run, because they start simultaneously.
Is there a way to enforce the order while running it from the same notebook?
I've tried the awaitTermination function but discovered this does not solve my problem. Some pseudocode:
def main():
# Read eventhub
metricbeat_df = spark \
.readStream \
.format("eventhubs") \
.options(**eh_conf) \
.load()
# Save raw events
metricbeat_df.writeStream \
.trigger({"once": True}) \
.format("delta") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/...") \
.queryName("query1") \
.table("my_db.raw_events")
# Parse events
metricbeat_df = spark.readStream \
.format("delta") \
.option("ignoreDeletes", True) \
.table("my_db.raw_events")
# *Do some transformations here*
metricbeat_df.writeStream \
.trigger({"once": True}) \
.format("delta") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/...") \
.queryName("query2") \
.table("my_db.joined_bronze_events")
TLDR
To summarize the issue: when I run the code above, query1 and query2 start at the same time which causes that my_db.joined_bronze_events is a bit behind my_db.raw_events because what is being added in query1 is not available in query2 in the current run (it will be in the next run of course).
Is there a way to enforce that query2 will not start until query1 has finished while still running it in the same notebook?
As you are using the option Trigger.once, you can make use of the processAllAvailable method in your StreamingQuery:
def main():
# Read eventhub
# note that I have changed the variable name to metricbeat_df1
metricbeat_df1 = spark \
.readStream \
.format("eventhubs") \
.options(**eh_conf) \
.load()
# Save raw events
metricbeat_df1.writeStream \
.trigger({"once": True}) \
.format("delta") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/...") \
.queryName("query1") \
.table("my_db.raw_events") \
.processAllAvailable()
# Parse events
# note that I have changed the variable name to metricbeat_df2
metricbeat_df2 = spark.readStream \
.format("delta") \
.option("ignoreDeletes", True) \
.table("my_db.raw_events")
# *Do some transformations here*
metricbeat_df2.writeStream \
.trigger({"once": True}) \
.format("delta") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/...") \
.queryName("query2") \
.table("my_db.joined_bronze_events") \
.processAllAvailable()
Note, that I have changed the dataframe names as they should not be the same for both streaming queries.
The method processAllAvailable is described as:
"Blocks until all available data in the source has been processed and committed to the sink. This method is intended for testing. Note that in the case of continually arriving data, this method may block forever. Additionally, this method is only guaranteed to block until data that has been synchronously appended data to a org.apache.spark.sql.execution.streaming.Source prior to invocation. (i.e. getOffset must immediately reflect the addition)."

Databricks: Structured Stream fails with TimeoutException

I want to create a structured stream in databricks with a kafka source.
I followed the instructions as described here. My script seems to start, however it fails with the first element of the stream. The stream itsellf works fine and produces results and works (in databricks) when I use confluent_kafka, thus there seems to be a different issue I am missing:
After the initial stream is processed, the script times out:
java.util.concurrent.TimeoutException: Stream Execution thread for stream [id = 80afdeed-9266-4db4-85fa-66ccf261aee4,
runId = b564c626-9c74-42a8-8066-f1f16c7ab53d] failed to stop within 36000 milliseconds (specified by spark.sql.streaming.stopTimeout). See the cause on what was being executed in the streaming query thread.`
WHAT I TRIED: looking at SO and finding this answer, to which I included
spark.conf.set("spark.sql.streaming.stopTimeout", 36000)
into my setup - which changed nothing.
Any input is highly appreciated!
from pyspark.sql import functions as F
from pyspark.sql.types import *
# Define a data schema
schema = StructType() \
.add('PARAMETERS_TEXTVALUES_070_VALUES', StringType())\
.add('ID', StringType())\
.add('PARAMETERS_TEXTVALUES_001_VALUES', StringType())\
.add('TIMESTAMP', TimestampType())
df = spark \
.readStream \
.format("kafka") \
.option("host", "stream.xxx.com") \
.option("port", 12345)\
.option('kafka.bootstrap.servers', 'stream.xxx.com:12345') \
.option('subscribe', 'stream_test.json') \
.option("startingOffset", "earliest") \
.load()
df_word = df.select(F.col('key').cast('string'),
F.from_json(F.col('value').cast('string'), schema).alias("parsed_value"))
df_word \
.writeStream \
.format("parquet") \
.option("path", "dbfs:/mnt/streamfolder/stream/") \
.option("checkpointLocation", "dbfs:/mnt/streamfolder/check/") \
.outputMode("append") \
.start()
my stream output data looks like this:
"PARAMETERS_TEXTVALUES_070_VALUES":'something'
"ID":"47575963333908"
"PARAMETERS_TEXTVALUES_001_VALUES":12345
"TIMESTAMP": "2020-10-22T15:06:42.507+02:00"
Furthermore, stream and check folders are filled with 0-b files, except for metadata, which includes the ìd from the error above.
Thanks and stay safe.

Pyspark Kafka structured streaming: error while writing out

I am able to read a stream from a Kafka topic and write the (transformed) data back to another Kafka topic in two different steps in PySpark. The code to do that is as follows:
# Define Stream:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "instream") \
.load()
# Transform
matchdata = df.select(from_json(F.col("value").cast("string"),schema).alias("value"))\
.select(F.col('value').cast("string"))
# Stream the data, from a Kafka topic to a Spark in-memory table
query = matchdata \
.writeStream \
.format("memory") \
.queryName("PositionTable") \
.outputMode("append") \
.start()
query.awaitTermination(5)
# Create a new dataframe after stream completes:
tmp_df=spark.sql("select * from PositionTable")
# Write data to a different Kafka topic
tmp_df \
.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "outstream") \
.save()
The code above works as expected: the data in Kafka topic "instream" is read in PySpark, and then PySpark can write out data to Kafka topic "outstream".
However, I would like to read the stream in and write the transformed data back out immediately (the stream will be unbounded and we would like insights immediately as the data rolls in). Following the documentation, I replaced the query above with the following:
query = matchdata \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "outstream") \
.option("checkpointLocation", "/path/to/HDFS/dir") \
.start()
This does not appear to work.
There is no error message, so I do not know what is wrong. I've also tried windowing and aggregating within windows, but that also does not work. Any advice will be appreciated!
Ok, I found the problem. The main reason was that the subdirectory "path/to/HDFS/dir" has to exist. After creating that directory the code ran as expected. It would have been nice if an error message stated something along those lines.

Spark Streaming: Write dataframe to ElasticSearch

I using the following code to write a stream to elasticsearch from python (pyspark) application.
#Streaming code
query = df.writeStream \
.outputMode("append") \
.format("org.elasticsearch.spark.sql") \
.option("checkpointLocation", "/tmp/") \
.option("es.resource", "logs/raw") \
.option("es.nodes", "localhost") \
.start()
query.awaitTermination()
If I write the results to the console it works fine, also, if I write to ES - not in streaming mode, it works ok. This is the code I used to write to ES:
#Not streaming
df.write.format("org.elasticsearch.spark.sql") \
.mode('append') \
.option("es.resource", "log/raw") \
.option("es.nodes", "localhost").save("log/raw")
The thing is, I can't debug it, the code is running, but nothing is written to ES (in streaming mode).
Thanks,
Eventually did work out for me, the problem was technical (needed vpn)
query = df.writeStream \
.outputMode("append") \
.queryName("writing_to_es") \
.format("org.elasticsearch.spark.sql") \
.option("checkpointLocation", "/tmp/") \
.option("es.resource", "index/type") \
.option("es.nodes", "localhost") \
.start()
query.awaitTermination()
Code:
val stream = df
.writeStream
.option("checkpointLocation", checkPointDir)
.format("es")
.start("realtime/data")
SBT Dependency:
libraryDependencies += "org.elasticsearch" %% "elasticsearch-spark-20" % "6.2.4"

Resources