Why an exception while writing an aggregated dataframe to a file sink? - apache-spark

I'm performing an aggregation on a streaming dataframe and trying to write the result to an output directory. But I'm getting an exception saying
pyspark.sql.utils.AnalysisException: 'Data source json does not support Update output mode;
I'm getting similar error with "complete" output mode.
This is my code:
grouped_df = logs_df.groupBy('host', 'timestamp').agg(count('host').alias('total_count'))
result_host = grouped_df.filter(col('total_count') > threshold)
writer_query = result_host.writeStream \
.format("json") \
.queryName("JSON Writer") \
.outputMode("update") \
.option("path", "output") \
.option("checkpointLocation", "chk-point-dir") \
.trigger(processingTime="1 minute") \
.start()
writer_query.awaitTermination()

FileSinks only support "append" mode according to documentation on OutputSinks, see "supported output modes" in below table.

Related

Spark Structured Streaming inconsistent output to multiple sinks

I am using spark structured streaming to read data from Kafka and apply some udf to the dataset. The code as below :
calludf = F.udf(lambda x: function_name(x))
dfraw = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', KAFKA_CONSUMER_IP) \
.option('subscribe', topic_name) \
.load()
df = dfraw.withColumn("value", F.col('value').cast('string')).withColumn('value', calludf(F.col('value')))
ds = df.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format('console') \
.option('truncate', False) \
.start()
dsf = df.selectExpr("CAST (value AS STRING)") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_CONSUMER_IP) \
.option("topic", topic_name_two) \
.option("checkpointLocation", checkpoint_location) \
.start()
ds.awaitTermination()
dsf.awaitTermination()
Now the problem is that I am getting 10 dataframes as input. 2 of them failed due to some issue with the data which is understandable. The console displays rest of the 8 processed dataframes BUT only 6 of those 8 processed dataframes are written to the Kafka topic using dsf steaming query. Even though I have added checkpoint location to it but it is still not working.
PS: Do let me know if you have any suggestion regarding the code as well. I am new to spark structured streaming so maybe there is something wrong with the way I am doing it.

upsert (merge) delta with spark structured streaming

I need to upsert data in real time (with spark structured streaming) in python
This data is read in realtime (format csv) and then is written as a delta table (here we want to update the data that's why we use merge into from delta)
I am using delta engine with databricks
I coded this:
from delta.tables import *
spark = SparkSession.builder \
.config("spark.sql.streaming.schemaInference", "true")\
.appName("SparkTest") \
.getOrCreate()
sourcedf= spark.readStream.format("csv") \
.option("header", True) \
.load("/mnt/user/raw/test_input") #csv data that we read in real time
spark.conf.set("spark.sql.shuffle.partitions", "1")
spark.createDataFrame([], sourcedf.schema) \
.write.format("delta") \
.mode("overwrite") \
.saveAsTable("deltaTable")
def upsertToDelta(microBatchOutputDF, batchId):
microBatchOutputDF.createOrReplaceTempView("updates")
microBatchOutputDF._jdf.sparkSession().sql("""
MERGE INTO deltaTable t
USING updates s
ON s.Id = t.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
sourcedf.writeStream \
.format("delta") \
.foreachBatch(upsertToDelta) \
.outputMode("update") \
.option("checkpointLocation", "/mnt/user/raw/checkpoints/output")\
.option("path", "/mnt/user/raw/PARQUET/output") \
.start() \
.awaitTermination()
but nothing gets written as expected in the output path , the checkpoint path gets filled in as expected , a display in the delta table gives me results too
display(table("deltaTable"))
in the spark UI I see the writestream step :
sourcedf.writeStream \ .format("delta") \ ....
first at Snapshot.scala:156+details
RDD: Delta Table State #1 - dbfs:/user/hive/warehouse/deltatable/_delta_log
any idea how to fix this so I can upsert csv data into delta tables in S3 in real time with spark
Best regards
Apologies for a late reply, but just in case anyone else has the same problem. I have found the below worked for me, I wonder is it because you didn't use "cloudFiles" on your readstream to make use of autoloader?:
%python
sourcedf= spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("cloudFiles.includeExistingFiles","true") \
.schema(csvSchema) \
.load("/mnt/user/raw/test_input")
%sql
CREATE TABLE IF NOT EXISTS deltaTable(
col1 int NOT NULL,
col2 string NOT NULL,
col3 bigint,
col4 int
)
USING DELTA
LOCATION '/mnt/user/raw/PARQUET/output'
%python
def upsertToDelta(microBatchOutputDF, batchId):
microBatchOutputDF.createOrReplaceTempView("updates")
microBatchOutputDF._jdf.sparkSession().sql("""
MERGE INTO deltaTable t
USING updates s
ON s.Id = t.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
%python
sourcedf.writeStream \
.format("delta") \
.foreachBatch(upsertToDelta) \
.outputMode("update") \
.option("checkpointLocation", "/mnt/user/raw/checkpoints/output") \
.start("/mnt/user/raw/PARQUET/output")

How to make Spark streams execute sequentially

Issue
I have a job that executes two streams in total but I want the last one to start after the first stream has finished since the first stream saves events from the readstream in a DeltaTable that serve as input for the second stream. The problem is that what is being added in the first stream is not available in the second stream, in the current notebook run, because they start simultaneously.
Is there a way to enforce the order while running it from the same notebook?
I've tried the awaitTermination function but discovered this does not solve my problem. Some pseudocode:
def main():
# Read eventhub
metricbeat_df = spark \
.readStream \
.format("eventhubs") \
.options(**eh_conf) \
.load()
# Save raw events
metricbeat_df.writeStream \
.trigger({"once": True}) \
.format("delta") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/...") \
.queryName("query1") \
.table("my_db.raw_events")
# Parse events
metricbeat_df = spark.readStream \
.format("delta") \
.option("ignoreDeletes", True) \
.table("my_db.raw_events")
# *Do some transformations here*
metricbeat_df.writeStream \
.trigger({"once": True}) \
.format("delta") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/...") \
.queryName("query2") \
.table("my_db.joined_bronze_events")
TLDR
To summarize the issue: when I run the code above, query1 and query2 start at the same time which causes that my_db.joined_bronze_events is a bit behind my_db.raw_events because what is being added in query1 is not available in query2 in the current run (it will be in the next run of course).
Is there a way to enforce that query2 will not start until query1 has finished while still running it in the same notebook?
As you are using the option Trigger.once, you can make use of the processAllAvailable method in your StreamingQuery:
def main():
# Read eventhub
# note that I have changed the variable name to metricbeat_df1
metricbeat_df1 = spark \
.readStream \
.format("eventhubs") \
.options(**eh_conf) \
.load()
# Save raw events
metricbeat_df1.writeStream \
.trigger({"once": True}) \
.format("delta") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/...") \
.queryName("query1") \
.table("my_db.raw_events") \
.processAllAvailable()
# Parse events
# note that I have changed the variable name to metricbeat_df2
metricbeat_df2 = spark.readStream \
.format("delta") \
.option("ignoreDeletes", True) \
.table("my_db.raw_events")
# *Do some transformations here*
metricbeat_df2.writeStream \
.trigger({"once": True}) \
.format("delta") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/...") \
.queryName("query2") \
.table("my_db.joined_bronze_events") \
.processAllAvailable()
Note, that I have changed the dataframe names as they should not be the same for both streaming queries.
The method processAllAvailable is described as:
"Blocks until all available data in the source has been processed and committed to the sink. This method is intended for testing. Note that in the case of continually arriving data, this method may block forever. Additionally, this method is only guaranteed to block until data that has been synchronously appended data to a org.apache.spark.sql.execution.streaming.Source prior to invocation. (i.e. getOffset must immediately reflect the addition)."

Databricks: Structured Stream fails with TimeoutException

I want to create a structured stream in databricks with a kafka source.
I followed the instructions as described here. My script seems to start, however it fails with the first element of the stream. The stream itsellf works fine and produces results and works (in databricks) when I use confluent_kafka, thus there seems to be a different issue I am missing:
After the initial stream is processed, the script times out:
java.util.concurrent.TimeoutException: Stream Execution thread for stream [id = 80afdeed-9266-4db4-85fa-66ccf261aee4,
runId = b564c626-9c74-42a8-8066-f1f16c7ab53d] failed to stop within 36000 milliseconds (specified by spark.sql.streaming.stopTimeout). See the cause on what was being executed in the streaming query thread.`
WHAT I TRIED: looking at SO and finding this answer, to which I included
spark.conf.set("spark.sql.streaming.stopTimeout", 36000)
into my setup - which changed nothing.
Any input is highly appreciated!
from pyspark.sql import functions as F
from pyspark.sql.types import *
# Define a data schema
schema = StructType() \
.add('PARAMETERS_TEXTVALUES_070_VALUES', StringType())\
.add('ID', StringType())\
.add('PARAMETERS_TEXTVALUES_001_VALUES', StringType())\
.add('TIMESTAMP', TimestampType())
df = spark \
.readStream \
.format("kafka") \
.option("host", "stream.xxx.com") \
.option("port", 12345)\
.option('kafka.bootstrap.servers', 'stream.xxx.com:12345') \
.option('subscribe', 'stream_test.json') \
.option("startingOffset", "earliest") \
.load()
df_word = df.select(F.col('key').cast('string'),
F.from_json(F.col('value').cast('string'), schema).alias("parsed_value"))
df_word \
.writeStream \
.format("parquet") \
.option("path", "dbfs:/mnt/streamfolder/stream/") \
.option("checkpointLocation", "dbfs:/mnt/streamfolder/check/") \
.outputMode("append") \
.start()
my stream output data looks like this:
"PARAMETERS_TEXTVALUES_070_VALUES":'something'
"ID":"47575963333908"
"PARAMETERS_TEXTVALUES_001_VALUES":12345
"TIMESTAMP": "2020-10-22T15:06:42.507+02:00"
Furthermore, stream and check folders are filled with 0-b files, except for metadata, which includes the ìd from the error above.
Thanks and stay safe.

How to include partitioned column in pyspark dataframe read method

I am writing Avro file-based from a parquet file. I have read the file as below:
Reading data
dfParquet = spark.read.format("parquet").option("mode", "FAILFAST")
.load("/Users/rashmik/flight-time.parquet")
Writing data
I have written the file in Avro format as below:
dfParquetRePartitioned.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
As expected, I got data partitioned by OP_CARRIER.
Reading Avro partitioned data from a specific partition
In another job, I need to read data from the output of the above job, i.e. from datasink/avro directory. I am using the below code to read from datasink/avro
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load("datasink/avro/OP_CARRIER=AA")
It reads data successfully, but as expected OP_CARRIER column is not available in dfAvro dataframe as it is a partition column of the first job. Now my requirement is to include OP_CARRIER field also in 2nd dataframe i.e. in dfAvro. Could somebody help me with this?
I am referring documentation from the spark document, but I am not able to locate the relevant information. Any pointer will be very helpful.
You replicate the same column value with a different alias.
dfParquetRePartitioned.withColumn("OP_CARRIER_1", lit(df.OP_CARRIER)) \
.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
This would give you what you wanted. But with a different alias.
Or you can also do it during reading. If location is dynamic then you can easily append the column.
path = "datasink/avro/OP_CARRIER=AA"
newcol = path.split("/")[-1].split("=")
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load(path).withColumn(newcol[0], lit(newcol[1]))
If the value is static its way more easy to add it during the data read.

Resources