How to output of structured stream without truncation in Pyspark? - apache-spark

I am trying to output structured stream result to console:
.writeStream \
.outputMode("append") \
.format("console") \
.start()
output table looks like that:
+--------------------+--------+--------+
| column#1|column#2|column#3|
+--------------------+--------+--------+
|08/25/2022 00:00:...|abcde...|12345...|
+--------------------+--------+--------+
how can I output the whole content without truncation? expected result is as using show(truncate=False):
+--------------------+--------+--------+
| column#1|column#2|column#3|
+--------------------+--------+--------+
|08/25/2022 00:00:00|abcdefgh|12345678|
+--------------------+--------+--------+

Related

Spark Structured Streaming inconsistent output to multiple sinks

I am using spark structured streaming to read data from Kafka and apply some udf to the dataset. The code as below :
calludf = F.udf(lambda x: function_name(x))
dfraw = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', KAFKA_CONSUMER_IP) \
.option('subscribe', topic_name) \
.load()
df = dfraw.withColumn("value", F.col('value').cast('string')).withColumn('value', calludf(F.col('value')))
ds = df.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format('console') \
.option('truncate', False) \
.start()
dsf = df.selectExpr("CAST (value AS STRING)") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_CONSUMER_IP) \
.option("topic", topic_name_two) \
.option("checkpointLocation", checkpoint_location) \
.start()
ds.awaitTermination()
dsf.awaitTermination()
Now the problem is that I am getting 10 dataframes as input. 2 of them failed due to some issue with the data which is understandable. The console displays rest of the 8 processed dataframes BUT only 6 of those 8 processed dataframes are written to the Kafka topic using dsf steaming query. Even though I have added checkpoint location to it but it is still not working.
PS: Do let me know if you have any suggestion regarding the code as well. I am new to spark structured streaming so maybe there is something wrong with the way I am doing it.

Spark structured streaming truncates Kafka value string to 4095

The following code
builder = SparkSession.builder\
.appName("PythonTest11")
spark = builder.getOrCreate()
#spark.conf.set("spark.sql.debug.maxToStringFields", 10000)
# Subscribe to 1 topic
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", dataFlowTopic) \
.load()
df = df \
.selectExpr("LENGTH(value)")
#.selectExpr("CAST(value as string)") \
df.printSchema()
# Start running the query that prints the running counts to the console
query = df \
.writeStream \
.outputMode('append') \
.format('console') \
.start()
query.awaitTermination()
prints
+-------------+
|length(value)|
+-------------+
| 4095|
+-------------+
for any big message, i.e. it truncates incoming strings.
How to fix this?
It was something like console truncation or something. Not Kafka or Spark problem.
First I was running
# kafka-console-producer.sh --topic dataflow --bootstrap-server localhost:9092
and then pasting messages to it's command line and truncation was occurring.
The I ran
# kafka-console-producer.sh --topic dataflow --bootstrap-server localhost:9092 < row01.json
with the same data inside row01.json and it worked without truncation.

Databricks: Structured Stream fails with TimeoutException

I want to create a structured stream in databricks with a kafka source.
I followed the instructions as described here. My script seems to start, however it fails with the first element of the stream. The stream itsellf works fine and produces results and works (in databricks) when I use confluent_kafka, thus there seems to be a different issue I am missing:
After the initial stream is processed, the script times out:
java.util.concurrent.TimeoutException: Stream Execution thread for stream [id = 80afdeed-9266-4db4-85fa-66ccf261aee4,
runId = b564c626-9c74-42a8-8066-f1f16c7ab53d] failed to stop within 36000 milliseconds (specified by spark.sql.streaming.stopTimeout). See the cause on what was being executed in the streaming query thread.`
WHAT I TRIED: looking at SO and finding this answer, to which I included
spark.conf.set("spark.sql.streaming.stopTimeout", 36000)
into my setup - which changed nothing.
Any input is highly appreciated!
from pyspark.sql import functions as F
from pyspark.sql.types import *
# Define a data schema
schema = StructType() \
.add('PARAMETERS_TEXTVALUES_070_VALUES', StringType())\
.add('ID', StringType())\
.add('PARAMETERS_TEXTVALUES_001_VALUES', StringType())\
.add('TIMESTAMP', TimestampType())
df = spark \
.readStream \
.format("kafka") \
.option("host", "stream.xxx.com") \
.option("port", 12345)\
.option('kafka.bootstrap.servers', 'stream.xxx.com:12345') \
.option('subscribe', 'stream_test.json') \
.option("startingOffset", "earliest") \
.load()
df_word = df.select(F.col('key').cast('string'),
F.from_json(F.col('value').cast('string'), schema).alias("parsed_value"))
df_word \
.writeStream \
.format("parquet") \
.option("path", "dbfs:/mnt/streamfolder/stream/") \
.option("checkpointLocation", "dbfs:/mnt/streamfolder/check/") \
.outputMode("append") \
.start()
my stream output data looks like this:
"PARAMETERS_TEXTVALUES_070_VALUES":'something'
"ID":"47575963333908"
"PARAMETERS_TEXTVALUES_001_VALUES":12345
"TIMESTAMP": "2020-10-22T15:06:42.507+02:00"
Furthermore, stream and check folders are filled with 0-b files, except for metadata, which includes the ìd from the error above.
Thanks and stay safe.

Why an exception while writing an aggregated dataframe to a file sink?

I'm performing an aggregation on a streaming dataframe and trying to write the result to an output directory. But I'm getting an exception saying
pyspark.sql.utils.AnalysisException: 'Data source json does not support Update output mode;
I'm getting similar error with "complete" output mode.
This is my code:
grouped_df = logs_df.groupBy('host', 'timestamp').agg(count('host').alias('total_count'))
result_host = grouped_df.filter(col('total_count') > threshold)
writer_query = result_host.writeStream \
.format("json") \
.queryName("JSON Writer") \
.outputMode("update") \
.option("path", "output") \
.option("checkpointLocation", "chk-point-dir") \
.trigger(processingTime="1 minute") \
.start()
writer_query.awaitTermination()
FileSinks only support "append" mode according to documentation on OutputSinks, see "supported output modes" in below table.

How to write parquet files from streaming query?

I'm reading from a CSV file using Spark 2.2 structured streaming.
My query for writing the result to the console is this:
val consoleQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.writeStream
.format("console")
.option("truncate", value = false)
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Complete())
The result looks fine:
+---------------------------------------------+-------------+-----+
|window |id |count|
+---------------------------------------------+-------------+-----+
|[2017-02-17 09:00:00.0,2017-02-17 10:00:00.0]|EXC0000000001|1 |
|[2017-02-17 09:00:00.0,2017-02-17 10:00:00.0]|EXC0000000002|8 |
|[2017-02-17 08:00:00.0,2017-02-17 09:00:00.0]|EXC2200002 |1 |
+---------------------------------------------+-------------+-----+
But when writing it to a Parquet file
val parquetQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.coalesce(1)
.writeStream
.format("parquet")
.option("path", "src/main/resources/parquet")
.trigger(Trigger.ProcessingTime(10.seconds))
.option("checkpointLocation", "src/main/resources/checkpoint")
.outputMode(OutputMode.Append())
and reading it in with another job,
val data = spark.read.parquet("src/main/resources/parquet/")
the result is this:
+------+---+-----+
|window|id |count|
+------+---+-----+
+------+---+-----+
TL;DR parquetQuery has not been started and so no output from a streaming query.
Check out the type of parquetQuery which is org.apache.spark.sql.streaming.DataStreamWriter which is simply a description of a query that at some point is supposed to be started. Since it was not, the query has never been able to do anything that would write the stream.
Add start at the very end of parquetQuery declaration (right after or as part of the call chain).
val parquetQuery = exceptions
.withWatermark("time", "5 years")
.groupBy(window($"time", "1 hour"), $"id")
.count()
.coalesce(1)
.writeStream
.format("parquet")
.option("path", "src/main/resources/parquet")
.trigger(Trigger.ProcessingTime(10.seconds))
.option("checkpointLocation", "src/main/resources/checkpoint")
.outputMode(OutputMode.Append())
.start // <-- that's what you miss

Resources