PySpark structured streaming output sink as Kafka giving error - apache-spark

Using Kafka 0.9.0 and Spark 2.1.0 - I am using PySpark structured streaming to compute the results and output it on Kafka topic. I am referring the Spark docs for the same
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Now when I run the command
(output mode complete as it is aggregating the streaming data.)
(mydataframe.writeStream
.outputMode("complete")
.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "topicname")
.option("checkpointLocation","/data/checkpoint/1")
.start())
It gives me error as below
ERROR StreamExecution: Query [id = 0686130b-8668-48fa-bdb7-b79b63d82680, runId = b4b7494f-d8b8-416e-ae49-ad8498dfe8f2] terminated with error
org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:73)
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:73)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.kafka010.KafkaWriter$.validateQuery(KafkaWriter.scala:72)
at org.apache.spark.sql.kafka010.KafkaWriter$.write(KafkaWriter.scala:88)
at org.apache.spark.sql.kafka010.KafkaSink.addBatch(KafkaSink.scala:38)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply$mcV$sp(StreamExecution.scala:503)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:503)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:503)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:502)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcV$sp(StreamExecution.scala:255)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:239)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:177)**
Not sure what attribute value does it expects. Need help in resolving this.
The console output sink produces the correct output on console so code seems to work fine. Only when using kafka as output sink causing this issue

Not sure what attribute value does it expects. Need help in resolving this.
Your myDataFrame needs a column value (of either StringType or BinaryType) containing the payload (message) which you want to send to Kafka.
Currently you are trying to write to Kafka, but don't describe which data is to be written.
One way to obtain such a colunm is to rename an existing column using .withColumnRenamed. If you want to write multiple columns, it's usually a good idea to create a column containing a JSON representation of the dataframe, which can be obtained using the to_json sql.function. But beware of .toJSON!

Spark 2.1.0 does not support Kafka as output sink. It's been introduced in 2.2.0 as per documentation.
See also this answer, which links to the commit introducing the feature, and provides an alternate solution, as well as this JIRA, which added the documentation in 2.2.1.

Related

Can't read via Apache Spark Structured Streaming from Hive Table

When I try and read from a hive table with the following code. I get the error buildReader is not supported for HiveFileFormat from spark driver pod.
spark.readStream \
.table("table_name") \
.repartition("filename") \
.writeStream \
.outputMode("append") \
.trigger(processingTime="10 minutes") \
.foreachBatch(perBatch)
Have tried every possible combination including most simple queries possible. Reading via parquet method direct from specified folder works, as does Spark SQL without streaming, but reading with Structured Streaming via readStream does not.
The documentation says the following...
Since Spark 3.1, you can also use DataStreamReader.table() to read tables as streaming DataFrames and use DataStreamWriter.toTable() to write streaming DataFrames as tables:
I'm using the latest Spark 3.2.1. Although reading from a table is not shown in the examples the paragraph above clearly suggests it should be possible.
Any assistance to help get this working would be really great and simplify my project a lot.

Writing data as JSON array with Spark Structured Streaming

I have to write data from Spark Structure streaming as JSON Array, I have tried using below code:
df.selectExpr("to_json(struct(*)) AS value").toJSON
which returns me DataSet[String], but unable to write as JSON Array.
Current Output:
{"name":"test","id":"id"}
{"name":"test1","id":"id1"}
Expected Output:
[{"name":"test","id":"id"},{"name":"test1","id":"id1"}]
Edit (moving comments into question):
After using proposed collect_list method I am getting
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;
Then I tried something like this -
withColumn("timestamp", unix_timestamp(col("event_epoch"), "MM/dd/yyyy hh:mm:ss aa")) .withWatermark("event_epoch", "1 minutes") .groupBy(col("event_epoch")) .agg(max(col("event_epoch")).alias("timestamp"))
But I don't want to add a new column.
You can use the SQL built-in function collect_list for this. This function collects and returns a set of non-unique elements (compared to collect_set which returns only unique elements).
From the source code for collect_list you will see that this is an aggregation function. Based on the requirements given in the Structured Streaming Programming Guide on Output Modes it is highlighted that the output modes "complete" and "updated" are supported for aggregations without a watermark.
As I understand from your comments, you do not wish to add watermark and new columns. Also, the error you are facing
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;
reminds you to not use the output mode "append".
In the comments, you have mentioned that you plan to produce the results into a Kafka message. One big JSON Array as one Kafka value. The complete code could look like
val df = spark.readStream
.[...] // in my test I am reading from Kafka source
.load()
.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value", "offset", "partition")
// do not forget to convert you data into a String before writing to Kafka
.selectExpr("CAST(collect_list(to_json(struct(*))) AS STRING) AS value")
df.writeStream
.format("kafka")
.outputMode("complete")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "test")
.option("checkpointLocation", "/path/to/sparkCheckpoint")
.trigger(Trigger.ProcessingTime(10000))
.start()
.awaitTermination()
Given the key/value pairs (k1,v1), (k2,v2), and (k3,v3) as inputs you will get a value in the Kafka topic that contains all selected data as a JSON Array:
[{"key":"k1","value":"v1","offset":7,"partition":0}, {"key":"k2","value":"v2","offset":8,"partition":0}, {"key":"k3","value":"v3","offset":9,"partition":0}]
Tested with Spark 3.0.1 and Kafka 2.5.0.

Unable to Count the documents using spark structured streaming

I am trying to use couchbase as the streaming source for spark structured streaming using spark connector.
val records = spark.readStream
.format(“com.couchbase.spark.sql”).schema(schema)
.load()
And I have this query
records
.groupBy(“type”)
.count()
.writeStream
.outputMode(“complete”)
.format(“console”)
.start()
.awaitTermination()
For this query I am not getting the correct output . My query output table is like this
Batch: 0
20/04/14 14:28:00 INFO CodeGenerator: Code generated in 10.538654 ms
20/04/14 14:28:00 INFO WriteToDataSourceV2Exec: Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#17fe0ec7 committed.
±-------±----+
|type | count|
±-------±----+
±-------±----+
However if I use the couchbase to fetch the documents as non streaming. Like
val cdr = spark.read.couchbase(EqualTo(“type”, “cdr”))
cdr.count()
Schema is correctly inferred for this non streaming operation and used the same schema for the structured streaming as well.
INFO N1QLRelation: Inferred schema is StructType(StructField(META_ID,StringType,true), StructField(_class,StringType,true), StructField(accountId,StringType,true),
gives the correct output. (count= 28).
Please let me know why this is not working with structured streaming.
This is probably because you are streaming only what has changed from now onward and not the past events.
If you would like to stream everything "from the beginning" you need to specify that.
Example is shown in this blog post: https://blog.couchbase.com/couchbase-spark-connector-2-0-0-released/
basically in your stream, you need to specify the following line
.couchbaseStream(from = FromBeginning, to = ToInfinity)

Spark structured streaming: what are the possible usages of queryName() setting?

As per Structured Streaming Programming Guide
queryName("myTableName") is used to defined the in-memory table name when the output sink is format("memory")
aggDF
.writeStream
.queryName("aggregates") // this query name will be the table name
.outputMode("complete")
.format("memory")
.start()
spark.sql("select * from aggregates").show() // interactively query in-memory table
Spark source code for DataStreamWriterscala documents queryName() as:
Specifies the name of the [[StreamingQuery]] that can be started with start().
This name must be unique among all the currently active queries in the associated SQLContext.
QUESTION: is there any other possible usages of the queryName() setting? Spark job logs? details in progress monitoring of the query ?
I came across the following three usages of the queryName:
As mentioned by OP and documented in the Structured Streaming Guide it is used to define the in-memory table name when the output sink is of format "memory".
The queryName defines the value of event.progress.name where the event is a QueryProgressEvent within the StreamingQueryListener.
It is also used in the description column of the Spark Web UI (see screenshot where I set queryName("StackoverflowTest"):
Adding to #mike answer, I want to mention that in Databricks (which uses Spark at its core) you can use the defined query name in conjunction with the function untilStreamIsReady().
For example, if you define the streaming query StackoverflowTest, then you could execute the function untilStreamIsReady('StackoverflowTest') to wait until the query is ready and started (sorry for being Captain Obvious).
I must say I could not find a direct reference for this function in the official documentation, but found it in the following links:
In Spark Streaming, is there a way to detect when a batch has finished?
example of usage: https://youtu.be/KLD10xn4sX8?t=1219

Kafka spark streaming dynamic schema

I'm strangling Kafka spark streaming with dynamic schema.
I"m consuming from Kafka (KafkaUtils.createDirectStream) each message /JSON field can be nested, each field can appear in some messages and sometimes not.
The only thing I found is to do:
Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)
case class MyTyp(column1: Option[Any], column2: Option[Any]....)
This will cover,im not sure, fields that may appear, and nested Fileds.
Any approval/other Ideas/general help will be appreciated ...
After long integration and trails, two ways to solve non schema Kafka consuming: 1) Throw "editing/validation" each message with "lambda" function .not my favorite. 2) Spark: on each micro batch obtain flatten schema and intersect needed columns. use spark SQL to query the frame for needed data. That worked for me.

Resources