Spark Structured Streaming Batch - apache-spark

I am running batch in Structured programming of Spark. The below snippet code throws error saying "kafka is not a valid Spark SQL Data Source;". The version I am using for the same is --> spark-sql-kafka-0-10_2.10. Your help is appreciated. Thanks.
Dataset<Row> df = spark
.read()
.format("kafka")
.option("kafka.bootstrap.servers", "*****")
.option("subscribePattern", "test.*")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load();
Exception in thread "main" org.apache.spark.sql.AnalysisException: kafka is not a valid Spark SQL Data Source.;

I had the same problem and like me you are using read instead of readStream.
Changing spark.read() to spark.readStream worked fine for me.

Use the spark-submit mechanism and pass along -jars spark-sql-kafka-0-10_2.11-2.1.1.jar
Adjust the version of kafka, scala and spark in that library according to ur own situation.

Related

Apache Spark with kafka stream - Missing Kafka

I have trying to setup the Apache Spark with kafka and wrote simple program in local and its failing and not able figure out from debug.
build.gradle.kts
implementation ("org.jetbrains.kotlin:kotlin-stdlib:1.4.0")
implementation ("org.jetbrains.kotlinx.spark:kotlin-spark-api-3.0.0_2.12:1.0.0-preview1")
compileOnly("org.apache.spark:spark-sql_2.12:3.0.0")
implementation("org.apache.kafka:kafka-clients:3.0.0")
Main function code is
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Ship metrics").orCreate
val shipmentDataFrame = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("includeHeaders", "true")
.load()
val query = shipmentDataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
query.writeStream()
.format("console")
.outputMode("append")
.start()
.awaitTermination()
and getting error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:194)
at com.tgt.ff.axon.shipmetriics.stream.ShipmentStream.run(ShipmentStream.kt:23)
at com.tgt.ff.axon.shipmetriics.ApplicationKt.main(Application.kt:12)
21/12/25 22:22:56 INFO SparkContext: Invoking stop() from shutdown hook
The Kotlin API for Spark by JetBrains (https://github.com/Kotlin/kotlin-spark-api) has support for streaming since the 1.1.0 update.
There is also an example with a Kafka stream which might be of help to you: https://github.com/Kotlin/kotlin-spark-api/blob/spark-3.2/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples/streaming/KotlinDirectKafkaWordCount.kt
It does use the Spark DStream API instead of the Spark Structured Streaming API you appear to be using.
You can, of course, also still use the structured streaming one, if you prefer that, but then it needs to be deployed like is described here.

Spark-Streaming hangs with kafka starting offset at earliest (Kafka 2, spark 2.4.3)

i'm having an issue with Spark-Streaming and Kafka. While running a sample program to consume from a Kafka topic and output micro-batched results to the terminal, my job seems to hang when i set the option:
df.option("startingOffsets", "earliest")
Starting the job from the latest offset works fine, results are printed to the terminal as each micro batch streams through.
I was thinking maybe this was a resouces issue--i'm trying to read from a topic with quite a bit of data. However i don't seem to have memory/cpu issues (running this job with a local[*] cluster). The job never really seems to start, but just hangs on the line:
19/09/17 15:21:37 INFO Metadata: Cluster ID: JFXVL24JQ3K4CEbE-VA58A
val sc = new SparkConf().setMaster("local[*]").setAppName("spark-test")
val streamContext = new StreamingContext(sc, Seconds(1))
val spark = SparkSession.builder().appName("spark-test")
.getOrCreate()
val topic = "topic.with.alotta.data"
//subscribe tokafka
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
//write
df.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
I'd expect to see results printed to the console....but, the application just seems to hang as I mentioned. Any thoughts? It feels like a spark resource issue (because i'm running a local "cluster" against a topic that has a lot of data. Is there something about the nature of streaming dataframes that i'm missing?
Writing to console causes all data to be collected in memory in the driver every trigger. Since you're currently not limiting the size of your batches, this means the entire topic contents is being accumulated in the driver. See https://spark.apache.org/docs/2.4.3/structured-streaming-programming-guide.html#output-sinks
Setting a limit on your batch sizes should fix your issue.
Try adding the maxOffsetsPerTrigger setting when reading from Kafka...
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.load()
See https://spark.apache.org/docs/2.4.3/structured-streaming-kafka-integration.html for details.

Spark Streaming failing due to error on a different Kafka topic than the one being read

For the following write topic/read topic air2008rand tandem :
import org.apache.spark.sql.streaming.Trigger
(spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("subscribe", "air2008rand")
.load()
.groupBy('value.cast("string").as('key))
.agg(count("*").cast("string") as 'value)
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("includeTimestamp", true)
.option("topic","t1")
.trigger(Trigger.ProcessingTime("2 seconds"))
.outputMode("update")
.option("checkpointLocation","/tmp/cp")
.start)
There is an error generated due to a a different topic air2008m1-0:
scala> 19/07/14 13:27:22 ERROR MicroBatchExecution: Query [id = 711d44b2-3224-4493-8677-e5c8cc4f3db4, runId = 68a3519a-e9cf-4a82-9d96-99be833227c0]
terminated with error
java.lang.IllegalStateException: Set(air2008m1-0) are gone.
Some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
at org.apache.spark.sql.kafka010.KafkaMicroBatchReader.org$apache$spark$sql$kafka010$KafkaMicroBatchReader$$reportDataLoss(KafkaMicroBatchReader.scala:261)
at org.apache.spark.sql.kafka010.KafkaMicroBatchReader.planInputPartitions(KafkaMicroBatchReader.scala:124)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.partitions$lzycompute(DataSourceV2ScanExec.scala:76)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.partitions(DataSourceV2ScanExec.scala:75)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.outputPartitioning(DataSourceV2ScanExec.scala:65)
This behavior is repeatable by stopping the read/write code (in spark-shell repl) and then re-running it.
Why is there "cross-talk" between different kafka topics here?
The problem is due to a checkpoint directory containing data from an earlier spark streaming operation. The resolution is to change the checkpoint directory.
The solution was found as a comment (from #jaceklaskowski himself) in this question [IllegalStateException]: Spark Structured Streaming is termination Streaming Query with Error

How to load all records from kafka topic using spark in batch mode

I want to load all records from kafka topic using spark, but all examples which I have seen were using spark-streaming. How can can I load messages fwom kafka exactly once?
Exact steps are listed in the official documentation, for example:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load()
However "all records" is rather poorly defined if the source is continuous stream, as the result depends on the point in time, when query is executed.
Additionally you should keep in mind that parallelism is limited by the partitions of the Kafka topic, so you have to be careful not to overwhelm the cluster.

How to write streaming dataset to Kafka?

I'm trying to do some enrichment to the topics data. Therefore read from Kafka sink back to Kafka using Spark structured streaming.
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("group.id", groupId)
.option("subscribe", "topicname")
.load()
val enriched = ds.select("key", "value", "topic").as[(String, String, String)].map(record => enrich(record._1,
record._2, record._3)
val query = enriched.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("group.id", groupId)
.option("topic", "desttopic")
.start()
But im getting an exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Data source kafka does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:287)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:266)
at kafka_bridge.KafkaBridge$.main(KafkaBridge.scala:319)
at kafka_bridge.KafkaBridge.main(KafkaBridge.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Any workarounds?
As T. Gawęda mentioned, there is no kafka format to write streaming datasets to Kafka (i.e. a Kafka sink).
The currently recommended solution in Spark 2.1 is to use foreach operator.
The foreach operation allows arbitrary operations to be computed on the output data. As of Spark 2.1, this is available only for Scala and Java. To use this, you will have to implement the interface ForeachWriter (Scala/Java docs), which has methods that get called whenever there is a sequence of rows generated as output after a trigger. Note the following important points.
Spark 2.1 (which is currently the latest release of Spark) doesn't have it. The next release - 2.2 - will have Kafka Writer, see this commit.
Kafka Sink is the same as Kafka Writer.
Try this
ds.map(_.toString.getBytes).toDF("value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092"))
.option("topic", topic)
.start
.awaitTermination()

Resources