Apache Spark with kafka stream - Missing Kafka - apache-spark

I have trying to setup the Apache Spark with kafka and wrote simple program in local and its failing and not able figure out from debug.
build.gradle.kts
implementation ("org.jetbrains.kotlin:kotlin-stdlib:1.4.0")
implementation ("org.jetbrains.kotlinx.spark:kotlin-spark-api-3.0.0_2.12:1.0.0-preview1")
compileOnly("org.apache.spark:spark-sql_2.12:3.0.0")
implementation("org.apache.kafka:kafka-clients:3.0.0")
Main function code is
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Ship metrics").orCreate
val shipmentDataFrame = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("includeHeaders", "true")
.load()
val query = shipmentDataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
query.writeStream()
.format("console")
.outputMode("append")
.start()
.awaitTermination()
and getting error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:194)
at com.tgt.ff.axon.shipmetriics.stream.ShipmentStream.run(ShipmentStream.kt:23)
at com.tgt.ff.axon.shipmetriics.ApplicationKt.main(Application.kt:12)
21/12/25 22:22:56 INFO SparkContext: Invoking stop() from shutdown hook

The Kotlin API for Spark by JetBrains (https://github.com/Kotlin/kotlin-spark-api) has support for streaming since the 1.1.0 update.
There is also an example with a Kafka stream which might be of help to you: https://github.com/Kotlin/kotlin-spark-api/blob/spark-3.2/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples/streaming/KotlinDirectKafkaWordCount.kt
It does use the Spark DStream API instead of the Spark Structured Streaming API you appear to be using.
You can, of course, also still use the structured streaming one, if you prefer that, but then it needs to be deployed like is described here.

Related

Spark-Streaming hangs with kafka starting offset at earliest (Kafka 2, spark 2.4.3)

i'm having an issue with Spark-Streaming and Kafka. While running a sample program to consume from a Kafka topic and output micro-batched results to the terminal, my job seems to hang when i set the option:
df.option("startingOffsets", "earliest")
Starting the job from the latest offset works fine, results are printed to the terminal as each micro batch streams through.
I was thinking maybe this was a resouces issue--i'm trying to read from a topic with quite a bit of data. However i don't seem to have memory/cpu issues (running this job with a local[*] cluster). The job never really seems to start, but just hangs on the line:
19/09/17 15:21:37 INFO Metadata: Cluster ID: JFXVL24JQ3K4CEbE-VA58A
val sc = new SparkConf().setMaster("local[*]").setAppName("spark-test")
val streamContext = new StreamingContext(sc, Seconds(1))
val spark = SparkSession.builder().appName("spark-test")
.getOrCreate()
val topic = "topic.with.alotta.data"
//subscribe tokafka
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
//write
df.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
I'd expect to see results printed to the console....but, the application just seems to hang as I mentioned. Any thoughts? It feels like a spark resource issue (because i'm running a local "cluster" against a topic that has a lot of data. Is there something about the nature of streaming dataframes that i'm missing?
Writing to console causes all data to be collected in memory in the driver every trigger. Since you're currently not limiting the size of your batches, this means the entire topic contents is being accumulated in the driver. See https://spark.apache.org/docs/2.4.3/structured-streaming-programming-guide.html#output-sinks
Setting a limit on your batch sizes should fix your issue.
Try adding the maxOffsetsPerTrigger setting when reading from Kafka...
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.load()
See https://spark.apache.org/docs/2.4.3/structured-streaming-kafka-integration.html for details.

Spark Streaming failing due to error on a different Kafka topic than the one being read

For the following write topic/read topic air2008rand tandem :
import org.apache.spark.sql.streaming.Trigger
(spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("subscribe", "air2008rand")
.load()
.groupBy('value.cast("string").as('key))
.agg(count("*").cast("string") as 'value)
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("includeTimestamp", true)
.option("topic","t1")
.trigger(Trigger.ProcessingTime("2 seconds"))
.outputMode("update")
.option("checkpointLocation","/tmp/cp")
.start)
There is an error generated due to a a different topic air2008m1-0:
scala> 19/07/14 13:27:22 ERROR MicroBatchExecution: Query [id = 711d44b2-3224-4493-8677-e5c8cc4f3db4, runId = 68a3519a-e9cf-4a82-9d96-99be833227c0]
terminated with error
java.lang.IllegalStateException: Set(air2008m1-0) are gone.
Some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
at org.apache.spark.sql.kafka010.KafkaMicroBatchReader.org$apache$spark$sql$kafka010$KafkaMicroBatchReader$$reportDataLoss(KafkaMicroBatchReader.scala:261)
at org.apache.spark.sql.kafka010.KafkaMicroBatchReader.planInputPartitions(KafkaMicroBatchReader.scala:124)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.partitions$lzycompute(DataSourceV2ScanExec.scala:76)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.partitions(DataSourceV2ScanExec.scala:75)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.outputPartitioning(DataSourceV2ScanExec.scala:65)
This behavior is repeatable by stopping the read/write code (in spark-shell repl) and then re-running it.
Why is there "cross-talk" between different kafka topics here?
The problem is due to a checkpoint directory containing data from an earlier spark streaming operation. The resolution is to change the checkpoint directory.
The solution was found as a comment (from #jaceklaskowski himself) in this question [IllegalStateException]: Spark Structured Streaming is termination Streaming Query with Error

Configuring StreamingContext in Apache Zeppelin

My goal is to read streaming data from a stream(in my case aws kinesis) and then query the data. The problem is that I want to query the last 5 minutes data on every batch interval. And what I found is that it is possible to keep the data in a stream for a certain period(using StreamingContext.remember(Duration duration) method). Zeppelin's spark interpreter creates the SparkSession automatically and I don't know how to configure the StreamingContext. Here's what I do:
val df = spark
.readStream
.format("kinesis")
.option("streams", "test")
.option("endpointUrl", "kinesis.us-west-2.amazonaws.com")
.option("initialPositionInStream", "latest")
.option("format", "csv")
.schema(//schema definition)
.load
So far so good. Then as far as I can see the streaming context is started when the write stream is set and started:
df.writeStream
.format(//output source)
.outputMode("complete")
.start()
But having only the SparkSession I don't know how to achieve a query over last X minutes data. Any suggestions?

Spark Structured Streaming Batch

I am running batch in Structured programming of Spark. The below snippet code throws error saying "kafka is not a valid Spark SQL Data Source;". The version I am using for the same is --> spark-sql-kafka-0-10_2.10. Your help is appreciated. Thanks.
Dataset<Row> df = spark
.read()
.format("kafka")
.option("kafka.bootstrap.servers", "*****")
.option("subscribePattern", "test.*")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load();
Exception in thread "main" org.apache.spark.sql.AnalysisException: kafka is not a valid Spark SQL Data Source.;
I had the same problem and like me you are using read instead of readStream.
Changing spark.read() to spark.readStream worked fine for me.
Use the spark-submit mechanism and pass along -jars spark-sql-kafka-0-10_2.11-2.1.1.jar
Adjust the version of kafka, scala and spark in that library according to ur own situation.

How to write streaming dataset to Kafka?

I'm trying to do some enrichment to the topics data. Therefore read from Kafka sink back to Kafka using Spark structured streaming.
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("group.id", groupId)
.option("subscribe", "topicname")
.load()
val enriched = ds.select("key", "value", "topic").as[(String, String, String)].map(record => enrich(record._1,
record._2, record._3)
val query = enriched.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("group.id", groupId)
.option("topic", "desttopic")
.start()
But im getting an exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Data source kafka does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:287)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:266)
at kafka_bridge.KafkaBridge$.main(KafkaBridge.scala:319)
at kafka_bridge.KafkaBridge.main(KafkaBridge.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Any workarounds?
As T. Gawęda mentioned, there is no kafka format to write streaming datasets to Kafka (i.e. a Kafka sink).
The currently recommended solution in Spark 2.1 is to use foreach operator.
The foreach operation allows arbitrary operations to be computed on the output data. As of Spark 2.1, this is available only for Scala and Java. To use this, you will have to implement the interface ForeachWriter (Scala/Java docs), which has methods that get called whenever there is a sequence of rows generated as output after a trigger. Note the following important points.
Spark 2.1 (which is currently the latest release of Spark) doesn't have it. The next release - 2.2 - will have Kafka Writer, see this commit.
Kafka Sink is the same as Kafka Writer.
Try this
ds.map(_.toString.getBytes).toDF("value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092"))
.option("topic", topic)
.start
.awaitTermination()

Resources