Cassandra Sink for PySpark Structured Streaming from Kafka topic - apache-spark

I want to write Structure Streaming Data into Cassandra using PySpark Structured Streaming API.
My data flow is like below:
REST API -> Kafka -> Spark Structured Streaming (PySpark) -> Cassandra
Source and Version in below:
Spark version: 2.4.3
DataStax DSE: 6.7.6-1
initialize spark:
spark = SparkSession.builder\
.master("local[*]")\
.appName("Analytics")\
.config("kafka.bootstrap.servers", "localhost:9092")\
.config("spark.cassandra.connection.host","localhost:9042")\
.getOrCreate()
subscribe topic from Kafka:
df = spark.readStream.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "topic") \
.load()
Write into Cassandra:
w_df_3 = df...
write_db = w_df_3.writeStream \
.option("checkpointLocation", '/tmp/check_point/') \
.format("org.apache.spark.sql.cassandra") \
.option("keyspace", "analytics") \
.option("table", "table") \
.outputMode(outputMode="update")\
.start()
executed with the following command:
$spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0,datastax:spark-cassandra-connector:2.4.0-s_2.11 Analytics.py localhost:9092 topic
I am facing below issue/exception while writestream into Cassandra:
py4j.protocol.Py4JJavaError: An error occurred while calling o81.start.
: java.lang.UnsupportedOperationException: Data source org.apache.spark.sql.cassandra does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:297)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:322)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Could anyone help me out on how to resolve and proceed further? Any help will be appreciated.
Thanks in advance.

As i mentioned in the comment, if you're using DSE, you can use OSS Apache Spark with so-called BYOS (bring your own spark) - special jar that contains the DataStax's version of Spark Cassandra Connector (SCC) that contains direct support for structured streaming.
Since SCC 2.5.0 support for structured streaming is also available in open source version, so you can simply use writeStream with format for Cassandra. 2.5.0 also contains a lot of good things previously not available in the open source, such as additional optimizations, etc. There is a blog post that describes them in great details.

Thanks a lot for your response.
I have implemented using it with ForeachBatch Sink instead of a direct sink.
w_df_3.writeStream\
.trigger(processingTime='5 seconds')\
.outputMode('update')\
.foreachBatch(save_to_cassandra)\
.start()
It's working. Thank you all.

Related

Apache Spark with kafka stream - Missing Kafka

I have trying to setup the Apache Spark with kafka and wrote simple program in local and its failing and not able figure out from debug.
build.gradle.kts
implementation ("org.jetbrains.kotlin:kotlin-stdlib:1.4.0")
implementation ("org.jetbrains.kotlinx.spark:kotlin-spark-api-3.0.0_2.12:1.0.0-preview1")
compileOnly("org.apache.spark:spark-sql_2.12:3.0.0")
implementation("org.apache.kafka:kafka-clients:3.0.0")
Main function code is
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Ship metrics").orCreate
val shipmentDataFrame = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("includeHeaders", "true")
.load()
val query = shipmentDataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
query.writeStream()
.format("console")
.outputMode("append")
.start()
.awaitTermination()
and getting error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:194)
at com.tgt.ff.axon.shipmetriics.stream.ShipmentStream.run(ShipmentStream.kt:23)
at com.tgt.ff.axon.shipmetriics.ApplicationKt.main(Application.kt:12)
21/12/25 22:22:56 INFO SparkContext: Invoking stop() from shutdown hook
The Kotlin API for Spark by JetBrains (https://github.com/Kotlin/kotlin-spark-api) has support for streaming since the 1.1.0 update.
There is also an example with a Kafka stream which might be of help to you: https://github.com/Kotlin/kotlin-spark-api/blob/spark-3.2/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples/streaming/KotlinDirectKafkaWordCount.kt
It does use the Spark DStream API instead of the Spark Structured Streaming API you appear to be using.
You can, of course, also still use the structured streaming one, if you prefer that, but then it needs to be deployed like is described here.

Databricks structured streaming with Snowflake as source?

Is it possible to use a Snowflake table as a source for spark structured streaming in Databricks? When I run the following pyspark code:
options = dict(sfUrl=our_snowflake_url,
sfUser=user,
sfPassword=password,
sfDatabase=database,
sfSchema=schema,
sfWarehouse=warehouse)
df = spark.readStream.format("snowflake") \
.schema(final_struct) \
.options(**options) \
.option("dbtable", "BASIC_DATA_TEST") \
.load()
I get this warning:
java.lang.UnsupportedOperationException: Data source snowflake does not support streamed reading
I haven't been able to find anything in the Spark Structured Streaming Docs that explicitly says Snowflake is supported as a source, but I'd like to make sure I'm not missing anything obvious.
Thanks!
The Spark Snowflake connector currently does not support using the .writeStream/.readStream calls from Spark Structured Streaming

Spark Structured Streaming Batch

I am running batch in Structured programming of Spark. The below snippet code throws error saying "kafka is not a valid Spark SQL Data Source;". The version I am using for the same is --> spark-sql-kafka-0-10_2.10. Your help is appreciated. Thanks.
Dataset<Row> df = spark
.read()
.format("kafka")
.option("kafka.bootstrap.servers", "*****")
.option("subscribePattern", "test.*")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load();
Exception in thread "main" org.apache.spark.sql.AnalysisException: kafka is not a valid Spark SQL Data Source.;
I had the same problem and like me you are using read instead of readStream.
Changing spark.read() to spark.readStream worked fine for me.
Use the spark-submit mechanism and pass along -jars spark-sql-kafka-0-10_2.11-2.1.1.jar
Adjust the version of kafka, scala and spark in that library according to ur own situation.

How to write streaming dataset to Kafka?

I'm trying to do some enrichment to the topics data. Therefore read from Kafka sink back to Kafka using Spark structured streaming.
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("group.id", groupId)
.option("subscribe", "topicname")
.load()
val enriched = ds.select("key", "value", "topic").as[(String, String, String)].map(record => enrich(record._1,
record._2, record._3)
val query = enriched.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("group.id", groupId)
.option("topic", "desttopic")
.start()
But im getting an exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Data source kafka does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:287)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:266)
at kafka_bridge.KafkaBridge$.main(KafkaBridge.scala:319)
at kafka_bridge.KafkaBridge.main(KafkaBridge.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Any workarounds?
As T. Gawęda mentioned, there is no kafka format to write streaming datasets to Kafka (i.e. a Kafka sink).
The currently recommended solution in Spark 2.1 is to use foreach operator.
The foreach operation allows arbitrary operations to be computed on the output data. As of Spark 2.1, this is available only for Scala and Java. To use this, you will have to implement the interface ForeachWriter (Scala/Java docs), which has methods that get called whenever there is a sequence of rows generated as output after a trigger. Note the following important points.
Spark 2.1 (which is currently the latest release of Spark) doesn't have it. The next release - 2.2 - will have Kafka Writer, see this commit.
Kafka Sink is the same as Kafka Writer.
Try this
ds.map(_.toString.getBytes).toDF("value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092"))
.option("topic", topic)
.start
.awaitTermination()

Consume Kafka in Spark Streaming (Spark 2.0)

I found there're two methods to consume Kafka topic in Spark Streaming (Spark 2.0):
1) using KafkaUtils.createDirectStream to get DStream every k seconds, please refer to this document
2) using kafka: sqlContext.read.format(“json”).stream(“kafka://KAFKA_HOST”) to create an infinite DataFrame for Spark 2.0's new feature: Structured Streaming, related doc is here
Method 1) works, but 2) doesn't, I got the following error
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.stream(Ljava/lang/String;)Lorg/apache/spark/sql/Dataset;
...
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
My questions are:
What's “kafka://KAFKA_HOST” referring to?
How should I fix this problem?
Thank you in advance!
Spark 2.0 doesn't yet support Kafka as a source of infinite DataFrames/Sets. Support is planned to be added in 2.1
Edit: (6.12.2016)
Kafka 0.10 is now expiramentaly supported in Spark 2.0.2:
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
ds1
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]

Resources