Kafka to Spark and Cassandra Sink on a Spark structured streaming doesn't work on update mode - apache-spark

I am trying to build the below spark streaming spark job that would read from kafka, perform aggregation (count on every min window) and store in Cassandra. I am getting an error on update mode.
java.lang.IllegalArgumentException: requirement failed: final_count does not support Update mode.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.execution.datasources.v2.V2Writes$.org$apache$spark$sql$execution$datasources$v2$V2Writes$$buildWriteForMicroBatch(V2Writes.scala:121)
at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:90)
at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:43)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
at
My spark source is
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,com.datastax.spark:spark-cassandra-connector_2.12:3.2.0 pyspark-shell'
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "xxxx:9092") \
.option("subscribe", "yyyy") \
.option("startingOffsets", "earliest") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("parsed_value")) \
.select(col("parsed_value.country"), col("parsed_value.city"), col("parsed_value.Location").alias("location"), col("parsed_value.TimeStamp")) \
.withColumn('currenttimestamp', unix_timestamp(col('TimeStamp'), "yyyy-MM-dd HH:mm:ss").cast(TimestampType())) \
.withWatermark("currenttimestamp", "1 minutes");
df.printSchema();
df=df.groupBy(window(df.currenttimestamp, "1 minutes"), df.location) \
.count();
df = df.select(col("location"), col("window.start").alias("starttime"), col("count"));
df.writeStream.outputMode("update").format("org.apache.spark.sql.cassandra").option("checkpointLocation", '/tmp/check_point/').option("keyspace", "cccc").option("table", "bbbb").option("spark.cassandra.connection.host", "aaaa").option("spark.cassandra.auth.username", "ffff").option("spark.cassandra.auth.password", "eee").start().awaitTermination();
Schema for table in cassandra is as below
CREATE TABLE final_count (
starttime TIMESTAMP,
location TEXT,
count INT,
PRIMARY KEY (starttime,location);
Works on update mode printing on console, but fails with error while updating cassandra.
Any suggestions?

Need foreachBatch as Cassandra is still not a standard Sink.
See https://docs.databricks.com/structured-streaming/examples.html#write-to-cassandra-using-foreachbatch-in-scala

Related

Spark Structured Streaming inconsistent output to multiple sinks

I am using spark structured streaming to read data from Kafka and apply some udf to the dataset. The code as below :
calludf = F.udf(lambda x: function_name(x))
dfraw = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', KAFKA_CONSUMER_IP) \
.option('subscribe', topic_name) \
.load()
df = dfraw.withColumn("value", F.col('value').cast('string')).withColumn('value', calludf(F.col('value')))
ds = df.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format('console') \
.option('truncate', False) \
.start()
dsf = df.selectExpr("CAST (value AS STRING)") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_CONSUMER_IP) \
.option("topic", topic_name_two) \
.option("checkpointLocation", checkpoint_location) \
.start()
ds.awaitTermination()
dsf.awaitTermination()
Now the problem is that I am getting 10 dataframes as input. 2 of them failed due to some issue with the data which is understandable. The console displays rest of the 8 processed dataframes BUT only 6 of those 8 processed dataframes are written to the Kafka topic using dsf steaming query. Even though I have added checkpoint location to it but it is still not working.
PS: Do let me know if you have any suggestion regarding the code as well. I am new to spark structured streaming so maybe there is something wrong with the way I am doing it.

Spark Streaming Kafka - How to stop streaming after processing all existing messages (gracefully)

This is what i am trying to do
Stream data from a kafka topic, which keeps getting data continuously.
Run the job twice a day, to process all data existing data at that point and stop the stream.
So i put and call stop on the query initially, but it was throwing "TimeoutException"
Then i tried increasing the timeout dynamically, but now i am getting java.io.IOException: Caused by: java.lang.InterruptedException
So, is there any way to gracefully stop the stream without getting any exceptions?
Below is my current code (part), which is throwing the interrupted exception
df = (
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", os.environ["KAFKA_SERVERS"])
.option("subscribe", config.kafka.topic)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 25000)
.load()
)
# <do some processing and save the data>
def save_batch(batch_df, batch_id):
pass
query = df.writeStream.foreachBatch(save_batch).start(
outputMode="append",
checkpointLocation=os.path.join(checkpoint_path, config.kafka.topic),
)
while query.isActive:
progress = query.lastProgress
if progress and progress["numInputRows"] < 25000 * 0.9:
timeout = sum(progress["durationMs"].values())
timeout = min(5 * 60 * 1000, max(15000, timeout))
spark.conf.set("spark.sql.streaming.stopTimeout", str(timeout))
stream_query.stop()
break
time.sleep(10)
Spark Version: 2.4.5
Scala Version: 2.1.1
Update: With Spark 3.3 .trigger(availableNow=True) is an option that will play nicely with .option("maxOffsetsPerTrigger", 25000).
I would recommend .trigger(once=True) and .awaitTermination() (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers).
Warning: This will not work with .option("maxOffsetsPerTrigger", 25000), but if maxOffsetsPerTrigger is not set it will default to pulling all offsets since it was last run to create one large micro-batch.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", os.environ["KAFKA_SERVERS"]) \
.option("subscribe", config.kafka.topic) \
.option("startingOffsets", "earliest") \
.load()
def foreach_batch_function(df, epoch_id):
# Transform and write batchDF
pass
df \
.writeStream \
.foreachBatch(foreach_batch_function) \
.trigger(once=True) \
.start(
outputMode="append",
checkpointLocation=os.path.join(checkpoint_path, config.kafka.topic),
) \
.awaitTermination()

Issue Streaming Delta Table in Databricks

I'm streaming from a Delta Table (source) to a Delta Table (target) in Databricks
%python
df = spark.readStream \
.format("delta") \
.load(path/to/source)
query = (df
.writeStream
.format("delta")
.option("mergeSchema", "true")
.outputMode("append")
.trigger(once=True) # Every 30 min
.option("checkpointLocation","{0}/{1}/".format(checkpointsPath,key))
.table(tableName)
)
But it seems that in a point of time, the job starts to process "less" data that it should be processing:
Do you know if there is a max size to process streaming data or something?
I'm trying to debug reading the logs but i can't find any issue

Pyspark Kafka structured streaming: error while writing out

I am able to read a stream from a Kafka topic and write the (transformed) data back to another Kafka topic in two different steps in PySpark. The code to do that is as follows:
# Define Stream:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "instream") \
.load()
# Transform
matchdata = df.select(from_json(F.col("value").cast("string"),schema).alias("value"))\
.select(F.col('value').cast("string"))
# Stream the data, from a Kafka topic to a Spark in-memory table
query = matchdata \
.writeStream \
.format("memory") \
.queryName("PositionTable") \
.outputMode("append") \
.start()
query.awaitTermination(5)
# Create a new dataframe after stream completes:
tmp_df=spark.sql("select * from PositionTable")
# Write data to a different Kafka topic
tmp_df \
.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "outstream") \
.save()
The code above works as expected: the data in Kafka topic "instream" is read in PySpark, and then PySpark can write out data to Kafka topic "outstream".
However, I would like to read the stream in and write the transformed data back out immediately (the stream will be unbounded and we would like insights immediately as the data rolls in). Following the documentation, I replaced the query above with the following:
query = matchdata \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "outstream") \
.option("checkpointLocation", "/path/to/HDFS/dir") \
.start()
This does not appear to work.
There is no error message, so I do not know what is wrong. I've also tried windowing and aggregating within windows, but that also does not work. Any advice will be appreciated!
Ok, I found the problem. The main reason was that the subdirectory "path/to/HDFS/dir" has to exist. After creating that directory the code ran as expected. It would have been nice if an error message stated something along those lines.

Structured Streaming (Spark 2.3.0) cannot write to Parquet file sink when submitted as a job

I'm consuming from Kafka and writing to parquet in EMRFS. Below code works in spark-shell:
val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start
SBT is able to package the code without errors. When the .jar is sent to spark-submit, the job is accepted and stays in running state forever without writing data to HDFS.
There is no ERROR in the .inprogress log
Some posts suggest that a large watermark duration can cause it, but I have not set a custom watermark duration.
I can write to parquet using Pyspark, I put you my code in case that will be useful:
stream = self.spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", self.kafka_bootstrap_servers) \
.option("subscribe", self.topic) \
.option("startingOffsets", self.startingOffsets) \
.option("max.poll.records", self.max_poll_records) \
.option("auto.commit.interval.ms", self.auto_commit_interval_ms) \
.option("session.timeout.ms", self.session_timeout_ms) \
.option("key.deserializer", self.key_deserializer) \
.option("value.deserializer", self.value_deserializer) \
.load()
self.query = stream \
.select(col("value")) \
.select((self.proto_function("value")).alias("value_udf")) \
.select(*columns,
date_format(column_time, "yyyy").alias("date").alias("year"),
date_format(column_time, "MM").alias("date").alias("month"),
date_format(column_time, "dd").alias("date").alias("day"),
date_format(column_time, "HH").alias("date").alias("hour"))
query = self.query \
.writeStream \
.format("parquet") \
.option("checkpointLocation", self.path) \
.partitionBy("year", "month", "day", "hour") \
.option("path", self.path) \
.start()
Also, you need to run the code in that way: spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 <code>

Resources