Issue Streaming Delta Table in Databricks - apache-spark

I'm streaming from a Delta Table (source) to a Delta Table (target) in Databricks
%python
df = spark.readStream \
.format("delta") \
.load(path/to/source)
query = (df
.writeStream
.format("delta")
.option("mergeSchema", "true")
.outputMode("append")
.trigger(once=True) # Every 30 min
.option("checkpointLocation","{0}/{1}/".format(checkpointsPath,key))
.table(tableName)
)
But it seems that in a point of time, the job starts to process "less" data that it should be processing:
Do you know if there is a max size to process streaming data or something?
I'm trying to debug reading the logs but i can't find any issue

Related

Kafka to Spark and Cassandra Sink on a Spark structured streaming doesn't work on update mode

I am trying to build the below spark streaming spark job that would read from kafka, perform aggregation (count on every min window) and store in Cassandra. I am getting an error on update mode.
java.lang.IllegalArgumentException: requirement failed: final_count does not support Update mode.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.execution.datasources.v2.V2Writes$.org$apache$spark$sql$execution$datasources$v2$V2Writes$$buildWriteForMicroBatch(V2Writes.scala:121)
at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:90)
at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:43)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
at
My spark source is
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,com.datastax.spark:spark-cassandra-connector_2.12:3.2.0 pyspark-shell'
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "xxxx:9092") \
.option("subscribe", "yyyy") \
.option("startingOffsets", "earliest") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("parsed_value")) \
.select(col("parsed_value.country"), col("parsed_value.city"), col("parsed_value.Location").alias("location"), col("parsed_value.TimeStamp")) \
.withColumn('currenttimestamp', unix_timestamp(col('TimeStamp'), "yyyy-MM-dd HH:mm:ss").cast(TimestampType())) \
.withWatermark("currenttimestamp", "1 minutes");
df.printSchema();
df=df.groupBy(window(df.currenttimestamp, "1 minutes"), df.location) \
.count();
df = df.select(col("location"), col("window.start").alias("starttime"), col("count"));
df.writeStream.outputMode("update").format("org.apache.spark.sql.cassandra").option("checkpointLocation", '/tmp/check_point/').option("keyspace", "cccc").option("table", "bbbb").option("spark.cassandra.connection.host", "aaaa").option("spark.cassandra.auth.username", "ffff").option("spark.cassandra.auth.password", "eee").start().awaitTermination();
Schema for table in cassandra is as below
CREATE TABLE final_count (
starttime TIMESTAMP,
location TEXT,
count INT,
PRIMARY KEY (starttime,location);
Works on update mode printing on console, but fails with error while updating cassandra.
Any suggestions?
Need foreachBatch as Cassandra is still not a standard Sink.
See https://docs.databricks.com/structured-streaming/examples.html#write-to-cassandra-using-foreachbatch-in-scala

Spark Structured Streaming inconsistent output to multiple sinks

I am using spark structured streaming to read data from Kafka and apply some udf to the dataset. The code as below :
calludf = F.udf(lambda x: function_name(x))
dfraw = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', KAFKA_CONSUMER_IP) \
.option('subscribe', topic_name) \
.load()
df = dfraw.withColumn("value", F.col('value').cast('string')).withColumn('value', calludf(F.col('value')))
ds = df.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format('console') \
.option('truncate', False) \
.start()
dsf = df.selectExpr("CAST (value AS STRING)") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_CONSUMER_IP) \
.option("topic", topic_name_two) \
.option("checkpointLocation", checkpoint_location) \
.start()
ds.awaitTermination()
dsf.awaitTermination()
Now the problem is that I am getting 10 dataframes as input. 2 of them failed due to some issue with the data which is understandable. The console displays rest of the 8 processed dataframes BUT only 6 of those 8 processed dataframes are written to the Kafka topic using dsf steaming query. Even though I have added checkpoint location to it but it is still not working.
PS: Do let me know if you have any suggestion regarding the code as well. I am new to spark structured streaming so maybe there is something wrong with the way I am doing it.

upsert (merge) delta with spark structured streaming

I need to upsert data in real time (with spark structured streaming) in python
This data is read in realtime (format csv) and then is written as a delta table (here we want to update the data that's why we use merge into from delta)
I am using delta engine with databricks
I coded this:
from delta.tables import *
spark = SparkSession.builder \
.config("spark.sql.streaming.schemaInference", "true")\
.appName("SparkTest") \
.getOrCreate()
sourcedf= spark.readStream.format("csv") \
.option("header", True) \
.load("/mnt/user/raw/test_input") #csv data that we read in real time
spark.conf.set("spark.sql.shuffle.partitions", "1")
spark.createDataFrame([], sourcedf.schema) \
.write.format("delta") \
.mode("overwrite") \
.saveAsTable("deltaTable")
def upsertToDelta(microBatchOutputDF, batchId):
microBatchOutputDF.createOrReplaceTempView("updates")
microBatchOutputDF._jdf.sparkSession().sql("""
MERGE INTO deltaTable t
USING updates s
ON s.Id = t.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
sourcedf.writeStream \
.format("delta") \
.foreachBatch(upsertToDelta) \
.outputMode("update") \
.option("checkpointLocation", "/mnt/user/raw/checkpoints/output")\
.option("path", "/mnt/user/raw/PARQUET/output") \
.start() \
.awaitTermination()
but nothing gets written as expected in the output path , the checkpoint path gets filled in as expected , a display in the delta table gives me results too
display(table("deltaTable"))
in the spark UI I see the writestream step :
sourcedf.writeStream \ .format("delta") \ ....
first at Snapshot.scala:156+details
RDD: Delta Table State #1 - dbfs:/user/hive/warehouse/deltatable/_delta_log
any idea how to fix this so I can upsert csv data into delta tables in S3 in real time with spark
Best regards
Apologies for a late reply, but just in case anyone else has the same problem. I have found the below worked for me, I wonder is it because you didn't use "cloudFiles" on your readstream to make use of autoloader?:
%python
sourcedf= spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("cloudFiles.includeExistingFiles","true") \
.schema(csvSchema) \
.load("/mnt/user/raw/test_input")
%sql
CREATE TABLE IF NOT EXISTS deltaTable(
col1 int NOT NULL,
col2 string NOT NULL,
col3 bigint,
col4 int
)
USING DELTA
LOCATION '/mnt/user/raw/PARQUET/output'
%python
def upsertToDelta(microBatchOutputDF, batchId):
microBatchOutputDF.createOrReplaceTempView("updates")
microBatchOutputDF._jdf.sparkSession().sql("""
MERGE INTO deltaTable t
USING updates s
ON s.Id = t.Id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
%python
sourcedf.writeStream \
.format("delta") \
.foreachBatch(upsertToDelta) \
.outputMode("update") \
.option("checkpointLocation", "/mnt/user/raw/checkpoints/output") \
.start("/mnt/user/raw/PARQUET/output")

Exception has occurred: pyspark.sql.utils.AnalysisException 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'

at the code
if not df.head(1).isEmpty:
I got exception,
Exception has occurred: pyspark.sql.utils.AnalysisException 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'
I do not know how to use if in streaming data.
when I use jupyter, to execute each line, the code is well, and I can got my result. but use .py it's not good.
my perpose is this: I want use streaming to get data from kafka every one second, then I transform every batch steaming data(one batch means the data one second I get) to pandas dataframe, and then I use pandas function to do something to the data, finally I send the result to other kafka topic.
Please help me, and forgive my pool english, Thanks a lot.
sc = SparkContext("local[2]", "OdometryConsumer")
spark = SparkSession(sparkContext=sc) \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "data") \
.load()
ds = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
print(type(ds))
if not df.head(1).isEmpty:
alertQuery = ds \
.writeStream \
.queryName("qalerts")\
.format("memory")\
.start()
alerts = spark.sql("select * from qalerts")
pdAlerts = alerts.toPandas()
a = pdAlerts['value'].tolist()
d = []
for i in a:
x = json.loads(i)
d.append(x)
df = pd.DataFrame(d)
print(df)
ds = df['jobID'].unique().tolist()
dics = {}
for source in ds:
ids = df.loc[df['jobID'] == source, 'id'].tolist()
dics[source]=ids
print(dics)
query = ds \
.writeStream \
.queryName("tableName") \
.format("console") \
.start()
query.awaitTermination()
Remove if not df.head(1).isEmpty: and you should be fine.
The reason for the exception is simple, i.e. a streaming query is a structured query that never ends and is continually executed. It is simply not possible to look at a single element since there is no "single element", but (possibly) thousands of elements and it'd be hard to tell when exactly you'd like to look under the covers and see just a single element.

Count number of records written to Hive table in Spark Structured Streaming

I have this code.
val query = event_stream
.selectExpr("CAST(key AS STRING)", "CAST(value AS .select(from_json($"value", schema_simple).as("data"))
.select("data.*")
.writeStream
.outputMode("append")
.format("orc")
.option("path", "hdfs:***********")
//.option("path", "/tmp/orc")
.option("checkpointLocation", "hdfs:**********/")
.start()
println("###############" + query.isActive)
query.awaitTermination()
I want to count the number of records inserted into Hive.
What are the options available? And how to do it?
I found SparkEventListener TaskEnd. I'm not sure if it would work for a streaming source. I tried it, it's not working as of now.
One approach I thought was to make hiveReader and then count the number of records in the stream.

Resources