Configuring StreamingContext in Apache Zeppelin - apache-spark

My goal is to read streaming data from a stream(in my case aws kinesis) and then query the data. The problem is that I want to query the last 5 minutes data on every batch interval. And what I found is that it is possible to keep the data in a stream for a certain period(using StreamingContext.remember(Duration duration) method). Zeppelin's spark interpreter creates the SparkSession automatically and I don't know how to configure the StreamingContext. Here's what I do:
val df = spark
.readStream
.format("kinesis")
.option("streams", "test")
.option("endpointUrl", "kinesis.us-west-2.amazonaws.com")
.option("initialPositionInStream", "latest")
.option("format", "csv")
.schema(//schema definition)
.load
So far so good. Then as far as I can see the streaming context is started when the write stream is set and started:
df.writeStream
.format(//output source)
.outputMode("complete")
.start()
But having only the SparkSession I don't know how to achieve a query over last X minutes data. Any suggestions?

Related

Spark structured streaming job failing after running for 3-4 days with java.io.FileNotFoundException: File does not exist

We have a spark structured streaming job written in scala running in production which reads from a kafka topic and writes to HDFS sink. Triggertime is 60 Seconds. The job has been deployed 4months back and after running well for a month, we started getting the below error and job fails instantly:
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/XYZ/hive/prod/landing_dir/abc/_spark_metadata/.edfac8fb-aa6d-4c0e-aa19-48f44cc29774.tmp (inode 6736258272) Holder DFSClient_NONMAPREDUCE_841796220_49 does not have any open files
Earlier this issue was not regular i.e. it was happening once in 2-3 weeks. Last 1 month, these error has become very frequent and happening at an interval of 3-4 days and failing the job. We restart this job once in a week as part of regular maintenance. Spark version is 2.3.2 and we run on YARN cluster manager. From the error it is evident that something is not going right within Write Ahead Log(WAL) directory since the path is pointing to _spark_metadata. Would like to understand what causing this exception and how we can handle it. Is this something we can handle in our application or is it an environment issue need to be addressed at the infra level.
Below is the code snippet:
val spark = SparkSession
.builder
.master(StreamerConfig.sparkMaster)
.appName(StreamerConfig.sparkAppName)
.getOrCreate()
spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.streaming.stopGracefullyOnShutdown","true")
spark.conf.set("spark.sql.files.ignoreCorruptFiles","true")
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("spark.shuffle.service.enabled","true")
val readData = spark
.readStream
.format("kafka") .option("kafka.bootstrap.servers",StreamerConfig.kafkaBootstrapServer)
.option("subscribe",StreamerConfig.topicName)
.option("failOnDataLoss", false)
.option("startingOffsets",StreamerConfig.kafkaStartingOffset) .option("maxOffsetsPerTrigger",StreamerConfig.maxOffsetsPerTrigger)
.load()
val deserializedRecords = StreamerUtils.deserializeAndMapData(readData,spark)
val streamingQuery = deserializedRecords.writeStream
.queryName(s"Persist data to hive table for ${StreamerConfig.topicName}")
.outputMode("append")
.format("orc")
.option("path",StreamerConfig.hdfsLandingPath)
.option("checkpointLocation",StreamerConfig.checkpointLocation)
.partitionBy("date","hour")
.option("truncate","false")
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()

How we manage offsets in Spark Structured Streaming? (Issues with _spark_metadata )

Background:
I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large, when the streaming app runs for a long time the metadata folder grows so big that we start getting OOM errors. I want to get rid of metadata and checkpoint folders of Spark Structured Streaming and manage offsets myself.
How we managed offsets in Spark Streaming:
I have used val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges to get offsets in Spark Structured Streaming. But want to know how to get the offsets and other metadata to manage checkpointing ourself using Spark Structured Streaming. Do you have any sample program that implements checkpointing?
How we managed offsets in Spark Structured Streaming??
Looking at this JIRA https://issues-test.apache.org/jira/browse/SPARK-18258. looks like offsets are not provided. How should we go about?
The issue is in 6 hours size of metadata increased to 45MB and it grows till it reaches nearly 13 GB. Driver memory allocated is 5GB. At that time system crashes with OOM. Wondering how to avoid making this meta data grow so large? How to make metadata not log so much information.
Code:
1. Reading records from Kafka topic
Dataset<Row> inputDf = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.option("startingOffsets", "earliest") \
.load()
2. Use from_json API from Spark to extract your data for further transformation in a dataset.
Dataset<Row> dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
....withColumn("oem_id", col("metadata.oem_id"));
3. Construct a temp table of above dataset using SQLContext
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
4. Flatten events since Parquet does not support hierarchical data.
5. Store output in parquet format on S3
StreamingQuery query = flatDf.writeStream().format("parquet")
Dataset dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
.select("event.metadata", "event.data", "event.connection", "event.registration_event","event.version_event"
);
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
Dataset flatDf = sqlContext
.sql("select " + " date, time, id, " + flattenSchema(EVENT_SCHEMA, "event") + " from event");
StreamingQuery query = flatDf
.writeStream()
.outputMode("append")
.option("compression", "snappy")
.format("parquet")
.option("checkpointLocation", checkpointLocation)
.option("path", outputPath)
.partitionBy("date", "time", "id")
.trigger(Trigger.ProcessingTime(triggerProcessingTime))
.start();
query.awaitTermination();
For non-batch Spark Structured Streaming KAFKA integration:
Quote:
Structured Streaming ignores the offsets commits in Apache Kafka.
Instead, it relies on its own offsets management on the driver side which is responsible for distributing offsets to executors and
for checkpointing them at the end of the processing round (epoch or
micro-batch).
You need not worry if you follow the Spark KAFKA integration guides.
Excellent reference: https://www.waitingforcode.com/apache-spark-structured-streaming/apache-spark-structured-streaming-apache-kafka-offsets-management/read
For batch the situation is different, you need to manage that yourself and store the offsets.
UPDATE
Based on the comments I suggest the question is slightly different and advise you look at Spark Structured Streaming Checkpoint Cleanup. In addition to your updated comments and the fact that there is no error, I suggest you consukt this on metadata for Spark Structured Streaming https://www.waitingforcode.com/apache-spark-structured-streaming/checkpoint-storage-structured-streaming/read. Looking at the code, different to my style, but cannot see any obvious error.

Spark Structured Streaming with Kafka source, change number of topic partitions while query is running

I've set up a Spark structured streaming query that reads from a Kafka topic.
If the number of partitions in the topic is changed while the Spark query is running, Spark does not seem to notice and data on new partitions is not consumed.
Is there a way to tell Spark to check for new partitions in the same topic apart from stopping the query an restarting it?
EDIT:
I'm using Spark 2.4.4. I read from kafka as follows:
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaURL)
.option("startingOffsets", "earliest")
.option("subscribe", topic)
.option("failOnDataLoss", value = false)
.load()
after some processing, I write to HDFS on a Delta Lake table.
Answering my own question. Kafka consumers check for new partitions/topic (in case of subscribing to topics with a pattern) every metadata.max.age.ms, whose default value is 300000 (5 minutes).
Since my test was lasting far less than that, I wouldn't notice the update. For tests, reduce the value to something smaller, e.g. 100 ms, by setting the following option of the DataStreamReader:
.option("kafka.metadata.max.age.ms", 100)

How to validate every row of streaming batch?

Need to validate each row of Streaming Dataframe (consumed through readStream(kafka) - Getting error
Queries with streaming sources must be executed with writeStream.start()
as it is not allowing to validate row by row
I have created spark batch job to consume data from Kafka , validated each row against HBase data, another set of validations based on rowkey and created a dataframe out of it. But here I need to handle the Kafka offset manually in the code.
To avoid the offset handling, am trying to use spark structural Streaming but there am not able to validate row by row.
writestream.foreach (foreachwriter) is good to sink with any external datasource or writing to kafka.
But in my case, I need to validate each row and create a new dataframe based on my validation. foreachwriter - process is not allowing to collect the data using other external classes/list.
Errors:
Getting this error when I tried to access the streaming dataframe to validate
Queries with streaming sources must be executed with writeStream.start();
Task is not serializable when I tried to create a list out of foreach(foreachwriter extended object). Will update with more details (as I am trying other options)
I am trying to achieve spark batch using writerstream.trigger(Trigger.once) with checkpointlocation
Updating with Spark batch and Structural Streaming Code.
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootStrap)
.option("subscribePattern", kafkaSubTopic)
.option("startingOffsets", "earliest")
//.option("endingOffsets", "latest")
.load()
rawData.collect.foreach(row => {
if (dValidate.dValidate(row)) {
validatedCandidates += (row.getString(0))
}
==================== in the above code I need to handle the offset manually for rerun so decided to use structural streaming.============
val rawData = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootStrap)
.option("subscribe", kafkaSubTopic)
.option("enable.auto.commit", "true")
.option("startingOffsets","latest")
.option("minPartitions", "10")
.option("failOnDataLoss", "true")
.load()
val sInput = new SinkInput(validatedCandidates,dValidate)
rawData.writeStream
.foreach(sInput)
.outputMode(OutputMode.Append())
.option("truncate", "false")
.trigger(Trigger.Once())
.start()
am getting "Task not serialized" error in here.
with class SinkInput , I am trying to do the same collect operation with external dValidate instance
Unless I misunderstood your case, rawData is a streaming query (a streaming Dataset) and does not support collect. The following part of your code is not correct:
rawData.collect
That's not supported and hence the exception.
You should be using foreach or foreachBatch to access rows.
Do this instead:
rawData.write.foreach(...)

how to check if stop streaming from kafka topic by a limited time duration or record count?

My ultimate goal is to see if a kafka topic is running and if the data in it is good, otherwise fail / throw an error
if I could pull just 100 messages, or pull for just 60 seconds I think I could accomplish what i wanted. But all the streaming examples / questions I have found online have no intention of shutting down the streaming connection.
Here is the best working code I have so far, that pulls data and displays it, but it keeps trying to pull for more data, and if I try to access it in the next line, it hasnt had a chance to pull the data yet. I assume I need some sort of call back. has anyone done something similar? is this the best way of going about this?
I am using databricks notebooks to run my code
import org.apache.spark.sql.functions.{explode, split}
val kafka = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<kafka server>:9092")
.option("subscribe", "<topic>")
.option("startingOffsets", "earliest")
.load()
val df = kafka.select(explode(split($"value".cast("string"), "\\s+")).as("word"))
display(df.select($"word"))
The trick is you don't need streaming at all. Kafka source supports batch queries, if you replace readStream with read and adjust startingOffsets and endingOffsets.
val df = spark
.read
.format("kafka")
... // Remaining options
.load()
You can find examples in the Kafka streaming documentation.
For streaming queries you can use once trigger, although it might not be the best choice in this case:
df.writeStream
.trigger(Trigger.Once)
... // Handle the output, for example with foreach sink (?)
You could also use standard Kafka client to fetch some data without starting SparkSession.

Resources