What is the purpose of StreamingQuery.awaitTermination? - apache-spark

I have a Spark Structured Streaming job, it reads the offsets from a Kafka topic and writes it to the aerospike database. Currently I am in the process making this job production ready and implementing the SparkListener.
While going to through the documentation I stumbled upon this example:
StreamingQuery query = wordCounts.writeStream()
.outputMode("complete")
.format("console")
.start();
query.awaitTermination();
After this code is executed, the streaming computation will have
started in the background. The query object is a handle to that active
streaming query, and we have decided to wait for the termination of
the query using awaitTermination() to prevent the process from exiting
while the query is active.
I understand that it waits for query to complete before terminating the process.
What does it mean exactly? It helps to avoid data loss written by the query.
How is it helpful when query is writing millions of records every day?
My code looks pretty simple though:
dataset.writeStream()
.option("startingOffsets", "earliest")
.outputMode(OutputMode.Append())
.format("console")
.foreach(sink)
.trigger(Trigger.ProcessingTime(triggerInterval))
.option("checkpointLocation", checkpointLocation)
.start();

There are quite a few questions here, but answering just the one below should answer all.
I understand that it waits for query to complete before terminating the process. What does it mean exactly?
A streaming query runs in a separate daemon thread. In Java, daemon threads are used to allow for parallel processing until the main thread of your Spark application finishes (dies). Right after the last non-daemon thread finishes, the JVM shuts down and the entire Spark application finishes.
That's why you need to keep the main non-daemon thread waiting for the other daemon threads so they can do their work.
Read up on daemon threads in What is a daemon thread in Java?

I understand that it waits for query to complete before terminating the process.
What does it mean exactly
Nothing more, nothing less. Since query is started in background, without explicit blocking instruction your code would simply reach the end of main function and exit immediately.
How is it helpful when query is writing millions of records every day?
It really doesn't. It instead ensure that query is execute at all.

Related

HBase batch loading with speed control cause of slow consumer

We need to load a big part of data from HBase using Spark.
Then we put it into Kafka and read by consumer. But consumer is too slow
At the same time Kafka memory is not enough to keep all scan result.
Our key contain ...yyyy.MM.dd, and now we load 30 days in one Spark job, using operator filter.
But we cant split job to many jobs, (30 jobs filtering each day), cause then each job will have to scan all HBase, and it will make summary scan to slow.
Now we launch Spark job with 100 threads, but we cant make speed slower by set less threads (for example 7 threads). Cause Kafka is used by third hands developers, that make Kafka sometimes too busy to keep any data. So, we need to control HBase scan speed, checking all time is there a memory in Kafka to store our data
We try to save scan result before load to Kafka into some place, for example in ORC files in hdfs, but scan result make many little files, it is problem to group them by memory (or there is a way, if you know please tell me how?), and store into hdfs little files bad. And merging such a files is very expensive operation and spend a lot of time that will make total time too slow
Sugess solutions:
Maybe it is possible to store scan result in hdfs by spark, by set some special flag in filter operator and then run 30 spark jobs to select data from saved result and put each result to Kafka when it possible
Maybe there is some existed mechanism in spark to stop and continue launched jobs
Maybe there is some existed mechanism in spark to separate result by batches (without control to stop and continue loading)
Maybe there is some existed mechanism in spark to separate result by batches (with control to stop and continue loading by external condition)
Maybe when Kafka will throw an exception (that there is no place to store data), there is some backpressure mechanism in spark that will stop scan for some time if there some exceptions appear in execution (but i guess that there is will be limited retry of restarting to execute operator, is it possible to set restart operation forever, if it is a real solution?). But better to keep some free place in Kafka, and not to wait untill it will be overloaded
Do using PageFilter in HBase (but i guess that it is hard to realize), or other solutions variants? And i guess that there is too many objects in memory to use PageFilter
P.S
This https://github.com/hortonworks-spark/shc/issues/108 will not help, we already use filter
Any ideas would be helpful

How can a Spark structured streaming [2.2.x or 2.3.x] application signal that it is ready to consume from a Kafka topic

Framing the question
This question stems from the following problem:
I want to test a Spark structured streaming [2.2.X or 2.3.x] application that reads its input from Kafka (without a from beginning flag).
The app essentially reads like this:
val sparkSession = SparkSession.builder.getOrCreate()
val lines = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
.option("subscribe", "test")
.load()
Once the app is started and running, it may take an arbitrary amount of time for it to start listening to the Kafka topic.
How can I post the input data to Kafka, after waiting the least time possible?
Naive solution
A simple solution to the problem would be to wait a large arbitrary amount of time after starting the app:
startApplication()
Thread.sleep(10*1000)
postInputDataToKafka()
This is problematic on 2 accounts:
- Not all environments are equal, and some may take longer than you expected
- It's wasteful
Complex solution
Another option would be to use a global supervisor, meaning, to have some process that coordinates the test.
Meaning, the same process that starts the application waits to receive a signal from it that it's ready to listen. After this signal is received, it then starts posting the input data.
This approach requires the application to send such a signal, my question is how to do so.
You can wait until StreamingQuery.lastProgress returns a non-null value, such as
import org.apache.spark.sql.streaming.StreamingQuery
val q: StreamingQuery = ... // start a streaming query
while (q.lastProcess == null) {
Thread.sleep(100)
}
postInputDataToKafka

Spark-Streaming Kafka Direct Streaming API & Parallelism

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.
In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:
Does a corresponding Spark task to a Kafka partition both read
and process the data altogether ?
The rational behind this question is that in the previous API, that
is, the receiver based, a TASK was dedicated for receiving the data,
meaning a number tasks slot of your executors were reserved for data
ingestion and the other were there for processing. This had an
impact on how you size your executor in term of cores.
Take for example the advise on how to launch spark-streaming with
--master local. Everyone would tell that in the case of spark streaming,
one should put local[2] minimum, because one of the
core, will be dedicated to running the long receiving task that never
ends, and the other core will do the data processing.
So if the answer is that in this case, the task does both the reading
and the processing at once, then the question that follows, is that
really smart, i mean, this sounds like asynchronous. We want to be
able to fetch while we process so on the next processing the data is
already there. However if there only one core or more precisely to
both read the data and process them, how can both be done in
parallel, and how does that make things faster in general.
My original understand was that, things would have remain somehow the
same in the sense that, a task would be launch to read but that the
processing would be done in another task. That would mean that, if
the processing task is not done yet, we can still keep reading, until
a certain memory limit.
Can someone outline with clarity what is exactly going on here ?
EDIT1
We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.
Does a corresponding Spark task to a Kafka partition both read and
process the data altogether ?
The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:
Driver reads offsets from all kafka topics and partitions
Driver assigns each executor a topic and partition to be read and processed.
Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.
This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.
Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

Using Spark in while loop to process log files

I have a server that generate some log files every 1 second and I want to process this file using Apache Spark.
I write a spark application using python and in a while loop I process a group of log files.
I stop sparkContext in each iteration and start it for next step.
My question is that what is the best approach for this kind of application that runs infinitely and process batches or group of generated files. should I use a infinite while loop or should I run my code in cron job or even scheduling frameworks like airflow?
The best possible way to solve this is to use "Spark Streaming". Spark streaming enables you to process live data streams.Spark streaming currently works with Kafka,Flume,HDFS,S3,Amazon Kinesis and Twitter.Hence,you should first insert these logs into Kafka and then write a Spark streaming program which processes live stream of logs.This is a cleaner solution instead of using infinite loops and starting and stopping SparkContext multiple times.

How do I stop a spark streaming job?

I have a Spark Streaming job which has been running continuously. How do I stop the job gracefully? I have read the usual recommendations of attaching a shutdown hook in the job monitoring and sending a SIGTERM to the job.
sys.ShutdownHookThread {
logger.info("Gracefully stopping Application...")
ssc.stop(stopSparkContext = true, stopGracefully = true)
logger.info("Application stopped gracefully")
}
It seems to work but does not look like the cleanest way to stop the job. Am I missing something here?
From a code perspective it may make sense but how do you use this in a cluster environment? If we start a spark streaming job (we distribute the jobs on all the nodes in the cluster) we will have to keep track of the PID for the job and the node on which it was running. Finally when we have to stop the process, we need to keep track which node the job was running at and the PID for that. I was just hoping that there would be a simpler way of job control for streaming jobs.
You can stop your streaming context in cluster mode by running the following command without needing to sending a SIGTERM. This will stop the streaming context without you needing to explicitly stop it using a thread hook.
$SPARK_HOME_DIR/bin/spark-submit --master $MASTER_REST_URL --kill $DRIVER_ID
-$MASTER_REST_URL is the rest url of the spark driver, ie something like spark://localhost:6066
-$DRIVER_ID is something like driver-20150915145601-0000
If you want spark to stop your app gracefully, you can try setting the following system property when your spark app is initially submitted (see http://spark.apache.org/docs/latest/submitting-applications.html on setting spark configuration properties).
spark.streaming.stopGracefullyOnShutdown=true
This is not officially documented, and I gathered this from looking at the 1.4 source code. This flag is honored in standalone mode. I haven't tested it in clustered mode yet.
I am working with spark 1.4.*
Depends on the use case and how driver can be used.
Consider the case you wanted to collect some N records(tweets) from the Spark Structured Streaming, store them in Postgresql and stop the stream once the count crosses N records.
One way of doing this is to use accumulator and python threading.
Create a Python thread with stream query object and the accumulator, stop the query once the count is crossed
While starting the stream query pass the accumulator variable and update the value for each batch of the stream.
Sharing the code snippet for understanding/illustration purpose...
import threading
import time
def check_n_stop_streaming(query, acc, num_records=3500):
while (True):
if acc.value > num_records:
print_info(f"Number of records received so far {acc.value}")
query.stop()
break
else:
print_info(f"Number of records received so far {acc.value}")
time.sleep(1)
...
count_acc = spark.sparkContext.accumulator(0)
...
def postgresql_all_tweets_data_dump(df,
epoch_id,
raw_tweet_table_name,
count_acc):
print_info("Raw Tweets...")
df.select(["text"]).show(50, False)
count_acc += df.count()
mode = "append"
url = "jdbc:postgresql://{}:{}/{}".format(self._postgresql_host,
self._postgresql_port,
self._postgresql_database)
properties = {"user": self._postgresql_user,
"password": self._postgresql_password,
"driver": "org.postgresql.Driver"}
df.write.jdbc(url=url, table=raw_tweet_table_name, mode=mode, properties=properties)
...
query = tweet_stream.writeStream.outputMode("append"). \
foreachBatch(lambda df, id :
postgresql_all_tweets_data_dump(df=df,
epoch_id=id,
raw_tweet_table_name=raw_tweet_table_name,
count_acc=count_acc)).start()
stop_thread = threading.Thread(target=self.check_n_stop_streaming, args=(query, num_records, raw_tweet_table_name, ))
stop_thread.setDaemon(True)
stop_thread.start()
query.awaitTermination()
stop_thread.join()
If all you need is just stop running streaming application, then simplest way is via Spark admin UI (you can find it's URL in the startup logs of Spark master).
There is a section in the UI, that shows running streaming applications, and there are tiny (kill) url buttons near each application ID.
It is official now,please look into original apache documentation here-
http://spark.apache.org/docs/latest/configuration.html#spark-streaming

Resources