DB writes are executed lazily with Spark Streaming - apache-spark

HBase puts are not executed while Spark Streaming is running, only when I'm shutting down Spark - it tries to perform all puts altogether
val inputRdd = FlumeUtils.createStream(ssc, "server", 44444)
inputRdd.foreachRDD({ rdd =>
rdd.foreachPartition(partitionOfRecords => {
val hbaseClient = new HBaseClient(zookeeper)
partitionOfRecords.foreach({ event =>
hbaseClient.put(parse(event))
hbaseClient.flush()

ok - I've found my answer - apparently my code was correct, the problem was I didn't leave enough threads for processing the data
from
http://spark.apache.org/docs/latest/streaming-programming-guide.html
"""
If you are using a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).
"""
using local[*] fixed the issue

Related

Spark Structured Streaming - How to ignore checkpoint?

I'm reading messages from Kafka stream using microbatching (readStream), processing them and writing results to another Kafka topic via writeStream. The job (streaming query) is designed to run "forever", processing microbatches of size 10 seconds (of processing time). The checkpointDirectory option is set, since Spark requires checkpointing.
However, when I try to submit another query with the same source stream (same topic etc.) but possibly different processing algorithm), Spark finishes the previous running query and creates a new one with the same ID (so it starts from the very same offset on which the previous job "finished").
How to tell Spark that the second job is different from the first one, so there is no need to restore from checkpoint (i.e. intended behaviour is to create a completely new streaming query not connected to previous one, and keep the previous one running)?
You can achieve independence of the two streaming queries by setting the checkpointLocation option in their respective writeStream call. You should not set the checkpoint location centrally in the SparkSession.
That way, they can run independently and will not interfere from each other.

Spark Streaming Replay

I have a Spark Streaming application to analyze events incoming from a Kafka broker. I have rules like below and new rules can be generated by combining existing ones:
If this event type occurs raise an alert.
If this event type occurs more than 3 times in a 5-minute interval, raise an alert.
In parallel, I save every incoming data to Cassandra. What I like to do is run this streaming app for historic data from Cassandra. For example,
<This rule> would have generated <these> alerts for <last week>.
Is there any way to do this in Spark or is it in roadmap? For example, Apache Flink has event time processing. But migrating existing codebase to it seems hard and I'd like to solve this problem with reusing my existing code.
This is fairly straight-forward, with some caveats. First, it helps to understand how this works from the Kafka side.
Kafka manages what are called offsets -- each message in Kafka has an offset relative to its position in a partition. (Partitions are logical divisions of a topic.) The first message in a partition has an offset of 0L, second one is 1L etc. Except that, because of log rollover and possibly topic compaction, 0L isn't always the earliest offset in a partition.
The first thing you are going to have to do is to collect the offsets for all of the partitions you want to read from the beginning. Here's a function that does this:
def getOffsets(consumer: SimpleConsumer, topic: String, partition: Int) : (Long,Long) = {
val time = kafka.api.OffsetRequest.LatestTime
val reqInfo = Map[TopicAndPartition,PartitionOffsetRequestInfo](
(new TopicAndPartition(topic, partition)) -> (new PartitionOffsetRequestInfo(time, 1000))
)
val req = new kafka.javaapi.OffsetRequest(
reqInfo, kafka.api.OffsetRequest.CurrentVersion, "test"
)
val resp = consumer.getOffsetsBefore(req)
val offsets = resp.offsets(topic, partition)
(offsets(offsets.size - 1), offsets(0))
}
You would call it like this:
val (firstOffset,nextOffset) = getOffsets(consumer, "MyTopicName", 0)
For everything you ever wanted to know about retrieving offsets from Kafka, read this. It's cryptic, to say the least. (Let me know when you fully understand the second argument to PartitionOffsetRequestInfo, for example.)
Now that you have firstOffset and lastOffset of the partition you want to look at historically, you then use the fromOffset parameter of createDirectStream, which is of type: fromOffset: Map[TopicAndPartition, Long]. You would set the Long / value to the firstOffset you got from getOffsets().
As for nextOffset -- you can use that to determine in your stream when you move from handling historical data to new data. If msg.offset == nextOffset then you are processing the first non-historical record within the partition.
Now for the caveats, directly from the documentation:
Once a context has been started, no new streaming computations can be
set up or added to it.
Once a context has been stopped, it cannot be
restarted.
Only one StreamingContext can be active in a JVM at the
same time.
stop() on StreamingContext also stops the SparkContext. To
stop only the StreamingContext, set the optional parameter of stop()
called stopSparkContext to false.
A SparkContext can be re-used to
create multiple StreamingContexts, as long as the previous
StreamingContext is stopped (without stopping the SparkContext)
before the next StreamingContext is created.
It's because of these caveats that I grab nextOffset at the same time as firstOffset -- so I can keep the stream up, but change the context from historical to present-time processing.

How a Spark Streaming application be loaded and run?

hi i am new to spark and spark streaming.
from the official document i could understand how to manipulate input data and save them.
the problem is the quick example of Spark Streaming quick examplemade me confuse
i knew the the job should get data from the DStream you have setted and do something on them, but since its running 24/7. how will the application be loaded and run?
will it run every n seconds or just run once at the beginning and then enter the cycle of [read-process-loop]?
BTW, i am using python, so i checked the python code of that example, if its the latter case, how spark's executor knews the which code snipnet is the loop part ?
Spark Streaming is actually a microbatch processing. That means each interval, which you can customize, a new batch is executed.
Look at the coding of the example, which you have mentioned
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc,1)
You define a streaming context, which a micro-batch interval of 1 second.
That is the subsequent coding, which uses the streaming context
lines = ssc.socketTextStream("localhost", 9999)
...
gets executed every second.
The streaming process gets initially triggerd by this line
ssc.start() # Start the computation

How do I stop a spark streaming job?

I have a Spark Streaming job which has been running continuously. How do I stop the job gracefully? I have read the usual recommendations of attaching a shutdown hook in the job monitoring and sending a SIGTERM to the job.
sys.ShutdownHookThread {
logger.info("Gracefully stopping Application...")
ssc.stop(stopSparkContext = true, stopGracefully = true)
logger.info("Application stopped gracefully")
}
It seems to work but does not look like the cleanest way to stop the job. Am I missing something here?
From a code perspective it may make sense but how do you use this in a cluster environment? If we start a spark streaming job (we distribute the jobs on all the nodes in the cluster) we will have to keep track of the PID for the job and the node on which it was running. Finally when we have to stop the process, we need to keep track which node the job was running at and the PID for that. I was just hoping that there would be a simpler way of job control for streaming jobs.
You can stop your streaming context in cluster mode by running the following command without needing to sending a SIGTERM. This will stop the streaming context without you needing to explicitly stop it using a thread hook.
$SPARK_HOME_DIR/bin/spark-submit --master $MASTER_REST_URL --kill $DRIVER_ID
-$MASTER_REST_URL is the rest url of the spark driver, ie something like spark://localhost:6066
-$DRIVER_ID is something like driver-20150915145601-0000
If you want spark to stop your app gracefully, you can try setting the following system property when your spark app is initially submitted (see http://spark.apache.org/docs/latest/submitting-applications.html on setting spark configuration properties).
spark.streaming.stopGracefullyOnShutdown=true
This is not officially documented, and I gathered this from looking at the 1.4 source code. This flag is honored in standalone mode. I haven't tested it in clustered mode yet.
I am working with spark 1.4.*
Depends on the use case and how driver can be used.
Consider the case you wanted to collect some N records(tweets) from the Spark Structured Streaming, store them in Postgresql and stop the stream once the count crosses N records.
One way of doing this is to use accumulator and python threading.
Create a Python thread with stream query object and the accumulator, stop the query once the count is crossed
While starting the stream query pass the accumulator variable and update the value for each batch of the stream.
Sharing the code snippet for understanding/illustration purpose...
import threading
import time
def check_n_stop_streaming(query, acc, num_records=3500):
while (True):
if acc.value > num_records:
print_info(f"Number of records received so far {acc.value}")
query.stop()
break
else:
print_info(f"Number of records received so far {acc.value}")
time.sleep(1)
...
count_acc = spark.sparkContext.accumulator(0)
...
def postgresql_all_tweets_data_dump(df,
epoch_id,
raw_tweet_table_name,
count_acc):
print_info("Raw Tweets...")
df.select(["text"]).show(50, False)
count_acc += df.count()
mode = "append"
url = "jdbc:postgresql://{}:{}/{}".format(self._postgresql_host,
self._postgresql_port,
self._postgresql_database)
properties = {"user": self._postgresql_user,
"password": self._postgresql_password,
"driver": "org.postgresql.Driver"}
df.write.jdbc(url=url, table=raw_tweet_table_name, mode=mode, properties=properties)
...
query = tweet_stream.writeStream.outputMode("append"). \
foreachBatch(lambda df, id :
postgresql_all_tweets_data_dump(df=df,
epoch_id=id,
raw_tweet_table_name=raw_tweet_table_name,
count_acc=count_acc)).start()
stop_thread = threading.Thread(target=self.check_n_stop_streaming, args=(query, num_records, raw_tweet_table_name, ))
stop_thread.setDaemon(True)
stop_thread.start()
query.awaitTermination()
stop_thread.join()
If all you need is just stop running streaming application, then simplest way is via Spark admin UI (you can find it's URL in the startup logs of Spark master).
There is a section in the UI, that shows running streaming applications, and there are tiny (kill) url buttons near each application ID.
It is official now,please look into original apache documentation here-
http://spark.apache.org/docs/latest/configuration.html#spark-streaming

Spark Streaming not distributing task to nodes on cluster

I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.
sparkConf.setMaster("spark://rsplws224:7077")
val ssc=new StreamingContext()
println(ssc.sparkContext.master)
val inDStream = ssc.receiverStream //batch of 500 ms as i would like to have 1 sec latency
val filteredDStream = inDStream.filter // filtering unwanted tuples
val keyDStream = filteredDStream.map // converting to pair dstream
val stateStream = keyDStream .updateStateByKey //updating state for history
stateStream.checkpoint(Milliseconds(2500)) // to remove long lineage and meterilizing state stream
stateStream.count()
val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system
My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.
I have tried below things but nothing is working
1) spark.locality.wait to 1 sec
2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.
3) increased spark.streaming.concurrentJobs from 1 (default) to 3
4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.
If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.
Edit
One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated
Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#performance-tuning for more information.
If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.
By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data.
Change this property in spark-defaults.conf file in order to increase the parallelism level.

Resources