Spark Streaming Replay - apache-spark

I have a Spark Streaming application to analyze events incoming from a Kafka broker. I have rules like below and new rules can be generated by combining existing ones:
If this event type occurs raise an alert.
If this event type occurs more than 3 times in a 5-minute interval, raise an alert.
In parallel, I save every incoming data to Cassandra. What I like to do is run this streaming app for historic data from Cassandra. For example,
<This rule> would have generated <these> alerts for <last week>.
Is there any way to do this in Spark or is it in roadmap? For example, Apache Flink has event time processing. But migrating existing codebase to it seems hard and I'd like to solve this problem with reusing my existing code.

This is fairly straight-forward, with some caveats. First, it helps to understand how this works from the Kafka side.
Kafka manages what are called offsets -- each message in Kafka has an offset relative to its position in a partition. (Partitions are logical divisions of a topic.) The first message in a partition has an offset of 0L, second one is 1L etc. Except that, because of log rollover and possibly topic compaction, 0L isn't always the earliest offset in a partition.
The first thing you are going to have to do is to collect the offsets for all of the partitions you want to read from the beginning. Here's a function that does this:
def getOffsets(consumer: SimpleConsumer, topic: String, partition: Int) : (Long,Long) = {
val time = kafka.api.OffsetRequest.LatestTime
val reqInfo = Map[TopicAndPartition,PartitionOffsetRequestInfo](
(new TopicAndPartition(topic, partition)) -> (new PartitionOffsetRequestInfo(time, 1000))
)
val req = new kafka.javaapi.OffsetRequest(
reqInfo, kafka.api.OffsetRequest.CurrentVersion, "test"
)
val resp = consumer.getOffsetsBefore(req)
val offsets = resp.offsets(topic, partition)
(offsets(offsets.size - 1), offsets(0))
}
You would call it like this:
val (firstOffset,nextOffset) = getOffsets(consumer, "MyTopicName", 0)
For everything you ever wanted to know about retrieving offsets from Kafka, read this. It's cryptic, to say the least. (Let me know when you fully understand the second argument to PartitionOffsetRequestInfo, for example.)
Now that you have firstOffset and lastOffset of the partition you want to look at historically, you then use the fromOffset parameter of createDirectStream, which is of type: fromOffset: Map[TopicAndPartition, Long]. You would set the Long / value to the firstOffset you got from getOffsets().
As for nextOffset -- you can use that to determine in your stream when you move from handling historical data to new data. If msg.offset == nextOffset then you are processing the first non-historical record within the partition.
Now for the caveats, directly from the documentation:
Once a context has been started, no new streaming computations can be
set up or added to it.
Once a context has been stopped, it cannot be
restarted.
Only one StreamingContext can be active in a JVM at the
same time.
stop() on StreamingContext also stops the SparkContext. To
stop only the StreamingContext, set the optional parameter of stop()
called stopSparkContext to false.
A SparkContext can be re-used to
create multiple StreamingContexts, as long as the previous
StreamingContext is stopped (without stopping the SparkContext)
before the next StreamingContext is created.
It's because of these caveats that I grab nextOffset at the same time as firstOffset -- so I can keep the stream up, but change the context from historical to present-time processing.

Related

Is it possible to have a single kafka stream for multiple queries in structured streaming?

I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.

Spark Streaming appends to S3 as Parquet format, too many small partitions

I am building an app that uses Spark Streaming to receive data from Kinesis streams on AWS EMR. One of the goals is to persist the data into S3 (EMRFS), and for this I am using a 2 minutes non-overlapping window.
My approaches:
Kinesis Stream -> Spark Streaming with batch duration about 60 seconds, using a non-overlapping window of 120s, save the streamed data into S3 as:
val rdd1 = kinesisStream.map( rdd => /* decode the data */)
rdd1.window(Seconds(120), Seconds(120).foreachRDD { rdd =>
val spark = SparkSession...
import spark.implicits._
// convert rdd to df
val df = rdd.toDF(columnNames: _*)
df.write.parquet("s3://bucket/20161211.parquet")
}
Here is what s3://bucket/20161211.parquet looks like after a while:
As you can see, lots of fragmented small partitions (which is horrendous for read performance)...the question is, is there any way to control the number of small partitions as I stream data into this S3 parquet file?
Thanks
What I am thinking to do, is to each day do something like this:
val df = spark.read.parquet("s3://bucket/20161211.parquet")
df.coalesce(4).write.parquet("s3://bucket/20161211_4parition.parquet")
where I kind of repartition the dataframe to 4 partitions and save them back....
It works, I feel that doing this every day is not elegant solution...
That's actually pretty close to what you want to do, each partition will get written out as an individual file in Spark. However coalesce is a bit confusing since it can (effectively) apply upstream of where the coalesce is called. The warning from the Scala doc is:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
this may result in your computation taking place on fewer nodes than
you like (e.g. one node in the case of numPartitions = 1). To avoid this,
you can pass shuffle = true. This will add a shuffle step, but means the
current upstream partitions will be executed in parallel (per whatever
the current partitioning is).
In Dataset's its a bit easier to persist and count to do wide evaluation since the default coalesce function doesn't take repartition as a flag for input (although you could construct an instance of Repartition manually).
Another option is to have a second periodic batch job (or even a second streaming job) that cleans up/merges the results, but this can be a bit complicated as it introduces a second moving part to keep track of.

Kafka.Utils.createRDD Vs KafkaDirectStreaming

I would like to know if the read operations from a Kafka queue is faster by using batch-Kafka RDD instead of the KafkaDirectStream when I want to read all the Kafka queue.
I've observed that reading from different partition with batch RDD is not resulting in Spark concurrent jobs. Is there some Spark proprierties to config in order to allow this behaviour?
Thanks.
Try running your spark consumers in different threads or as different processes. That's the approach I take. I've observed that I get the best concurrency by allocating one consumer thread (or process) per topic partition.
Regarding your questions about batch vs KafkaDirectStream, I think even KafkaDirectStream is batch oriented. The batch interval can be specified in the streaming context, like this:
private static final int INTERVAL = 5000; // 5 seconds
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(INTERVAL));
There's a good image that described how spark streaming is batch oriented here:
http://spark.apache.org/docs/1.6.0/streaming-programming-guide.html#discretized-streams-dstreams
Spark is essentially a batch engine and Spark streaming takes batching closer to streaming by defining something called micro-batching. Micro-batching is nothing but specifying batch interval to be very low (can be as low as 50ms per the advice in the official documentation). So now all it matters is how much is your micro-batch interval going to be. If you keep it low, you would feel it is near real-time.
On the Kafka consumer front, Spark direct receiver runs as a separate task in each executor. So if you have enough executors as the partitions, then it fetches data from all partitions and creates an RDD out of it.
If you are talking about reading from multiple queues, then you would create multiple DStreams, which would again need more executors to match the total number of partitions.

Spark Streaming from Kafka Source Go Back to a Checkpoint or Rewinding

When streaming Spark DStreams as a consumer from a Kafka source, one can checkpoint the spark context so when the app crashes (or is affected by a kill -9), the app can recover from the context checkpoint. But if the app is 'accidentally deployed with bad logic', one might want to rewind to the last topic+partition+offset to replay events from a certain Kafka topic's partitions' offset positions that were working fine before the 'bad logic'. How are streaming apps rewound to the last 'good spot' (topic+partition+offset) when checkpointing is in effect?
Note: In I (Heart) Logs, Jay Kreps writes about using a parallel consumer (group) process that starts at the diverging Kafka offset locations until caught up with the original and then killing the original. (What does this 2nd Spark streaming process look like with respect to the starting from certain partition/offset locations?)
Sidebar: This question may be related to Mid-Stream Changing Configuration with Check-Pointed Spark Stream as a similar mechanism may need to be deployed.
You are not going to be able to rewind a stream in a running SparkStreamingContext. It's important to keep these points in mind (straight from the docs):
Once a context has been started, no new streaming computations can be set up or added to it.
Once a context has been stopped, it cannot be restarted.
Only one StreamingContext can be active in a JVM at the same time.
stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop()
called stopSparkContext to false.
A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping
the SparkContext) before the next StreamingContext is created
Instead, you are going to have to stop the current stream, and create a new one. You can start a stream from a specific set of offsets using one of the versions of createDirectStream that takes a fromOffsets parameter with the signature Map[TopicAndPartition, Long] -- it's the starting offset mapped by topic and partition.
Another theoretical possibility is to use KafkaUtils.createRDD which takes offset ranges as input. Say your "bad logic" started at offset X and then you fixed it at offset Y. For certain use cases, you might just want to do createRDD with the offsets from X to Y and process those results, instead of trying to do it as a stream.

Spark Streaming shuffle data size on disk keeps increasing

I have a basic spark streaming app that reads in logs from kafka and does a join with a batch RDD and saves the result to cassandra
val users = sparkStmContext.cassandraTable[User](usersKeyspace, usersTable)
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
sparkStmContext, Map(("zookeeper.connect", zkConnectionString), ("group.id", groupId)), Map(TOPIC -> 1), StorageLevel.MEMORY_ONLY_SER)
.transform(rdd => StreamTransformations.joinLogsWithUsers(rdd, users)) // simple join with users batch RDD and stream RDD
.saveToCassandra("users", "user_logs")
when running this though I notice that some of the shuffle data in "spark.local.dir" is not being removed and the directory size keeps growing until I run out of disk space.
Spark is supposed to take care of deleting this data (see last comment from TD).
I set the "spark.cleaner.ttl" but this had no effect.
Is this happening because I create the users RDD outside of the stream and using lazy loading to reload it every streaming window (as the user table can be updated by another process) and hence then it never goes out of scope?
Is there a better pattern to use for this use case as any docs I have seen have taken the same approach as me?

Resources