Multiple consumers exactly-once processing with Apache Spark Streaming - apache-spark

I am looking to process elements on a queue (Kafka or Amazon Kinesis) and to have multiple operations to be performed on each element, for example:
Write that to HDFS cluster
Invoke a rest API
Trigger a notification on slack.
On each of these operations I am expecting an exactly-once semantic, is this achievable in Apache Spark and how?

You will need to manage unique keys manually: but given that approach it is possible when using
KafkaUtils.createDirectStream
From the Spark docs http://spark.apache.org/docs/latest/streaming-kafka-integration.html :
Approach 2: Direct Approach (No Receivers)
each record is received by Spark Streaming
effectively exactly once despite failures.
And here is the idempotency requirement - so e.g. saving unique key per message in Postgres:
In order to achieve
exactly-once semantics for output of your results, your output
operation that saves the data to an external data store must be either
idempotent, or an atomic transaction that saves results and offsets
(see Semantics of output operations in the main programming guide for
further information).
Here is an idea of the kind of code you would need to manage the unique keys (from http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/ ):
stream.foreachRDD { rdd =>
rdd.foreachPartition { iter =>
// make sure connection pool is set up on the executor before writing
SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
iter.foreach { case (key, msg) =>
DB.autoCommit { implicit session =>
// the unique key for idempotency is just the text of the message itself, for example purposes
sql"insert into idem_data(msg) values (${msg})".update.apply
}
}
}
}
A unique per-message ID would need to be managed.

Exactly once is a side effect of at least once processing semantic, when the operations are idempotent. In your case, if all 3 operations are idempotent, then you can get exactly once semantic. The other way to get exactly once semantic is to wrap all the 3 operations and Kafka offset storage in one transaction, which is not feasible.
https://pkghosh.wordpress.com/2016/05/18/exactly-once-stream-processing-semantics-not-exactly/

Related

Handling Out-Of-Order Event Windowing in Apache Beam from a Multitenant Kafka Topic

I’ve been mulling over how to solve a given problem in Beam and thought I’d reach out to a larger audience for some advice. At present things seem to be working sparsely and I was curious if someone could provide a sounding-board to see if this workflow makes sense.
The primary high-level goal is to read records from Kafka that may be out of order and need to be windowed in Event Time according to another property found on the records and eventually emitting the contents of those windows and writing them out to GCS.
The current pipeline looks roughly like the following:
val partitionedEvents = pipeline
.apply("Read Events from Kafka",
KafkaIO
.read<String, Log>()
.withBootstrapServers(options.brokerUrl)
.withTopic(options.incomingEventsTopic)
.withKeyDeserializer(StringDeserializer::class.java)
.withValueDeserializerAndCoder(
SpecificAvroDeserializer<Log>()::class.java,
AvroCoder.of(Log::class.java)
)
.withReadCommitted()
.commitOffsetsInFinalize()
// Set the watermark to use a specific field for event time
.withTimestampPolicyFactory { _, previousWatermark -> WatermarkPolicy(previousWatermark) }
.withConsumerConfigUpdates(
ImmutableMap.of<String, Any?>(
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest",
ConsumerConfig.GROUP_ID_CONFIG, "log-processor-pipeline",
"schema.registry.url", options.schemaRegistryUrl
)
).withoutMetadata()
)
.apply("Logging Incoming Logs", ParDo.of(Events.log()))
.apply("Rekey Logs by Tenant", ParDo.of(Events.key()))
.apply("Partition Logs by Source",
// This is a custom function that will partition incoming records by a specific
// datasource field
Partition.of(dataSources.size, Events.partition<KV<String, Log>>(dataSources))
)
dataSources.forEach { dataSource ->
// Store a reference to the data source name to avoid serialization issues
val sourceName = dataSource.name
val tempDirectory = Directories.resolveTemporaryDirectory(options.output)
// Grab all of the events for this specific partition and apply the source-specific windowing
// strategies
partitionedEvents[dataSource.partition]
.apply(
"Building Windows for $sourceName",
SourceSpecificWindow.of<KV<String, Log>>(dataSource)
)
.apply("Group Windowed Logs by Key for $sourceName", GroupByKey.create())
.apply("Log Events After Windowing for $sourceName", ParDo.of(Events.logAfterWindowing()))
.apply(
"Writing Windowed Logs to Files for $sourceName",
FileIO.writeDynamic<String, KV<String, MutableIterable<Log>>>()
.withNumShards(1)
.by { row -> "${row.key}/${sourceName}" }
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(SerializableFunction { logs -> Files.stringify(logs.value) }), TextIO.sink())
.to(options.output)
.withNaming { partition -> Files.name(partition)}
.withTempDirectory(tempDirectory)
)
}
In a simpler, bulleted form, it might look like this:
Read records from single Kafka topic
Key all records by their tenant
Partition stream by another event properly
Iterate through known partitions in previous step
Apply custom windowing rules for each partition (related to datasource, custom window rules)
Group windowed items by key (tenant)
Write tenant-key pair groupings to GCP via FileIO
The problem is that the incoming Kafka topic contains out-of-order data across multiple tenants (e.g. events for tenant1 might be streaming in now, but then a few minutes later you’ll get them for tenant2 in the same partition, etc.). This would cause the watermark to bounce back and forth in time as each incoming record would not be guaranteed to continually increase, which sounds like it would be a problem, but I'm not certain. It certainly seems that while data is flowing through, some files are simply not being emitted at all.
The custom windowing function is extremely simple and was aimed to emit a single window once the allowed lateness and windowing duration has elapsed:
object SourceSpecificWindow {
fun <T> of(dataSource: DataSource): Window<T> {
return Window.into<T>(FixedWindows.of(dataSource.windowDuration()))
.triggering(Never.ever())
.withAllowedLateness(dataSource.allowedLateness(), Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes()
}
}
However, it seemed inconsistent since we'd see logging come out after the closing of the window, but not necessarily files being written out to GCS.
Does anything seem blatantly wrong or incorrect with this approach? Since the data can come in out of order within the source (i.e. right now, 2 hours ago, 5 minutes from now) and covers data across multiple tenants, but the aim is try and ensure that one tenant that keeps up to date won't drown out tenants that might come in the past.
Would we potentially need another Beam application or something to "split" this single stream of events into sub-streams that are each processed independently (so that each watermark processes on their own)? Is that where a SplittableDoFn would come in? Since I'm running on the SparkRunner, which doesn't appear to support that - but it seems as though it'd be a valid use case.
Any advice would be greatly appreciated or even just another set of eyes. I'd be happy to provide any additional details that I could.
Environment
Currently running against SparkRunner
While this may not be the most helpful response, I'll be transparent as far as the end result. Eventually the logic required for this specific use-case extended far beyond the built-in capabilities of those in Apache Beam, primarily in the area around windowing/governance of time.
The solution that was landed on was to switch the preferred streaming technology from Apache Beam to Apache Flink, which as you might imagine was quite a leap. The stateful-centric nature of Flink allowed us to more easily handle our use cases, define custom eviction criteria (and ordering) around windowing, while losing a layer of abstraction over it.

How to access stateSnapshot in mapGroupsWithState or share the GroupState between streams?

With the DStream API it's possible to access the snapshot state of a stateful stream using MapWithStateDStream.stateSnapshots(). In the new Structured Streaming API it seems to me that only the function passed to mapGroupsWithState is able to access and update the state.
I'd like to create an in-memory distributed state of a stream based on it's input events. Then enrich events from another stream by joining them to the first event's (complete) state, i.e. not to the first stream itself.
Using the DStream API I'd simply join the second stream with the first stream's stateSnapshot. Is this feature missing on the new Structured Streaming API or is there a new better/cleaner way of sharing the GroupState between two streams this?
Is this feature missing on the new Structured Streaming API or is
there a new better/cleaner way of doing this?
There is no stateSnapshot provided out of the box as part of Structured Streaming (SS). I'm assuming this could be done, perhaps in a later version of SS. I'm not sure it fits their design goals, since state is completely masked away from the end user for arbitrary streams, although this could be useful for people using custom state via (flat)mapGroupsWithState.
In order to roll your own "snapshot", you can always output the intermediate state you have in GroupState[S] at for every batch generated, i.e:
def updateSessionEvents(
id: Int,
userEvents: Iterator[UserEvent],
state: GroupState[UserSession]): Option[UserSession] = {
// Do stuff
val someState = ??? // update state
someState
}
Then, you're exposed to the entire state all the time. What this means is that you'll now need to maintain some flag indicating if the state is actually complete or not, in order not to send incomplete state downstream where it isn't mean to be.
mapGroupsWithState is like the functional map where for every input you have to give an output. That's the contract of map in its broad sense and mapGroupsWithState follows it.
If I understood you correctly, what you really need is KeyValueGroupedDataset.flatMapGroupsWithState as you seem to be building a so-called arbitrary stateful stream aggregation.
flatMapGroupsWithState[S, U](outputMode: OutputMode, timeoutConf: GroupStateTimeout)(func: (K, Iterator[V], GroupState[S]) ⇒ Iterator[U])(implicit arg0: Encoder[S], arg1: Encoder[U]): Dataset[U] applies the given function to each group of data, while maintaining a user-defined per-group state.
This seemingly complex flatMapGroupsWithState function gives you the most powerful abstraction for stateful aggregations and my understanding is that if it's not capable of giving you the expected result it's unlikely there's a better function (I'd love being told that I'm wrong).

Cassandra : Batch write optimisation

I get bulk write request for let say some 20 keys from client.
I can either write them to C* in one batch or write them individually in async way and wait on future to get them completed.
Writing in batch does not seem to be a goo option as per documentation as my insertion rate will be high and if keys belong to different partitions co-ordinators will have to do extra work.
Is there a way in datastax java driver with which I can group keys
which could belong to same partition and then club them into small
batches and then do invidual unlogged batch write in async. IN that
way i make less rpc calls to server at the same time coordinator will
have to write locally. I will be using token aware policy.
Your idea is right, but there is no built-in way, you usually do that manually.
Main rule here is to use TokenAwarePolicy, so some coordination would happen on driver side.
Then, you could group your requests by equality of partition key, that would probably be enough, depending on your workload.
What I mean by 'grouping by equality of partition key` is e.g. you have some data that looks like
MyData { partitioningKey, clusteringKey, otherValue, andAnotherOne }
Then when inserting several such objects, you group them by MyData.partitioningKey. It is, for all existsing paritioningKey values, you take all objects with same partitioningKey, and wrap them in BatchStatement. Now you have several BatchStatements, so just execute them.
If you wish to go further and mimic cassandra hashing, then you should look at cluster metadata via getMetadata method in com.datastax.driver.core.Cluster class, there is method getTokenRanges and compare them to result of Murmur3Partitioner.getToken or any other partitioner you configured in cassandra.yaml. I've never tried that myself though.
So, I would recommend to implement first approach, and then benchmark your application. I'm using that approach myself, and on my workload it works far better than without batches, let alone batches without grouping.
Logged batches should be used carefully in Cassandra becase they impose additional overhead. It also depends on the partition keys distribution. If your bulk write targets a single partition then using Unlogged batch results in a single insert operation.
In general, writing them invidually in async manner seems to be a good aproach as pointed here:
https://medium.com/#foundev/cassandra-batch-loading-without-the-batch-the-nuanced-edition-dd78d61e9885
You can find sample code on the above site how to handle multiple async writes:
https://gist.github.com/rssvihla/26271f351bdd679553d55368171407be#file-bulkloader-java
https://gist.github.com/rssvihla/4b62b8e5625a805583c1ce39b1260ff4#file-bulkloader-java
EDIT:
please read this also:
https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/#14
What does a single partition batch cost?
There’s no batch log written for single partition batches. The
coordinator doesn’t have any extra work (as for multi partition
writes) because everything goes into a single partition. Single
partition batches are optimized: they are applied with a single
RowMutation [10].
In a few words: single partition batches don’t put much more load on
the server than normal writes.
What does a multi partition batch cost?
Let me just quote Christopher Batey, because he has summarized this
very well in his post “Cassandra anti-pattern: Logged batches” [3]:
Cassandra [is first] writing all the statements to a batch log. That
batch log is replicated to two other nodes in case the coordinator
fails. If the coordinator fails then another replica for the batch log
will take over. [..] The coordinator has to do a lot more work than
any other node in the cluster.
Again, in bullets what has to be done:
serialize the batch statements
write the serialized batch to the batch log system table
replicate of this serialized batch to 2 nodes
coordinate writes to nodes holding the different partitions
on success remove the serialized batch from the batch log (also on the 2 replicas)
Remember that unlogged batches for multiple partitions are deprecated since Cassandra 2.1.6

Spark streaming with Kafka - createDirectStream vs createStream

We have been using spark streaming with kafka for a while and until now we were using the createStream method from KafkaUtils.
We just started exploring the createDirectStream and like it for two reasons:
1) Better/easier "exactly once" semantics
2) Better correlation of kafka topic partition to rdd partitions
I did notice that the createDirectStream is marked as experimental. The question I have is (sorry if this in not very specific):
Should we explore the createDirectStream method if exactly once is very important to us? Will be awesome if you guys can share your experience with it. Are we running the risk of having to deal with other issues such as reliability etc?
There is a great, extensive blog post by the creator of the direct approach (Cody) here.
In general, reading the Kafka delivery semantics section, the last part says:
So effectively Kafka guarantees at-least-once delivery by default and
allows the user to implement at most once delivery by disabling
retries on the producer and committing its offset prior to processing
a batch of messages. Exactly-once delivery requires co-operation with
the destination storage system but Kafka provides the offset which
makes implementing this straight-forward.
This basically means "we give you at least once out of the box, if you want exactly once, that's on you". Further, the blog post talks about the guarantee of "exactly once" semantics you get from Spark with both approaches (direct and receiver based, emphasis mine):
Second, understand that Spark does not guarantee exactly-once
semantics for output actions. When the Spark streaming guide talks
about exactly-once, it’s only referring to a given item in an RDD
being included in a calculated value once, in a purely functional
sense. Any side-effecting output operations (i.e. anything you do in
foreachRDD to save the result) may be repeated, because any stage of
the process might fail and be retried.
Also, this is what the Spark documentation says about receiver based processing:
The first approach (Receiver based) uses Kafka’s high level API to store consumed
offsets in Zookeeper. This is traditionally the way to consume data
from Kafka. While this approach (in combination with write ahead logs)
can ensure zero data loss (i.e. at-least once semantics), there is a
small chance some records may get consumed twice under some failures.
This basically means that if you're using the Receiver based stream with Spark you may still have duplicated data in case the output transformation fails, it is at least once.
In my project I use the direct stream approach, where the delivery semantics depend on how you handle them. This means that if you want to ensure exactly once semantics, you can store the offsets along with the data in a transaction like fashion, if one fails the other fails as well.
I recommend reading the blog post (link above) and the Delivery Semantics in the Kafka documentation page. To conclude, I definitely recommend you look into the direct stream approach.

Synchronization between Spark RDD partitions

Say that I have an RDD with 3 partitions and I want to run each executor/ worker in a sequence, such that, after partition 1 has been computed, then partition 2 can be computed, and after 2 is computed, finally, partition 3 can be computed. The reason I need this synchronization is because each partition has a dependency on some computation of a previous partition. Correct me if I'm wrong, but this type of synchronization does not appear to be well suited for the Spark framework.
I have pondered opening a JDBC connection in each worker task node as illustrated below:
rdd.foreachPartition( partition => {
// 1. open jdbc connection
// 2. poll database for the completion of dependent partition
// 3. read dependent edge case value from computed dependent partition
// 4. compute this partition
// 5. write this edge case result to database
// 6. close connection
})
I have even pondered using accumulators, picking the acc value up in the driver, and then re-broadcasting a value so the appropriate worker can start computation, but apparently broadcasting doesn't work like this, i.e., once you have shipped the broadcast variable through foreachPartition, you cannot re-broadcast a different value.
Synchronization is not really an issue. Problem is that you want to use a concurrency layer to achieve this and as a result you get completely sequential execution. No to mention that by pushing changes to the database just to fetch these back on another worker means you get not benefits of in-memory processing. In the current form it doesn't make sense to use Spark at all.
Generally speaking if you want to achieve synchronization in Spark you should think in terms of transformations. Your question is rather sketchy but you can try something like this:
Create first RDD with data from the first partition. Process in parallel and optionally push results outside
Compute differential buffer
Create second RDD with data from the second partition. Merge with differential buffer from 2, process, optionally push results to database.
Back to 2. and repeat
What do you gain here? First of all you can utilize your whole cluster. Moreover partial results are kept in memory and don't have to be transfered back and forth between the workers and the database.

Resources