Storm like structure in Apache Spark - apache-spark

You know how in Apache Storm you can have a Spout streaming data to multiple Bolts. Is there a way to do something similar in Apache Spark?
I basically want that there be one program to read data from a Kafka Queue and output it to 2 different programs, which can then process it in their own, different ways.
Specifically, there would be a reader program that would read data from the Kafka queue and output it to 2 programs x and y. x would process the data to calculate metrics of one kind (in my case it would calculate the user activities) whereas y would calculate metrics of another kind (in my case this would be checking activities based on different devices).
Can someone help me understand how this is possible in Spark?

Why don't you simply create two topologies?
Both topologies have a spout reading from the kafka topic (yes, you can have multiple topologies reading from the same topic; I have this running on production systems). Make sure that you use different spout configs, otherwise kafka-zookeper will see both topologies as being the same. Have a look at the documentation here.
Spoutconfig is an extension of KafkaConfig that supports additional fields with ZooKeeper connection info and for controlling behavior specific to KafkaSpout. The Zkroot will be used as root to store your consumer's offset. The id should uniquely identify your spout.
public SpoutConfig(BrokerHosts hosts, String topic, String zkRoot, String id);
Implement program x in topology x and program y in topology y.
Another option would have two graphs of bolts subscribing from the same spout, but IMHO this is not optimal, because failed tuples (which are likely to fail only in one single graph) would be replayed to both graphs, event if they fail in only one of the graphs; and therefore some kafka messages will end being processed twice, using tow separated topologies you avoid this.

Related

Handling Out-Of-Order Event Windowing in Apache Beam from a Multitenant Kafka Topic

I’ve been mulling over how to solve a given problem in Beam and thought I’d reach out to a larger audience for some advice. At present things seem to be working sparsely and I was curious if someone could provide a sounding-board to see if this workflow makes sense.
The primary high-level goal is to read records from Kafka that may be out of order and need to be windowed in Event Time according to another property found on the records and eventually emitting the contents of those windows and writing them out to GCS.
The current pipeline looks roughly like the following:
val partitionedEvents = pipeline
.apply("Read Events from Kafka",
KafkaIO
.read<String, Log>()
.withBootstrapServers(options.brokerUrl)
.withTopic(options.incomingEventsTopic)
.withKeyDeserializer(StringDeserializer::class.java)
.withValueDeserializerAndCoder(
SpecificAvroDeserializer<Log>()::class.java,
AvroCoder.of(Log::class.java)
)
.withReadCommitted()
.commitOffsetsInFinalize()
// Set the watermark to use a specific field for event time
.withTimestampPolicyFactory { _, previousWatermark -> WatermarkPolicy(previousWatermark) }
.withConsumerConfigUpdates(
ImmutableMap.of<String, Any?>(
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest",
ConsumerConfig.GROUP_ID_CONFIG, "log-processor-pipeline",
"schema.registry.url", options.schemaRegistryUrl
)
).withoutMetadata()
)
.apply("Logging Incoming Logs", ParDo.of(Events.log()))
.apply("Rekey Logs by Tenant", ParDo.of(Events.key()))
.apply("Partition Logs by Source",
// This is a custom function that will partition incoming records by a specific
// datasource field
Partition.of(dataSources.size, Events.partition<KV<String, Log>>(dataSources))
)
dataSources.forEach { dataSource ->
// Store a reference to the data source name to avoid serialization issues
val sourceName = dataSource.name
val tempDirectory = Directories.resolveTemporaryDirectory(options.output)
// Grab all of the events for this specific partition and apply the source-specific windowing
// strategies
partitionedEvents[dataSource.partition]
.apply(
"Building Windows for $sourceName",
SourceSpecificWindow.of<KV<String, Log>>(dataSource)
)
.apply("Group Windowed Logs by Key for $sourceName", GroupByKey.create())
.apply("Log Events After Windowing for $sourceName", ParDo.of(Events.logAfterWindowing()))
.apply(
"Writing Windowed Logs to Files for $sourceName",
FileIO.writeDynamic<String, KV<String, MutableIterable<Log>>>()
.withNumShards(1)
.by { row -> "${row.key}/${sourceName}" }
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(SerializableFunction { logs -> Files.stringify(logs.value) }), TextIO.sink())
.to(options.output)
.withNaming { partition -> Files.name(partition)}
.withTempDirectory(tempDirectory)
)
}
In a simpler, bulleted form, it might look like this:
Read records from single Kafka topic
Key all records by their tenant
Partition stream by another event properly
Iterate through known partitions in previous step
Apply custom windowing rules for each partition (related to datasource, custom window rules)
Group windowed items by key (tenant)
Write tenant-key pair groupings to GCP via FileIO
The problem is that the incoming Kafka topic contains out-of-order data across multiple tenants (e.g. events for tenant1 might be streaming in now, but then a few minutes later you’ll get them for tenant2 in the same partition, etc.). This would cause the watermark to bounce back and forth in time as each incoming record would not be guaranteed to continually increase, which sounds like it would be a problem, but I'm not certain. It certainly seems that while data is flowing through, some files are simply not being emitted at all.
The custom windowing function is extremely simple and was aimed to emit a single window once the allowed lateness and windowing duration has elapsed:
object SourceSpecificWindow {
fun <T> of(dataSource: DataSource): Window<T> {
return Window.into<T>(FixedWindows.of(dataSource.windowDuration()))
.triggering(Never.ever())
.withAllowedLateness(dataSource.allowedLateness(), Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes()
}
}
However, it seemed inconsistent since we'd see logging come out after the closing of the window, but not necessarily files being written out to GCS.
Does anything seem blatantly wrong or incorrect with this approach? Since the data can come in out of order within the source (i.e. right now, 2 hours ago, 5 minutes from now) and covers data across multiple tenants, but the aim is try and ensure that one tenant that keeps up to date won't drown out tenants that might come in the past.
Would we potentially need another Beam application or something to "split" this single stream of events into sub-streams that are each processed independently (so that each watermark processes on their own)? Is that where a SplittableDoFn would come in? Since I'm running on the SparkRunner, which doesn't appear to support that - but it seems as though it'd be a valid use case.
Any advice would be greatly appreciated or even just another set of eyes. I'd be happy to provide any additional details that I could.
Environment
Currently running against SparkRunner
While this may not be the most helpful response, I'll be transparent as far as the end result. Eventually the logic required for this specific use-case extended far beyond the built-in capabilities of those in Apache Beam, primarily in the area around windowing/governance of time.
The solution that was landed on was to switch the preferred streaming technology from Apache Beam to Apache Flink, which as you might imagine was quite a leap. The stateful-centric nature of Flink allowed us to more easily handle our use cases, define custom eviction criteria (and ordering) around windowing, while losing a layer of abstraction over it.

Kappa architecture: when insert to batch/analytic serving layer happens

As you know, Kappa architecture is some kind of simplification of Lambda architecture. Kappa doesn't need batch layer, instead speed layer have to guarantee computation precision and enough throughput (more parallelism/resources) on historical data re-computation.
Still Kappa architecture requires two serving layers in case when you need to do analytic based on historical data. For example, data that have age < 2 weeks are stored at Redis (streaming serving layer), while all older data are stored somewhere at HBase (batch serving layer).
When (due to Kappa architecture) I have to insert data to batch serving layer?
If streaming layer inserts data immidiately to both batch & stream serving layers - than how about late data arrival? Or streaming layer should backup speed serving layer to batch serving layer on regular basis?
Example: let say source of data is Kafka, data are processed by Spark Structured Streaming or Flink, sinks are Redis and HBase. When write to Redis & HBase should happen?
If we perform stream processing, we want to make sure that output data is firstly made available as a data stream. In your example that means we write to Kafka as a primary sink.
Now you have two options:
have secondary jobs that reads from that Kafka topic and writes to Redis and HBase. That is the Kafka way, in that Kafka Streams does not support writing directly to any of these systems and you set up a Kafka connect job. These secondary jobs can then be tailored to the specific sinks, but they add additional operations overhead. (That's a bit of the backup option that you mentioned).
with Spark and Flink you also have the option to have secondary sinks directly in your job. You may add additional processing steps to transform the Kafka output into a more suitable form for the sink, but you are more limited when configuring the job. For example in Flink, you need to use the same checkpointing settings for the Kafka sink and the Redis/HBase sink. Nevertheless, if the settings work out, you just need to run one streaming job instead of 2 or 3.
Late events
Now the question is what to do with late data. The best solution is to let the framework handle that through watermarks. That is, data is only committed at all sinks, when the framework is sure that no late data arrives. If that doesn't work out because you really need to process late events even if they arrive much, much later and still want to have temporary results, you have to use update events.
Update events
(as requested by the OP, I will add more details to the update events)
In Kafka Streams, elements are emitted through a continuous refinement mechanism by default. That means, windowed aggregations emit results as soon as they have any valid data point and update that result while receiving new data. Thus, any late event is processed and yield an updated result. While this approach nicely lowers the burden to users, as they do not need to understand watermarks, it has some severe short-comings that led the Kafka Streams developers to add Suppression in 2.1 and onward.
The main issue is that it poses quite big challenges to downward users to process intermediate results as also explained in the article about Suppression. If it's not obvious if a result is temporary or "final" (in the sense that all expected events have been processed) then many applications are much harder to implement. In particular, windowing operations need to be replicated on consumer side to get the "final" value.
Another issue is that the data volume is blown up. If you'd have a strong aggregation factor, using watermark-based emission will reduce your data volume heavily after the first operation. However, continuous refinement will add a constant volume factor as each record triggers a new (intermediate) record for all intermediate steps.
Lastly, and particularly interesting for you is how to offload data to external systems if you have update events. Ideally, you would offload the data with some time lag continuously or periodically. That approach simulates the watermark-based emission again on consumer side.
Mixing the options
It's possible to use watermarks for the initial emission and then use update events for late events. The volume is then reduced for all "on-time" events. For example, Flink offers allowed lateness to make windows trigger again for late events.
This setup makes offloading data much easier as data only needs to be re-emitted to the external systems if a late event actually happened. The system should be tweaked that a late event is a rare case though.

Order Guarantee with Sparking Streaming

I am trying to get some change event from Kafka that I would like to propagate downstream in another system. However the Change order matters. Hence I wonder what is the appropriate way to do that with some Spark transformation in the middle.
The only thing I see is to loose the parallelism and make the DStream on one partition. Maybe there is a way to do operation in parallel and bring everything back in one partition and then send it to the external system or back in Kafka and then use a Kafka Sink for the matter.
What approach can I try?
In a distributed environment, with some form of cashing/buffering at most layer, message generated from same machine may reach back-end in different order. Also the definition of order is subjective. Implementing a global definition of order will be restrictive (may not be correct) for the data as a whole.
So, Kafka is meant for keeping the data in order in the order of put but partition comes as a catch!!! Partition defines the level of parallelism per topic.
Typically, the level of abstraction at which kafka is kept, it should not bother much about order. It should be optimised for maximum throughput, where partitioning will come handy!!! Consider ordering just a side effect of supporting streaming!!!
Now, what ever logic ensures, that data is put in to kafka in order, that makes more sense in your application (spark job).

What is the most simple way to write to kafka from spark stream

I would like to write to kafka from spark stream data.
I know that I can use KafkaUtils to read from kafka.
But, KafkaUtils doesn't provide API to write to kafka.
I checked past question and sample code.
Is Above sample code the most simple way to write to kafka?
If I adopt way like above sample, I must create many classes...
Do you know more simple way or library to help to write to kafka?
Have a look here:
Basically this blog post summarise your possibilities which are written in different variations in the link you provided.
If we will look at your task straight forward, we can make several assumptions:
Your output data is divided to several partitions, which may (and quite often will) reside on different machines
You want to send the messages to Kafka using standard Kafka Producer API
You don't want to pass data between machines before the actual sending to Kafka
Given those assumptions your set of solution is pretty limited: You whether have to create a new Kafka producer for each partition and use it to send all the records of that partition, or you can wrap this logic in some sort of Factory / Sink but the essential operation will remain the same : You'll still request a producer object for each partition and use it to send the partition records.
I'll suggest you continue with one of the examples in the provided link, the code is pretty short, and any library you'll find would most probably do the exact same thing behind the scenes.

Spark: Importing Data

I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!
Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.

Resources