How to use datatorrent in a kappa architecture? - apache-spark

I read a lot about lambda and kappa architectures where we need to use either Apache Spark or Apache Storm. I just discovered a new tool called DataTorrent which can do batch and real-time process. I was wondering if DataTorrent can do, at the same time, the batch and speed layer of a lambda (or kappa) architecture ?
Cheers,

Apache apex or Datatorrent RTS allows your team to develop, test, debug and operate on a single processing framework.
Although, there is no explicit mention about kappa architecture in the Apache apex documentation, IMO it can be used to serve kappa architecture.
Apache apex would provide built-in support for fault tolerance, checkpointing, recovery. Thus, you can rely on single dataflow DAG in Apex to get reliable results with low latencies. There is no need to have separate batch layer and speed layer when you define your application using DAG on Apex.
But, note that Apache Apex is an example of stream computation engine. For complete Kappa architecture you would have combination of
Log stores + stream computation engine + Serving layer store.

DataTorrent can be used to serve Kappa architecture requirements. You can process your batch data and real time stream data at the same time.
Datatorrent is continuos flow model where the batch data flows like a stream through the DAG unlike Spark where streaming data flow in batches.
You may need to feed in your data from different input sources using different operator ports and the inmemory computation on the data is taken care by the platform calls on the ports.
It is just like having a sink (operator in DT) fed by two pipes (input ports).

Related

Need architecture hint: Data replication into the cloud + data cleansing

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?
From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.
The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).
Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

Mqtt + Spark streaming and dynamodb

I am trying to design an IoT platform using the above mentioned technologies. I would be happy if someone can comment on the architecture, if its good and scalable !
I get IoT sensor data through mqtt which I will receive through spark streaming( There is a mqtt connector for spark streaming which does it). I only have to subscribe to the topics and there is a third party server which publishes IoT data to the topic.
Then I parse the data , and insert in AWS DynamoDB . Yes whole setup will run on AWS.
I may have to process/transform the data in future depending on the IoT use cases so I thought spark might be useful . Also I have heard spark streaming is blazing fast.
It's a simple overview and I am not sure if its a good architecture. Will it be a overkill to use spark streaming ? Are there other ways to directly store data on DynamoDB received from mqtt ?
I cannot state whether your components will result in a scalable architecture, since you did not elaborate on how you will scale them, nor what will be the estimated load that such a system should handle, or if there will be peaks in terms of load.
If you are talking about scalability in terms of performance, you should also consider scalability in terms of pricing which may be important to your project.
For instance, DynamoDB is a very scalable NoSQL database service, which offers elastic performances with a very efficient pricing. I do not know much about Apache Spark, and even if it has been designed to be very efficient at scale, how will you distribute incoming data ? Will you host multiple instances on EC2 and use autoscaling to manage instances ?
My advice would be to segregate your needs in terms of components to conduct a successful analysis. To summarize your statements:
You need to ingest incoming sensor telemetry at scale using MQTT.
You need to transform or enrich these data on the fly.
You need to insert these data (probably as time-series) into DynamoDB in order to build an event-sourcing system.
Since you mentioned Apache Spark, I imagine you would need to perform some analysis of these data, either in near real-time, or in batch, to build value out of your data.
My advice would be to use serverless, managed services in AWS so that you can only pay for what really you use, and forget about the maintenance, or the scalability, and focus on your project.
AWS IoT is a platform built into AWS which will allow you to securely ingest data at any scale using MQTT.
This platform also embeds a rules engine, which will allow you to build your business rules in the cloud. For example, intercepting incoming messages, enrich them, and call other AWS services as a result (e.g calling a Lambda function to do some processing on the ingested data).
The rules engine has a native connector to DynamoDB, which will allow you to insert your enriched or transformed data into a table.
The rules engine has also a connector to the new Amazon Machine Learning service, if you want to get predictions on sensor data in real-time.
You can then use other services such as EMR + Spark to batch-process your data once a day, week, month.
The advantage here is that you assemble your components and use them as you go, meaning that you do not need the full featured stack when you are beginning, but still have the flexibility of making changes in the future.
An overview of the AWS IoT service.

What are the benefits of Apache Beam over Spark/Flink for batch processing?

Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing.
Looking at the Beam word count example, it feels it is very similar to the native Spark/Flink equivalents, maybe with a slightly more verbose syntax.
I currently don't see a big benefit of choosing Beam over Spark/Flink for such a task. The only observations I can make so far:
Pro: Abstraction over different execution backends.
Con: This abstraction comes at the price of having less control over what exactly is executed in Spark/Flink.
Are there better examples that highlight other pros/cons of the Beam model? Is there any information on how the loss of control affects performance?
Note that I'm not asking for differences in the streaming aspects, which are partly covered in this question and summarized in this article (outdated due to Spark 1.X).
There's a few things that Beam adds over many of the existing engines.
Unifying batch and streaming. Many systems can handle both batch and streaming, but they often do so via separate APIs. But in Beam, batch and streaming are just two points on a spectrum of latency, completeness, and cost. There's no learning/rewriting cliff from batch to streaming. So if you write a batch pipeline today but tomorrow your latency needs change, it's incredibly easy to adjust. You can see this kind of journey in the Mobile Gaming examples.
APIs that raise the level of abstraction: Beam's APIs focus on capturing properties of your data and your logic, instead of letting details of the underlying runtime leak through. This is both key for portability (see next paragraph) and can also give runtimes a lot of flexibility in how they execute. Something like ParDo fusion (aka function composition) is a pretty basic optimization that the vast majority of runners already do. Other optimizations are still being implemented for some runners. For example, Beam's Source APIs are specifically built to avoid overspecification the sharding within a pipeline. Instead, they give runners the right hooks to dynamically rebalance work across available machines. This can make a huge difference in performance by essentially eliminating straggler shards. In general, the more smarts we can build into the runners, the better off we'll be. Even the most careful hand tuning will fail as data, code, and environments shift.
Portability across runtimes.: Because data shapes and runtime requirements are neatly separated, the same pipeline can be run in multiple ways. And that means that you don't end up rewriting code when you have to move from on-prem to the cloud or from a tried and true system to something on the cutting edge. You can very easily compare options to find the mix of environment and performance that works best for your current needs. And that might be a mix of things -- processing sensitive data on premise with an open source runner and processing other data on a managed service in the cloud.
Designing the Beam model to be a useful abstraction over many, different engines is tricky. Beam is neither the intersection of the functionality of all the engines (too limited!) nor the union (too much of a kitchen sink!). Instead, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines.
Keyed State is a great example of functionality that existed in various engines and enabled interesting and common use cases, but wasn't originally expressible in Beam. We recently expanded the Beam model to include a version of this functionality according to Beam's design principles.
And vice versa, we hope that Beam will influence the roadmaps of various engines as well. For example, the semantics of Flink's DataStreams were influenced by the Beam (née Dataflow) model.
This also means that the capabilities will not always be exactly the same across different Beam runners at a given point in time. So that's why we're using capability matrix to try to clearly communicate the state of things.
I have a disadvantage, not a benefit. We had a leaky abstraction problem with Beam: when an issue needs to be debugged, we need to learn about the underlying runner and its API, Flink in this case, to understand the issue. This doubles the learning curve, having to learn about Beam and Flink at the same time. We ended up later switching the later developed pipelines to Flink.
Helpful information can be found here - https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html
---Quoted---
Beam provides a unified API for both batch and streaming scenarios.
Beam comes with native support for different programming languages, like Python or Go with all their libraries like Numpy, Pandas, Tensorflow, or TFX.
You get the power of Apache Flink like its exactly-once semantics, strong memory management and robustness.
Beam programs run on your existing Flink infrastructure or infrastructure for other supported Runners, like Spark or Google Cloud Dataflow.
You get additional features like side inputs and cross-language pipelines that are not supported natively in Flink but only supported when using Beam with Flink

Difference between DSMS, Storm and Flink

DSMS corresponds to Data Stream Management Systems. These systems allow users to submit queries that will be continuously executed until being removed by the user.
Can systems such as Storm and Flink be seen as DSMS or are they something more generic?
Thanks
Both types of systems are more orthogonal to each other as they try to solve different use cases. Thus, none does subsume or is a generalization of the other.
DSMS are usually:
end-to-end solutions providing storage and computation as a unified solution
required to import external data into system first
often DSMS are SQL orientated what makes them easy to use but often they are less expressive
usually can only handle structured data (schema based tuple format)
DSMS do often not scale
Stream Processing Frameworks (Flink, Storm, Spark):
only provide a computation layer and consumer data from other storage systems
most offer language embedded DSL (some also offer SQL to some extent)
can handle any type of data (flat tuples, JSON, XML, flat files, text)
build to scale to large clusters (many hundreds of nodes)
good for data crunching, machine learning
Streaming Platform (Kafka)
provides storage layer and computation
can handle any type of data as long as imported into the system (flat tuples, JSON, XML, flat files, text)
scalable and elastic
no SQL, only Java DSL (Confluent Platform which is based on Kafka offers KSQL as developer preview)
very good to build micro services

What does "streaming" mean in Apache Spark and Apache Flink?

As I went to Apache Spark Streaming Website, I saw a sentence:
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
And in Apache Flink Website, there is a sentence:
Apache Flink is an open source platform for scalable batch and stream data processing.
What means streaming application and batch data processing, stream data processing? Can you give some concrete examples? Are they designed for sensor data?
Streaming data analysis (in contrast to "batch" data analysis) refers to a continuous analysis of a typically infinite stream of data items (often called events).
Characteristics of Streaming Applications
Stream data processing applications are typically characterized by the following points:
Streaming applications run continuously, for a very long time, and consume and process events as soon as they appear. In contrast. batch applications gather data in files or databases and process it later.
Streaming applications frequently concern themselves with the latency of results. The latency is the delay between the creation of an event and the point when the analysis application has taken that event into account.
Because streams are infinite, many computations cannot refer not to the entire stream, but to a "window" over the stream. A window is a view of a sub-sequence of the stream events (such as the last 5 minutes). An example of a real world window statistic is the "average stock price over the past 3 days".
In streaming applications, the time of an event often plays a special role. Interpreting events with respect to their order in time is very common. While certain batch applications may do that as well, it not a core concept there.
Examples of Streaming Applications
Typical examples of stream data processing application are
Fraud Detection: The application tries to figure out whether a transaction fits with the behavior that has been observed before. If it does not, the transaction may indicate an attempted misuse. Typically very latency critical application.
Anomaly Detection: The streaming application builds a statistical model of the events it observes. Outliers indicate anomalies and may trigger alerts. Sensor data may be one source of events that one wants to analyze for anomalies.
Online Recommenders: If not a lot of past behavior information is available on a user that visits a web shop, it is interesting to learn from her behavior as she navigates the pages and explores articles, and to start generating some initial recommendations directly.
Up-to-date Data Warehousing: There are interesting articles on how to model a data warehousing infrastructure as a streaming application, where the event stream is sequence of changes to the database, and the streaming application computes various warehouses as specialized "aggregate views" of the event stream.
There are many more ...

Resources