What does "streaming" mean in Apache Spark and Apache Flink?

What does "streaming" mean in Apache Spark and Apache Flink? - apache-spark

As I went to Apache Spark Streaming Website, I saw a sentence:
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
And in Apache Flink Website, there is a sentence:
Apache Flink is an open source platform for scalable batch and stream data processing.
What means streaming application and batch data processing, stream data processing? Can you give some concrete examples? Are they designed for sensor data?

Streaming data analysis (in contrast to "batch" data analysis) refers to a continuous analysis of a typically infinite stream of data items (often called events).
Characteristics of Streaming Applications
Stream data processing applications are typically characterized by the following points:
Streaming applications run continuously, for a very long time, and consume and process events as soon as they appear. In contrast. batch applications gather data in files or databases and process it later.
Streaming applications frequently concern themselves with the latency of results. The latency is the delay between the creation of an event and the point when the analysis application has taken that event into account.
Because streams are infinite, many computations cannot refer not to the entire stream, but to a "window" over the stream. A window is a view of a sub-sequence of the stream events (such as the last 5 minutes). An example of a real world window statistic is the "average stock price over the past 3 days".
In streaming applications, the time of an event often plays a special role. Interpreting events with respect to their order in time is very common. While certain batch applications may do that as well, it not a core concept there.
Examples of Streaming Applications
Typical examples of stream data processing application are
Fraud Detection: The application tries to figure out whether a transaction fits with the behavior that has been observed before. If it does not, the transaction may indicate an attempted misuse. Typically very latency critical application.
Anomaly Detection: The streaming application builds a statistical model of the events it observes. Outliers indicate anomalies and may trigger alerts. Sensor data may be one source of events that one wants to analyze for anomalies.
Online Recommenders: If not a lot of past behavior information is available on a user that visits a web shop, it is interesting to learn from her behavior as she navigates the pages and explores articles, and to start generating some initial recommendations directly.
Up-to-date Data Warehousing: There are interesting articles on how to model a data warehousing infrastructure as a streaming application, where the event stream is sequence of changes to the database, and the streaming application computes various warehouses as specialized "aggregate views" of the event stream.
There are many more ...

Related

Apache Pulsar - use cases for infinite retention of a topic

I am actually planing our next version of our telemetry system architecture. I am strongly considering Pulsar at the messaging solution.
To better understand what's this technology is best for, can someone share their use cases of why their use the infinite retention of a topic other than audit trail ?
I was main goal is to see if our telemetry data could be simply stored in a pulsar topic and query that for analytics purpose instead of using a time series database like Apache Druid.
Thanks !

The use-case I've had for infinite retention is when you want to store the history going back to the beginning: e.g. in an event-sourcing style approach, the longer you're keeping the events archived, the more able you are to remix your state.
With durable-log style storage, remember that it heavily optimizes for slurping the log starting at some point. For higher-volume queries or queries with strict latency requirements, this is generally pretty unsuited for that sort of workload, and even more so if you can't limit reads to a single partition (remember also that with multiple partitions, even the ordering of the messages in the log may be difficult to reconstruct). For infrequent queries with loose latency requirements, though, storing them in pulsar might not be that bad, especially if you'd be using pulsar already to feed data into the time-series store (as you can then dispense with the time-series store).

Apache Flink - Run same job multiple times for multi-tenant applications

We have a multi-tenant application where we maintain message queue for each tenant. We have implemented a Flink job to process the stream data from the message queues. Basically each of the message queue is a source in the Flink job. Is this the recommended way to do so? Or is it ok to run the same job (with one source) multiple times based on the number of tenants? We expect that the each of the tenants will produce data in different volumes. Will there be any scalability advantages in the multi job approach?
Approaches
1: Single job with multiple sources
2. Run duplicate of same job each with one source each
I think these approaches suits to Storm, Spark or any other streaming platforms.
Thank you

Performance-wise approach 1) has the greatest potential. Resources are better utilized for the different sources. Since it's different sources, the optimization potential of the query itself is limited though.
However, if we really talk multi-tenant, I'd go with the second approach. You can assign much more fine-grain rights to the application (e.g., which Kafka topic can be consumed, to which S3 bucket to write). Since most application developer tend to develop GDPR compliant workflows (even though the current countries might not be affected), I'd go this route to stay on the safe side. This approach also has the advantage that you don't need to restart the jobs for everyone if you add/remove a certain tenant.

One stream analytics job vs multiple jobs

Performance-wise, is there a big difference between having one stream analytics job with multiple queries and outputs (Azure tables in my case) Vs. splitting these queries and outputs into multiple stream analytics jobs?
And if the difference is significant, how to determine which scenario best suits my needs?
The doc states that:
Stream Analytics can handle up to 1 GB of incoming data per second.
but my concern is more about the processing of the data.

There are two things that will govern how far you can scale one stream analytics account:
The documented limits
How many Streaming Units you need to process your workload - see the FAQ at the bottom of this page.
Smaller windows processing less volumes will consume less units. Like many of the other PaaS services, the measured unit is esoteric and your mileage may vary, it's best to build a small sample for your use case and then make a prediction on how it will scale.
I certainly would not build a separate streaming job for every case, but you may need to find a way to "partition" your requirement, maybe by use case/feature, by business domain, consuming system, etc.

Hazelcast Jet - Use Cases

What are the use-cases of Hazelcast Jet? Has anyone started using it?
Our project uses Hazelcast for Distributed Map holding Key-Value pair and Distributed computing on those Keys to run the task at the node holding the Key. We use NearCache solution as well.
I was curious to know how different is Hazelcast Jet and what problems does it solve?

As of current version (0.3), Jet's advantage over just submitting a Runnable to each partition is the ability to perform grouping by a key other than the one used in the Hazelcast map. For this to work in a distributed environment you have to send each item to the processing unit responsible for its grouping key, and this is something that is easy to get from Jet.
Further from that, you can build a multistage cascade of groupBy operations, you can have forks in your data stream to reuse the same intermediate result in more than one way, you can build a pipeline where an I/O task distributes the processing of the data it reads across all CPU cores, etc... in short, all the advantages that a full-blown DAG computation engine offers.
By the time it reaches 1.0 Jet will also support fault-tolerant infinite stream processing, event time-based windows, and more.

2021 answer for use cases:
Change data capture streaming - Use Debezium/Hazelcast to detect changes to your database and stream to other microservices (if data is common), stream changes to a data lake, or update a search engine
Real time analytics - Take market data stream and perform technical analysis in realtime or twitter analysis
Async job processing - PDF conversion service

How to use datatorrent in a kappa architecture?

I read a lot about lambda and kappa architectures where we need to use either Apache Spark or Apache Storm. I just discovered a new tool called DataTorrent which can do batch and real-time process. I was wondering if DataTorrent can do, at the same time, the batch and speed layer of a lambda (or kappa) architecture ?
Cheers,

Apache apex or Datatorrent RTS allows your team to develop, test, debug and operate on a single processing framework.
Although, there is no explicit mention about kappa architecture in the Apache apex documentation, IMO it can be used to serve kappa architecture.
Apache apex would provide built-in support for fault tolerance, checkpointing, recovery. Thus, you can rely on single dataflow DAG in Apex to get reliable results with low latencies. There is no need to have separate batch layer and speed layer when you define your application using DAG on Apex.
But, note that Apache Apex is an example of stream computation engine. For complete Kappa architecture you would have combination of
Log stores + stream computation engine + Serving layer store.

DataTorrent can be used to serve Kappa architecture requirements. You can process your batch data and real time stream data at the same time.
Datatorrent is continuos flow model where the batch data flows like a stream through the DAG unlike Spark where streaming data flow in batches.
You may need to feed in your data from different input sources using different operator ports and the inmemory computation on the data is taken care by the platform calls on the ports.
It is just like having a sink (operator in DT) fed by two pipes (input ports).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string