Full join between multiple streams coming from different sources - hazelcast-jet

I am using hazelcast jet 0.6.1 for real time analysis. There are multiple streams (mostly from remote journal) coming from different sources.
I would like to know, if full join supported between multiple streams.
If yes, will you please point me to some links / examples for full join between multiple streams.

I think you need to elaborate a bit more on what you are trying to do. Streams are theoretically infinite, so the term "full join" has to mean something different than it does in a database.
There are several types of joins available in Jet. As Can said above, there is a merge operator, but you might be thinking more of windowed join where you time bound the period of the joins.
Merge Steams is here:
https://docs.hazelcast.org/docs/jet/0.7.2/manual/#merge
Window Concepts are here:
https://docs.hazelcast.org/docs/jet/0.7.2/manual/#unbounded-stream-processing

*This is in response to the comment from the first answer, it's to large for another comment and I thought the first answer is still relevant
Is this the same data and data type, just from different nodes? Like app servers for a microservices architecture? It seems to me that you have a few options here that really come down to preferred overall architecture, especially about how you want to transport the events. A couple thoughts:
You can simply merge streams from different data sources if that fits the use case:
See: https://docs.hazelcast.org/docs/jet/0.7.2/manual/#merge
If this is homogenous data, just distributed across app servers, if might be a case where you use the Hazelcast client on each app server to put events into an IMap (which is shared by all the app servers) with an Event Journal on a Hazelcast cluster. Then Jet just receives all the events from the Event Journal.
See: https://docs.hazelcast.org/docs/latest/manual/html-single/#event-journal
If you have Kafka available, perhaps you create a topic for the events from the servers and Jet receives the events from Kafka. Either way they are already merged when Jet gets them, so they are processed as one stream.
See: https://docs.hazelcast.org/docs/jet/0.7.2/manual/#kafka

Related

Apache Pulsar - use cases for infinite retention of a topic

I am actually planing our next version of our telemetry system architecture. I am strongly considering Pulsar at the messaging solution.
To better understand what's this technology is best for, can someone share their use cases of why their use the infinite retention of a topic other than audit trail ?
I was main goal is to see if our telemetry data could be simply stored in a pulsar topic and query that for analytics purpose instead of using a time series database like Apache Druid.
Thanks !
The use-case I've had for infinite retention is when you want to store the history going back to the beginning: e.g. in an event-sourcing style approach, the longer you're keeping the events archived, the more able you are to remix your state.
With durable-log style storage, remember that it heavily optimizes for slurping the log starting at some point. For higher-volume queries or queries with strict latency requirements, this is generally pretty unsuited for that sort of workload, and even more so if you can't limit reads to a single partition (remember also that with multiple partitions, even the ordering of the messages in the log may be difficult to reconstruct). For infrequent queries with loose latency requirements, though, storing them in pulsar might not be that bad, especially if you'd be using pulsar already to feed data into the time-series store (as you can then dispense with the time-series store).

Hazelcast Jet - Support for multiple clients

Can Hazelcast Jet be used for processing of millions of records using multiple clients accessing an event journal and each client would process a portion of the records?
Furthermore, is it possible to accumulate the results processed by different clients?
This is also an architectural question. To fit your aggregation need, you might have clients begin as individual streams, accumulate your aggregates there, then join the streams for common processing. Just as an example.
Also, you do have access to the underlying IMDG technology that you can leverage. You have a free hand at how you want to build the overall architecture.

Need architecture hint: Data replication into the cloud + data cleansing

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?
From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.
The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).
Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

Can Amazon's Kinesis Client Library consume multiple streams?

I have a quick question. Is the KCL able to consume from multiple streams? Should you ever set up multiple streams for your application, or is a individual stream supposed to be tied with an individual application? My particular use case is that I need to consume data being produced from the backend and also from the frontend. One of these produces data at much greater rates than the other, and for that reason think they should produce into separate streams for processing. Is there a way to consume both streams from the same KCL process or do I need to set up two? Thanks for your help!
KCL is an open source project that you can modify to consume events from multiple streams, but this is not recommended. It is better to keep things simpler.
If you have 2 different event streams, you better have 2 different kinesis streams, one for each. This allows you to scale each stream independently as each has a different rate and possibly different peaks.
If you need to share information between the streams, you can use share state variables between them, using some DB such as DynamoDB or Redis.
Please note that if you have a set of servers that are sending out these events, you should expect that some of the events of the back end, might be processed before the events from the front end. The KCL (or Lambda) code that you will have to process these events, can have different processing rates, different failure points and other out-of-synch events. Take note of such potential dependencies and exceptions.

What are the best papers for learning about algorithms for communicating updates in a distributed system?

I have a distributed system in mind (multiple nodes in a single datacenter) that I want to have the following properties:
nodes can enter and leave the system at any time.
There is no data replication between nodes.
Which node the client makes use of is up to the client (i.e. it could be consistent hashing, it could be something else)
no master (i.e. no central point of failure)
each node may receive a piece of information that needs to be forwarded to the rest of the nodes
What algorithms (links to papers are best) are suitable for this?
(I assume some of the answers will include P2P algorithms, but most of them that I've encountered in the past have acted more like distributed hash tables, where nodes enter and take over some part of the keyspace, etc. I also recognize that multicast with simple UDP messages might be appropriate here, but what existing work would help make the messaging reliable?)
How about trying to implement ADHOC nodes with JXTA? See the Practical JXTA II book available online at Scribd.

Resources