Hazelcast Jet multiple outbound edges - hazelcast-jet

I need to populate the result of aggregation to 3 separate sinks - maps where updating logic is slightly different. I tried to convert pipeline object into a DAG and add another edge to second last vertex but it threw an exception that multiple outbound edges were not allowed. Is there any way to create DAG with multiple outbound edges?

You should just able to assign the stage you want to drain to multiple sinks to a variable and then repeatedly call drainTo() on it with different sinks.
Example:
StreamStage<TimestampedEntry<..>> stage = pipeline.drawFrom(..)
.map(..)
.groupingKey(..)
.window(..)
.aggregate(counting());
stage.drainTo(Sinks.map("map1));
stage.drainTo(Sinks.map("map2")).
If you want to achieve the same using DAG API then you need to assign them to different ordinals using the Edge.from().to() construct. However if you are already starting with a pipeline this should not be necessary.

Related

Azure Data Factory V2: workaround to not being able to have a forEach activity inside a If Condition activity

I want to have a forEach activity to be run if it meets some condition (inside a If Condition activity). But I get the following error:
ForEach activity ('') is not allowed under a Switch Activity.
Is there any way of looping through items only if a condition is met in ADF?
You will need to architect your solution around the nesting restrictions like this. This is typically solved by placing the conditional workloads in other pipelines and using the Execute Pipeline activity inside the parent pipeline. You may need several child pipelines based on the complexity of the workloads. Use parameters to pass dependent values and the "Wait on completion" action to control concurrency.

Azure Stream Analytics Get Previous Output Row for Join to Input

I have the following scenario:
Mobile app produces events that are sent to Event Hub which is input stream source to a Stream Analytics query. From there they are passed through a sequential flow of queries that splits the stream into 2 streams based on criteria, evaluates other conditions and decides whether or not to let the event keep flowing through the pipeline (if it doesn't it is simply discarded). You could classify what we are doing is noise reduction/event filtering. Basically if A just happened don't let A happen again unless B & C happened or X time passes. At the end of the query gauntlet the streams are merged again and the "selected" events are propagated as "chosen" outputs.
My problem is that I need the ability to compare the current event to the previous "chosen" event (not just the previous input event) so in essence I need to join my input stream to my output stream. I have tried various ways to do this and so far none have worked, I know that other CEP engines support this concept. My queries are mostly all defined as temporary results sets inside of a WITH statement (that's where my initial input stream is pulled into the first query and each following query depends on the one above it) but I see no way to either join my input to my output or to join my input to another temporary result set that is further down in the chain. It appears that join only supports inputs?
For the moment I am attempting to work around this limitation with something I really don't want to do in production, but I actually have an output defined going to an Azure Queue then an Azure Function triggered by events on that queue that wakes up and posts it to a different Event hub that is mapped as a recirc feed input back into my queries which I can join to. Still wiring all of that up so not 100% sure it will work but thinking there has to be a better option for this relatively common pattern?
The WITH statement is indeed the right way to get a previous input joined with some other data.
You may need to combine it with the LAG operator, that gets the previous event in a data stream.
Let us know if it works for you.
Thanks,
JS - Azure Stream Analytics
AFAIK, the stream analytics job supports two distinct data input types: data stream inputs and reference data inputs. Per my understanding, you could leverage Reference data to perform a lookup or to correlate with your data stream. For more details, you could refer to the following tutorials:
Data input types: Data stream and reference data
Configuring reference data
Tips on refreshing your reference data
Reference Data JOIN (Azure Stream Analytics)

Google Cloud DataFlow: Synchronize/ merge multiple pipeline into one

I have a two Google Pub/Sub topics my pipelines are streaming from with the Windowing. Technically I have two pipelines individually for each of the topics and I need to merge these two pipeline Windows to a single one to do some aggregation which requires combined events within the same Window.
Say we have Event1 and Event2. These two events have two separate topics say Topic1 and Topic2. I have Pipeline1 and Pipeline2 which individually streams from those topics. I need to somehow get access to Event1 and Event2 which fall within the same Window and produce some output. Is this possible?
You can read from multiple Pubsub topics in the same pipeline like so:
Pipeline p = ...;
PCollection<A> collection1 = p.apply(PubsubIO.Read.topic(topic1));
PCollection<B> collection2 = p.apply(PubsubIO.Read.topic(topic2));
Now, how you want to combine these two PCollections depends on your application. You will probably want to read Handling Multiple PCollections. Here is a quick mention of three possibilities:
Flatten: if you just want to merge the contents of the two collections on a per-window basis, this will do it.
ParDo with side inputs: if windows of one collection are fairly small, then reading this as a side input of a ParDo over the larger collection may be reasonable.
Joins with CoGroupByKey: you can implement many sorts of joins between the two collections by keying them on some common key and using CoGroupByKey.

Spark SQL - READ and WRITE in sequence or pipeline?

I am working on a cost function for Spark SQL.
While modelling the TABLE SCAN behaviour I cannot understand if READ and WRITE are carried out in pipeline or in sequence.
Let us consider the following SQL query:
SELECT * FROM table1 WHERE columnA = ‘xyz’;
Each task:
Reads a data block (either locally or from a remote node)
Filter out the tuples that do not satisfy the predicate
Write to the disk the remaining tuples
Are (1), (2) and (3) carried out in sequence or in pipeline? In other words, the data block is completely read (all the disk pages composing it) first and then it is filtered and then it is rewritten to the disk or are these activities carried out in pipeline? (i.e. while reading the (n+1)-tuple, n-tuple can be processed and written).
Thanks in advance.
Whenever you submit a job, first thing spark does is create DAG (Directed acyclic graph) for your job.
After creating DAG, spark knows, which tasks it can run in parallel, which task are dependent on output of previous step and so on.
So, in your case,
Spark will read your data in parallel (which you can see in partition), filter them out (in each partition).
Now, since saving required filtering, so it will wait for filtering to finish for at least one partition, then start to save it.
After some more digging I found out that Spark SQL uses a so called "volcano style pull model".
According to such model, a simple scan-filter-write query whould be executed in pipeline and are fully distributed.
In other words, while reading the partition (HDFS block), filtering can be executed on read rows. No need to read the whole block to kick off the filtering. Writing is performed accordingly.

Partition data mid-job on Spring Batch

I want to create a job in spring data which should consist of two steps:
Step 1 - First step reads certain transactions from database and produces a list of record Ids that will be sent to step 2 via jobContext attribute.
Step 2 - This should be a partition step: The slave steps should be partitioned based on the list obtained from step 1 (each thread gets a different Id from the list) and perform their read/process/write operations without interfering with each other.
My problem is that even though I want to partition data based on the list produced by step 1, spring configures step 2 (and thus, calls the partitioner's partition() method) before step 1 even starts, so I cannot inject the partitioning criteria on time. I tried using #StepScope on the partitioner bean, but it still attempts to create the partitions before the job starts.
Is there a way to dynamically create the step partitions during runtime, or an alternative way to divide a step into threads based on the list provided by step 1?
Some background:
I am working on a batch job using spring batch which has to process Transactions stored in a database. Every transaction is tied to an Account (in a different table), which has an accountBalance that also needs to be updated whenever the transaction is processed.
Since I want to perform these operations using multi-threading, I thought a good way to avoid collisions would be to group transactions based on their accountId, and have each thread process only the transactions that belong to that specific accountId. This way, no two threads will attempt to modify the same Account at the same time, as their Transactions will always belong to different Accounts.
However, I cannot know which accountIds need to be processed until I get the list of transactions to process and extract the list from there, so I need to be able to provide the list to partition during runtime. Thtat's why I thought I could generate that list in a previous step, and then have the next step partition and process the data accordingly.
Is the approach I am taking plausible with this setup? Or should I just look for a different solution?
I couldn't find a way to partition the data mid-job like I wanted, so I had to use this workaround:
Instead of dividing the job in two steps, I moved the logic from step 1 (the "setup step") into a service method that returns the list of transactions to process, and added a call to that method inside the partition() method in my partitioner, allowing me to create the partitions based on the returned list.
This achieves the same result in my case, although I'm still interested in knowing if it is possible to configure the partitions mid-job, since this solution would not work if I had to perform more complex processing or writing in the setup step and wanted to configure exception handling policies and such. It probably would not work either if the setup step was placed in the middle of a step chain instead of at the start.

Resources