Google Cloud DataFlow: Synchronize/ merge multiple pipeline into one - apache-spark

I have a two Google Pub/Sub topics my pipelines are streaming from with the Windowing. Technically I have two pipelines individually for each of the topics and I need to merge these two pipeline Windows to a single one to do some aggregation which requires combined events within the same Window.
Say we have Event1 and Event2. These two events have two separate topics say Topic1 and Topic2. I have Pipeline1 and Pipeline2 which individually streams from those topics. I need to somehow get access to Event1 and Event2 which fall within the same Window and produce some output. Is this possible?

You can read from multiple Pubsub topics in the same pipeline like so:
Pipeline p = ...;
PCollection<A> collection1 = p.apply(PubsubIO.Read.topic(topic1));
PCollection<B> collection2 = p.apply(PubsubIO.Read.topic(topic2));
Now, how you want to combine these two PCollections depends on your application. You will probably want to read Handling Multiple PCollections. Here is a quick mention of three possibilities:
Flatten: if you just want to merge the contents of the two collections on a per-window basis, this will do it.
ParDo with side inputs: if windows of one collection are fairly small, then reading this as a side input of a ParDo over the larger collection may be reasonable.
Joins with CoGroupByKey: you can implement many sorts of joins between the two collections by keying them on some common key and using CoGroupByKey.

Related

Delta live tables in data bricks can take only one target

If I need to publish two tables in two different databases in the metastore, do I need to create two different DLT pipeline? I am asking this because I saw that in the pipeline setting, i can only specify 1 target.
Right now - yes, DLT only supports one target database. So if you need to push into different databases, then you may have two DLT pipelines.
Theoretically you can have one pipeline that will be publishing two tables into a single database, and then you can use create table ... using delta location '<dlt_storage>/tables/<table_name>' to refer to it, but it won't work well with the schema evolution, etc.

Hazelcast Jet multiple outbound edges

I need to populate the result of aggregation to 3 separate sinks - maps where updating logic is slightly different. I tried to convert pipeline object into a DAG and add another edge to second last vertex but it threw an exception that multiple outbound edges were not allowed. Is there any way to create DAG with multiple outbound edges?
You should just able to assign the stage you want to drain to multiple sinks to a variable and then repeatedly call drainTo() on it with different sinks.
Example:
StreamStage<TimestampedEntry<..>> stage = pipeline.drawFrom(..)
.map(..)
.groupingKey(..)
.window(..)
.aggregate(counting());
stage.drainTo(Sinks.map("map1));
stage.drainTo(Sinks.map("map2")).
If you want to achieve the same using DAG API then you need to assign them to different ordinals using the Edge.from().to() construct. However if you are already starting with a pipeline this should not be necessary.

kafka connect multiple topics in sink connector properties

I am trying to read 2 kafka topics using Cassandra sink connector and insert into 2 Cassandra tables. How can I go about doing this?
This is my connector.properties file:
name=cassandra-sink-orders
connector.class=com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector
tasks.max=1
topics=topic1,topic2
connect.cassandra.kcql=INSERT INTO ks.table1 SELECT * FROM topic1;INSERT INTO ks.table2 SELECT * FROM topic2
connect.cassandra.contact.points=localhost
connect.cassandra.port=9042
connect.cassandra.key.space=ks
connect.cassandra.contact.points=localhost
connect.cassandra.username=cassandra
connect.cassandra.password=cassandra
Am I doing everything right? Is this the best way of doing this or should I create two separate connectors?
There's one issue with your config. You need one task per topic-partition. So if your topics have one partition, you need tasks.max set to at least 2.
I don't see it documented in Connect's docs, which is a shame
If you want to consume those two topics in one consumer that's fine and it's correct setup. The best way of doing this depends whether those messages should be consumed by one or two consumers. So it depends on your business logic.
Anyway, if you want to consume two topics via one consumer that should work find since consumer can subscribe to multiple topics. Did you try running this consumer? Is it working?

Azure Stream Analytics Get Previous Output Row for Join to Input

I have the following scenario:
Mobile app produces events that are sent to Event Hub which is input stream source to a Stream Analytics query. From there they are passed through a sequential flow of queries that splits the stream into 2 streams based on criteria, evaluates other conditions and decides whether or not to let the event keep flowing through the pipeline (if it doesn't it is simply discarded). You could classify what we are doing is noise reduction/event filtering. Basically if A just happened don't let A happen again unless B & C happened or X time passes. At the end of the query gauntlet the streams are merged again and the "selected" events are propagated as "chosen" outputs.
My problem is that I need the ability to compare the current event to the previous "chosen" event (not just the previous input event) so in essence I need to join my input stream to my output stream. I have tried various ways to do this and so far none have worked, I know that other CEP engines support this concept. My queries are mostly all defined as temporary results sets inside of a WITH statement (that's where my initial input stream is pulled into the first query and each following query depends on the one above it) but I see no way to either join my input to my output or to join my input to another temporary result set that is further down in the chain. It appears that join only supports inputs?
For the moment I am attempting to work around this limitation with something I really don't want to do in production, but I actually have an output defined going to an Azure Queue then an Azure Function triggered by events on that queue that wakes up and posts it to a different Event hub that is mapped as a recirc feed input back into my queries which I can join to. Still wiring all of that up so not 100% sure it will work but thinking there has to be a better option for this relatively common pattern?
The WITH statement is indeed the right way to get a previous input joined with some other data.
You may need to combine it with the LAG operator, that gets the previous event in a data stream.
Let us know if it works for you.
Thanks,
JS - Azure Stream Analytics
AFAIK, the stream analytics job supports two distinct data input types: data stream inputs and reference data inputs. Per my understanding, you could leverage Reference data to perform a lookup or to correlate with your data stream. For more details, you could refer to the following tutorials:
Data input types: Data stream and reference data
Configuring reference data
Tips on refreshing your reference data
Reference Data JOIN (Azure Stream Analytics)

Azure Data Factory Data Migration

Not really sure this is an explicit question or just a query for input. I'm looking at Azure Data Factory to implement a data migration operation. What I'm trying to do is the following:
I have a No SQL DB with two collections. These collections are associated via a common property.
I have a MS SQL Server DB which has data that is related to the data within the No SQL DB Collections via an attribute/column.
One of the NoSQL DB collections will be updated on a regular basis, the other one on a not so often basis.
What I want to do is be able to prepare a Data Factory pipline that will grab the data from all 3 DB locations combine them based on the common attributes, which will result in a new dataset. Then from this dataset push the data wihin the dataset to another SQL Server DB.
I'm a bit unclear on how this is to be done within the data factory. There is a copy activity, but only works on a single dataset input so I can't use that directly. I see that there is a concept of data transformation activities that look like they are specific to massaging input datasets to produce new datasets, but I'm not clear on what ones would be relevant to the activity I am wanting to do.
I did find that there is a special activity called a Custom Activity that is in effect a user defined definition that can be developed to do whatever you want. This looks the closest to being able to do what I need, but I'm not sure if this is the most optimal solution.
On top of that I am also unclear about how the merging of the 3 data sources would work if the need to join data from the 3 different sources is required but do not know how you would do this if the datasets are just snapshots of the originating source data, leading me to think that the possibility of missing data occurring. I'm not sure if a concept of publishing some of the data someplace someplace would be required, but seems like it would in effect be maintaining two stores for the same data.
Any input on this would be helpful.
There are a lot of things you are trying to do.
I don't know if you have experience with SSIS but what you are trying to do is fairly common for either of these integration tools.
Your ADF diagram should look something like:
1. You define your 3 Data Sources as ADF Datasets on top of a
corresponding Linked service
2. Then you build a pipeline that brings information from SQL Server into a
temporary Data Source (Azure Table for example)
3. Next you need to build 2 pipelines that will each take one of your NoSQL
Dataset and run a function to update the temporary Data Source which is the ouput
4. Finally you can build a pipeline that will bring all your data from the
temporary Data Source into your other SQL Server
Steps 2 and 3 could be switched depending on which source is the master.
ADF can run multiple tasks one after another or concurrently. Simply break down the task in logical jobs and you should have no problem coming up with a solution.

Resources