Delta Logic implementation using SnapLogic - snaplogic

Is there any snap available in SnapLogic to do following
Connect with snowflake and get data by SELECT * FROM VIEW
Connect with Azure Blob Storage and get the data from csv file : FILENAME_YYYYMMDD.csv
Take only those data which are available in 1 but NOT available in 2 and write this delta back to Azure Blob Storage : FILENAME_YYYYMMDD.csv
Is In-Memory Look-Up useful for this?

No, In-Memory Lookup snap is used for cases where you need to look up the value corresponding to the value in a certain field of the incoming records. For example, say you want to look up a country name against the country ISO code. This snap generally fetches the lookup table once and stores it in memory. Then it uses this stored lookup table to provide data corresponding to the incoming records.
In your case, you have to use the Join snap and configure it to an inner join.

Related

I can´t Delete or Truncate Table Storage

Create a table from a storage account and I can insert records with no problem.
I decided to use this type of table to be able to share the URI (Https) and for another system to consume it, and since it is a NoSQL table, it gives me the possibility of adapting what I need to store in it.
But I have the inconvenience that this table, I must truncate it every time information is processed and the option from Data Factory where it indicates the insertion mode, (Replace or combine), does not work, it always performs an append.
I tried to do it from DataBricks, but I don't know how to reference that table, since it is outside of a Blob Storage and it cannot be mounted as such. any ideas?
Azure Table Storage
Azure data Factory
"Tipo de Inserción" = Insert Type
"Reemplazar" = Replace
Table Data
Configuration Data Factory
How to configure so that I can delete the data?
Thanks a lot.
Greetings.

Can you make a Azure Data Factory data flow for updating data using a foreign key?

I've tried this a few ways and seem to be blocked.
This is nothing more than a daily ETL process. What I'm trying to do is to use ADF and pull in a csv as one of my datasets. With that data I need to update docs in a CosmosDb container, which is the other dataset in this flow. My data really simple.
ForeignId string
Value1 int
Value2 int
Value3 int
The Cosmos docs all have these data items and more. ForeignId is unique in the container and is the partition key. The docs are a composite dataset that actually have 3 other id fields that would be considered the PK in the system of origin.
When you try and use a data flow UPDATE with this data the validation complains that you have to map "Id" to use UPDATE. I have an Id in my document, but it only relates to my collection, not to old, external systems. I have no choice but to use the ForeignId. I have it flowing using UPSERT but, even though I have the ForeignId mapped between the datasets, I get inserts instead of updates.
Is there something I'm missing or is ADF not set up to sync data based on anything other than the a data item named "id"? Is there another option ADF aside from the straight-forward approach? I've read that you can drop updates into the Lookup tasks but that seems like a hack.
The row ID is needed by CosmosDB to know which row to update. It has nothing to do with ADF.
To make this work in ADF, add an Exists transformation in your data flow to see if the row already exists in your collection. Check using the foreign key column in your incoming source data against the existing collection.
If a row is found with that foreign key, then you can the corresponding ID to your metadata, allowing you to include it in your sink.

How do I store run-time data in Azure Data Factory between pipeline executions?

I have been following Microsoft's tutorial to incrementally/delta load data from an SQL Server database.
It uses a watermark (timestamp) to keep track of changed rows since last time. The tutorial stores the watermark to an Azure SQL database using the "Stored Procedure" activity in the pipeline so it can be reused in the next execution.
It seems overkill to have an Azure SQL database just to store that tiny bit of meta information (my source database is read-only btw). I'd rather just store that somewhere else in Azure. Maybe in the blob storage or whatever.
In short: Is there an easy way of keeping track of this type of data or are we limited to using stored procs (or Azure Functions et al) for this?
I had come across a very similar scenario, and from what I found you can't store any watermark information in ADF - at least not in a way that you can easily access.
In the end I just created a basic tier Azure SQL database to store my watermark / config information on a SQL server that I was already using in my pipelines.
The nice thing about this is when my solution scaled out to multiple business units, all with different databases, I could still maintain watermark information for each of them by simply adding a column that tracks which BU that specific watermark info was for.
Blob storage is indeed a cheaper option but I've found it to require a little more effort than just using an additional database / table in an existing database.
I agree it would be really useful to be able to maintain a small dataset in ADF itself for small config items - probably a good suggestion to make to Microsoft!
There is a way to achieve this by using Copy activity, but it is complicated to get latest watermark in 'LookupOldWaterMarkActivity', just for reference.
Dataset setting:
Copy activity setting:
Source and sink dataset is the same one. Change the expression in additional columns to #{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}
Through this, you can save watermark as column in .txt file. But it is difficult to get the latest watermark with Lookup activity. Because your output of 'LookupOldWaterMarkActivity' will be like this:
{
"count": 1,
"value": [
{
"Prop_0": "11/24/2020 02:39:14",
"Prop_1": "11/24/2020 08:31:42"
}
]
}
The name of key is generated by ADF. If you want to get "11/24/2020 08:31:42", you need to get column count and then use expression like this: #activity('LookupOldWaterMarkActivity').output.value[0][Prop_(column count - 1)]
How to get latest watermark:
use GetMetadata activity to get columnCount
use this expression:#activity('LookupOldWaterMarkActivity').output.value[0][concat('Prop_',string(sub(activity('Get Metadata1').output.columnCount,1)))]

Can you use dynamic/run-time outputs with azure stream analytics

I am trying to get aggregate data sent to different table storage outputs based on a column name in select query. I am not sure if this is possible with stream analytics.
I've looked up the stream analytics docs and different forums, so far haven't found any leads. I am looking for something like
Select tableName,count(distinct records)
into tableName
from inputStream
I hope this makes it clear what I'm trying to achieve, I am trying to insert aggregates data into table storage (defined as outputs). I want to grab the output stream/tablestorage name from a select Query. Any idea how that could be done?
I am trying to get aggregate data sent to different table storage
outputs based on a column name in select query.
If i don't misunderstand your requirement,you want to do a case...when... or if...else... structure in the ASA sql so that you could send data into different table output based on some conditions. If so,i'm afraid that it could not be implemented so far.Every destination in ASA has to be specific,dynamic output is not supported in ASA.
However,as a workaround,you could use Azure Function as output.You could pass the columns into Azure Function,then do the switches with code in the Azure Function to save data into different table storage destinations. More details,please refer to this official doc:https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-with-azure-functions

Peculiar Issue with Azure Table Storage Select Query

I came across with weird behavior of Azure table Storage query. I used following code to get the list of entities from Azure Table Storage
query = context.CreateQuery (DomainData.Employee.TABLE_NAME) .Where(strPredicate).Select(selectQuery));
where is context TableServiceContext, I was trying to pull Employee entity from Azure table storage, My requirement is dynamically construct predicate and Projections.
So strPredicate is a string, where it contains dynamically constructed predicate.
selectQuery is projection string, where it is constructed dynamically based on User Selected Properties.
When the users selects all the properties of Employee Object, here Employee object has over 200 properties. system constructed dynamic projection string based on all properties and System takes 45 minutes to retrieve 60000 records from Azure table storage.
Whereas when i enter directly object in select projection, i.e looks like below
query = (context.CreateQuery<DomainData.Employee> (DomainData.Employee.TABLE_NAME) .Where(strPredicate)
then query takes only 5 minutes to retrieve 60000 records from Azure table storage. Why is this peculiar behavior both the query are same , one with project of columns/properties other is without any projection, but Azure table storage provides same number of entity with same column property and same size of each entity why is it Azure table storage is taking too much of time in the first query why is it faster in second query. Please let me know.
The standard advice when dealing with perceived anomalies with Windows Azure Storage is to use Fiddler to identify the actual storage operation invoked. This will quickly allow you to see what the actual differences are with the two operations.

Resources