Data Masking in Azure data Factory - azure

We are using Azure Data factory to move data from Source like Azure SQL and Azure Postgres to destination as Azure data lake.There is some sensitive data which needs to be masked.
Is it possible to have data masking in Azure Data factory during transformation phase only?
Thanks! in advance

You can leverage on Cryptographic functions of the source DB and use it in the SELECT statement to get encrypted data into the data lake. If you use a reversible function you can decrypt later on.
You can also mask the data using SQL function (e.g selecting only a substring of a sensitive column) but then it won't be reversible (same thing if you leverage Data Masking on the Azure SQL DB)
Here the Cryptographic functions for Azure SQL DB

Hi there is a option of dynamic data masking in the portal where you have deployed the database. You can go there and select the table and column to mask your data
https://learn.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview?view=azuresql

Related

How to access a Redshift DB through VPN to extract data and load into own Azure environment?

pretty new to the Azure environment and so far my search for information wasnt very successful.
Problem is as follows:
we wanna access a redshift DB which you can only connect to if you are conntected to a specific VPN beforehand - this is the main problem
we then wanna build an automated data pipeline which extracts daily updated data from the redshift db and create our own analytics solution from it
how can that be set up in a fully automated workflow and also in the simplest, most efficient way with the tools available on the azure platform?
thanks for the help.
If VPN is not the challenge and you just need to extract the data from Redshift DB and store it in any Azure Service like Blob Storage or Azure Synapse Analytics, then best possible way is to use Azure Data Factory. Azure Data Factory is a fully managed, serverless data integration service.
You can copy data using Copy activity from Amazon Redshift to any supported sink data store. For a list of data stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Amazon Redshift connector supports retrieving data from Redshift using query or built-in Redshift UNLOAD support.
Note: When copying data to an Azure data store, see Azure Data Center IP Ranges for the Compute IP address and SQL ranges used by the Azure data centers.
In case you need to import data into Azure SQL database from AWS Redshift, follow the link.

How to decide between Azure Data Lake vs Azure SQL vs Azure Data Lake Analytics vs Azure SQL VM?

I am new to Azure and hence trying to understand what services to use when and how.
At the moment, I have one excel file that has couple of tabs that require some transformation to create one excel file tab (inside the source file itself - say Tab "x"). The final tab "x" created is then being useful for creating one final excel file that is shared to various team.
At present, everything is done manually.
This needs to change and the excel file shared to team has to be automated. The source of the file is the excel file that has various tabs (excluding tab "x") and the reporting tool will be SSRS with excel data being stored in cloud.
Keeping this scenario in mind, what is the best way to store excel data into cloud? The excel data will be stored in cloud on a monthly basis. I am confused as to whether to store data in Azure-SQL, Azure Data Lake Gen 2 or Azure Data Lake Analytics or Azure SQL VM?
Every month data can be fetched from Excel file and populate into Azure using azure data factory. But I am not sure what is the best way to store data in the cloud considering the fact that some ETL process is needed to generate data in format similar to tab "X".
I think you can think about to using Azure SQL database.
Azure SQL database or SQL server support you import data from the excel( or csv) files. For more details and limits, please see: Import data from Excel to SQL Server or Azure SQL Database.
If your data have stored in Azure SQL database, you also can using EXCEL to get the data from Azure SQL database:
Connect Excel to a single database in Azure SQL Database and import data and create tables and charts based on values in the database. In this tutorial you will set up the connection between Excel and a database table, save the file that stores data and the connection information for Excel, and then create a pivot chart from the database values.
Reference: Import data from Excel to SQL Server or Azure SQL Database.
I think you don't need to store these excel files in Azure Data Lake.Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob storage. It's still a storage.
The more Azure resource you use, the more cost you need to pay.
If your excel file stored in you local computer, you can using Azure Data Factory to access these local files or with self host integration runtime.
Please reference: Copy data to or from a file system by using Azure Data Factory.
Hope this helps.
Your storage requirements are very minimal, so I would select Data Lake to store your documents. The alternative is Blob Storage, but I always prefer Data Lake because it works with Azure Active Directory.
In your scenario, drop it in the ADL, and use the ADL as the source in Azure Data Factory.
Edit:
Honestly, your original post is a little confusing. You have a RAW Excel document, you do some transformations on the RAW document, to generate an Excel Source document. This source document holds the final dataset that the dev team will use to build out SSRS reports. You need to make this dataset available to the teams so that they can connect to it to build the reports? My suggestion is to keep it simple and drop the final source dataset in Excel format, into blob or data lake storage and then ask the dev guys to pick it up from the location. If you are going the route of designing and maintaining a data pipeline (Blob > Data Factory > SQL, or CSV, TSV - then you are introducing unnecessary complications.

is it posible update row values from tables in Azure Data Factory?

I have a dataset in Data Factory, and I would like to know if is possible update row values using only data factory activities, without data flow, store procedures, queries...
There is a way to do update (and probably any other SQL statement) from Data Factory, it's a bit tacky though.
The Loopup activity, can execute a set of statements in Query mode, ie:
The only condition is to end it with select, otherwise Lookup activity throws error.
This works for Azure SQL, PostgreSQL, and most likely for any other DB Data Factory can connect to.
Concepts:
Datasets:
Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.
Now, a dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files, folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in Blob storage from which the activity should read the data.
Currently, according to my experience, it's impossible to update row values using only data factory activities. Azure Data Factory doesn't support this now.
Fore more details,please reference:
Datasets
Datasets and linked services in Azure Data Factory.
For example, when I use Copy Active, Data Factory doesn't provide my any ways to update the rows:
Hope this helps.
This is now possible in Azure Data Factory, your Data flow should have an Alter Row stage, and the Sink has a drop-down where you can select the key column for doing updates.
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-alter-row
As mentioned in Above comment regarding ADF data flow, ADF data flow does not support on-permise sink or source, the sink & source should reside in Azure SQL or Azure Data lake or any other AZURE data services.

Best way to extract data from Azure Data Lake to SQL Server

I am looking for a best programmatic way to extract data from Azure Data Lake to MSSQL database, which is installed on a VM within Azure.
Currently I am considering following options:
Azure Data Factory
SSIS (Using Azure Data Lake Store Connection Manager)
User-Defined Outputter Example1, Example2
Custom C# code that reads Azure Data Lake data and inserts it into SQL Server DB
Any other good ways I am missing?
Data factory v2 (currently in public preview), also supports hosting SSIS to give you a data factory AND ssis option.
And not necessarily a good idea for many scenarios, but Azure Logic Apps has both a data lake store connector and SQL Server connector, which could be useful in scenarios such as writing lots of small files on a schedule or trigger.
You also may not need to go full on c# and instead use PowerShell, there are powershell modules for both data lake store and sql server.

How to move on-premises database table identity column data to Azure SQL table

I need to move on-premises database to Azure SQL. The structure of the database is exactly same. In my database there are tables with Indentity property on columns. I need to make Insert to these columns from my on-premises database. I need to do it through ADF copy activity on predefined schedule(say everyday). I tried various options like metioned in following links :
Data migration to Azure with foreign key referencing an identity column and Azure Data Factory Copy Identity Column With Gaps. However, I didn't get there option to copy identity column data.
Is it possible in Azure? That's my question now. And if yes, then how?
I can't speak with authority on the ADF copy activity, but you can write into an identity column in SQL Server/SQL Azure using the following setting:
https://learn.microsoft.com/en-us/sql/t-sql/statements/set-identity-insert-transact-sql
This would copy the values from the SQL Server table to the SQL Azure table exactly (overriding the identity semantics)
Note that it does not make sense to allow inserts on both SQL tables while maintaining this property (assuming you are using this as a surrogate key).
If all you want to do is publish to a SQL Azure table that is "read-only" (other than for your publishing step), please note that you can also use SQL Azure as a subscriber to a transactional replication publisher attached to the on-premises SQL Server. This may also solve your problem.
Hope that helps,
Conor Cunningham
Architect, SQL Team

Resources