Using existing data sets withing Azure data factory in a new data flow - azure

I'm having trouble using existing data sets withing Azure data factory when I want to create a new data flow. under data set combo box in data factory doesn't load the existing data sets.
Also when I want to create a new data set most of the data sources such as SQL server is disable.
Is there any idea?
please see this screen shot
I Can not select the data sets which was built before
When I press new data set sql server is disable and I can not select it.
SQL server in this list is disabled and can not be selected

Data flow supports seven type source by now, SQL server is not supported. If your exist dataset isn't within these types, you can't choose them. More Details please refer to this documentation.

Since you can't use SQL Server within a Data Flow, prior to the Data Flow you need to run a Copy activity to stage your data from SQL Server into a location reachable by the subsequent Data Flow. The easiest option to use is a Delimited text file (CSV) in Azure Blob Storage.

Related

What to do when my Data source is not supported by Azure Synapse's Data Flow?

I am trying to transform data from Salesforce before loading it to dedicated SQL pool.
When I try to create a dataset from Synapse's Dataflow, I am not able to choose Salesforce a Data store:
Can anyone suggest how to transform data from Salesforce or any other Datasource that is not supported by Dataflow?
As per the Official Documentation, Currently Dataflows does not support Salesforce data as source or sink.
If you want, you can raise the feature request in the Synapse portal.
As an alternate, you can use Copy activity in the Azure Data factory to copy data from Salesforce to Dedicated SQL pool and then you can transform it using Dataflows in synapse from Dedicated SQL DB to Dedicated SQL DB.
Follow the below steps to achieve your requirement:
First create a Data Factory Workspace.
Select the Author hub and a create a pipeline. Now, drag the copy activity from the workspace and select the source. You can see that Salesforce is supported when you select new source dataset. Select it and create a linked service for that.
Now, select the sink dataset and click on Azure Synapse analytics.
Create a linked service for the Dedicated SQL database and select it.
Then, you can select the table in the Dedicated SQL and copy your data by running this.
After this copy, go to Synapse workspace and click on the Source of the Dataflow.
Select the Azure Synapse Analytics in source and click on continue.
Now, click on New to create linked service for the SQL DB. Give the subscription and server name and authenticate with your database.
After the creation of linked service, select it and give your table which is result of the copy in the DB.
Now, go to sink and select Azure Synapse Analytics and create another linked service for it as same above and select the resultant table in DB which you want after transform.
By following the above process, we can achieve the transformation from Salesforce data to Dedicated SQL DB.
Can anyone suggest how to transform data from Salesforce or any other Datasource that is not supported by Dataflow?
You can try this approach for the data stores which are not supported by the Data flows and please refer this to check various data stores supported by Copy activity before doing this process for the other data stores.

File is not readed completely by Copy Data in Azure Data Factory

I'm developing a pipeline that be able to insert data from a .txt file located in the Blob Storage into a table in a SQL Data Base.
Problem: Somehow the activity configuration is not working properly cause' is not reading all the records in the file and in consequence is not loading all the data into the Data Base (I realized this issue when I opened the file and compared the number of records from .text file against SQL table. Also, when I searched records from the last month in the table on SQL I didn't find them)
Note: I checked out the size limit of characters in the table from SQL and that isn't the problem.
I'd like to share with you the Data Copy activity and Source Data Set configuration as well:
Sink Dataset:
Do you know, guys what I'm doing wrong here? Hope you can help me, best regards.
P.S. Here's the Source Dataset
As discussed in comments, while using copy activity you would have to make sure to set the schema before running the activity. By design the schema mapping is left empty and has to be configured by the user either manually or asking adf to import the schema from the dataset.
Note: While using Auto create table option in sink, it automatically creates sink table (if nonexistent) in source schema,
but won't be supported when a stored procedure is specified (on the
sink side) or when staging is enabled.
Using COPY statement to load data into Azure Synapse Analytics as sink, the connector supports automatically creating destination table with DISTRIBUTION = ROUND_ROBIN if not exists based on the source schema.
Refer official doc: Copy and transform data in Azure Synapse Analytics by using Azure Data Factory or Synapse pipelines
Source...
Sink...
So Azure Synapse will be used as the sink. Additionally, an Azure Synapse table has to be created which matches the column names, column order, and column data types of source.
For dynamic mapping
If you view the pipeline code, you can see in the Translator section the JSON equivalent of the mapping section from UI.
You can reuse this as a base in Dynamic mapping to enable further copying similar files without having to manually configure schema.
Copy the JSON under mappings in translator

How to decide between Azure Data Lake vs Azure SQL vs Azure Data Lake Analytics vs Azure SQL VM?

I am new to Azure and hence trying to understand what services to use when and how.
At the moment, I have one excel file that has couple of tabs that require some transformation to create one excel file tab (inside the source file itself - say Tab "x"). The final tab "x" created is then being useful for creating one final excel file that is shared to various team.
At present, everything is done manually.
This needs to change and the excel file shared to team has to be automated. The source of the file is the excel file that has various tabs (excluding tab "x") and the reporting tool will be SSRS with excel data being stored in cloud.
Keeping this scenario in mind, what is the best way to store excel data into cloud? The excel data will be stored in cloud on a monthly basis. I am confused as to whether to store data in Azure-SQL, Azure Data Lake Gen 2 or Azure Data Lake Analytics or Azure SQL VM?
Every month data can be fetched from Excel file and populate into Azure using azure data factory. But I am not sure what is the best way to store data in the cloud considering the fact that some ETL process is needed to generate data in format similar to tab "X".
I think you can think about to using Azure SQL database.
Azure SQL database or SQL server support you import data from the excel( or csv) files. For more details and limits, please see: Import data from Excel to SQL Server or Azure SQL Database.
If your data have stored in Azure SQL database, you also can using EXCEL to get the data from Azure SQL database:
Connect Excel to a single database in Azure SQL Database and import data and create tables and charts based on values in the database. In this tutorial you will set up the connection between Excel and a database table, save the file that stores data and the connection information for Excel, and then create a pivot chart from the database values.
Reference: Import data from Excel to SQL Server or Azure SQL Database.
I think you don't need to store these excel files in Azure Data Lake.Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob storage. It's still a storage.
The more Azure resource you use, the more cost you need to pay.
If your excel file stored in you local computer, you can using Azure Data Factory to access these local files or with self host integration runtime.
Please reference: Copy data to or from a file system by using Azure Data Factory.
Hope this helps.
Your storage requirements are very minimal, so I would select Data Lake to store your documents. The alternative is Blob Storage, but I always prefer Data Lake because it works with Azure Active Directory.
In your scenario, drop it in the ADL, and use the ADL as the source in Azure Data Factory.
Edit:
Honestly, your original post is a little confusing. You have a RAW Excel document, you do some transformations on the RAW document, to generate an Excel Source document. This source document holds the final dataset that the dev team will use to build out SSRS reports. You need to make this dataset available to the teams so that they can connect to it to build the reports? My suggestion is to keep it simple and drop the final source dataset in Excel format, into blob or data lake storage and then ask the dev guys to pick it up from the location. If you are going the route of designing and maintaining a data pipeline (Blob > Data Factory > SQL, or CSV, TSV - then you are introducing unnecessary complications.

is it posible update row values from tables in Azure Data Factory?

I have a dataset in Data Factory, and I would like to know if is possible update row values using only data factory activities, without data flow, store procedures, queries...
There is a way to do update (and probably any other SQL statement) from Data Factory, it's a bit tacky though.
The Loopup activity, can execute a set of statements in Query mode, ie:
The only condition is to end it with select, otherwise Lookup activity throws error.
This works for Azure SQL, PostgreSQL, and most likely for any other DB Data Factory can connect to.
Concepts:
Datasets:
Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.
Now, a dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files, folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in Blob storage from which the activity should read the data.
Currently, according to my experience, it's impossible to update row values using only data factory activities. Azure Data Factory doesn't support this now.
Fore more details,please reference:
Datasets
Datasets and linked services in Azure Data Factory.
For example, when I use Copy Active, Data Factory doesn't provide my any ways to update the rows:
Hope this helps.
This is now possible in Azure Data Factory, your Data flow should have an Alter Row stage, and the Sink has a drop-down where you can select the key column for doing updates.
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-alter-row
As mentioned in Above comment regarding ADF data flow, ADF data flow does not support on-permise sink or source, the sink & source should reside in Azure SQL or Azure Data lake or any other AZURE data services.

Can i populate different SQL tables at once inside azure data factory when the source data set is Blob storage?

I want to copy data from azure blob storage to azure sql database. The destination database is divided among different tables.
So is there any way in which i directly send the blob data to different sql tables using a single pipeline in one copy activity?
As this should be a trigger based pipeline so it is a continuous process, i created trigger for every hour but right now i can just send blob data to one table and then divide them into different table by invoking another pipeline where source and sink dataset both are SQL database.
Finding a solution for this
You could use a stored procedure in your database as a sink in the copy activity. This way, you can define the logic in the stored procedure to write the data to your destination tables. You can find the description of the stored procedure sink here.
You'll have to use a user defined table type for this solution, maintaining them can be difficult, if you run into issues, you can have a look at my & BioEcoSS' answer in this thread.
According to my experience and Azure Data Factory doucmentation, we could not directly send the blob data to different sql tables using a single pipeline in one copy activity.
Because during Table mapping settings, One Copy Data Active only allows us select one corresponding table in the destination data store or specify the stored procedure to run at the destination.
You don't need to create a new pipeline, just add a new copy data active, each copy active call different stored procedure.
Hope this helps.

Resources