I'm new to both of ADF(Azure Data Factory) and ADX(Azure Data Explorer).
I have multiple Json files in ADLS in different folder level, and I need to ingest all the files into ADX.
ex) UserData/Overground/UsersFolder/project1/main/data/json/demo-02/2021/01/28/03/demo-02-2021-01-28-03-30.json
UserData/Overground/UsersFolder/project1/main/data/json/demo-02/2021/01/28/04/demo-02-2021-01-28-03-30.json
UserData/Overground/UsersFolder/project1/main/data/json/demo-02/2021/01/29/03/demo-02-2021-01-28-03-30.json
UserData/Overground/UsersFolder/project1/main/data/json/demo-02/2021/02/23/03/demo-02-2021-01-28-03-30.json
I'm just wondering if I need to create as many tables in ADX as the number of the Json files in ADLS.. so if I have 1000 Json files in ADLS, should I create 1000 tables in ADX to copy the data from adls to adx?
and how could I copy the data from adls to adx in ADF?
Appreciate your help in advance
To copy from multiple folders, you can use Additional settings of Copy activity Source. For more information follow this official document. You may need to use wildcard for multiple files.
Additional settings:
recursive Indicates whether the data is read recursively from the subfolders or only from the specified folder. Note that when recursive is set to true and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink.
Allowed values are true (default) and false.
This property doesn't apply when you configure fileListPath.
Also refer Azure Data Explorer as Sink
Azure Data Explorer is supported as a source, where data is copied from Azure Data Explorer to any supported data store, and a sink, where data is copied from any supported data store to Azure Data Explorer. Integrate Azure Data Explorer with Azure Data Factory
Related
Usecase: I have data files of varying size copied to a specific SFTP folder periodically (Daily/Weekly). All these files needs to be validated and processed. Then write them to related tables in Azure SQL. Files are of CSV format and are actually a flat text file which directly corresponds to a specific Table in Azure SQL.
Implementation:
Planning to use Azure Data Factory. So far, from my reading I could see that I can have a Copy pipeline in-order to copy the data from On-Prem SFTP to Azure Blob storage. As well, we can have SSIS pipeline to copy data from On-Premise SQL Server to Azure SQL.
But I don't see a existing solution to achieve what I am looking for. can someone provide some insight on how can I achieve the same?
I would try to use Data Factory with a Data Flow to validate/process the files (if possible for your case). If the validation is too complex/depends on other components, then I would use functions and put the resulting files to blob. The copy activity is also able to import the resulting CSV files to SQL server.
You can create a pipeline that does the following:
Copy data - Copy Files from SFTP to Blob Storage
Do Data processing/validation via Data Flow
and sink them directly to SQL table (via Data Flow sink)
Of course, you need an integration runtime, that can access the on-prem server - either by using VNet integration or by using the self hosted IR. (If it is not publicly accessible)
Trying to move data from Teradata to Snowflake. Have created a process to run TPT scripts for each table to generate files for each table.
Files are also split to achieve concurrency while running COPY INTO in snowflake.
Need to understand what is the best way to move those Files from On Prem Linux Machine to Azure ADLS. Considering files in Terabyte size.
Does Azure provide any mechanism to move these files or can we directly create files on ADLS from Teradata?
The best approach to load data to snowflake via external table if you have the Azure Blob Storage or ADLS Gen2. Load data to blob storage and create external table and then load data data to snowflake.
I am just going through some Microsoft Document and doing handOn for Data engineering related things.
I have couple of queries for a scenrerio - "copy CSV file(s) from Blob storage to Synapse analytics (stage table(s)):
I read that we can do direct data pull in Synapse with the process of creating external tables. (https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/load-data-wideworldimportersdw)
If above is possible, then in what cases we do use Azure Data factory Copy or data flow method?
While working with Azure data factory, is it a good idea to use Polybase, because it will use Blob storage again as staging in this scenrerio (i.e. I am copying file from Blob only and again using blob for staging)?
I searched for answers to my queries but haven't found any satisfactory answer yet.
If you're just straight loading data from CSV into DW, use Copy. Polybase is recommended, but not always needed for small files.
If you need to transform that data or perform updates, then use data flows.
I have 20 files of type Excel/pdf located in different https server. i need to validate these file and load into azure storage Using Data Factory.I need to do apply some business logic on this data and load into azure SQL Database.I need to if we have to create a pipe line and store this data in azure blob storage and then load into Azure sql Database
I have tried creating copy data in data factory
My idea as below:
No.1
Step 1: Use Copy Activity to transfer data from http connector source into blob storage connector sink.
Step 2: Meanwhile, configure a blob storage trigger to execute your logic code so that the blob data will be processed as soon as it's collected into blob storage.
Step 3: Use Copy Activity to transfer data from blob storage connector source into SQL database connector sink.
No.2:
Step 1:Use Copy Activity to transfer data from http connector source into SQL database connector sink.
Step 2: Meanwhile, you could configure stored procedure to add your logic steps. The data will be executed before inserted into table.
I think both methods are feasible. The No.1, the business logic is freer and more flexible. The No.2, it is more convenient, but it is limited by the syntax of stored procedures. You could pick the solution as you want.
The excel and pdf are supported yet. Based on the link,only below formats are supported by ADF diectly:
i tested for csv file and get the below random characters:
You could refer to this case to read excel files in ADF:How to read files with .xlsx and .xls extension in Azure data factory?
I have a large dataset on Azure Data lake store and a few files might be added/updated there daily. How can I process these new files without reading the entire dataset each time?
I need to copy these new files using Data Factory V1 to SQL server.
If you could use ADF V2, then you could use get metadata activity to get the lastModifiedDate Properties of each file and then only copy new files. You could reference this doc. https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity