Conditional statements importing JSON to SQL Azure - azure

I have a pipeline in Azure Data Factory which imports JSON to SQL Azure. This works fine, except some JSON files have multiple structures.
It's fine if every line was in the file is the same. I can take two runs at the files in the data lake gen 2. I don't mind ignoring the lines with rc and then having another pipeline which ignore rows with marketDefinition and just processes the others getting both into seperate tables.
Not sure what the best solution here is.

Just for now, Data factory doesn't works well for multiple files which have different schema.
The pre script is the operation directly to SQL database, even you pass the source file path to the script, it still won't filter the source dataset. It's an independent command.
So I'm afraid to say there isn't a best solution for your scenario.

Related

Azure Data Factory - Copy files using a CSV with filepaths

I am trying to create an ADF pipeline that does the following:
Takes in a csv with 2 columns, eg:
Source, Destination
test_container/test.txt, test_container/test_subfolder/test.txt
Essentially I want to copy/move the filepath from the source directory into the Destination directory (Both these directories are in Azure blob storage).
I think there is a way to do this using lookups, but lookups are limited to 5000 rows and my CSV will be larger than that. Any suggestions on how this can be accomplished?
Thanks in advance,
This is a complex scenario for Azure Data Factory. Also as you mentioned there are more than 5000 file paths records in your CSV files, it also means same number of Source and Destination paths. So now if you create this architecture in ADF, it will goes like this:
You will use the Lookup activity to read the Source and Destination paths. In that also you can't read all the paths due to Lookup activity limitation.
Later you will iterate over the records using ForEach activity.
Now you also need to split the path so that you will get container, directory and file names separately to pass the details to Datasets created for Source and Destination location. Once you split the paths, you need to use the Set variable activity to store the Source and Destination container, directory and file names. These variables will be then passed to Datasets dynamically. This is a tricky part as even if a single record is unable to split properly then your pipeline would fail.
If above step completed successfully, then you not need to worry about copy activity. If all the parameters got the expected values under Source and Destination tabs in copy activity it will work properly.
My suggestion is to use programmatical approach for this. Use python, for example, to read the CSV file using pandas module and iterate over each path and copy the files. This will work fine even if you have 5000+ records.
You can refer this SO thread which will help you to implement the same programmatically.
First, if you want to maintain a hierarchical pattern in your data, i recommend using ADLS (Azure Data Lake Storage) this will guarantee a certain structure for your data.
second, if you have a Folder in Blob Storage and you would like to copy files to it, use Copy Activity, you should define 2 datasets, one for the source and one for the sink.
check this link : https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview

Copy one file at a time from one container to another using azure data factory v2

I am trying to copy one file from one container to another another in a storage account. the scenario i implemented works fine for a single file. but for multiple files, it is copying both of them in one copy activity. i want the file to be moved one at a time and after a single copy to provide a delay of 1 min, then proceed with the next file copy.
i created a pipeline with the move File template but it did not work for multiple files.
i have taken the source and sink dataset as csv datasets and not binary. i will not be aware of the pattern or the names of the files.
when a user input say about 10 files, i want to copy it one at a time and also provide a delay between each copy. this has to happen between 2 storage account containers.
i have tried to use move files template too. but it did not work for multiple. Please help me.
Sanaa, to force a sequential processing, check the "Sequential" checkbox:
Time delay could be achieved by adding "Wait" action:

Azure Data Factory Data Migration

Not really sure this is an explicit question or just a query for input. I'm looking at Azure Data Factory to implement a data migration operation. What I'm trying to do is the following:
I have a No SQL DB with two collections. These collections are associated via a common property.
I have a MS SQL Server DB which has data that is related to the data within the No SQL DB Collections via an attribute/column.
One of the NoSQL DB collections will be updated on a regular basis, the other one on a not so often basis.
What I want to do is be able to prepare a Data Factory pipline that will grab the data from all 3 DB locations combine them based on the common attributes, which will result in a new dataset. Then from this dataset push the data wihin the dataset to another SQL Server DB.
I'm a bit unclear on how this is to be done within the data factory. There is a copy activity, but only works on a single dataset input so I can't use that directly. I see that there is a concept of data transformation activities that look like they are specific to massaging input datasets to produce new datasets, but I'm not clear on what ones would be relevant to the activity I am wanting to do.
I did find that there is a special activity called a Custom Activity that is in effect a user defined definition that can be developed to do whatever you want. This looks the closest to being able to do what I need, but I'm not sure if this is the most optimal solution.
On top of that I am also unclear about how the merging of the 3 data sources would work if the need to join data from the 3 different sources is required but do not know how you would do this if the datasets are just snapshots of the originating source data, leading me to think that the possibility of missing data occurring. I'm not sure if a concept of publishing some of the data someplace someplace would be required, but seems like it would in effect be maintaining two stores for the same data.
Any input on this would be helpful.
There are a lot of things you are trying to do.
I don't know if you have experience with SSIS but what you are trying to do is fairly common for either of these integration tools.
Your ADF diagram should look something like:
1. You define your 3 Data Sources as ADF Datasets on top of a
corresponding Linked service
2. Then you build a pipeline that brings information from SQL Server into a
temporary Data Source (Azure Table for example)
3. Next you need to build 2 pipelines that will each take one of your NoSQL
Dataset and run a function to update the temporary Data Source which is the ouput
4. Finally you can build a pipeline that will bring all your data from the
temporary Data Source into your other SQL Server
Steps 2 and 3 could be switched depending on which source is the master.
ADF can run multiple tasks one after another or concurrently. Simply break down the task in logical jobs and you should have no problem coming up with a solution.

Can i upload data from multiple datasources to azure DW at same time

Can i retrieve data from multiple data sources to Azure SQL DataWarehouse at the same time using single pipeline?
SQL DW can certainly load multiple tables concurrently using external (aka PolyBase) tables, bcp, or insert statements. As hirokibutterfield asks, are you referring to a specific loading tool like Azure Data Factory?
Yes you can, but there you have to mention a copy activity for each of the data source being copied to the azure data warehous.
Yes you can, and depending on the extent of transformation required, there would be 2 ways to do this. Regardless of the method, the data source does not matter to ADF since your data movement happens via the copy activity which looks at the dataset and takes care of firing the query on the related datasource.
Method 1:
If all your transformation for a table can be done in a SELECT query on the source systems, you can have a set of copy activities specifying SELECT statements. This is the simple approach
Method 2:
If your transformation requires complex integration logic, first use copy activities to copy over the raw data from the source systems to staging tables in the SQLDW instance (Step 1). Then use a set of stored procedures to do the transformations (Step 2).
The ADF datasets which are the output from Step1 will be input datasets to Step 2 in order to maintain consistency.

Best way to handle Flat File Import to SQL Server using C#.Net

I wrote a Console Application that reads list of flat files
and Parse the data type on a row basis
and inserts records one after other in respective tables.
there are few Flat Files which contains about 63k records(rows).
for such files, my program is taking about 6 hours for one file of 63k
records to complete.
This is a test data file. In production i have to deal with 100 time more load.
I am worried, if i can do this any better to speed up.
Can any one suggest a best way to handle this job?
Work Flow is as below:
Read the FlatFile from Local Machine using File.ReadAllLines("location")
Create a Record Entity object after parsing each field of the row.
Insert this current row in to the Entity
Purpose of making this as console application is,
this application should be run(scheduled application) on weekly basis
and there is conditional logic in it, based on some variable there will be
full table replace or
update a existing table or
delete records in table.
You can try to use 'bulk insert' operation for inserting a huge data into database.

Resources