If I need to publish two tables in two different databases in the metastore, do I need to create two different DLT pipeline? I am asking this because I saw that in the pipeline setting, i can only specify 1 target.
Right now - yes, DLT only supports one target database. So if you need to push into different databases, then you may have two DLT pipelines.
Theoretically you can have one pipeline that will be publishing two tables into a single database, and then you can use create table ... using delta location '<dlt_storage>/tables/<table_name>' to refer to it, but it won't work well with the schema evolution, etc.
Related
I would like to merge data from different data sources (ERP system, Excel files) with the ADF and make it available in an AzureSQLDB for further analyzing.
I'm not sure where and when I do the transformations and joins between the tables. Can I run all of this directly in the pipeline and then load the data into AzureDB, or do I need to stage the data first?
My understanding is to load the data into the ADF using Copy Activities and Datasets. Transforming and merging the datasets there with MappingDataFlows or similar activities. Then they are loaded into the AzureSQLDB
Your question is fully requirement based. You can go for either ETL or ELT process. Since your sink is AzureSQLDB , I would suggest to go with ELT , as you can handle lots of transformations in SQL itself by creating views on the landing tables.
If you have complex transformations to handle , then go with ETL and use Dataflow instead.
Also, regarding staging tables, if your requirement is to perform daily incremental load after the first full load, then you should opt for staging table.
Checkout this video for full load and incremental load .
I am setting up a new data lake and have been tasked with creating the master data tables in the data bricks delta lake component. I'm trying to do this in a use-case agnostic way (or as agnostic as possible), and need to automate the process where possible. I have researched aws glue crawlers, and it seems it is a good way to automatically create a schema and catalog for the data.
However, I'm not sure how to proceed. I'm assuming that creating the master data means identifying common fields in all the data sources and creating a schema for all the data using a single crawler, and then dividing this schema into facts and dimensions. After that I could use spark jobs on data bricks to extract what I need from the raw data and to populate the master data, while checking for duplicates and doing whatever other transformations that need to be done.
This plan seems like it requires a lot of manual labor though, and it's not use case agnostic in any way. Does anyone know how it could be automated further?
Any help would be much appreciated.
I want to copy some tables from a production system to a test system on a regular basis. Both systems run a PostgreSQL server. I want to copy only specific tables from production to test.
I´ve already set up a foreach which iterates over the table names I want to copy. The problem is, that the table structures may change during development process and the copy job might fail.
So is there a way to use some kind auf "automatic mapping"? Cause the tables in both systems always have exactly the same structure. Or is there some kind of "Copy table" procedure?
You could remove mapping and structure in your pipeline . Then it will using the default mapping behavior. Given your tables always have the same schema, both mapping by name and mapping by order should work.
Not really sure this is an explicit question or just a query for input. I'm looking at Azure Data Factory to implement a data migration operation. What I'm trying to do is the following:
I have a No SQL DB with two collections. These collections are associated via a common property.
I have a MS SQL Server DB which has data that is related to the data within the No SQL DB Collections via an attribute/column.
One of the NoSQL DB collections will be updated on a regular basis, the other one on a not so often basis.
What I want to do is be able to prepare a Data Factory pipline that will grab the data from all 3 DB locations combine them based on the common attributes, which will result in a new dataset. Then from this dataset push the data wihin the dataset to another SQL Server DB.
I'm a bit unclear on how this is to be done within the data factory. There is a copy activity, but only works on a single dataset input so I can't use that directly. I see that there is a concept of data transformation activities that look like they are specific to massaging input datasets to produce new datasets, but I'm not clear on what ones would be relevant to the activity I am wanting to do.
I did find that there is a special activity called a Custom Activity that is in effect a user defined definition that can be developed to do whatever you want. This looks the closest to being able to do what I need, but I'm not sure if this is the most optimal solution.
On top of that I am also unclear about how the merging of the 3 data sources would work if the need to join data from the 3 different sources is required but do not know how you would do this if the datasets are just snapshots of the originating source data, leading me to think that the possibility of missing data occurring. I'm not sure if a concept of publishing some of the data someplace someplace would be required, but seems like it would in effect be maintaining two stores for the same data.
Any input on this would be helpful.
There are a lot of things you are trying to do.
I don't know if you have experience with SSIS but what you are trying to do is fairly common for either of these integration tools.
Your ADF diagram should look something like:
1. You define your 3 Data Sources as ADF Datasets on top of a
corresponding Linked service
2. Then you build a pipeline that brings information from SQL Server into a
temporary Data Source (Azure Table for example)
3. Next you need to build 2 pipelines that will each take one of your NoSQL
Dataset and run a function to update the temporary Data Source which is the ouput
4. Finally you can build a pipeline that will bring all your data from the
temporary Data Source into your other SQL Server
Steps 2 and 3 could be switched depending on which source is the master.
ADF can run multiple tasks one after another or concurrently. Simply break down the task in logical jobs and you should have no problem coming up with a solution.
Can i retrieve data from multiple data sources to Azure SQL DataWarehouse at the same time using single pipeline?
SQL DW can certainly load multiple tables concurrently using external (aka PolyBase) tables, bcp, or insert statements. As hirokibutterfield asks, are you referring to a specific loading tool like Azure Data Factory?
Yes you can, but there you have to mention a copy activity for each of the data source being copied to the azure data warehous.
Yes you can, and depending on the extent of transformation required, there would be 2 ways to do this. Regardless of the method, the data source does not matter to ADF since your data movement happens via the copy activity which looks at the dataset and takes care of firing the query on the related datasource.
Method 1:
If all your transformation for a table can be done in a SELECT query on the source systems, you can have a set of copy activities specifying SELECT statements. This is the simple approach
Method 2:
If your transformation requires complex integration logic, first use copy activities to copy over the raw data from the source systems to staging tables in the SQLDW instance (Step 1). Then use a set of stored procedures to do the transformations (Step 2).
The ADF datasets which are the output from Step1 will be input datasets to Step 2 in order to maintain consistency.