transfer, massage and import data with azure data factory - azure

I am working with Azure and have a number of sql databases on such.
I am looking to transfer data between databases on such. I have been doing some research and have found that azure data factory is a method that can be used to achieve. However, I found it difficult to find information on this.
Could someone point me in the direction of using data factory for taking data from db1, transform and massage it and then insert in to db2?

If you simply COPY data from source A to source B, ADF is a good option for you. There is a rich set of supported sources and destinations.
To quickly try it out, you can use COPY wizard, which is code-free. https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-wizard
For more details about COPY activity, you may look at this. https://learn.microsoft.com/en-us/azure/data-factory/data-factory-data-movement-activities
When you said "massage data", I don't know what exactly it will be, but ADF Custom Activity and Stored procedure Activity should mostly meet your need.

Related

structure in getMetadata activity for csv file dataset show string datatypes for integer column in azure data factory

I want to do validation as first step before proceeding further in pipeline execute.
I am fetching metadata activity for my dataset and then checking it against a predefined schema in if condition.
Metadata for csv files show column type string even for integer which is breaking the validation.
Get Metadata doesn't support it, all the data type is considered as string in csv files.
You have posted a question on Microsoft forums here: https://learn.microsoft.com/en-us/answers/questions/44635/structure-in-getmetadata-activity-for-csv-file-dat.html, but Microsoft MSFT confirmed that: Using getMetadata on a csv file will give all strings.
The link he provided doesn't work for the column type.
I think that's a by-design problem and has no workaround now. And per my experience, the structure only works well for database.
The best way for you it that ask Azure Support for more details. Or post a new Data Factory feedback here: https://feedback.azure.com/forums/270578-data-factory. Hope the Data Factory Product team will see it and give us some guides.

Issues while building Tabular Data Model from Vertica

I have been assigned a new project where I need to prepare a PowerBI report using Azure Analysis Services (Data mart). Here the flow is Data from Vertica DW -> Azure Analysis Services (via tabular Model)-> PowerBI. I am pretty much new to Tabular Model and Vertica
Scenario:
1) The DW is in Vertica Platform online.
2) I am trying to build a data model using Analysis Services Tabular Project in VS 2019
3) This model will be deployed on Azure which will act as data source to PowerBI
4) I cannot select individual tables directly (from Vertica) while performing "Import from Data Source". I have to use a view here.
5) I have been given a single big table with around 30 columns as a source from Vertica
Concerns:
1) While importing data from Vertica, there is no option to "Transform" it as we used to have it in PowerBI Query Editor while importing data. However, I tried to import a local file and at this time, I could find this option
2) with reference to Scenario #5, how can I split the big table in various Dimensions in Model.bim? Currently, I am adding them as calculated tables. Is this optimal way or you guys can suggest something better?
Also, any good online material where I can get my hands dirty on modeling in Analysis Services Tabular Project (I can do it very well in PowerBI)?
Thanks in advance
Regards
My personal suggestion is to avoid using Visual Studio as hell. Unfortunately, it is not only useless but also damages you.
Instead, use Tabular Editor. From there you can easily work with the Tabular Model.
My personal suggestion is to avoid using calculated table as dimensions, instead create several tables in Tabular Editor and simply modify the source query / fields.
In reference to the 1st question, I believe there is some bug while connecting Vertica with PowerBI it works perfectly elsewhere except for this combination.
For #2, I can use I can choose "Import new tables" from the connected data source. It can be found under Tabular Editor View.

Why are ADF datasets important?

In Azure Data Factory v2 I've created a number of pipelines. I noticed that each pipeline I create there is a source and destination dataset created.
According to the ADF documentation: A dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs.
These datasets are visible within my data factory. I'm curious why I would care about these? These almost seem like 'under the hood' objects ADF creates to move data around. What value are these to me and why would I care about them?
These datasets are entities that can be reused. For example, dataset A can be referenced by many pipelines if those pipelines need the same data (same table or same file).
Linked services can be reused too. I think that's why ADF has these concepts.
You may be seeing those show up in your Factory if you create pipelines via the Copy Wizard Tool. That will create Datasets for your Source & Sink. The Copy Activity is the primary consumer of Datasets in ADF Pipelines.
If you are using ADFv2 to transform data, no DataSet is required. But if you are using ADF copy activity to copy data, DataSet is used to let ADF know the path and name of object to copy from/to. Once you have one dataset created, it can be used in many pipelines. Could you please help to let me understand more why creating a dataset is a friction to you in your projects?

What is the point of a table in a data lake? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I thought the whole point of using a Data Lake versus a Data Warehouse was to invert the ETL (Extract, Transform, Load) process to LET (Load, Extract, Transform). Doesn't extracting this data, transforming and loading it into a table get us right back where we started?
IMHO the point of a data lake is to store all types of data: unstructured, semi-structured and structured. The Azure version of this is Azure Data Lake Store (ADLS) and its primary function is scalable, high-volume storage.
Separately, there is a product Azure Data Lake Analytics (ADLA). This analytics product can interact with ADLS, but also blob storage, SQL on a VM (IaaS) and the two PaaS database offerings, SQL Database and SQL Data Warehouse and HDInsight. It has a powerful batch language called U-SQL, a combination of SQL and .net for interrogating and manipulating these data stores. It also has a database option which, where appropriate, allows you to store data you have processed in table format.
One example might be where you have some unstructured data in your lake, you run your batch output and want to store the structured intermediate output. This is where you might store the output in an ADLA database table. I tend to use them where I can prove I can get a performance improvement out of them and/or want to take advantage of the different indexing options.
I do not tend to think of these as warehouse tables because they don't interact well with other products yet, ie they don't as yet have endpoints / aren't visible, eg Azure Data Factory can't move tables from there yet.
Finally I tend to think of ADLS as analogous to HDFS and U-SQL/ADLA as analogous to Spark.
HTH
By definition a data lake is a huge repository storing raw data in it's native format until needed. Lakes use a flat architecture rather than nested (http://searchaws.techtarget.com/definition/data-lake). Data in the lake has a unique ID and metadata tags, which are used in queries.
So data lakes can store structured, semi-structured and unstructured data. Structured data would include SQL database type data in tables with rows and columns. Semi-structured would be CSV files and the like. And unstructured data is anything and everything -- emails, PDFs, video, binary. It's that ID and the metadata tags that help users find data inside the lake.
To keep a data lake manageable, successful implementers rotate, archive or purge data from the lake on a regular basis. Otherwise it becomes what some have called a "data swamp", basically a graveyard of data.
The traditional ELT process is better suited to data warehouses because they are more structured and data in a warehouse is there for a purpose. Data lakes, being less structured, are more suited to other approaches such as ELT (Extract, Load, Transform), because they store raw data that is only categorized by each query. (See this article by Panopoly for a discussion of ELT vs ETL.) For example, you want to see customer data from 2010. When you query a data lake for that you will get everything from accounting data, CRM records and even emails from 2010. You cannot analyze that data until it has been transformed into usable formats where the common denominators are customers + 2010.
To me, the answer is "money" and "resources"
(and probably correlated to use of Excel to consume data :) )
I've been through a few migrations from RDBMS to Hadoop/Azure platforms and it comes down to the cost/budget and use-cases:
1) Port legacy reporting systems to new architectures
2) Skillset of end-users who will consume the data to drive business value
3) The type of data being processed by the end user
4) Skillset of support staff who will support the end users
5) Whether the purpose of migration is to reduce infrastructure support costs, or enable new capabilities.
Some more details for a few of the above:
Legacy reporting systems often are based either on some analytics software or homegrown system that, over time, has a deeply embedded expectation for clean, governed, structured, strongly-typed data. Switching out the backend system often requires publishing the exact same structures to avoid replacing the entire analytics solution and code base.
Skillsets are a primary concern as well, because your often talking about hundreds to thousands of folks who are used to using Excel, with some knowing SQL. Few end-users, in my experience, and few Analysts I've worked with know how to program. Statisticians and Data Engineers tend towards R/Python. And developers with Java/C# experience tend towards Scala/Python.
Data Types are a clincher for what tool is right for the job... but here you have a big conflict, because there are folks who understand how to work with "Data Rectangles" (e.g. dataframes/tabular data), and those who know how to work with other formats. However, I still find folks consistently turning semi-structured/binary/unstructured data into a table as soon as they need to get a result operationalized... because support is hard to find for Spark.

Azure POS and weather data analysis strategy

I have a question about an approach of a solution in Azure. The question is how to decide what technologies to use and how to find the best combination of them.
Let's suppose i have two data sets, which are growing daily:
I have a CSV file which comes daily to my ADL store and it contains weather data for all possible Lattitudes and Longtitudes combinations and zip codes for them, together with 50 different weather variables.
I have another dataset with POS (point of sales), which also comes as a daily CSV file to my ADL storage. It contains sales data for all retail locations.
The desired output is to have the files "shredded" in a way that the data is prepared for AzureML forecasting of sales based on weather, and the forecasting is done per retail location and delivered via PowerBI dashboard to each one of them. A requirement is not to allow different location see the forecasts for any other locations.
My questions are:
How do I choose the right set of technologies?
how do I append the incoming daily data?
How do I create a separate ML forecasting results for each location?
Any general guidance on the architecture topic is appreciated, and any more specific ideas on comparison of different suitable solutions is also appreciated.
This is way to broad of a question.
I will only answer your ADL specific question #2 and give you a hint on #3 that is not related to Azure ML (since I don't know what that format is):
If you just use files, add date/time information to your file path name (either in folder or filename). Then use U-SQL File sets to query the ranges you are interested in. If you use U-SQL Tables, use PARTITIONED BY. For more details look in the U-SQL Reference documentation.
If you need to create more than one file as output, you have two options:
a. you know all file names, write an OUTPUT statement for each file selecting only the relevant data for it.
b. you have to dynamically generate a script and then execute it. Similar to this.

Resources