In Azure Data Factory v2 I've created a number of pipelines. I noticed that each pipeline I create there is a source and destination dataset created.
According to the ADF documentation: A dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs.
These datasets are visible within my data factory. I'm curious why I would care about these? These almost seem like 'under the hood' objects ADF creates to move data around. What value are these to me and why would I care about them?
These datasets are entities that can be reused. For example, dataset A can be referenced by many pipelines if those pipelines need the same data (same table or same file).
Linked services can be reused too. I think that's why ADF has these concepts.
You may be seeing those show up in your Factory if you create pipelines via the Copy Wizard Tool. That will create Datasets for your Source & Sink. The Copy Activity is the primary consumer of Datasets in ADF Pipelines.
If you are using ADFv2 to transform data, no DataSet is required. But if you are using ADF copy activity to copy data, DataSet is used to let ADF know the path and name of object to copy from/to. Once you have one dataset created, it can be used in many pipelines. Could you please help to let me understand more why creating a dataset is a friction to you in your projects?
Related
I am reading an excel file and applying some transformations.
I am not able to find any data from Select transformer in Data Preview.
But the previous "FromNumericLen" transformer outputs data and can find the same in Data Preview.
Thanks.
Per the comment thread, the fix is to enable Allow Schema Drift in the Source transform.
I had been wondering if it were possible to apply "data preparation" (.dprep) files to incoming data in the score.py, similar to how Pipeline objects may be applied. This would be very useful for model deployment. To find out, I asked this question on the MSDN forums and received a response confirming it were possible, but little explanation about how to actually do it. The response was:
in your score.py file, you can invoke the dprep package from Python
SDK to apply the same transformation to the incoming scoring data.
make sure you bundle your .dprep file in the image you are building.
So my questions are:
What function do I apply to invoke this dprep package?
Is it: run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) ?
How do I bundle it into the image when creating a web-service from the CLI?
Is there a switch to -f for score files?
I have scanned through the entire documentation and Workbench Repo but cannot seem to find any examples.
Any suggestions would be much appreciated!
Thanks!
EDIT:
Scenario:
I import my data from a live database and let's say this data set has 10 columns.
I then feature engineer this (.dsource) data set using the Workbench resulting in a .dprep file which may have 13 columns.
This .dprep data set is then imported as a pandas DataFrame and used to train and test my model.
Now I have a model ready for deployment.
This model is deployed via Model Management to a Container Service and will be fed data from a live database which once again will be of the original format (10 columns).
Obviously this model has been trained on the transformed data (13 columns) and will not be able to make a prediction on the 10 column data set.
What function may I use in the 'score.py' file to apply the same transformation I created in workbench?
I believe I may have found what you need.
From this documentation you would import from the azureml.dataprep package.
There aren't any examples there, but searching on GitHub, I found this file which has the following to run data preparation.
from azureml.dataprep import package
df = package.run('Data analysis.dprep', dataflow_idx=0)
Hope that helps!
To me, it looks like this can be achieved by using the run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) method from the azureml.dataprep.package module.
From the documentation :
run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) runs the specified data flow based on an in-memory data source and returns the results as a dataframe. The user_config argument is a dictionary that maps the absolute path of a data source (.dsource file) to an in-memory data source represented as a list of lists.
I have a question about an approach of a solution in Azure. The question is how to decide what technologies to use and how to find the best combination of them.
Let's suppose i have two data sets, which are growing daily:
I have a CSV file which comes daily to my ADL store and it contains weather data for all possible Lattitudes and Longtitudes combinations and zip codes for them, together with 50 different weather variables.
I have another dataset with POS (point of sales), which also comes as a daily CSV file to my ADL storage. It contains sales data for all retail locations.
The desired output is to have the files "shredded" in a way that the data is prepared for AzureML forecasting of sales based on weather, and the forecasting is done per retail location and delivered via PowerBI dashboard to each one of them. A requirement is not to allow different location see the forecasts for any other locations.
My questions are:
How do I choose the right set of technologies?
how do I append the incoming daily data?
How do I create a separate ML forecasting results for each location?
Any general guidance on the architecture topic is appreciated, and any more specific ideas on comparison of different suitable solutions is also appreciated.
This is way to broad of a question.
I will only answer your ADL specific question #2 and give you a hint on #3 that is not related to Azure ML (since I don't know what that format is):
If you just use files, add date/time information to your file path name (either in folder or filename). Then use U-SQL File sets to query the ranges you are interested in. If you use U-SQL Tables, use PARTITIONED BY. For more details look in the U-SQL Reference documentation.
If you need to create more than one file as output, you have two options:
a. you know all file names, write an OUTPUT statement for each file selecting only the relevant data for it.
b. you have to dynamically generate a script and then execute it. Similar to this.
I am working with Azure and have a number of sql databases on such.
I am looking to transfer data between databases on such. I have been doing some research and have found that azure data factory is a method that can be used to achieve. However, I found it difficult to find information on this.
Could someone point me in the direction of using data factory for taking data from db1, transform and massage it and then insert in to db2?
If you simply COPY data from source A to source B, ADF is a good option for you. There is a rich set of supported sources and destinations.
To quickly try it out, you can use COPY wizard, which is code-free. https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-wizard
For more details about COPY activity, you may look at this. https://learn.microsoft.com/en-us/azure/data-factory/data-factory-data-movement-activities
When you said "massage data", I don't know what exactly it will be, but ADF Custom Activity and Stored procedure Activity should mostly meet your need.
Does Azure Data Lake Analytics and U-SQL support the notion of Cursors in scripts?
I have a data set that contains paths to further data sets I would like to extract and I want to output the results to separate files.
At the moment I can't seem to find a solution for dynamically extracting and outputting data based on values inside data sets.
U-SQL currently expect that files are known at compile time. Thus, you cannot do extraction or outputting based on locations provided inside a file.
You can specify filesets in the EXTRACT statement that will be somewhat data driven. We are currently working on adding the ability to use filesets on OUTPUT as well.
You can file feature requests at http://aka.ms/adlfeedback.
Cheers
Michael
You might be able to write a Processor to iterate over the rows in the primary dataset. However, you might not be able to access the additional datasets in the Processor.
Another work around might be to concatenate all the additional datasets and perform a join with the primary dataset.