How to apply a dprep package to incoming data in score.py Azure Workbench - azure

I had been wondering if it were possible to apply "data preparation" (.dprep) files to incoming data in the score.py, similar to how Pipeline objects may be applied. This would be very useful for model deployment. To find out, I asked this question on the MSDN forums and received a response confirming it were possible, but little explanation about how to actually do it. The response was:
in your score.py file, you can invoke the dprep package from Python
SDK to apply the same transformation to the incoming scoring data.
make sure you bundle your .dprep file in the image you are building.
So my questions are:
What function do I apply to invoke this dprep package?
Is it: run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) ?
How do I bundle it into the image when creating a web-service from the CLI?
Is there a switch to -f for score files?
I have scanned through the entire documentation and Workbench Repo but cannot seem to find any examples.
Any suggestions would be much appreciated!
Thanks!
EDIT:
Scenario:
I import my data from a live database and let's say this data set has 10 columns.
I then feature engineer this (.dsource) data set using the Workbench resulting in a .dprep file which may have 13 columns.
This .dprep data set is then imported as a pandas DataFrame and used to train and test my model.
Now I have a model ready for deployment.
This model is deployed via Model Management to a Container Service and will be fed data from a live database which once again will be of the original format (10 columns).
Obviously this model has been trained on the transformed data (13 columns) and will not be able to make a prediction on the 10 column data set.
What function may I use in the 'score.py' file to apply the same transformation I created in workbench?

I believe I may have found what you need.
From this documentation you would import from the azureml.dataprep package.
There aren't any examples there, but searching on GitHub, I found this file which has the following to run data preparation.
from azureml.dataprep import package
df = package.run('Data analysis.dprep', dataflow_idx=0)
Hope that helps!

To me, it looks like this can be achieved by using the run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) method from the azureml.dataprep.package module.
From the documentation :
run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) runs the specified data flow based on an in-memory data source and returns the results as a dataframe. The user_config argument is a dictionary that maps the absolute path of a data source (.dsource file) to an in-memory data source represented as a list of lists.

Related

Azure ML Dataset Versioning: What is Different if it Points to the Same Data?

Context
In AzureML, we are facing an error when running a pipeline. It fails on to_pandas_dataframe because a particular dataset "could not be read beyond end of stream". On its own, this seems to be an issue with the parquet file that is being registered, maybe special characters being misinterpreted.
However, when we explicitly load a previous "version" of this Dataset--which points to the exact same location of data--it works as expected. In the documentation (here), Azure says that "when you load data from a dataset, the current data content referenced by the dataset is always loaded." This makes me think that a new version of the dataset with the same schema will be, well, the same.
Questions
What makes a Dataset version different from another version when both point to the same location? Is it only the schema definition?
Based on these differences, is there a way to figure out why one version would be succeeding and another failing?
Attempts
The schemas of the two versions are identical. We can profile both in AzureML, and all the fields have the same profile information.
As rightly suggested by #Anand Sowmithiran in comment section, This looks more like a bug with the SDK.
You can raise Azure support ticket

Create Data Catalog column tags by inspecting BigQuery data with Cloud Data Loss Prevention

I want to use DLP to inspect my tables in BigQuery, and then write the findings to policy tags on the columns of the table. For example, I have a (test) table that contains data including an email address and a phone number for individuals. I can use DLP to find those fields and identify them as emails and phone numbers, and I can do this in the console or via the API (I'm using NodeJS). When creating this inspection job, I know I can configure it to automatically write the findings to the Data Catalog, but this generates a tag on the table, not on the columns. I want to tag the columns with the specific type of PII that has been identified.
I found this tutorial that appears to achieve exactly that - but tutorial is a strong word; it's a script written in Java and a basic explanation of what that script does, with the only actual instructions being to clone the git repo and run a few commands. There's no information about which API calls are being made, not a lot of comments in the code, and no links to pertinent documentation. I have zero experience with Java, so I'm not able to work out the process and translate it into NodeJS for my own purposes.
I also found this similar tutorial which also utilises Dataflow, and again the instructions are simply "clone this repo, run this script". I've included the link because it features a screenshot showing what I want to achieve: tagging columns with PII data found by DLP
So, what I want to do appears to be possible, but I can't find useful documentation anywhere. I've been through the DLP and Data Catalog docs, and through the API references for NodeJS. If anyone could help me figure out how to do this, I'd be very grateful.
UPDATE: I've made some progress and changed my approach as a result.
DLP provides two methods to inspect data: dlp.inspectContent() and dlp.createDlpJob(). The latter takes a storageItem which can be a BigQuery table, but it doesn't return any information about the columns in the results, so I don't believe I can use it.
inspectContent() cannot be run on a BigQuery table; it can inspect structured text, which is what the Java script I linked above is utilising; that script is querying the BigQuery table, and constructing a Table from the results, then passing that Table into inspectContent() which then returns a Findings object which contains fieldnames. I want to do exactly that, but in NodeJS. I'm struggling to convert the BigQuery results into the format of a Table as NodeJS doesn't appear to have a constructor for that type, like Java does.
I was unable to find node.js documentation implementing column level tags.
However, you might find the Policy Tags official documentation helpful to point you in the right direction. Specifically, you might lack some roles to manage column-level tags.

How can I load my own dataset for person?

How can I load a dataset for person reidentification. In my dataset there are two folders train and test.
I wish I could provide comments, but I cannot yet. Therefore, I will "answer" your question to the best of my ability.
First, you should provide a general format or example content of the dataset. This would help me provide a less nebulous answer.
Second, from the nature of your question I am assuming that you are fairly new to python in general. Forgive me if I'm wrong in my assumption. With that assumption, depending on what kind of data you are trying to load (i.e. text, numbers, or a mixture of text and numbers) there are various ways to load the data. Some of the methods are easier than others. If you are strictly loading numbers, I suggest using numpy.loadtxt(<file name>). If you are using text, you could use the Pandas package, or if it's in a CSV file you could use the built-in (into Python that is) CSV package. Alternatively, if it's in a format that Tensorflow can read, you could use the provided load data functions.
Once you have loaded your data you will need to separate the data into the input and output values. Considering that Tensorflow models accept either lists or numpy arrays, you should be able to use these in your training and testing steps.
Checkout modules csv (import csv) or load your dataset via open(filename, „r“) or so. It might be easiest if you provide more context/info.

structure in getMetadata activity for csv file dataset show string datatypes for integer column in azure data factory

I want to do validation as first step before proceeding further in pipeline execute.
I am fetching metadata activity for my dataset and then checking it against a predefined schema in if condition.
Metadata for csv files show column type string even for integer which is breaking the validation.
Get Metadata doesn't support it, all the data type is considered as string in csv files.
You have posted a question on Microsoft forums here: https://learn.microsoft.com/en-us/answers/questions/44635/structure-in-getmetadata-activity-for-csv-file-dat.html, but Microsoft MSFT confirmed that: Using getMetadata on a csv file will give all strings.
The link he provided doesn't work for the column type.
I think that's a by-design problem and has no workaround now. And per my experience, the structure only works well for database.
The best way for you it that ask Azure Support for more details. Or post a new Data Factory feedback here: https://feedback.azure.com/forums/270578-data-factory. Hope the Data Factory Product team will see it and give us some guides.

Azure POS and weather data analysis strategy

I have a question about an approach of a solution in Azure. The question is how to decide what technologies to use and how to find the best combination of them.
Let's suppose i have two data sets, which are growing daily:
I have a CSV file which comes daily to my ADL store and it contains weather data for all possible Lattitudes and Longtitudes combinations and zip codes for them, together with 50 different weather variables.
I have another dataset with POS (point of sales), which also comes as a daily CSV file to my ADL storage. It contains sales data for all retail locations.
The desired output is to have the files "shredded" in a way that the data is prepared for AzureML forecasting of sales based on weather, and the forecasting is done per retail location and delivered via PowerBI dashboard to each one of them. A requirement is not to allow different location see the forecasts for any other locations.
My questions are:
How do I choose the right set of technologies?
how do I append the incoming daily data?
How do I create a separate ML forecasting results for each location?
Any general guidance on the architecture topic is appreciated, and any more specific ideas on comparison of different suitable solutions is also appreciated.
This is way to broad of a question.
I will only answer your ADL specific question #2 and give you a hint on #3 that is not related to Azure ML (since I don't know what that format is):
If you just use files, add date/time information to your file path name (either in folder or filename). Then use U-SQL File sets to query the ranges you are interested in. If you use U-SQL Tables, use PARTITIONED BY. For more details look in the U-SQL Reference documentation.
If you need to create more than one file as output, you have two options:
a. you know all file names, write an OUTPUT statement for each file selecting only the relevant data for it.
b. you have to dynamically generate a script and then execute it. Similar to this.

Resources