AzureML data sharing between pipelines - azure-machine-learning-service

AzureML data sharing between pipelines - azure-machine-learning-service

Moving my first steps in AML...
I am trying to create several pipelines, the idea s that some of the data generated by one pipeline will eventually be used by other pipelines. The way that I am going this is as follows:
In the first pipeline, I am registering the data that I want to use later on as datasets by
dir = OutputFileDatasetConfig(<<name>>).read_delimited_files().register_on_complete(<<ds_name>>)
I am saving data normally (data is numpy arrays)
np.savetxt(os.path.join(<<dir>>, <<file>>), X_test, delimiter=",")
In the second pipeline I am reading the location of the data
dir = Run.get_context().input_datasets[<<ds_name>>].download()
and then loading it in numpy
a = np.loadtxt(dir[0])
Not sure if there are better ways to achieve this, and ideas, please?

Related

Azure Data Factory ETL Process

I would like to merge data from different data sources (ERP system, Excel files) with the ADF and make it available in an AzureSQLDB for further analyzing.
I'm not sure where and when I do the transformations and joins between the tables. Can I run all of this directly in the pipeline and then load the data into AzureDB, or do I need to stage the data first?
My understanding is to load the data into the ADF using Copy Activities and Datasets. Transforming and merging the datasets there with MappingDataFlows or similar activities. Then they are loaded into the AzureSQLDB

Your question is fully requirement based. You can go for either ETL or ELT process. Since your sink is AzureSQLDB , I would suggest to go with ELT , as you can handle lots of transformations in SQL itself by creating views on the landing tables.
If you have complex transformations to handle , then go with ETL and use Dataflow instead.
Also, regarding staging tables, if your requirement is to perform daily incremental load after the first full load, then you should opt for staging table.
Checkout this video for full load and incremental load .

Is there a way to stop Azure ML throwing an error when exporting zero lines of data?

I am currently developing an Azure ML pipeline that as one of its outputs is maintaining a SQL table holding all of the unique items that are fed into it. There is no way to know in advance if the data fed into the pipeline is new unique items or repeats of previous items, so before updating the table that it maintains it pulls the data already in that table and drops any of the new items that already appear.
However, due to this there are cases where this self-reference results in zero new items being found, and as such there is nothing to export to the SQL table. When this happens Azure ML throws an error, as it is considered an error for there to be zero lines of data to export. In my case, however, this is expected behaviour, and as such absolutely fine.
Is there any way for me to suppress this error, so that when it has zero lines of data to export it just skips the export module and moves on?

It sounds as if you are struggling to orchestrate a data pipeline because there orchestration is happening in two places. My advice would be to either move more orchestration into Azure ML, or make the separation between the two greater. One way to do this would be to have a regular export to blob of the table you want to use as training. Then you can use a Logic App to trigger a pipeline whenever a non-empty blob lands in the location

This issue has been resolved by an update to Azure Machine Learning; You can now run pipelines with a flag set to "Continue on Failure Step", which means that steps following the failed data export will continue to run.
This does mean you will need to design your pipeline to be able to handles upstream failures in its downstream modules; this must be done very carefully.

Conditional statements importing JSON to SQL Azure

I have a pipeline in Azure Data Factory which imports JSON to SQL Azure. This works fine, except some JSON files have multiple structures.
It's fine if every line was in the file is the same. I can take two runs at the files in the data lake gen 2. I don't mind ignoring the lines with rc and then having another pipeline which ignore rows with marketDefinition and just processes the others getting both into seperate tables.
Not sure what the best solution here is.

Just for now, Data factory doesn't works well for multiple files which have different schema.
The pre script is the operation directly to SQL database, even you pass the source file path to the script, it still won't filter the source dataset. It's an independent command.
So I'm afraid to say there isn't a best solution for your scenario.

Caching preprocessed data for ML in spark/pyspark

I would like a ML pipeline like this:
raw_data = spark.read....()
data = time_consuming_data_transformation(raw_data, preprocessing_params)
model = fit_model(data)
evaluate(model, data)
Can I cache/persist data somehow after step 2, so when I run my spark app again, the data won't have to be transformed again? Ideally, I would like the cache to be automatically invalidated when the original data or transformation code (computing graph, preprocessing_params) change.

Can I cache/persist data somehow after step 2, so when I run my spark app again, the data won't have to be transformed again?
You can of course:
data = time_consuming_data_transformation(raw_data, preprocessing_params).cache()
but if you're data is non-static, it is always better to write data to persistent storage:
time_consuming_data_transformation(raw_data, preprocessing_params).write.save(...)
data = spark.read.load(...)
It is more expensive than cache, but prevents hard to detect inconsistencies when data changes.
Ideally, I would like the cache to be automatically invalidated when the original data
No. Unless it is a streaming program (and learning on streams is not so trivial) Spark doesn't monitor changes in the source.
or transformation code (computing graph, preprocessing_params) change.
It is not clear for me how things change but it is probablly not something that Spark will solve for you. You might need some event driven or reactive components.

Azure Data Factory Data Migration

Not really sure this is an explicit question or just a query for input. I'm looking at Azure Data Factory to implement a data migration operation. What I'm trying to do is the following:
I have a No SQL DB with two collections. These collections are associated via a common property.
I have a MS SQL Server DB which has data that is related to the data within the No SQL DB Collections via an attribute/column.
One of the NoSQL DB collections will be updated on a regular basis, the other one on a not so often basis.
What I want to do is be able to prepare a Data Factory pipline that will grab the data from all 3 DB locations combine them based on the common attributes, which will result in a new dataset. Then from this dataset push the data wihin the dataset to another SQL Server DB.
I'm a bit unclear on how this is to be done within the data factory. There is a copy activity, but only works on a single dataset input so I can't use that directly. I see that there is a concept of data transformation activities that look like they are specific to massaging input datasets to produce new datasets, but I'm not clear on what ones would be relevant to the activity I am wanting to do.
I did find that there is a special activity called a Custom Activity that is in effect a user defined definition that can be developed to do whatever you want. This looks the closest to being able to do what I need, but I'm not sure if this is the most optimal solution.
On top of that I am also unclear about how the merging of the 3 data sources would work if the need to join data from the 3 different sources is required but do not know how you would do this if the datasets are just snapshots of the originating source data, leading me to think that the possibility of missing data occurring. I'm not sure if a concept of publishing some of the data someplace someplace would be required, but seems like it would in effect be maintaining two stores for the same data.
Any input on this would be helpful.

There are a lot of things you are trying to do.
I don't know if you have experience with SSIS but what you are trying to do is fairly common for either of these integration tools.
Your ADF diagram should look something like:
1. You define your 3 Data Sources as ADF Datasets on top of a
corresponding Linked service
2. Then you build a pipeline that brings information from SQL Server into a
temporary Data Source (Azure Table for example)
3. Next you need to build 2 pipelines that will each take one of your NoSQL
Dataset and run a function to update the temporary Data Source which is the ouput
4. Finally you can build a pipeline that will bring all your data from the
temporary Data Source into your other SQL Server
Steps 2 and 3 could be switched depending on which source is the master.
ADF can run multiple tasks one after another or concurrently. Simply break down the task in logical jobs and you should have no problem coming up with a solution.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string