AzureML pass data between pipeline without saving it - azure

I have made two scripts using PythonScriptStep where data_prep.py prepares a dataset by doing some data transformation which is thereafter sent to train.py for training an ML model in AzureML.
It is possible passing data between pipeline steps using PipelineData and OutputFileDatasetConfig, however these seem to save the data in azure blob.
Q: How can I send the data between the steps without saving the data anywhere?

The data has to be passed somehow.
You can influence the storage account by changing the output datastore. If the data is just a collection of numbers, you can pass "dummy" data (e.g., empty text file) between the scripts, and have the upstream one log those numbers as metrics using Run.get_context().log(*) or MLFlow, and the downstream one load those values.
Fundamentally, there's no way to pass information between steps without it being stored somewhere, whether that's the "default blob store", another storage account, or metrics in the workspace.

Related

i can't connect to input container. but the container is acccessible and the file is there

I learning azure, specifically datafactory, so in a basic exercice.
1 - I should create a input container, and a output container (using azure sorage 2).
2 - After that, i created the datasets for input and output.
3 - And finally. I should connect the dataflow to my input dataset.
but
i can test conections on the datasets to prove that i created it without problems. but i cant test the connection on my dataflow to the input dataset.
enter image description here
i tryed
recreating it with different names.
keep only the needed file in the storage
use different input file (i am using a sample similar to the "movies.csv" expected to the exercise.
I created azure blob container and uploaded file
I created linked service with azure storage account
I created a dataset with above linked service following below procedure:
I tested the connection, it connected successfully.
I didn't get any error. The error which you mentioned above is related to dynamic content. If you assign any parameters in dataset provide the values of parameters correctly. I added parameters in dataset as below
I try to test the dataset I got error:
I added values for parameters in debug settings
Tested the connection, it connected successfully
Otherwise add the sink to the dataflow and try to debug it, it may work.
I think I found the solution.
when i am working with "debug on" and for some reason i create another "data flow", you cant connect to the new datasets.
But
if I restart the debug (put off and on again), the connections start working again.

Any way to exceed 10mb payload size limit in Custom Text Classification API?

I am training Azure Custom Text Classification model. Training set of 500k text documents are uploaded to Blob storage, so I thing left is to use REST-API to create a training project.
Issue that I am facing is that in API for project creation payload is limited to 10mb. My training set would requite a payload of about 80mb.
This would be fine if I could create a project and then append labeled documents to it in multiple batches, but from what I am able to see, in Custom Text Classification API the only way to add this data is to do it once during project creation or update afterwards, overwriting initially uploaded data. This means available training dataset for this service is hard-limited to whatever I can fit in 10mb payload.
Does this make sense? I'd imagine there needs to be a way to add more data labels than fits into 10mb payload to a project?
PS: I tried to upload json file to blob and create project this way, but it looks like this approach uses same API and is limited to the same 10mb payload restriction. I also tried to create a project and then substitute the project json in blob, but it then fails, complaining that file was manually changed.

How to read blob that are being uploaded to separate folders in the container into Stream Analytics job

I have few IOT Devices in the central application and they are sending telemetry to the Azure blob container.
For each blob there is a seperate folder being created into the container(based on the upload time). Following snapshot shows the directories, in a similar war multiple directories/subdirectories are being created to store the blob.
How can I read this data into my Stream analytics job.
I have a Stream analytics job with blob container as input, even though the container is continuously receiving data but it isn't showing any data when I run the select * query.
Please let me know how am I supposed to get blob input into stream analytics where each blob is stored in a separate folder in the container.
Usually if we have large amount of data, pulling them by query will take time.
Try to get the data as below:
SELECT
BlobName,
EventProcessedUtcTime,
BlobLastModifiedUtcTime
FROM Input
You can also specify tokens such as {date}, {time} on the path prefix pattern to help guide Stream Analytics on the files to read.
when the job was running for long enough, there appeared to be some output. And in those output records we can notice that big delay between my custom timestamp field and the general timestamp field
For detail understanding about how to configure the streaming inputs refer to blog
Also, if you want to read blobs from the root of the container, do not set a path pattern. Within the path, you can specify one or more instances of the following three variables: {date}, {time}, or {partition}

Azure DF: Get metadata of millions of files located in a VM and call a store procedure to update file details in a DB

I have created a Getmeta data activity in azure pipeline to fetch the file details located in a VM and I am iterating the output of Getmeta data activity using For each loop.
In the for each loop , I am calling a store procedure to update file details in the database.
If I have 2K files in the VM, the store procedure is called 2K times and which I feel not a good practice.
Is there any method to update all the file details in one shot ?
Per my knowledge, i think you could use GetMetadata Activity to get the output then pass it into Azure Function Activity.
Inside azure function,you could loop the output and use sdk (such as java sql lib) to update the tables as you want in the batch.

Bringing incremental data in from REST APIs into SQL azure

My needs are following:
- Need to fetch data from a 3rd party API into SQL azure.
The API's will be queried everyday for incremental data and may require pagination as by default any API response will give only Top N records.
The API also needs an auth token to work, which is the first call before we start downloading data from endpoints.
Due to last two reasons, I've opted for Function App which will be triggered daily rather than data factory which can query web APIs.
Is there a better way to do this?
Also I am thinking of pushing all JSON into Blob store and then parsing data from the JSON into SQL Azure. Any recommendations?
How long does it take to call all of the pages? If it is under ten minutes, then my recommendation would be to build an Azure Function that queries the API and inserts the json data directly into a SQL database.
Azure Function
Azure functions are very cost effective. The first million execution are free. If it takes longer than ten, then have a look at durable functions. For handling pagination, we have plenty of examples. Your exact solution will depend on the API you are calling and the language you are using. Here is an example in C# using HttpClient. Here is one for Python using Requests. For both, the pattern is similar. Get the total number of pages from the API, set a variable to that value, and loop over the pages; Getting and saving your data in each iteration. If the API won't provide the max number of pages, then loop until you get an error. Protip: Make sure specify an upper bound for those loops. Also, if your API is flakey or has intermittent failures, consider using a graceful retry pattern such as exponential backoff.
Azure SQL Json Indexed Calculated Columns
You mentioned storing your data as json files into a storage container. Are you sure you need that? If so, then you could create an external table link between the storage container and the database. That has the advantage of not having the data take up any space in the database. However, if the json will fit in the database, I would highly recommend dropping that json right into the SQL database and leveraging indexed calculated columns to make querying the json extremely quick.
Using this pairing should provide incredible performance per penny value! Let us know what you end up using.
Maybe you can create a time task by SQL server Agent.
SQL server Agent--new job--Steps--new step:
In the Command, put in your Import JSON documents from Azure Blob Storage sql statemanets for example.
Schedules--new schedule:
Set Execution time.
But I think Azure function is better for you to do this.Azure Functions is a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP.
It is more intuitive and efficient.
Hope this helps.
If you could set the default top N values in your api, then you could use web activity in azure data factory to call your rest api to get the response data.Then configure the response data as input of copy activity(#activity('ActivityName').output) and the sql database as output. Please see this thread :Use output from Web Activity call as variable.
The web activity support authentication properties for your access token.
Also I am thinking of pushing all JSON into Blob store and then
parsing data from the JSON into SQL Azure. Any recommendations?
Well,if you could dump the data into blob storage,then azure stream analytics is the perfect choice for you.
You could run the daily job to select or parse the json data with asa sql ,then dump the data into sql database.Please see this official sample.
One thing to consider for scale would be to parallelize both the query and the processing. If there is no ordering requirement, or if processing all records would take longer than the 10 minute function timeout. Or if you want to do some tweaking/transformation of the data in-flight, or if you have different destinations for different types of data. Or if you want to be insulated from a failure - e.g., your function fails halfway through processing and you don't want to re-query the API. Or you get data a different way and want to start processing at a specific step in the process (rather than running from the entry point). All sorts of reasons.
I'll caveat here to say that the best degree of parallelism vs complexity is largely up to your comfort level and requirements. The example below is somewhat of an 'extreme' example of decomposing the process into discrete steps and using a function for each one; in some cases it may not make sense to split specific steps and combine them into a single one. Durable Functions also help make orchestration of this potentially easier.
A timer-driven function that queries the API to understand the depth of pages required, or queues up additional pages to a second function that actually makes the paged API call
That function then queries the API, and writes to a scratch area (like Blob) or drops each row into a queue to be written/processed (e.g., something like a storage queue, since they're cheap and fast, or a Service Bus queue if multiple parties are interested (e.g., pub/sub)
If writing to scratch blob, a blob-triggered function reads the blob and queues up individual writes to a queue (e.g., a storage queue, since a storage queue would be cheap and fast for something like this)
Another queue-triggered function actually handles writing the individual rows to the next system in line, SQL or whatever.
You'll get some parallelization out of that, plus the ability to start from any step in the process, with a correctly-formatted message. If your processors encounter bad data, things like poison queues/dead letter queues would help with exception cases, so instead of your entire process dying, you can manually remediate the bad data.

Resources