Issues accessing a FileDataset created from HTTP URIs in a PythonScriptStep - azure-machine-learning-service

I’m having some issues trying to access a FileDataset created from two http URIs in an Azure ML Pipeline PythonScriptStep.
In the step, I’m only getting a single file named ['https%3A’] when doing an os.listdir() on my mount point. I would have expected two files, with their actual names instead. This happens both when sending the dataset as_upload and as_mount. Even happens when I send the dataset reference to the pipeline step and mount it directly from the step.
The dataset is registered in a notebook, the same notebook that creates and invokes the pipeline, as seen below:
tempFileData = Dataset.File.from_files(
['https://vladiliescu.net/images/deploying-models-with-azure-ml-pipelines.jpg',
'https://vladiliescu.net/images/reverse-engineering-automated-ml.jpg'])
tempFileData.register(ws, name='FileData', create_new_version=True)
#...
read_datasets_step = PythonScriptStep(
name='The Dataset Reader',
script_name='read-datasets.py',
inputs=[fileData.as_named_input('Files'), fileData.as_named_input('Files_mount').as_mount(), fileData.as_named_input('Files_download').as_download()],
compute_target=compute_target,
source_directory='./dataset-reader',
allow_reuse=False,
)
The FileDataset seems to be registered properly, if I examine it within the notebook I get the following result:
{
"source": [
"https://vladiliescu.net/images/deploying-models-with-azure-ml-pipelines.jpg",
"https://vladiliescu.net/images/reverse-engineering-automated-ml.jpg"
],
"definition": [
"GetFiles"
],
"registration": {
"id": "...",
"name": "FileData",
"version": 4,
"workspace": "Workspace.create(...)"
}
}
For reference, the machine running the notebook is using AML SDK v1.24, whereas the node running the pipeline steps is running v1.25.
Has anybody encountered anything like this? Is there a way to make it work?
Note that I'm specifically looking at file datasets created from web uris, and not necessarily interested in getting a FileDataset to work with blob storage or similar.

The files should've been mounted at path "https%3A/vladiliescu.net/images/deploying-models-with-azure-ml-pipelines.jpg" and "https%3A/vladiliescu.net/images/reverse-engineering-automated-ml.jpg".
We retain the directory structure following the url structure to avoid potential conflicts.

Related

'No storage connection string found.' when trying to list durable function instances from azure functions core tools

I am currently trying to list all instances of an activity function and the orchestrator function using azure function core tools. The application synchronizes data from different sources into a centralized location.
The setup is as follows:
TimerTrigger -> Durable Orchestrator -> Multiple Activity Functions
In my concrete example, it is like this:
Start Synchronization -> Orchestrate Synchronizations -> Synchronize Source
So we start the synchronization process which starts the orchestrator. The orchestrator then starts multiple different synchronizations, one for each source. The problem though is that I cannot seem to get the azure function core tools to list me all instances of the functions I am interested in.
Unfortunately, I would really prefer not to have to use the REST api to query for this information. The setup really complicates things with IP restrictions and managed identity authentication. I think I can correct the setup to get things to work from my network + user, if really needed, but I think that will take way longer than required.
I have tried running the following command:
func durable get-instances
in a directory with a file called host.json with the following contents:
{
"version": "2.0",
"AzureWebJobsStorage":"DefaultEndpointsProtocol=https;AccountName=Name;AccountKey=Key;EndpointSuffix=core.windows.net",
}
I have also tried where the contents of the file are as follows:
{
"version": "2.0",
"extensions": {
"durableTask": {
"storageProvider": {
"connectionStringName": "DefaultEndpointsProtocol=https;AccountName=Name;AccountKey=Key;EndpointSuffix=core.windows.net"
}
}
}
}
I have tried calling the func durable get-instances with and without the --connection-string-setting parameter, with the values 'AzureWebJobsStorage' and 'extensions:durableTask:storageProvider:connectionStringName', but nothing seems to work. I keep getting the error No storage connection string found.
I know that the connection string is correct. I have pulled it directly from the storage account 'Access keys' blade.
Is there anything I am missing? What am I doing wrong?
Thanks to #juunas, I got it to work. I edited the host.json file to have the following content:
{
"version": "2.0"
}
and created another file called local.settings.json with the following contents:
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "DefaultEndpointsProtocol=https;AccountName=Name;AccountKey=Key;EndpointSuffix=core.windows.net",
}
}
Running func durable get-instances now works and returns a continuation token, but an empty list. I was not expecting that, but I can now start exploring and understanding here what is going on.

ARM Deployment Error- The request content was invalid and could not be deserialized: 'Cannot deserialize the current JSON array

I had gone through the previous posts similar and not able to find any solution for my situation. So asking again. Please consider.
I am trying to deploy Azure Policy using ARM templates. So, I have created
1- Policy Definition File
2- Policy Parameter File
3- Power Shell script – Run with both Policy and Parameter file as input.
But when I trying to deploy, I am getting the error as attached. The “policyParameters” are being passed as Object type. Seems like the problem resides there. It would be great if you could look at this screen shot attached and advice.
Also the Powershell script out put shows the values expected I think but "ProvisioningState : Failed".
Thanks,
PolicyFile
Error Output
Parameter File
JSON-part1
JSON-Part2
You have to create a variable for policyParametars:
"variables": {
"policyParameters": {
"policyDefinitionId": {
"defaultValue": "[parameters('policyDefinitionId')]",
"type": "String"
},
...
This variable has to be passed to your parameters:
"parameters": "[variables('policyParameters')]",
You can find a sample here:
Configure Azure Diagnostic Settings with Azure Policies

Google Cloud Compute Engine API: createVM directly with setMetadata

I use #google-cloud/compute to create VM instances automatically.
Also I use startup scripts in those instances.
So, firstly I call Zone.createVM and then VM.setMetadata.
But in some regions startup script is not running. And it is running after VM reset, so looks like my VM.setMetadata call is just too late.
In the web-interface we can create VM directly with metadata. But I do not see this ability in API.
Can it be done with API?
To set up a startup script during instance deployment you can provide it as part of the metadata property in the API call:
POST https://www.googleapis.com/compute/v1/projects/myproject/zones/us-central1-a/instances
{
...
"metadata": {
"items": [
{
"key": "startup-script",
"value": "#! /bin/bash\n\n# Installs apache and a custom homepage\napt-get update\napt-get install -y apache2\ncat <<EOF > /var/www/html/index.html\n<html><body><h1>Hello World</h1>\n<p>This page was created from a simple start up script!</p>\n</body></html>"
}
]
}
...
}
See the full reference for the resource "compute.instances" of the Compute Engine API here.
Basically, if you are using a Nodejs library to create the instance you are already calling this, so you will only need to add the metadata keys as documented.
Also, if you are doing this frequently I guess it would be more practical if you stored the script in a bucket in GCP and simply add the URI to the metadata like this:
"metadata": {
"items": [
{
"key": "startup-script-url",
"value": "gs://bucket/myfile"
}
]
},

Can you clone a pipeline by using already existing JSON?

I'm new to Data Factories but in reading over the basics it looks like the solution to my problem is very simple -- too good to be true.
The existing Pipeline successfully transforms data in a test environment to tables in SQL Azure. There are 4 BLOB objects which have data which will end up in one table in SQL Azure.
The database is for a DNN site so it will be copied now to Dev, Test, possibly also UAT but ultimately to production.
It looks as simple as adding new pipelines to the existing Data Factory and just altering Database name the connection strings. In production I'll set up a new user account so that's unique and no one can easily hack it. That's simple enough.
The object names in the databases remain the same. There are just 3 sites (Dev, Test, Production).
So it should just be that easy, right? Create a new pipeline, copy and paste the JSON, alter the Database connection strings in the pipeline JSON and call it a day, right?
Thanks!
Instead of cloning the pipeline, JSON and altering the Database Connection Strings you should try to automate things which are gonna help you a lot.
Manual deployment always has a high error prone.
You can follow the below steps for
You can import your ADF into Visual Studio, using the VS plugin
here
You can then use configuration files in Visual Studio to configure
properties for linked services/tables/pipelines differently for each
environment like (Dev, Test, UAT/Production)
I would recommend to store the database credentials in an Azure Key Vault. You can reference it as a parameter.
{
"parameters": {
"azureSqlReportingDbPassword": {
"reference": {
"keyVault": {
"id": "/subscriptions/<subId>/resourceGroups/<resourcegroupId> /providers/Microsoft.KeyVault/vaults/<vault-name> "
},
"secretName": " < secret - name > "
}
}
}
}
See also the documentation for more details and the Blog-Post.

Issues deploying dscExtension to Azure VMSS

I've been having some issues deploying a dscExtension to an Azure virtual machine scale set (VMSS) using a deployment template.
Here's how I've added it to my template:
{
"name": "dscExtension",
"properties": {
"publisher": "Microsoft.Powershell",
"type": "DSC",
"typeHandlerVersion": "2.9",
"autoUpgradeMinorVersion": true,
"settings": {
"ModulesUrl": "[concat(parameters('_artifactsLocation'), '/', 'MyDscPackage.zip', parameters('_artifactsLocationSasToken'))]",
"ConfigurationFunction": "CmvmProcessor.ps1\\CmvmProcessor",
"Properties": [
{
"Name": "ServiceCredentials",
"Value": {
"UserName": "parameters('administratorLogin')",
"Password": "parameters('administratorLoginPassword')"
},
"TypeName": "System.Management.Automation.PSCredential"
}
]
}
}
}
The VMSS itself is successfully deploying, but when I browse the InstanceView of the individual VMs, the dscExtension shows the failed status with an error message.
The problems I'm having are as follows:
The ARM deployment does not try to update the dscExtension upon redeploy. I am used to MSDeploy web app extensions where the artifacts are updated and the code is redeployed on each new deployment. I do not know how to force it to update the dscExtension with new binaries. In fact it only seems to give an error on the first deploy of the VMSS, then it won't even try again.
The error I'm getting is for old code that doesn't exist anymore.
I had a bug previously in a custom DSC Powershell script where I tried to use the -replace operator which is supposed to create a $Matches variable but it was saying $Matches didn't exist.
In any case, I've since refactored the code and deleted the entire resource group then redeployed. The dscExtension is still giving the same error. I've verified the blob storage account where my DSC .zip is located no longer has the code which is capable of producing this error message. Azure must be caching the dscExtension somewhere. I can't get it to use my new blob .zip that I upload before each deployment.
Any insight into the DSC Extension and how to force it to update on deploy?
It sounds like you may be running into multiple things here, so trying the simple one first. In order to get a VM extension to run on a subsequent deployment you have to "seed" it. (and you're right this is different than the rest of AzureRM) Take a look at this template:
https://github.com/bmoore-msft/AzureRM-Samples/blob/master/VMDSCInstallFile/azuredeploy.json
There is a property on the DSC extension called:
"forceUpdateTag" : "changeThisToEnsureScriptRuns-maxlength=50",
The property value must be different if you ever want the extension to run again. So for example, if you wanted it to run every time you'd seed it with a random number or a guid. You could also use version numbers if you wanted to version it somehow. The point is, if the value in the template is the same as the one you're passing in, the extension won't run again.
That sample uses a VM, but the VMSS syntax should be the same. That property also applies to other extensions (e.g. custom script).
The part that seems odd is that you said you deleted the entire RG and couldn't get it to accept the new package... That sounds bad (i.e. like a bug). If the above doesn't fix it, we may need to dig deeper into the template and script. LMK...

Resources