When submitting an aml compute job, I want to access the nodes where the compute happens for debugging purposes. The portal gives me the IP address, the port and the nodeID, but no password seems to exist within the portal.
How am I supposed to connect to the machine?
I am running on NC6 machines for a single node training. I have already tried to run the command given through the portal, but the node is (hopefully so) protected by a password.
In order to be able to log in to the nodes of an AML Compute cluster, you have to provide a username and password and ssh key (ssh key is optional), when you create the cluster. You can only do that at creation time.
When you are setting up your cluster, you can define a username/password/ssh key that can be used to login to the cluster. If you did not define these at the time of creation, you would need to recreate the cluster unfortunately.
We are working on better documenting this, and also update our messaging a bit.
Related
Given that I have a 24x7 AKS Cluster on AZURE, for which afaik Kubernetes cannot stop/pause a pod and then resume it standardly,
with, in my case, a small Container in a Pod, and for that Pod it can be sidelined via --replicas=0,
then, how can I, on-demand, best kick off a LINIX script packaged in that Pod/Container which may be not running,
from an AZURE Web App?
I thought using ssh should work, after first upscaling the pod to 1 replica. Is this correct?
I am curious if there are simple http calls in AZURE to do this. I see CLI and Powershell to start/stop AKS cluster, but that is different of course.
You can interact remotely with AKS by different methods. The key here is to use the control plane API to deploy your kubernetes resource programmatically (https://kubernetes.io/docs/concepts/overview/kubernetes-api/) .
In order to do that, you should use client libraries that enable that kind of access. Many examples can be found here for different programming languages:
https://github.com/kubernetes-client
ssh is not really recommended since that is sort of a god access to the cluster and its usage is not meant for your purpose.
I would like to tweak some settings in AKS node group with something like userdata in AWS. Is it possible to do in AKS?
how abt using
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/virtual_machine_scale_set_extension
The underlying Virtual Machine Scale Set (VMSS) is an implementation detail, and one that you do not get to adjust outside of SKU and disk choice. Just like you cannot pick the image that goes on the VMSS; you also cannot use VM Extensions on that scale set, without being out of support. Any direct manipulation of those VMSSs (from an Azure resource provider perspective) behind your nodepools puts you out of support. The only supported affordance to perform host (node)-level actions is via deploying your custom script work in a DaemonSet to the cluster. This is fully supported, and will give you the ability to run (almost) anything you need at the host level. Examples being installing/executing custom security agents, FIM solutions, anti-virus.
From the support FAQ:
Any modification done directly to the agent nodes using any of the IaaS APIs renders the cluster unsupportable. Any modification done to the agent nodes must be done using kubernetes-native mechanisms such as Daemon Sets.
I create one compute instance 'yhd-notebook' in Azure Machine Learning compute with user1. When I login with user2, and try to open the JupyterLab of this compute instance, it shows an error message like below.
User user2 does not have access to compute instance yhd-notebook.
Only the creator can access a compute instance.
Click here to sign out and sign in again with a different account.
Is it possible to share compute instance with another user? BTW, both user1 and user2 have Owner role with the Azure subscription.
According to MS, all users in the workspace contributor and owner role can create, delete, start, stop, and restart compute instances across the workspace. However, only the creator of a specific compute instance is allowed to access Jupyter, JupyterLab, and RStudio on that compute instance. The creator of the compute instance has the compute instance dedicated to them, have root access, and can terminal in through Jupyter. Compute instance will have single-user login of creator user and all actions will use that user’s identity for RBAC and attribution of experiment runs. SSH access is controlled through public/private key mechanism.
Expanding a little on #hui chen's useful answer.
Although you can't share computer instances from azure's web interface you can access the compute instance's jupyter directly by sshing into them (note that ssh runs on port 50000). jupyter runs on port 8888.
I haven't experimented with the filesystem synchronization yet... so be careful. Also none of this is guaranteed.
We are working on an application that processes excel files and spits off output. Availability is not a big requirement.
Can we turn the VM sets off during night and turn them on again in the morning? Will this kind of setup work with service fabric? If so, is there a way to schedule it?
Thank you all for replying. I've got a chance to talk to a Microsoft Azure rep and documented the conversation in here for community sake.
Response for initial question
A Service Fabric cluster must maintain a minimum number of Primary node types in order for the system services to maintain a quorum and ensure health of the cluster. You can see more about the reliability level and instance count at https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-cluster-capacity/. As such, stopping all of the VMs will cause the Service Fabric cluster to go into quorum loss. Frequently it is possible to bring the nodes back up and Service Fabric will automatically recover from this quorum loss, however this is not guaranteed and the cluster may never be able to recover.
However, if you do not need to save state in your cluster then it may be easier to just delete and recreate the entire cluster (the entire Azure resource group) every day. Creating a new cluster from scratch by deploying a new resource group generally takes less than a half hour, and this can be automated by using Powershell to deploy an ARM template. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-creation-via-arm/ shows how to setup the ARM template and deploy using Powershell. You can additionally use a fixed domain name or static IP address so that clients don’t have to be reconfigured to connect to the cluster. If you have need to maintain other resources such as the storage account then you could also configure the ARM template to only delete the VM Scale Set and the SF Cluster resource while keeping the network, load balancer, storage accounts, etc.
Q)Is there a better way to stop/start the VMs rather than directly from the scale set?
If you want to stop the VMs in order to save cost, then starting/stopping the VMs directly from the scale set is the only option.
Q) Can we do a primary set with cheapest VMs we can find and add a secondary set with powerful VMs that we can turn on and off?
Yes, it is definitely possible to create two node types – a Primary that is small/cheap, and a ‘Worker’ that is a larger size – and set placement constraints on your application to only deploy to those larger size VMs. However, if your Service Fabric service is storing state then you will still run into a similar problem that once you lose quorum (below 3 replicas/nodes) of your worker VM then there is no guarantee that your SF service itself will come back with all of the state maintained. In this case your cluster itself would still be fine since the Primary nodes are running, but your service’s state may be in an unknown replication state.
I think you have a few options:
Instead of storing state within Service Fabric’s reliable collections, instead store your state externally into something like Azure Storage or SQL Azure. You can optionally use something like Redis cache or Service Fabric’s reliable collections in order to maintain a faster read-cache, just make sure all writes are persisted to an external store. This way you can freely delete and recreate your cluster at any time you want.
Use the Service Fabric backup/restore in order to maintain your state, and delete the entire resource group or cluster overnight and then recreate it and restore state in the morning. The backup/restore duration will depend entirely on how much data you are storing and where you export the backup.
Utilize something such as Azure Batch. Service Fabric is not really designed to be a temporary high capacity compute platform that can be started and stopped regularly, so if this is your goal you may want to look at an HPC platform such as Azure Batch which offers native capabilities to quickly burst up compute capacity.
No. You would have to delete the cluster and recreate the cluster and deploy the application in the morning.
Turning off the cluster is, as Todd said, not an option. However you can scale down the number of VM's in the cluster.
During the day you would run the number of VM's required. At night you can scale down to the minimum of 5. Check this page on how to scale VM sets: https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-scale-up-down/
For development purposes, you can create a Dev/Test Lab Service Fabric cluster which you can start and stop at will.
I have also been able to start and stop SF clusters on Azure by starting and stopping the VM scale sets associated with these clusters. But upon restart all your applications (and with them their state) are gone and must be redeployed.
I have been able to successfully create a Google Container Cluster in the developers console and have deployed my app to it. This all starts up fine, however I find that I can't connect to Cloud SQL, I get;
"Error: Handshake inactivity timeout"
After a bit of digging, I hadn't had any trouble connecting to the Database from App Engine or my local machine so I thought this was a little strange. It was then I noticed the cluster permissions...
When I select my cluster I see the following;
Permissions
User info Disabled
Compute Read Write
Storage Read Only
Task queue Disabled
BigQuery Disabled
Cloud SQL Disabled
Cloud Datastore Disabled
Cloud Logging Write Only
Cloud Platform Disabled
I was really hoping to use both Cloud Storage and Cloud SQL in my Container Engine Nodes. I have allowed access to each of these API's in my project settings and my Cloud SQL instance is accepting connections from any IP (I've been running Node in a Managed VM on App Engine previously), so my thinking is that Google is Explicitly disabling these API's.
So my two part question is;
Is there any way that I can modify these permissions?
Is there any good reason why these API's are disabled? (I assume there must be)
Any help much appreciated!
With Node Pools, you can sort of add scopes to a running cluster by creating a new node pool with the scopes you want (and then deleting the old one):
gcloud container node-pools create np1 --cluster $CLUSTER --scopes $SCOPES
gcloud container node-pools delete default-pool --cluster $CLUSTER
The permissions are defined by the service accounts attached to your node VMs during cluster creation (service accounts can't be changed after a VM is instantiated, so this the only time you can pick the permissions).
If you use the cloud console, click the "More" link on the create cluster page and you will see a list of permissions that you can add to the nodes in your cluster (all defaulting to off). Toggle any on that you'd like and you should see the appropriate permissions after your cluster is created.
If you use the command line to create your cluster, pass the --scopes command to gcloud container clusters create to set the appropriate service account scopes on your node VMs.
Hmm, I've found a couple of things, that maybe would be interested:
Permissions belong to a service account (so-called Compute Engine default service account, looks like 12345566788-compute#developer.gserviceaccount.com)
Any VM by default works using this service account. And its permissions do not let us Cloud SQL, buckets and so on. But...
But you can change this behavior using another service account with the right perms. Just create it manually and set only needed perms. Switch it out using gcloud auth activate-service-account --key-file=/new-service-account-cred.json
That's it.
For the cloudsql there's the possibility to connect from containers specifying a proxy as explained here https://cloud.google.com/sql/docs/postgres/connect-container-engine