Auto rollout kubernes deployment if the deployment fails in azuredevops

Auto rollout kubernes deployment if the deployment fails in azuredevops - linux

We have a release pipeline in AzureDevops to deploy the microservice to AKS and send the log of the microservice once its deployed.
We are using below command to deploy the deployment with the template kubectl and command as "-f /home/admin/builds/$(build.buildnumber)/Myservice_Deployment.yml --record"
Here we noticed that the task is not waiting for the existing pod to terminate and crate the new pod, but its continuing and just finishing the job.
Our expected scenario
1) Deploy the microservice using kubectl apply -f /home/admin/builds/$(build.buildnumber)/Myservice_Deployment.yml --record
2) wait for the existing pod to terminate and ensure that the new pod is in running status.
3) once the new pod is running status collect the log of the pod by kubectl log command and sent to the team
4) if the pod is not is not in running state, roll back to previous stable state.
I tried with different shell scripts to achieve this in azuredevops, but didnt succeeded
Ex:
ATTEMPTS=0
ROLLOUT_STATUS_CMD="kubectl --kubeconfig /home/admin/kubernetes/Dev-kubeconfig rollout status deployment/My-service"
until $ROLLOUT_STATUS_CMD || [ $ATTEMPTS -eq 60 ]; do
$ROLLOUT_STATUS_CMD
ATTEMPTS=$((attempts + 1))
sleep 10
done
Also need to get the log of the microservice using the kubectl log command and the file should be in the format with Date and need to be shared over mail..

you have several questions mixed up in a single question, but you'd need to configure your deployment with liveness probe for your desired behaviour to happen
Reading: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-command

Related

Get exit code from `az vm run-command` in Azure pipeline

I'm running a rather hefty build in my Azure pipeline, which involves processing a large amount of data, and hence requires too much memory for my buildagent to handle. My approach is therefore to start up an linux VM, run the build there, and push up the resulting docker image to my container registry.
To achieve this, I'm using the Azure CLI task to issue commands to the VM (e.g. az vm start, az vm run-command ... etc).
The problem I am facing is that az vm run-command "succeeds" even if the script that you run on the VM returns a nonzero status code. For example, this "bad" vm script:
az vm run-command invoke -g <group> -n <vmName> --command-id RunShellScript --scripts "cd /nonexistent/path"
returns the following response:
{
"value": [
{
"code": "ProvisioningState/succeeded",
"displayStatus": "Provisioning succeeded",
"level": "Info",
"message": "Enable succeeded: \n[stdout]\n\n[stderr]\n/var/lib/waagent/run-command/download/87/script.sh: 1: cd: can't cd to /nonexistent/path\n",
"time": null
}
]
}
So, the command succeeds, presumably because it succeeded in executing the script on the VM. The fact that the script actually failed on the VM is buried in the response "message"
I would like my Azure pipeline task to fail if the script on the VM returns a nonzero status code. How would I achieve that?
One idea would be to parse the response (somehow) and search the text under stderr - but that sounds like a real hassle, and I'm not sure even how to "access" the response within the task.

Have you enabled the option "Fail on Standard Error" on the Azure CLI task? If not, you can try to enable it and run the pipeline again to see if the error "cd: can't cd to /nonexistent/path" can make the task run failed.
If the task still is passed, the error "cd: can't cd to /nonexistent/path" should not be a Standard Error. In this situation, you may need to add more command lines in your script to monitor the output logs of the az command. Once there is any output message shows error, execute "exit 1" to exit the script and return a Standard Error to make the task be failed.

I solved this by using the SSH pipeline task - this allowed me to connect to the VM via SSH, and run the given script on the machine "directly" via SSH.
This means from the context of the task, you get the status code from the script itself running on the VM. You also see any console output inside the task logs, which was obscured when using az vm run-command.
Here's an example:
- task: SSH#0
displayName: My VM script
timeoutInMinutes: 10
inputs:
sshEndpoint: <sshConnectionName>
runOptions: inline
inline: |
echo "Write your script here"
Not that the SSH connection needs to be set up as a service connection using the Azure pipelines UI. You reference the name of the service connection you set up in yaml.

check if Kubernetes deployment was sucessful in CI/CD pipeline

I have an AKS cluster with Kubernetes version 1.14.7.
I have build CI/CD pipelines to deploy newly created images to the cluster.
I am using kubectl apply to update a specific deployment with the new image. sometimes and for many reasons, the deployment fails, for example ImagePullBackOff.
is there a command to run after the kubectl apply command to check if the pod creation and deployment was successful?

For this purpose Kubernetes has kubectl rollout and you should use option status.
By default 'rollout status' will watch the status of the latest rollout until it's done. If you don't want to wait for the rollout to finish then you can use --watch=false. Note that if a new rollout starts in-between, then 'rollout status' will continue watching the latest revision. If you want to pin to a specific revision and abort if it is rolled over by another revision, use --revision=N where N is the revision you need to watch for.
You can read the full description here
If you use kubect apply -f myapp.yaml and check rollout status you will see:
$ kubectl rollout status deployment myapp
Waiting for deployment "myapp" rollout to finish: 0 of 3 updated replicas are available…
Waiting for deployment "myapp" rollout to finish: 1 of 3 updated replicas are available…
Waiting for deployment "myapp" rollout to finish: 2 of 3 updated replicas are available…
deployment "myapp" successfully rolled out

There is another way to wait for deployment to become available with a configured timeout like
kubectl wait --for=condition=available --timeout=60s deploy/myapp
otherwise kubectl rollout status can be used but it may stuck forever in some rare cases and will require manual cancellation of pipeline if that happens.

You can parse the output through jq:
kubectl get pod -o=json | jq '.items[]|select(any( .status.containerStatuses[]; .state.waiting.reason=="ImagePullBackOff"))|.metadata.name'

It looks like kubediff tool is a perfect match for your task:
Kubediff is a tool for Kubernetes to show you the differences between your running configuration and your version controlled configuration.
The tool can be used from the command line and as a Pod in the cluster that continuously compares YAML files in the configured repository with the current state of the cluster.
$ ./kubediff
Usage: kubediff [options] <dir/file>...
Compare yaml files in <dir> to running state in kubernetes and print the
differences. This is useful to ensure you have applied all your changes to the
appropriate environment. This tools runs kubectl, so unless your
~/.kube/config is configured for the correct environment, you will need to
supply the kubeconfig for the appropriate environment.
kubediff returns the status to stdout and non-zero exit code when difference is found. You can change this behavior using command line arguments.
You may also want to check the good article about validating YAML files:
Validating Kubernetes Deployment YAMLs

Gitlab runner starting another job before one before it finishes

I have one gitlab runner configured for a single project. The issue that I am seeing is that the runner will not wait until the prior job finished, and instead does a checkout in the same directory as the prior job and stomps over everything. I have one job already running, and then another develop commits and thus another job is started. Why can't I configure the pipeline not to run so that it doesn't corrupt the already running workspace?
Here is the log from both of the jobs (only difference is the timestamp)
[0K] Running with gitlab-runner 12.6.0 (ac8e767a)
[0K] on gitlab.xxxx.com rz8RmGp4
[0K] section_start:1578357551:prepare_executor
[0K] Using Docker executor with image my-image-build ...
[0K] Using locally found image version due to if-not-present pull policy
[0K] Using docker image sha256:xxxxxxxxxx for my-image-build ...
[0;msection_end:1578357553:prepare_executor
[0Ksection_start:1578357553:prepare_script
[0K] Running on runner-rz8RmGp4-project-23-concurrent-0 via gitlab.xxxx.com...
section_end:1578357554:prepare_script
[0K] section_start:1578357554:get_sources
[0K[32;1mFetching changes with git depth set to 50...[0;m
Initialized empty Git repository in /builds/my-project/.git/
<proceeds to checkout and stomp over the already running runner>
Main issue I see is that they both checkout to the same directory of Initialized empty Git repository in /builds/my-project/.git/ which causes the problem.

You can use resource_group to keep jobs from running in parallel.
e.g.
Job 1:
stage: My Stage
resource_group: stage-wedge
...
Job 2:
stage: My Stage
resource_group: stage-wedge
...
In the above example Job 2 will run after Job 1 is finished.

Jobs of same stage are executed in parallel.
if you need it to be sequential, you may add a stage for each of those jobs.
see the docs
https://docs.gitlab.com/ee/ci/yaml/#stages
In case of multiple pipelines running you may want to configure your gitlab-runner options: limit / concurrent
https://docs.gitlab.com/runner/configuration/advanced-configuration.html

Deploy Nodejs apllication on azure linux vm using azure release pipeline

I am creating CI & CD pipeline for nodejs application using azure devops.
I deployed build code to azure linux vm using azure release pipeline, here I configured deployment group job.
In deployment groups I used extract files task to unzip the build files.
Unzip will works fine and my code also deployed in this path: $(System.DefaultWorkingDirectory)/LearnCab-Manage(V1.5)-CI (1)/coreservices/ *.zip
After that i would like to run the pm2 command using azure release pipeline, for this task i take bash in deployment group jobs and write the command
cd $(System.DefaultWorkingDirectory)/LearnCab-Manage(V1.5)-CI (1)/coreservices/*.zip
cd coreservices
pm2 start server.js
But bash not executed it will give exit code 2.

it will give exit code 2
This error caused by your argument are using parentheses ( in the command at your first line. As usual, the parentheses is used as group. This could not be compiled as a normal character in command line.
To solve it, you need transfer the parentheses as a normal character with \:
cd $(System.DefaultWorkingDirectory)/LearnCab-Manage\(V1.5\)-CI \(1\)/coreservices/*.zip
And now, \(V1.5\) and \(1\) could be translated into (V1.5) and (1) normally.
And also, you can use single or double quote to around the path:
cd "$(System.DefaultWorkingDirectory)/LearnCab-Manage(V1.5)-CI (1)/coreservices/*.zip"
Or
cd '$(System.DefaultWorkingDirectory)/LearnCab-Manage(V1.5)-CI (1)/coreservices/*.zip'

Capture Terraform provisioner output?

Use Case
Trying to provision a (Docker Swarm or Consul) cluster where initializing the cluster first occurs on one node, which generates some token, which then needs to be used by other nodes joining the cluster. Key thing being that nodes 1 and 2 shouldn't attempt to join the cluster until the join key has been generated by node 0.
Eg. on node 0, running docker swarm init ... will return a join token. Then on nodes 1 and 2, you'd need to pass that token to the same command, like docker swarm init ${JOIN_TOKEN} ${NODE_0_IP_ADDRESS}:{SOME_PORT}. And magic, you've got a neat little cluster...
Attempts So Far
Tried initializing all nodes with the AWS SDK installed, and storing the join key from node 0 on S3, then fetching that join key on other nodes. This is done via a null_resource with 'remote-exec' provisioners. Due to the way Terraform executes things in parallel, there are racy type conditions and predictably nodes 1 and 2 frequently attempt to fetch a key from S3 thats not there yet (eg. node 0 hasn't finished its stuff yet).
Tried using the 'local-exec' provisioner to SSH into node 0 and capture its join key output. This hasn't worked well or I sucked at doing it.
I've read the docs. And stack overflow. And Github issues, like this really long outstanding one. Thoroughly. If this has been solved elsewhere though, links appreciated!
PS - this is directly related to and is a smaller subset of this question, but wanted to re-ask it in order to focus the scope of the problem.

You can redirect the outputs to a file:
resource "null_resource" "shell" {
provisioner "local-exec" {
command = "uptime 2>stderr >stdout; echo $? >exitstatus"
}
}
and then read the stdout, stderr and exitstatus files with local_file
The problem is that if the files disappear, then terraform apply will fail.
In terraform 0.11 I made a workaround by reading the file with external data source and storing the results in a null_resource triggers (!)
resource "null_resource" "contents" {
triggers = {
stdout = "${data.external.read.result["stdout"]}"
stderr = "${data.external.read.result["stderr"]}"
exitstatus = "${data.external.read.result["exitstatus"]}"
}
lifecycle {
ignore_changes = [
"triggers",
]
}
}
But in 0.12 this can be replaced with file()
and then finally I can use / output those with:
output "stdout" {
value = "${chomp(null_resource.contents.triggers["stdout"])}"
}
See the module https://github.com/matti/terraform-shell-resource for full implementation

You can use external data:
data "external" "docker_token" {
program = ["/bin/bash", "-c" "echo \"{\\\"token\\\":\\\"$(docker swarm init...)\\\"}\""]
}
Then the token will be available as data.external.docker_token.result.token.
If you need to pass arguments in, you can use a script (e.g. relative to path.module). See https://www.terraform.io/docs/providers/external/data_source.html for details.

When I asked myself the same question, "Can I use output from a provisioner to feed into another resource's variables?", I went to the source for answers.
At this moment in time, provisioner results are simply streamed to terraform's standard out and never captured.
Given that you are running remote provisioners on both nodes, and you are trying to access values from S3 - I agree with this approach by the way, I would do the same - what you probably need to do is handle the race condition in your script with a sleep command, or by scheduling a script to run later with the at or cron or similar scheduling systems.
In general, Terraform wants to access all variables either up front, or as the result of a provider. Provisioners are not necessarily treated as first-class in Terraform. I'm not on the core team so I can't say why, but my speculation is that it reduces complexity to ignore provisioner results beyond success or failure, since provisioners are just scripts so their results are generally unstructured.
If you need more enhanced capabilities for setting up your instances, I suggest a dedicated tool for that purpose like Ansible, Chef, Puppet, etc. Terraform's focus is really on Infrastructure, rather than software components.

Simpler solution would be to provide the token yourself.
When creating the ACL token, simply pass in the ID value and consul will use that instead of generating one at random.

You could effectively run the docker swarm init step for node 0 as a Terraform External Data Source, and have it return JSON. Make the provisioning of the remaining nodes depend on this step and refer to the join token generated by the external data source.
https://www.terraform.io/docs/providers/external/data_source.html

With resource dependencies you can ensure that a resource is created before another.
Here's an incomplete example of how I create my consul cluster, just to give you an idea.
resource "aws_instance" "consul_1" {
user_data = <<EOF
#cloud-config
runcmd:
- 'docker pull consul:0.7.5'
- 'docker run -d -v /etc/localtime:/etc/localtime:ro -v $(pwd)/consul-data:/consul/data --restart=unless-stopped --net=host consul:0.7.5 agent -server -advertise=${self.private_ip} -bootstrap-expect=2 -datacenter=wordpress -log-level=info -data-dir=/consul/data'
EOF
}
resource "aws_instance" "consul_2" {
depends_on = ["aws_instance.consul_1"]
user_data = <<EOF
#cloud-config
runcmd:
- 'docker pull consul:0.7.5'
- 'docker run -d -v /etc/localtime:/etc/localtime:ro -v $(pwd)/consul-data:/consul/data --restart=unless-stopped --net=host consul:0.7.5 agent -server -advertise=${self.private_ip} -retry-join=${aws_instance.consul_1.private_ip} -datacenter=wordpress -log-level=info -data-dir=/consul/data'
EOF
}
For the docker swarm setup I think it's out of Terraform scope and I think it should because the token isn't an attribute of the infrastructure you are creating. So I agree with nbering, you could try to achieve that setup with a tool like Ansible or Chef.
But anyways, if the example helps you to setup your consul cluster I think you just need to configure consul as your docker swarm backend.

Sparrowform - is a lightweight provisioner for Terraform based infrastructure can handle your case. Here is example for aws ec2 instances.
Assuming we have 3 ec2 instances for consul cluster: node0, node1 and node2. The first one (node0) is where we fetch token from and keep it in S3 bucket. The other two ones load token later from S3.
$ nano aws_instance.node0.sparrowfile
#!/usr/bin/env perl6
# have not checked this command, but that's the idea ...
bash "docker swarm init | aws s3 cp - s3://alexey-bucket/stream.txt"
$ nano aws_instance.node1.sparrowfile
#!/usr/bin/env perl6
my $i=0;
my $token;
try {
while True {
my $s3-token = run 'aws', 's3', 'cp', 's3://alexey-bucket/stream.txt', '-', :out;
$token = $s3-token.out.lines[0];
$s3-token.out.close;
last if $i++ > 8 or $token;
say "retry num $i ...";
sleep 2*$i;
}
CATCH { { .resume } }
}
die "we have not succeed in fetching token" unless $token;
bash "docker swarm init $token";
$ nano aws_instance.node2.sparrowfile - the same setup as for node1
$ terrafrom apply # bootstrap infrastructure
$ sparrowform --ssh_private_key=~/.ssh/aws.pub --ssh_user=ec2-user # run provisioning on node0, node1, node2
PS disclosure, I am the tool author.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string