Related
Ok, it is very strange. I have some init scripts that I would like to run when a cluster starts
cluster has the init script , which is in a file (in dbfs)
basically this
dbfs:/databricks/init-scripts/custom-cert.sh
Now , when I make the init script like this, it works (no ssl errors for my endpoints. Also, the event logs for the cluster shows the duration as 1 second for the init script
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
However, if I just put the init script in an bash script and upload it to DBFS through a pipeline, the init script does not do anything. It executes , as per the event log but the execution duration is 0 sec.
I have the sh script in a file named
custom-cert.sh
with the same contents as above, i.e.
#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt"
but when I check /usr/local/share/ca-certificates/ , it does not contain /dbfs/orgcertificates/orgcerts.crt, even though the cluster init script has run.
Also, I have compared the contents of the init script in both cases and it least to the naked eye, I can't figure out any difference
i.e.
%sh
cat /dbfs/databricks/init-scripts/custom-cert.sh
shows the same contents in both the scenarios. What is the problem with the 2nd case?
EDIT: I read a bit more about init scripts and found that the logs of init scripts are written here
%sh
ls /databricks/init_scripts/
Looking at the err file in that location, it seems there is an error
sudo: update-ca-certificates
: command not found
Why is it that update-ca-certificates found in the first case but not when I put the same script in a sh script and upload it to dbfs (instead of executing the dbutils.fs.put within a notebook) ?
EDIT 2: In response to the first answer. After running the command
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
the output is the file custom-cert.sh and then I restart the cluster with the init script location as dbfs:/databricks/init-scripts/custom-cert.sh and then it works. So, it is essentially the same content that the init script is reading (which is the generated sh script). Why can't it read it if I do not use dbfs put but just put the contents in bash file and upload it during the CI/CD process?
As we aware, An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM start. case-2 When you run bash
command by using of %sh magic command means you are trying to execute this command in Local driver node. So that workers nodes is not able to access . But based on
case-1 , By using of %fs magic command you are trying run copy command (dbutils.fs.put )from root . So that along with driver node , other workers node also can access path .
Ref : https://docs.databricks.com/data/databricks-file-system.html#summary-table-and-diagram
It seems that my observations I made in the comments section of my question is the way to go.
I now create the init script using a databricks job that I run during the CI/CD pipeline from Azure DevOps.
The notebook has the commands
dbutils.fs.rm("/databricks/init-scripts/custom-cert.sh")
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/internal-certificates/certs.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
I then create a Databricks job (pointing to this notebook), the cluster is a job cluster which is just temporary . Of course , in my case , even this job creation is automated using a powershell script.
I then call this Databricks job in the release pipeline using again a Powershell script.
This creates the file
/databricks/init-scripts/custom-cert.sh
I then use this file in any other cluster that accesses my org's endpoints (without certificate errors).
I do not know (or still understand), why can't the same script file be just part of a repo and uploaded during the release process (instead of it being this Databricks job calling a notebook). I would love to know the reason . The other answer on this question does not hold true as you can see, that the cluster script is created by a job cluster and then accessed from another cluster as part of its init script.
It simply boils down to how the init script gets created.
But I get my job done. Just if it helps someone get their job done too.
I have raised a support case though to understand the reason.
What I'm doing
I am using AWS batch to run a docker container for a large compute job. I have configured the ECR/ECS successfully to the best of my knowledge but am having issues running the required commands for reasons that are beyond my level of understanding with docker ( newbie )
What I need to do is pass the below commands into my application and start my application to perform some heavy computing tasks; all commands listed below must be present.
The Issue(s)
The issue arises when I send the submit job to AWS batch; this service pulls the image from the ACR ( amazon container repository ) and spins up a compute environment. The issue comes from when I try to run the command I pass in, below I will go throgh it.
"command": [
"mkdir -p logging",
"chmod 777 logging/",
"docker run -t -i -e my-application", # container name
"-e APIKEY",
"-e BASEURI",
"-e APIUSER",
"-v WORKSPACE /logging:/src/log",
"DOCKERIMAGE",
"python my_app.py",
"-t APP_USER",
"-e APP_ENVIRONMENT",
"-u APP_USERNAME",
"-p APP_PASSWORD",
"-i IN_PATH",
"-o OUT_PATH",
"-b tmp/"
]
The command above generates the following error(s)
container_linux.go:370: starting container process caused: exec: "mkdir -p log": executable file not found in $PATH
I tried to pass in the command to echo the env var $PATH but was unsuccesfull getting a response and resulted in a similar error.
I have ran successfully "ls" and was able to see the directory contents of my application inside.
I am not however able to run any of these commands that I have included in the command [] section. I have tried just running python and such in hopes of getting a more detailed error but was unsuccessful.
Logic in plain English
Create a path called logging if it doesnt exist
set the permissions for logging
run the docker container and pass in the environment variables while doing so
Tell docker to run the python file my_app.py and pass in the expected runtime args
Execute and perform the required logic deligated in the python3 application
Questions
Why can I not create a directory here called "logging" where am I ?
Am I running these properly as defined by AWS batch? or docker
What am I missing or where am I going wrong?
AWS Batch high level doc
AWS Batch link specific to what i'm doing
Assuming that you're following the syntax described in the Container
Properties
section of the AWS docs, you have several problems with the syntax of
your command directive.
First
The command directive can only run a single command. You can't mash together a bunch of commands as you're trying to do in your example. If you need to run multiple commands you would need to embed them as an argument to a shell. For example, something like:
command: ["/bin/sh", "-c", "mkdir -p logging; chmod 777 logging; ..."]
Second
You must properly tokenize your
command lines -- that is, when you type mkdir -p logging at the
command prompt, the shell splits this into three parts (or "tokens"): ['mkdir', '-p', 'logging']. You need to do the same thing when building up the
list of arguments to command.
This is invalid:
command: ["mkdir -p logging"]
That would looking for a command named mkdir -p logging, and of course no such command exists. That would properly be written as:
command: ["mkdir", "-p", "logging"]
Third
I'm not very familiar with the AWS batch environment, but it's unlikely you can run a docker command inside a docker` container as you're trying to do. It's unclear why you're doing this, though: why not just configure your AWS batch job with the appropriate image, environment variables, etc?
Take a look at some of these example job definitions.
When I run the function locally on NodeJS 11.7.0 it works, when I run it in AWS Lambda NodeJS 8.10 it works, but I've recently tried to run it in AWS Lambda NodeJS 10.x and get this response and this error in Cloud Watch.
Any thoughts on how to correct this?
Response
{
"success": false,
"error": "Error: Could not find openssl on your system on this path: openssl"
}
Cloudwatch Error
ERROR (node:8) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
Function
...
const util = require('util');
const pem = require('pem');
...
return new Promise((fulfill) => {
require('./certs').get(req, res, () => {
return fulfill();
});
}).then(() => {
const createCSR = util.promisify(pem.createCSR);
//This seems to be where the issue is coming from
return createCSR({
keyBitsize: 1024,
hash: HASH,
commonName: id.toString(),
country: 'US',
state: 'Maryland',
organization: 'ABC', //Obfuscated
organizationUnit: 'XYZ', //Obfuscated
});
}).then(({ csr, clientKey }) => {
...
}).then(async ({ certificate, clientKey }) => {
...
}, (err) => {
return res.status(404).json({
success: false,
error: err,
});
});
...
I've tried with
"pem": "^1.14.3", and "pem": "^1.14.2",
I tried the answer documented by #Kris White, but I was not able to get it to work. Each execution resulted in the error Could not find openssl on your system on this path: /opt/openssl. I tried several different paths and approaches, but none worked well. It's entirely possible that I simply didn't copy the OpenSSL executable correctly.
Since I needed a working solution, I used the answer provided by #Wilfred Dittmer. I modified it slightly since I wasn't using Docker. I launched an Amazon Linux 2 server, built OpenSSL on it, transferred the package to my local machine, and deployed it via Serverless.
Create a file named create-openssl-zip.sh with the following contents. The script will create the Lambda Layer OpenSSL package.
#!/bin/bash -x
# This file should be copied to and run inside the /tmp folder
yum update -y
yum install autoconf bison gcc gcc-c++ libcurl-devel libxml2-devel -y
curl -sL http://www.openssl.org/source/openssl-1.1.1d.tar.gz | tar -xvz
cd openssl-1.1.1d
./config --prefix=/tmp/nodejs/openssl --openssldir=/tmp/nodejs/openssl && make && make install
cd /tmp
rm -rf nodejs/openssl/share nodejs/openssl/include
zip -r lambda-layer-openssl.zip nodejs
rm -rf nodejs openssl-1.1.1d
Then, follow these steps:
Open a terminal session in this project's root folder.
Run the following command to upload the Linux bash script.
curl -F "file=#create-openssl-zip.sh" https://file.io
Note: The command above uses the popular tool File.io to copy the script to the cloud temporarily so it can be securely retrieved from the build server.
Note: If curl is not installed on your dev machine, you can also upload the script manually using the File.io website.
Copy the URL for the uploaded file from either the terminal session or the File.io website.
Note: The url will look similar to this example: https://file.io/a1B2c3
Open the AWS Console to the EC2 Instances list.
Launch a new instance with these attributes:
AMI: Amazon Linux 2 AMI (HVM), SSD Volume Type (id: ami-0a887e401f7654935)
Instance Type: t2.micro
Instance Details: (use all defaults)
Storage: (use all defaults)
Tags: Name - 'build-lambda-layer-openssl'
Security Group: 'Create new security group' (use all defaults to ensure Instance will be publicly accessible via SSH over the internet)
When launching the instance and selecting a key pair, be sure to choose a Key Pair from the list to which you have access.
Launch the instance and wait for it to be accessible.
Once the instance is running, use an SSH Client to connect to the instance.
More details on how to open an SSH connection can be found here.
In the SSH terminal session, navigate to the tmp directory by running cd /tmp.
Download the bash script uploaded earlier by running curl {FILE_IO_URL} --output create-openssl-zip.sh.
Note: In the script above, replace FILE_IO_URL with the URL returned from File.io and copied in step 3.
Execute the bash script by running sudo bash ./create-openssl-zip.sh. The script may take a while to complete. You may need to confirm one or more package install prompts.
When the script completes, run the following command to upload the package to File.io: curl -F "file=#lambda-layer-openssl.zip" https://file.io.
Copy the URL for the uploaded file from the terminal session.
In the terminal session on the local development machine, run the following command to download the file: curl {FILE_IO_URL} --output lambda-layer-openssl.zip.
Note: In the script above, replace FILE_IO_URL with the URL returned from File.io and copied in step 13.
Note: If curl is not installed on your dev machine, you can also download the file manually by pasting the copied URL in the address bar of your favorite browser.
Close the SSH session.
In the EC2 Instances list, terminate the build-lambda-layer-openssl EC2 instance since it is not needed any longer.
The OpenSSL Lambda Layer is now ready to be deployed.
For completeness, here is a portion of my serverless.yml file:
functions:
functionName:
# ...
layers:
- { Ref: OpensslLambdaLayer }
layers:
openssl:
name: ${self:provider.stage}-openssl
description: Contains openssl command line utility for lambdas that need it
package:
artifact: 'path\to\lambda-layer-openssl.zip'
compatibleRuntimes:
- nodejs10.x
- nodejs12.x
retain: false
...and here is how I configured PEM in the code file:
import * as pem from 'pem';
process.env.LD_LIBRARY_PATH = '/opt/nodejs/openssl/lib';
pem.config({
pathOpenSSL: '/opt/nodejs/openssl/bin/openssl',
});
// other code...
I contacted AWS Support about this and it turns out that the openssl library is still on the Node10x image, just not the command line utility. However, it's pretty easy to just grab it off a standard AMI and use it as a Lambda layer.
Steps:
Launch an Amazon Linux 2 AMI as an EC2
SSH into the box, or use an SFTP utility to connect to the box
Copy the command line utility for openssl at /usr/bin/openssl somewhere you can work with it locally. In my case I downloaded it to my Mac even though it is a Linux file.
Verify that it's still marked as executable (chmod a+x openssl if necessary if you've downloaded it elsewhere)
Zip up the file
Optional: Upload it to an S3 bucket you can get to
Go to Lambda Layers in the AWS console
Create a new lambda layer. I named mine openssl and used the S3 pointer to the file on S3. You can also upload the zip directly if you have it on a local file system.
Attach the arn provided for the layer to your Lambda function. I use serverless so it was defined in the function setup per their documentation.
In your code, reference openssl as /opt/openssl or you can avoid pathing it in your code (or may not have an option if it's a package you don't control) by adding /opt to you path, i.e.
process.env['PATH'] = process.env['PATH'] + ':' + process.env['LAMBDA_TASK_ROOT'] + ':/opt';
The layer will have been unzipped for you and because you set it to be executable beforehand, it should just work. The underlying openssl libraries are there, so just copying the cli works just fine.
What you can do is to create a lambda layer with the openssl library.
Using the lambdaci/lambda:build-nodejs10.x you can compile the openssl library and create a zip file from the install. The zip file you can then use as a layer for your lambda.
Create a file called create-openssl-zip.sh and make sure to chmod u+x it.
#!/bin/bash -x
# This file should be run inside the lambci/lambda:build-nodejs10.x container
yum update -y
yum install autoconf bison gcc gcc-c++ libcurl-devel libxml2-devel -y
curl -sL http://www.openssl.org/source/openssl-1.1.1d.tar.gz | tar -xvz
cd openssl-1.1.1d
./config --prefix=/var/task/nodejs/openssl --openssldir=/var/task/nodejs/openssl && make && make install
cd /var/task/
rm -rf nodejs/openssl/share
rm -rf nodejs/openssl/include
zip -r lambda-openssl-layer.zip nodejs
cp lambda-openssl-layer.zip /opt/layer/
Then run:
docker run -it -v `pwd`:/opt/layer lambci/lambda:build-nodejs10.x /opt/layer/create-openssl-zip.sh
This will run the script inside the docker container and when it is done you have a file called lambda-openssl-layer.zip in your current directory.
Upload this lambda to an s3 bucket and create a lambda layer.
On your original lambda, add this layer and modify your code so that the PEM library knows where to look for the OpenSSL library as follows:
PEM.config({
pathOpenSSL: '/opt/nodejs/openssl/bin/openssl'
})
And finally add an extra environment variable to your lambda called LD_LIBRARY_PATH with value /opt/nodejs/openssl/lib
Otherwise it will fail with:
/opt/nodejs/openssl/bin/openssl: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory
PEM NPM docs says:
Setting openssl location
In some systems the openssl executable might not be available by the default name or it is not included in $PATH. In this case you can define the location of the executable yourself as a one time action after you have loaded the pem module:
So I think it is not able to find OpenSSL path in system you can try configuring it programmatically :
var pem = require('pem')
pem.config({
pathOpenSSL: '/usr/local/bin/openssl'
})
As you are using AWS Lambda so just try printing process.env.path you will get idea of whether OpenSSL is included in path env variable or not.
You can also check 'OpenSSL' by running below code
const exec = require('child_process').exec;
exec('which openssl',function(err,stdopt,stderr){
console.log(err ? err : stdopt);
})
UPDATE
As #hoangdv mentioned in his answer openssl is seems to be removed for node10.x runtime and I think he is right. Also, we have read-only access to file system so we can't do much.
#Seth McClaine, you can give try for node-forge npm module. One of the module built on top of this is 'https://github.com/jfromaniello/selfsigned' which will make your task easier
https://github.com/lambci/git-lambda-layer/issues/13#issue-444697784 (announcement email)
It seem openssl has been removed in nodejs10.x runtime.
I have checked again on lambci/lambda:build-nodejs10.x docker image and confirmed that. Maybe, you need to change your runtime version or find another way to createCSR.
which: no openssl in (/var/lang/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin)
Use Case
Trying to provision a (Docker Swarm or Consul) cluster where initializing the cluster first occurs on one node, which generates some token, which then needs to be used by other nodes joining the cluster. Key thing being that nodes 1 and 2 shouldn't attempt to join the cluster until the join key has been generated by node 0.
Eg. on node 0, running docker swarm init ... will return a join token. Then on nodes 1 and 2, you'd need to pass that token to the same command, like docker swarm init ${JOIN_TOKEN} ${NODE_0_IP_ADDRESS}:{SOME_PORT}. And magic, you've got a neat little cluster...
Attempts So Far
Tried initializing all nodes with the AWS SDK installed, and storing the join key from node 0 on S3, then fetching that join key on other nodes. This is done via a null_resource with 'remote-exec' provisioners. Due to the way Terraform executes things in parallel, there are racy type conditions and predictably nodes 1 and 2 frequently attempt to fetch a key from S3 thats not there yet (eg. node 0 hasn't finished its stuff yet).
Tried using the 'local-exec' provisioner to SSH into node 0 and capture its join key output. This hasn't worked well or I sucked at doing it.
I've read the docs. And stack overflow. And Github issues, like this really long outstanding one. Thoroughly. If this has been solved elsewhere though, links appreciated!
PS - this is directly related to and is a smaller subset of this question, but wanted to re-ask it in order to focus the scope of the problem.
You can redirect the outputs to a file:
resource "null_resource" "shell" {
provisioner "local-exec" {
command = "uptime 2>stderr >stdout; echo $? >exitstatus"
}
}
and then read the stdout, stderr and exitstatus files with local_file
The problem is that if the files disappear, then terraform apply will fail.
In terraform 0.11 I made a workaround by reading the file with external data source and storing the results in a null_resource triggers (!)
resource "null_resource" "contents" {
triggers = {
stdout = "${data.external.read.result["stdout"]}"
stderr = "${data.external.read.result["stderr"]}"
exitstatus = "${data.external.read.result["exitstatus"]}"
}
lifecycle {
ignore_changes = [
"triggers",
]
}
}
But in 0.12 this can be replaced with file()
and then finally I can use / output those with:
output "stdout" {
value = "${chomp(null_resource.contents.triggers["stdout"])}"
}
See the module https://github.com/matti/terraform-shell-resource for full implementation
You can use external data:
data "external" "docker_token" {
program = ["/bin/bash", "-c" "echo \"{\\\"token\\\":\\\"$(docker swarm init...)\\\"}\""]
}
Then the token will be available as data.external.docker_token.result.token.
If you need to pass arguments in, you can use a script (e.g. relative to path.module). See https://www.terraform.io/docs/providers/external/data_source.html for details.
When I asked myself the same question, "Can I use output from a provisioner to feed into another resource's variables?", I went to the source for answers.
At this moment in time, provisioner results are simply streamed to terraform's standard out and never captured.
Given that you are running remote provisioners on both nodes, and you are trying to access values from S3 - I agree with this approach by the way, I would do the same - what you probably need to do is handle the race condition in your script with a sleep command, or by scheduling a script to run later with the at or cron or similar scheduling systems.
In general, Terraform wants to access all variables either up front, or as the result of a provider. Provisioners are not necessarily treated as first-class in Terraform. I'm not on the core team so I can't say why, but my speculation is that it reduces complexity to ignore provisioner results beyond success or failure, since provisioners are just scripts so their results are generally unstructured.
If you need more enhanced capabilities for setting up your instances, I suggest a dedicated tool for that purpose like Ansible, Chef, Puppet, etc. Terraform's focus is really on Infrastructure, rather than software components.
Simpler solution would be to provide the token yourself.
When creating the ACL token, simply pass in the ID value and consul will use that instead of generating one at random.
You could effectively run the docker swarm init step for node 0 as a Terraform External Data Source, and have it return JSON. Make the provisioning of the remaining nodes depend on this step and refer to the join token generated by the external data source.
https://www.terraform.io/docs/providers/external/data_source.html
With resource dependencies you can ensure that a resource is created before another.
Here's an incomplete example of how I create my consul cluster, just to give you an idea.
resource "aws_instance" "consul_1" {
user_data = <<EOF
#cloud-config
runcmd:
- 'docker pull consul:0.7.5'
- 'docker run -d -v /etc/localtime:/etc/localtime:ro -v $(pwd)/consul-data:/consul/data --restart=unless-stopped --net=host consul:0.7.5 agent -server -advertise=${self.private_ip} -bootstrap-expect=2 -datacenter=wordpress -log-level=info -data-dir=/consul/data'
EOF
}
resource "aws_instance" "consul_2" {
depends_on = ["aws_instance.consul_1"]
user_data = <<EOF
#cloud-config
runcmd:
- 'docker pull consul:0.7.5'
- 'docker run -d -v /etc/localtime:/etc/localtime:ro -v $(pwd)/consul-data:/consul/data --restart=unless-stopped --net=host consul:0.7.5 agent -server -advertise=${self.private_ip} -retry-join=${aws_instance.consul_1.private_ip} -datacenter=wordpress -log-level=info -data-dir=/consul/data'
EOF
}
For the docker swarm setup I think it's out of Terraform scope and I think it should because the token isn't an attribute of the infrastructure you are creating. So I agree with nbering, you could try to achieve that setup with a tool like Ansible or Chef.
But anyways, if the example helps you to setup your consul cluster I think you just need to configure consul as your docker swarm backend.
Sparrowform - is a lightweight provisioner for Terraform based infrastructure can handle your case. Here is example for aws ec2 instances.
Assuming we have 3 ec2 instances for consul cluster: node0, node1 and node2. The first one (node0) is where we fetch token from and keep it in S3 bucket. The other two ones load token later from S3.
$ nano aws_instance.node0.sparrowfile
#!/usr/bin/env perl6
# have not checked this command, but that's the idea ...
bash "docker swarm init | aws s3 cp - s3://alexey-bucket/stream.txt"
$ nano aws_instance.node1.sparrowfile
#!/usr/bin/env perl6
my $i=0;
my $token;
try {
while True {
my $s3-token = run 'aws', 's3', 'cp', 's3://alexey-bucket/stream.txt', '-', :out;
$token = $s3-token.out.lines[0];
$s3-token.out.close;
last if $i++ > 8 or $token;
say "retry num $i ...";
sleep 2*$i;
}
CATCH { { .resume } }
}
die "we have not succeed in fetching token" unless $token;
bash "docker swarm init $token";
$ nano aws_instance.node2.sparrowfile - the same setup as for node1
$ terrafrom apply # bootstrap infrastructure
$ sparrowform --ssh_private_key=~/.ssh/aws.pub --ssh_user=ec2-user # run provisioning on node0, node1, node2
PS disclosure, I am the tool author.
I'm using Azure batch python API. When I'm creating a new job, I see exit code 128 (image attached). How can I know what is the reason for that?
I'm creating a new job using this code :
def wrap_commands_in_shell(commands):
return "/bin/bash -c 'set -e; set -o pipefail; {}; wait'".format(';'.join(commands))
job_tasks = ['cd /mnt/batch/tasks/shared/ && git clone https://github.com/cryptobiu/OSPSI.git',
'cd /mnt/batch/tasks/shared/OSPSI && git checkout cloud',
'cd /mnt/batch/tasks/shared/OSPSI && cmake CMake',
'cd /mnt/batch/tasks/shared/OSPSI && mkdir -p assets'
]
job_creation_information = batch.models.JobAddParameter(job_id, batch.models.PoolInformation(pool_id=pool_id),
job_preparation_task=batch.models.JobPreparationTask(
command_line=wrap_commands_in_shell(
job_tasks),
run_elevated=True,
wait_for_success=True
)
)
To diagnose, you can look at the stderr.txt and stdout.txt for the Job Preparation task that has failed in the Azure Portal, using Azure Batch Explorer, or using an SDK via code. If you look at which node ran the job prep task, navigate to that node, then the job directory. Under the job directory, you should see a jobpreparation directory. In that directory will have the stderr.txt and stdout.txt.
With regard to the exit code, there are a few potential problems that could cause this:
Did you install git, cmake and any other dependencies as part of a start task?
I get a 404 when I try to navigate to: https://github.com/cryptobiu/OSPSI. Does this repo exist? If it's a private repository, are you providing the correct credentials?
A few notes about your job_tasks array:
You should not hardcode the paths /mnt/batch/tasks/shared. This path to the "shared" directory may not be the same between Linux distributions. You should use the environment variable $AZ_BATCH_NODE_SHARED_DIR instead. You can view a full list of Azure Batch pre-filled environment variables here.
You do not need to cd into the directory for each command, you only need to do it once. You can rewrite job_tasks as:
['cd $AZ_BATCH_NODE_SHARED_DIR',
'TODO: INSERT YOUR COMMANDS TO SETUP AUTH WITH GITHUB FOR PRIVATE REPO',
'git clone https://github.com/cryptobiu/OSPSI.git',
'cd OSPSI',
'cmake CMake',
'mkdir -p assets']