Databricks init scripts not working sometimes - databricks

Ok, it is very strange. I have some init scripts that I would like to run when a cluster starts
cluster has the init script , which is in a file (in dbfs)
basically this
dbfs:/databricks/init-scripts/custom-cert.sh
Now , when I make the init script like this, it works (no ssl errors for my endpoints. Also, the event logs for the cluster shows the duration as 1 second for the init script
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
However, if I just put the init script in an bash script and upload it to DBFS through a pipeline, the init script does not do anything. It executes , as per the event log but the execution duration is 0 sec.
I have the sh script in a file named
custom-cert.sh
with the same contents as above, i.e.
#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt"
but when I check /usr/local/share/ca-certificates/ , it does not contain /dbfs/orgcertificates/orgcerts.crt, even though the cluster init script has run.
Also, I have compared the contents of the init script in both cases and it least to the naked eye, I can't figure out any difference
i.e.
%sh
cat /dbfs/databricks/init-scripts/custom-cert.sh
shows the same contents in both the scenarios. What is the problem with the 2nd case?
EDIT: I read a bit more about init scripts and found that the logs of init scripts are written here
%sh
ls /databricks/init_scripts/
Looking at the err file in that location, it seems there is an error
sudo: update-ca-certificates
: command not found
Why is it that update-ca-certificates found in the first case but not when I put the same script in a sh script and upload it to dbfs (instead of executing the dbutils.fs.put within a notebook) ?
EDIT 2: In response to the first answer. After running the command
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
the output is the file custom-cert.sh and then I restart the cluster with the init script location as dbfs:/databricks/init-scripts/custom-cert.sh and then it works. So, it is essentially the same content that the init script is reading (which is the generated sh script). Why can't it read it if I do not use dbfs put but just put the contents in bash file and upload it during the CI/CD process?

As we aware, An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM start. case-2 When you run bash
command by using of %sh magic command means you are trying to execute this command in Local driver node. So that workers nodes is not able to access . But based on
case-1 , By using of %fs magic command you are trying run copy command (dbutils.fs.put )from root . So that along with driver node , other workers node also can access path .
Ref : https://docs.databricks.com/data/databricks-file-system.html#summary-table-and-diagram

It seems that my observations I made in the comments section of my question is the way to go.
I now create the init script using a databricks job that I run during the CI/CD pipeline from Azure DevOps.
The notebook has the commands
dbutils.fs.rm("/databricks/init-scripts/custom-cert.sh")
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/internal-certificates/certs.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
I then create a Databricks job (pointing to this notebook), the cluster is a job cluster which is just temporary . Of course , in my case , even this job creation is automated using a powershell script.
I then call this Databricks job in the release pipeline using again a Powershell script.
This creates the file
/databricks/init-scripts/custom-cert.sh
I then use this file in any other cluster that accesses my org's endpoints (without certificate errors).
I do not know (or still understand), why can't the same script file be just part of a repo and uploaded during the release process (instead of it being this Databricks job calling a notebook). I would love to know the reason . The other answer on this question does not hold true as you can see, that the cluster script is created by a job cluster and then accessed from another cluster as part of its init script.
It simply boils down to how the init script gets created.
But I get my job done. Just if it helps someone get their job done too.
I have raised a support case though to understand the reason.

Related

Docker - unable to run script

What I'm doing
I am using AWS batch to run a docker container for a large compute job. I have configured the ECR/ECS successfully to the best of my knowledge but am having issues running the required commands for reasons that are beyond my level of understanding with docker ( newbie )
What I need to do is pass the below commands into my application and start my application to perform some heavy computing tasks; all commands listed below must be present.
The Issue(s)
The issue arises when I send the submit job to AWS batch; this service pulls the image from the ACR ( amazon container repository ) and spins up a compute environment. The issue comes from when I try to run the command I pass in, below I will go throgh it.
"command": [
"mkdir -p logging",
"chmod 777 logging/",
"docker run -t -i -e my-application", # container name
"-e APIKEY",
"-e BASEURI",
"-e APIUSER",
"-v WORKSPACE /logging:/src/log",
"DOCKERIMAGE",
"python my_app.py",
"-t APP_USER",
"-e APP_ENVIRONMENT",
"-u APP_USERNAME",
"-p APP_PASSWORD",
"-i IN_PATH",
"-o OUT_PATH",
"-b tmp/"
]
The command above generates the following error(s)
container_linux.go:370: starting container process caused: exec: "mkdir -p log": executable file not found in $PATH
I tried to pass in the command to echo the env var $PATH but was unsuccesfull getting a response and resulted in a similar error.
I have ran successfully "ls" and was able to see the directory contents of my application inside.
I am not however able to run any of these commands that I have included in the command [] section. I have tried just running python and such in hopes of getting a more detailed error but was unsuccessful.
Logic in plain English
Create a path called logging if it doesnt exist
set the permissions for logging
run the docker container and pass in the environment variables while doing so
Tell docker to run the python file my_app.py and pass in the expected runtime args
Execute and perform the required logic deligated in the python3 application
Questions
Why can I not create a directory here called "logging" where am I ?
Am I running these properly as defined by AWS batch? or docker
What am I missing or where am I going wrong?
AWS Batch high level doc
AWS Batch link specific to what i'm doing
Assuming that you're following the syntax described in the Container
Properties
section of the AWS docs, you have several problems with the syntax of
your command directive.
First
The command directive can only run a single command. You can't mash together a bunch of commands as you're trying to do in your example. If you need to run multiple commands you would need to embed them as an argument to a shell. For example, something like:
command: ["/bin/sh", "-c", "mkdir -p logging; chmod 777 logging; ..."]
Second
You must properly tokenize your
command lines -- that is, when you type mkdir -p logging at the
command prompt, the shell splits this into three parts (or "tokens"): ['mkdir', '-p', 'logging']. You need to do the same thing when building up the
list of arguments to command.
This is invalid:
command: ["mkdir -p logging"]
That would looking for a command named mkdir -p logging, and of course no such command exists. That would properly be written as:
command: ["mkdir", "-p", "logging"]
Third
I'm not very familiar with the AWS batch environment, but it's unlikely you can run a docker command inside a docker` container as you're trying to do. It's unclear why you're doing this, though: why not just configure your AWS batch job with the appropriate image, environment variables, etc?
Take a look at some of these example job definitions.

Deploy Nodejs apllication on azure linux vm using azure release pipeline

I am creating CI & CD pipeline for nodejs application using azure devops.
I deployed build code to azure linux vm using azure release pipeline, here I configured deployment group job.
In deployment groups I used extract files task to unzip the build files.
Unzip will works fine and my code also deployed in this path: $(System.DefaultWorkingDirectory)/LearnCab-Manage(V1.5)-CI (1)/coreservices/ *.zip
After that i would like to run the pm2 command using azure release pipeline, for this task i take bash in deployment group jobs and write the command
cd $(System.DefaultWorkingDirectory)/LearnCab-Manage(V1.5)-CI (1)/coreservices/*.zip
cd coreservices
pm2 start server.js
But bash not executed it will give exit code 2.
it will give exit code 2
This error caused by your argument are using parentheses ( in the command at your first line. As usual, the parentheses is used as group. This could not be compiled as a normal character in command line.
To solve it, you need transfer the parentheses as a normal character with \:
cd $(System.DefaultWorkingDirectory)/LearnCab-Manage\(V1.5\)-CI \(1\)/coreservices/*.zip
And now, \(V1.5\) and \(1\) could be translated into (V1.5) and (1) normally.
And also, you can use single or double quote to around the path:
cd "$(System.DefaultWorkingDirectory)/LearnCab-Manage(V1.5)-CI (1)/coreservices/*.zip"
Or
cd '$(System.DefaultWorkingDirectory)/LearnCab-Manage(V1.5)-CI (1)/coreservices/*.zip'

How to run two shell scripts at startup?

I am working with Ubuntu 16.04 and I have two shell scripts:
run_roscore.sh : This one fires up a roscore in one terminal.
run_detection_node.sh : This one starts an object detection node in another terminal and should start up once run_roscore.sh has initialized the roscore.
I need both the scripts to execute as soon as the system boots up.
I made both scripts executable and then added the following command to cron:
#reboot /path/to/run_roscore.sh; /path/to/run_detection_node.sh, but it is not running.
I have also tried adding both scripts to the Startup Applications using this command for roscore: sh /path/to/run_roscore.sh and following command for detection node: sh /path/to/run_detection_node.sh. And it still does not work.
How do I get these scripts to run?
EDIT: I used the following command to see the system log for the CRON process: grep CRON /var/log/syslog and got the following output:
CRON[570]: (CRON) info (No MTA installed, discarding output).
So I installed MTA and then systemlog shows:
CRON[597]: (nvidia) CMD (/path/to/run_roscore.sh; /path/to/run_detection_node.sh)
I am still not able to see the output (which is supposed to be a camera stream with detections, as I see it when I run the scripts directly in a terminal). How should I proceed?
Since I got this working eventually, I am gonna answer my own question here.
I did the following steps to get the script running from startup:
Changed the type of the script from shell to bash (extension .bash).
Changed the shebang statement to be #!/bin/bash.
In Startup Applications, give the command bash path/to/script to run the script.
Basically when I changed the shell type from sh to bash, the script starts running as soon as the system boots up.
Note, in case this helps someone: My intention to have run_roscore.bash as a separate script was to run roscore as a background process. One can run it directly from a single script (which is also running the detection node) by having roscore& as a command before the rosnode starts. This command will fire up the master as a background process and leave the same terminal open for following commands to be executed.
If you could install immortal you could use the require option to start in sequence your services, for example, this is could be the run config for /etc/immortal/script1.yml:
cmd: /path/to/script1
log:
file: /var/log/script1.log
wait: 1
require:
- script2
And for /etc/immortal/script2.yml
cmd: /path/to/script2
log:
file: /var/log/script2.log
What this will do it will try to start both scripts on boot time, the first one script1 will wait 1 second before starting and also wait for script2 to be up and running, see more about the wait and require option here: https://immortal.run/post/immortal/
Based on your operating system you will need to configure/setup immortaldir, her is how to do it for Linux: https://immortal.run/post/how-to-install/
Going more deep in the topic of supervisors there are more alternatives here you could find some: https://en.wikipedia.org/wiki/Process_supervision
If you want to make sure that "Roscore" (whatever it is) gets started when your Ubuntu starts up then you should start it as a service (not via cron).
See this question/answer.

Azure batch job start tasks failed

I'm using Azure batch python API. When I'm creating a new job, I see exit code 128 (image attached). How can I know what is the reason for that?
I'm creating a new job using this code :
def wrap_commands_in_shell(commands):
return "/bin/bash -c 'set -e; set -o pipefail; {}; wait'".format(';'.join(commands))
job_tasks = ['cd /mnt/batch/tasks/shared/ && git clone https://github.com/cryptobiu/OSPSI.git',
'cd /mnt/batch/tasks/shared/OSPSI && git checkout cloud',
'cd /mnt/batch/tasks/shared/OSPSI && cmake CMake',
'cd /mnt/batch/tasks/shared/OSPSI && mkdir -p assets'
]
job_creation_information = batch.models.JobAddParameter(job_id, batch.models.PoolInformation(pool_id=pool_id),
job_preparation_task=batch.models.JobPreparationTask(
command_line=wrap_commands_in_shell(
job_tasks),
run_elevated=True,
wait_for_success=True
)
)
To diagnose, you can look at the stderr.txt and stdout.txt for the Job Preparation task that has failed in the Azure Portal, using Azure Batch Explorer, or using an SDK via code. If you look at which node ran the job prep task, navigate to that node, then the job directory. Under the job directory, you should see a jobpreparation directory. In that directory will have the stderr.txt and stdout.txt.
With regard to the exit code, there are a few potential problems that could cause this:
Did you install git, cmake and any other dependencies as part of a start task?
I get a 404 when I try to navigate to: https://github.com/cryptobiu/OSPSI. Does this repo exist? If it's a private repository, are you providing the correct credentials?
A few notes about your job_tasks array:
You should not hardcode the paths /mnt/batch/tasks/shared. This path to the "shared" directory may not be the same between Linux distributions. You should use the environment variable $AZ_BATCH_NODE_SHARED_DIR instead. You can view a full list of Azure Batch pre-filled environment variables here.
You do not need to cd into the directory for each command, you only need to do it once. You can rewrite job_tasks as:
['cd $AZ_BATCH_NODE_SHARED_DIR',
'TODO: INSERT YOUR COMMANDS TO SETUP AUTH WITH GITHUB FOR PRIVATE REPO',
'git clone https://github.com/cryptobiu/OSPSI.git',
'cd OSPSI',
'cmake CMake',
'mkdir -p assets']

Runtime.exec() in Hadoop on Azure environment

This question is related to Hadoop on Azure environment.
I am trying to use Runtime.exec() to execute a batch script in the reduce function. I could not get this running in Hadoop on Azure environment while it runs fine in the Hadoop on Linux. I tested the Runtime.exec() code snippet in my desktop (windows 7) environment and it runs fine there. I have made sure that I consume the output and error streams of the sub-process after Runtime.exec().
The batch script contains the below ( a single command):
c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\attempt_201207121317_0024_r_000001_0\work\tool.exe
-f c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\work\11_task_201207121317_0024_r_000001.out
-i c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\attempt_201207121317_0024_r_000001_0\work\input.txt
I distribute the tool.exe and input.txt files using Distributed cache and it creates a symlink from the working directory. tool.exe and input.txt points to the actual files in the jobcache directory.
2012-07-16 04:31:51,613 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /hdfs/mapred/local/taskTracker/distcache/-978619214658189372_-1497645545_209290723/10.73.50.78tool.exe <- \hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\attempt_201207121317_0024_r_000001_0\work\tool.exe
2012-07-16 04:31:51,644 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /hdfs/mapred/local/taskTracker/distcache/-4944695173898834237_1545037473_2085004342/10.73.50.78input.txt <- \hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\attempt_201207121317_0024_r_000001_0\work\input.txt
The reducer gives the below error when it runs.
Command Execution Error: Cannot run program
"cmd /q /c c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\work\11_task_201207121317_0024_r_0000011513543720767963399.bat":
CreateProcess error=2, The system cannot find the file specified
In another case, I tried running the same but without using the absolute paths.. The output stream from the sub-process is shown below:
c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0022\attempt_201207121317_0022_r_000000_0\work>tool.exe -f /hdfs/mapred/local/taskTracker/nabeel/jobcache/job_201207121317_0022/work/1_task_201207121317_0022_r_000000.out
-i input.txt
I do not know how the job working directory paths and distributed cache works in Hadoop on Azure environment. Could you please let me know if I am missing something here (or) there is something I need to take care of while using Runtime.exec() in Hadoop on Azure environment.
Thanks,
.,._
Reply to sender | Reply to group | Reply via web post | Start a New Topic
I am not familiar with Hadoop. But the error message seems to be obvious. It would be better if you can check whether the file exists.
c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\work\11_task_201207121317_0024_r_0000011513543720767963399.bat
Best Regards,
Ming Xu

Resources