How to get `run_id` when using MLflow Project - mlflow

When using MLflow Projects (via an MLproject file) I get this message at starting time:
INFO mlflow.projects.backend.local:
=== Running command 'source /anaconda3/bin/../etc/profile.d/conda.sh &&
conda activate mlflow-4736797b8261ec1b3ab764c5060cae268b4c8ffa 1>&2 &&
python3 main.py' in run with ID 'e2f0e8c670114c5887963cd6a1ac30f9' ===
I want to access the run_id shown above (e2f0e8c670114c5887963cd6a1ac30f9) from inside the main script.
I expected a run to be active but:
mlflow.active_run()
> None
Initiating a run inside the main script does give me access the correct run_id, although any subsequent runs will have a different run_id.
# first run inside the script - correct run_id
with mlflow.start_run():
print(mlflow.active_run().info.run_id)
> e2f0e8c670114c5887963cd6a1ac30f9
# second run inside the script - wrong run_id
with mlflow.start_run():
print(mlflow.active_run().info.run_id)
> 417065241f1946b98a4abfdd920239b1
Seems like a strange behavior, and I was wondering if there's another way to access the run_id assigned at the beginning of the MLproject run?

with mlflow.start_run() as run:
print(run.info.run_id)

Related

How to get immediate output from a job run within gitlab-runner?

The command gitlab-runner lets you "test" a gitlab job locally. However, the local run of a job seems to have the same problem as a gitlab job run in gitlab CI: The output is not immediate!
What I mean: Even if your code/test/whatever produces printed output, it is not shown immediately in your log or console.
Here is how you can reproduce this behavior (on Linux):
Create a new git repository
mkdir testrepo
cd testrepo
git init
Create file .gitlab-ci.yml with the following content
job_test:
image: python:3.8-buster
script:
- python tester.py
Create a file tester.py with the following content:
import time
for index in range(10):
print(f"{time.time()} test output")
time.sleep(1)
Run this code locally
python tester.py
which produces the output
1648130393.143866 test output
1648130394.1441162 test output
1648130395.14529 test output
1648130396.1466148 test output
1648130397.147796 test output
1648130398.148115 test output
1648130399.148294 test output
1648130400.1494567 test output
1648130401.1506176 test output
1648130402.1508648 test output
with each line appearing on the console every second.
You commit the changes
git add tester.py
git add .gitlab-ci.yml
git commit -m "just a test"
You start the job within a gitlab runner
gitlab-runner exec docker job_test
....
1648130501.9057398 test output
1648130502.9068272 test output
1648130503.9079702 test output
1648130504.9090931 test output
1648130505.910158 test output
1648130506.9112566 test output
1648130507.9120533 test output
1648130508.9131665 test output
1648130509.9142723 test output
1648130510.9154003 test output
Job succeeded
Here you get essentially the same output, but you have to wait for 10 seconds and then you get the complete output at once!
What I want is to see the output as it happens. So like one line every second.
How can I do that for both, the local gitlab-runner and the gitlab CI?
In the source code, this is controlled mostly by the clientJobTrace's updateInterval and forceSendInterval properties.
These properties are not user-configurable. In order to change this functionality, you would have to patch the source code for the GitLab Runner and compile it yourself.
The parameters for the job trace are passed from the newJobTrace function and their defaults (where you would need to alter the source) are defined here.
Also note that the UI for GitLab may not necessarily get the trace in realtime, either. So, even if the runner has sent the trace to GitLab, the javascript responsible for updating the UI only polls for trace data every ~4 or 5 seconds.
You can poll gitlab web for new log lines as fast as you can:
For running job, use url like: https://gitlab.example.sk/grpup/project/-/jobs/42006/trace It will send you a json structure with lines of log file, offset, size and so on. You can have a look at documentation here: https://docs.gitlab.com/ee/api/jobs.html#get-a-log-file
Sidenote: you can use undocumented “state” parameter from response in subsequent request to get only new lines (if any). This is handy.
Through, this does not affect latency of arrival newlines from actual job from runner to gitlab web/backend. See sytech answer for this question.
This answer should help, when there is configured redis cache, incremental logging architecture, and someone wants to get logs from currently running job in "realtime". Polling is still needed through.
Some notes can be found also on forum: https://forum.gitlab.com/t/is-there-an-api-for-getting-live-log-from-running-job/73072

Databricks init scripts not working sometimes

Ok, it is very strange. I have some init scripts that I would like to run when a cluster starts
cluster has the init script , which is in a file (in dbfs)
basically this
dbfs:/databricks/init-scripts/custom-cert.sh
Now , when I make the init script like this, it works (no ssl errors for my endpoints. Also, the event logs for the cluster shows the duration as 1 second for the init script
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
However, if I just put the init script in an bash script and upload it to DBFS through a pipeline, the init script does not do anything. It executes , as per the event log but the execution duration is 0 sec.
I have the sh script in a file named
custom-cert.sh
with the same contents as above, i.e.
#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt"
but when I check /usr/local/share/ca-certificates/ , it does not contain /dbfs/orgcertificates/orgcerts.crt, even though the cluster init script has run.
Also, I have compared the contents of the init script in both cases and it least to the naked eye, I can't figure out any difference
i.e.
%sh
cat /dbfs/databricks/init-scripts/custom-cert.sh
shows the same contents in both the scenarios. What is the problem with the 2nd case?
EDIT: I read a bit more about init scripts and found that the logs of init scripts are written here
%sh
ls /databricks/init_scripts/
Looking at the err file in that location, it seems there is an error
sudo: update-ca-certificates
: command not found
Why is it that update-ca-certificates found in the first case but not when I put the same script in a sh script and upload it to dbfs (instead of executing the dbutils.fs.put within a notebook) ?
EDIT 2: In response to the first answer. After running the command
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/orgcertificates/orgcerts.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
the output is the file custom-cert.sh and then I restart the cluster with the init script location as dbfs:/databricks/init-scripts/custom-cert.sh and then it works. So, it is essentially the same content that the init script is reading (which is the generated sh script). Why can't it read it if I do not use dbfs put but just put the contents in bash file and upload it during the CI/CD process?
As we aware, An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM start. case-2 When you run bash
command by using of %sh magic command means you are trying to execute this command in Local driver node. So that workers nodes is not able to access . But based on
case-1 , By using of %fs magic command you are trying run copy command (dbutils.fs.put )from root . So that along with driver node , other workers node also can access path .
Ref : https://docs.databricks.com/data/databricks-file-system.html#summary-table-and-diagram
It seems that my observations I made in the comments section of my question is the way to go.
I now create the init script using a databricks job that I run during the CI/CD pipeline from Azure DevOps.
The notebook has the commands
dbutils.fs.rm("/databricks/init-scripts/custom-cert.sh")
dbutils.fs.put("/databricks/init-scripts/custom-cert.sh", """#!/bin/bash
cp /dbfs/internal-certificates/certs.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
""")
I then create a Databricks job (pointing to this notebook), the cluster is a job cluster which is just temporary . Of course , in my case , even this job creation is automated using a powershell script.
I then call this Databricks job in the release pipeline using again a Powershell script.
This creates the file
/databricks/init-scripts/custom-cert.sh
I then use this file in any other cluster that accesses my org's endpoints (without certificate errors).
I do not know (or still understand), why can't the same script file be just part of a repo and uploaded during the release process (instead of it being this Databricks job calling a notebook). I would love to know the reason . The other answer on this question does not hold true as you can see, that the cluster script is created by a job cluster and then accessed from another cluster as part of its init script.
It simply boils down to how the init script gets created.
But I get my job done. Just if it helps someone get their job done too.
I have raised a support case though to understand the reason.

Docker - unable to run script

What I'm doing
I am using AWS batch to run a docker container for a large compute job. I have configured the ECR/ECS successfully to the best of my knowledge but am having issues running the required commands for reasons that are beyond my level of understanding with docker ( newbie )
What I need to do is pass the below commands into my application and start my application to perform some heavy computing tasks; all commands listed below must be present.
The Issue(s)
The issue arises when I send the submit job to AWS batch; this service pulls the image from the ACR ( amazon container repository ) and spins up a compute environment. The issue comes from when I try to run the command I pass in, below I will go throgh it.
"command": [
"mkdir -p logging",
"chmod 777 logging/",
"docker run -t -i -e my-application", # container name
"-e APIKEY",
"-e BASEURI",
"-e APIUSER",
"-v WORKSPACE /logging:/src/log",
"DOCKERIMAGE",
"python my_app.py",
"-t APP_USER",
"-e APP_ENVIRONMENT",
"-u APP_USERNAME",
"-p APP_PASSWORD",
"-i IN_PATH",
"-o OUT_PATH",
"-b tmp/"
]
The command above generates the following error(s)
container_linux.go:370: starting container process caused: exec: "mkdir -p log": executable file not found in $PATH
I tried to pass in the command to echo the env var $PATH but was unsuccesfull getting a response and resulted in a similar error.
I have ran successfully "ls" and was able to see the directory contents of my application inside.
I am not however able to run any of these commands that I have included in the command [] section. I have tried just running python and such in hopes of getting a more detailed error but was unsuccessful.
Logic in plain English
Create a path called logging if it doesnt exist
set the permissions for logging
run the docker container and pass in the environment variables while doing so
Tell docker to run the python file my_app.py and pass in the expected runtime args
Execute and perform the required logic deligated in the python3 application
Questions
Why can I not create a directory here called "logging" where am I ?
Am I running these properly as defined by AWS batch? or docker
What am I missing or where am I going wrong?
AWS Batch high level doc
AWS Batch link specific to what i'm doing
Assuming that you're following the syntax described in the Container
Properties
section of the AWS docs, you have several problems with the syntax of
your command directive.
First
The command directive can only run a single command. You can't mash together a bunch of commands as you're trying to do in your example. If you need to run multiple commands you would need to embed them as an argument to a shell. For example, something like:
command: ["/bin/sh", "-c", "mkdir -p logging; chmod 777 logging; ..."]
Second
You must properly tokenize your
command lines -- that is, when you type mkdir -p logging at the
command prompt, the shell splits this into three parts (or "tokens"): ['mkdir', '-p', 'logging']. You need to do the same thing when building up the
list of arguments to command.
This is invalid:
command: ["mkdir -p logging"]
That would looking for a command named mkdir -p logging, and of course no such command exists. That would properly be written as:
command: ["mkdir", "-p", "logging"]
Third
I'm not very familiar with the AWS batch environment, but it's unlikely you can run a docker command inside a docker` container as you're trying to do. It's unclear why you're doing this, though: why not just configure your AWS batch job with the appropriate image, environment variables, etc?
Take a look at some of these example job definitions.

How to get logs from inside the container executed using DockerOperator?(Airflow)

I'm facing logging issues with DockerOperator.
I'm running a python script inside the docker container using DockerOperator and I need airflow to spit out the logs from the python script running inside the container. Airlfow is marking the job as success but the script inside the container is failing and I have no clue of what is going as I cannot see the logs properly. Is there way to set up logging for DockerOpertor apart from setting up tty option to True as suggested in docs
It looks like you can have logs pushed to XComs, but it's off by default. First, you need to pass xcom_push=True for it to at least start sending the last line of output to XCom. Then additionally, you can pass xcom_all=True to send all output to XCom, not just the first line.
Perhaps not the most convenient place to put debug information, but it's pretty accessible in the UI at least either in the XCom tab when you click into a task or there's a page you can list and filter XComs (under Browse).
Source: https://github.com/apache/airflow/blob/1.10.10/airflow/operators/docker_operator.py#L112-L117 and https://github.com/apache/airflow/blob/1.10.10/airflow/operators/docker_operator.py#L248-L250
Instead of DockerOperator you can use client.containers.run and then do the following:
with DAG(dag_id='dag_1',
default_args=default_args,
schedule_interval=None,
tags=['my_dags']) as dag:
#task(task_id='task_1')
def start_task(**kwargs):
# get the docker params from the environment
client = docker.from_env()
# run the container
response = client.containers.run(
# The container you wish to call
image='__container__:latest',
# The command to run inside the container
command="python test.py",
version='auto',
auto_remove=True,
stdout = True,
stderr=True,
tty=True,
detach=True,
remove=True,
ipc_mode='host',
network_mode='bridge',
# Passing the GPU access
device_requests=[
docker.types.DeviceRequest(count=-1, capabilities=[['gpu']])
],
# Give the proper system volume mount point
volumes=[
'src:/src',
],
working_dir='/src'
)
output = response.attach(stdout=True, stream=True, logs=True)
for line in output:
print(line.decode())
return str(response)
test = start_task()
Then in your test.py script (in the docker container) you have to do the logging using the standard Python logging module:
import logging
logger = logging.getLogger("airflow.task")
logger.info("Log something.")
Reference: here

How can I set run_name in mlflow command line?

MLFlow version: 1.4.0
Python version: 3.7.4
I'm running the UI as mlflow server... with all the required command line options.
I am logging to MLFlow as an MLFlow project, with the appropriate MLproject.yaml file. The project is being run on a Docker container, so the CMD looks like this:
mlflow run . -P document_ids=${D2V_DOC_IDS} -P corpus_path=... --no-conda --experiment-name=${EXPERIMENT_NAME}
Running the experiment like this results in a blank run_name. I know there's a run_id but I'd also like to see the run_name and set it in my code -- either in the command line, or in my code as mlflow.log.....
I've looked at Is it possible to set/change mlflow run name after run initial creation? but I want to programmatically set the run name instead of changing it manually on the UI.
One of the parameters to mlflow.start_run() is run_name. This would give you programmatic access to set the run name with each iteration. See the docs here.
Here's an example:
from datetime import datetime
## Define the name of our run
name = "this run is gonna be bananas" + datetime.now()
## Start a new mlflow run and set the run name
with mlflow.start_run(run_name = name):
## ...train model, log metrics/params/model...
## End the run
mlflow.end_run()
If you want to include set the name as part of an MLflow Project, you'll have to specify it as a parameter in the entry points to the project. This is located in in the MLproject file. Then you can pass those values into the mlflow.start_run() function from the command line.
for CLI, this seems to now be available:
--run-name <runname>
https://mlflow.org/docs/latest/cli.html#cmdoption-mlflow-run-run-name

Resources