Job submittal fails with : CondaHTTPError: HTTP 000 CONNECTION FAILED - azure-machine-learning-service

I'm trying to use the rep https://github.com/microsoft/MLAKSDeployAML/ to deploy an AML service with AKS.
Created this on an NC6_v2 DSVM machine and after struggling through getting conda to work at all, I finally got my environment setup and started running notebooks.
I submit the experiment then wait on run.wait_for_completion(show_output=True) and it bombs with the HTTP Error. Full control log is attached below.
Is this something to do with being a GPU machine, perhaps, or is there something else going on with the service?
Streaming log file azureml-logs/60_control_log.txt
Starting the daemon thread to refresh tokens in background for process with pid = 13317
nvidia-docker is installed on the target. Using nvidia-docker for docker operations.
Running: ['/bin/bash', '/tmp/azureml_runs/mlaks-train-on-local_1569245453_408a217b/azureml-environment-setup/docker_env_checker.sh']
Materialized image not found on target: azureml/azureml_473a6fe028e178fff5c9a8d49bc938f3
Logging experiment preparation status in history service.
Running: ['/bin/bash', '/tmp/azureml_runs/mlaks-train-on-local_1569245453_408a217b/azureml-environment-setup/docker_env_builder.sh']
Running: ['nvidia-docker', 'build', '-f', 'azureml-environment-setup/Dockerfile', '-t', 'azureml/azureml_473a6fe028e178fff5c9a8d49bc938f3', '.']
Sending build context to Docker daemon 410.1kB
Step 1/15 : FROM continuumio/miniconda3#sha256:54eb3dd4003f11f6a651b55fc2074a0ed6d9eeaa642f1c4c9a7cf8b148a30ceb
---> 4a51de2367be
Step 2/15 : USER root
---> Using cache
---> 42491a367cef
Step 3/15 : RUN mkdir -p $HOME/.cache
---> Using cache
---> 0771da9ffb76
Step 4/15 : WORKDIR /
---> Using cache
---> a8db57273ffb
Step 5/15 : COPY azureml-environment-setup/99brokenproxy /etc/apt/apt.conf.d/
---> Using cache
---> b2a669b740ca
Step 6/15 : RUN if dpkg --compare-versions `conda --version | grep -oE '[^ ]+$'` lt 4.4.11; then conda install conda==4.4.11; fi
---> Using cache
---> 1e430aeb68b0
Step 7/15 : COPY azureml-environment-setup/mutated_conda_dependencies.yml azureml-environment-setup/mutated_conda_dependencies.yml
---> Using cache
---> 0c6a9fafa84b
Step 8/15 : RUN ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_6303d702d8163bbfc0017533e979d4a3 -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig
---> Running in a579672607b3
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... failed
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/repodata.json>
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
ConnectionError(MaxRetryError("HTTPSConnectionPool(host='conda.anaconda.org', port=443): Max retries exceeded with url: /conda-forge/linux-64/repodata.json (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fbb8c38cda0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))"))
The command '/bin/sh -c ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_6303d702d8163bbfc0017533e979d4a3 -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig' returned a non-zero code: 1
CalledProcessError(1, ['nvidia-docker', 'build', '-f', 'azureml-environment-setup/Dockerfile', '-t', 'azureml/azureml_473a6fe028e178fff5c9a8d49bc938f3', '.'])
Building docker image failed with exit code: 1
Logging error in history service: Failed to run ['/bin/bash', '/tmp/azureml_runs/mlaks-train-on-local_1569245453_408a217b/azureml-environment-setup/docker_env_builder.sh']
Exit code 1
Details can be found in azureml-logs/60_control_log.txt log file.
Uploading control log...
Sending final run history status...
Logging experiment failed status in history service.
Control script execution completed

That's a transient network issue. Please retry

Related

Docker Chrome Memory Leak When Using --Disable-Dev-Shm-Usage

This is a follow-up to my previous question:
I am creating a NodeJS-based image that I install latest Chrome and Chromedriver on, then run a NodeJS-based cron job that uses Selenium Webdriver for testing on a one-minute interval.
This runs in an Azure Container Instance, which is the simplest way to run containers in Azure.
My challenge is that Docker containers in ACI run with 64 MB of dev/shm by default, which causes Chrome failures due to the relatively low amount of memory. Chrome provides a disable-dev-shm-usage flag, but running that creates a memory leak that I can't seem to figure out how to prevent. How can I address this best for my container in ACI, please?
Azure Container Instance Container Memory Consumption
Dockerfile
# 1) Build from this Dockerfile's directory:
# docker build -t "<some tag>" -f Dockerfile .
# 2) Start the image (e.g. in Docker)
# 3) Observe that the button's value is printed.
# ---------------------------------------------------------------------------------------------
# 1) Use alpine-based NodeJS base image
FROM node:latest
# 2) Install latest stable Chrome
# https://gerg.dev/2021/06/making-chromedriver-and-chrome-versions-match-in-a-docker-image/
RUN echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | \
tee -a /etc/apt/sources.list.d/google.list && \
wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | \
apt-key add - && \
apt-get update && \
apt-get install -y google-chrome-stable libxss1
# 3) Install the Chromedriver version that corresponds to the installed major Chrome version
# https://blogs.sap.com/2020/12/01/ui5-testing-how-to-handle-chromedriver-update-in-docker-image/
RUN google-chrome --version | grep -oE "[0-9]{1,10}.[0-9]{1,10}.[0-9]{1,10}" > /tmp/chromebrowser-main-version.txt
RUN wget --no-verbose -O /tmp/latest_chromedriver_version.txt https://chromedriver.storage.googleapis.com/LATEST_RELEASE_$(cat /tmp/chromebrowser-main-version.txt)
RUN wget --no-verbose -O /tmp/chromedriver_linux64.zip https://chromedriver.storage.googleapis.com/$(cat /tmp/latest_chromedriver_version.txt)/chromedriver_linux64.zip && rm -rf /opt/selenium/chromedriver && unzip /tmp/chromedriver_linux64.zip -d /opt/selenium && rm /tmp/chromedriver_linux64.zip && mv /opt/selenium/chromedriver /opt/selenium/chromedriver-$(cat /tmp/latest_chromedriver_version.txt) && chmod 755 /opt/selenium/chromedriver-$(cat /tmp/latest_chromedriver_version.txt) && ln -fs /opt/selenium/chromedriver-$(cat /tmp/latest_chromedriver_version.txt) /usr/bin/chromedriver
# 4) Set the variable for the container working directory, create and set the working directory
ARG WORK_DIRECTORY=/program
RUN mkdir -p $WORK_DIRECTORY
WORKDIR $WORK_DIRECTORY
# 5) Install npm packages (do this AFTER setting the working directory)
COPY package.json .
RUN npm config set unsafe-perm true
RUN npm i
ENV NODE_ENV=development NODE_PATH=$WORK_DIRECTORY
# 6) Copy script to execute to working directory
COPY runtest.js .
EXPOSE 8080
# 7) Execute the script in NodeJS
CMD ["node", "runtest.js"]
runtest.js
const { Builder, By } = require('selenium-webdriver');
const { Options } = require('selenium-webdriver/chrome');
const cron = require('node-cron');
cron.schedule('*/1 * * * *', async () => await main());
async function main() {
let driver;
try {
//Browser Setup
let options = new Options()
.headless() // run headless Chrome
.excludeSwitches(['enable-logging']) // disable 'DevTools listening on...'
.addArguments([
// no-sandbox is not an advised flag due to security but eliminates "DevToolsActivePort file doesn't exist" error
'no-sandbox',
// Docker containers run with 64 MB of dev/shm by default, which causes Chrome failures
// Disabling dev/shm uses tmp, which solves the problem but appears to result in memory leaks
'disable-dev-shm-usage'
]);
driver = await new Builder().forBrowser('chrome').setChromeOptions(options).build();
// Navigate to Google and get the "Google Search" button text.
await driver.get('https://www.google.com');
let btnText = await driver.findElement(By.name('btnK')).getAttribute('value');
log(`Google button text: ${btnText}`);
} catch (e) {
log(e);
} finally {
if (driver) {
await driver.close(); // helps chromedriver shut down cleanly and delete the "scoped_dir" temp directories that eventually fill up the harddrive.
await driver.quit();
driver = null;
log(' Closed and quit the driver, then set to null.');
} else {
log(' *** No driver to close and quit ***');
}
}
}
function log(msg) {
console.log(`${new Date()}: ${msg}`);
}
UPDATE
Interestingly, it seems to stabilize once it reaches a certain consumption. The container is allocated 2 GB of memory. I don't see crashes in my app logs, so this seems functional overall.

Fabric chaincode container (nodejs) cannot access npm

I appreciate help in this matter.
I have the latest images (2.2.0, CA 1.4.8), but I'm getting the error when I'm installing chaincode at the first peer:
failed to invoke chaincode lifecycle, error: timeout expired while executing transaction
I'm working behind a proxy, using a VPN.
I tried to increase the timeouts at the docker config, for all peers:
CORE_CHAINCODE_DEPLOYTIMEOUT=300s
CORE_CHAINCODE_STARTUPTIMEOUT=300s
The process works perfectly up to that point (channel created, peers joined the channel). The chaincode can be installed manually with npm install.
I couldn't find an answer to this anywhere. Can someone provide guidance?
UPDATE: It seems that the chaincode container gets boostrap (and even attributed a random name), but gets stuck at:
+ INPUT_DIR=/chaincode/input
+ OUTPUT_DIR=/chaincode/output
+ cp -R /chaincode/input/src/. /chaincode/output
+ cd /chaincode/output
+ '[' -f package-lock.json -o -f npm-shrinkwrap.json ]
+ npm install --production
I believe it is the proxy blocking npm.
I tried to solve this with:
npm config set proxy proxy
npm config set https-proxy proxy
npm set maxsockets 3
After days of struggling, I've found a solution:
-Had to build a custom fabric nodeenv image, that contained the env variables to setup the npm proxy vars: as in node chaincode instantiate behind proxy. After that, I've setup the following env vars in docker.yaml:
- CORE_CHAINCODE_NODE_RUNTIME=my_custom_image
- CORE_CHAINCODE_PULL=true
Useful! For Chinese users, npmjs may sometimes be disconnected and can build their own instances.
For example, if I use version 2.3.1, Download https://github.com/hyperledger/fabric-chaincode-node/tree/v.2.3.1 Edition. Then enter docker/fabric-nodeenv, modify the Dockerfile, and add a line to modify the warehouse source for npmjs:
RUN npm config set registry http://registry.npm.taobao.org
The entire file is as follows:
ARG NODE_VER=12.16.1
FROM node:${NODE_VER}-alpine
RUN npm config set registry http://registry.npm.taobao.org
RUN apk add --no-cache \
make \
python \
g++;
RUN mkdir -p /chaincode/input \
&& mkdir -p /chaincode/output \
&& mkdir -p /usr/local/src;
ADD build.sh start.sh /chaincode/
Then build a docker image, using the command
docker image build-t whatever/fabric-nodeenv:2.3.
Wait a few minutes, and he'll build the image, and then with docker images, you'll see the image he created
Finally, in peer's configuration file, add in peer-environment
- CORE_CHAINCODE_NODE_RUNTIME=whatever/fabric-nodeenv:2.3
Hope it can help others!

lookup google.com on 192.168.65.1:53: cannot unmarshal DNS message

I have been with getting my docker image to issue srv record queries. It seems the golang guys broke the existing behavior by disregarding malformed records. I heard there was a fix, but I keep trying newer versions of ubuntu/alpine linux and nothing seems to make a difference. I cannot downgrade to golang 1.10. Is there something I'm doing wrong here? like screwing up my docker file? how ca I make this code actually work in my container?
My code:
package main
import (
"fmt"
"net"
)
func main() {
net.DefaultResolver.PreferGo=true
cname, srvs, err := net.LookupSRV("xmpp-server", "tcp", "google.com")
if err != nil {
panic(err)
}
fmt.Printf("\ncname: %s \n\n", cname)
for _, srv := range srvs {
fmt.Printf("%v:%v:%d:%d\n", srv.Target, srv.Port, srv.Priority, srv.Weight)
}
// cname: _xmpp-server._tcp.google.com.
//
// xmpp-server.l.google.com.:5269:5:0
// alt2.xmpp-server.l.google.com.:5269:20:0
// alt1.xmpp-server.l.google.com.:5269:20:0
// alt4.xmpp-server.l.google.com.:5269:20:0
// alt3.xmpp-server.l.google.com.:5269:20:0
}
My error:
panic: lookup google.com on 192.168.65.1:53: cannot unmarshal DNS message
goroutine 1 [running]:
main.main()
/app/run_stuff.go:12 +0x322
exit status 2
my docker file:
FROM golang:1.12
RUN mkdir /app
RUN uname -a
RUN go version
WORKDIR /app
COPY . /app/
CMD ["go","run","run_stuff.go"]
It's not really a Go issue. Running go run run_stuff.go on my mac yields the result you're expecting
$ go run run_stuff.go
cname: _xmpp-server._tcp.google.com.
xmpp-server.l.google.com.:5269:5:0
alt2.xmpp-server.l.google.com.:5269:20:0
alt4.xmpp-server.l.google.com.:5269:20:0
alt1.xmpp-server.l.google.com.:5269:20:0
alt3.xmpp-server.l.google.com.:5269:20:0
The issue likely has to do with your DNS settings in Docker. Using the exact same code and Dockerfile as you posted above, I ran the command docker build -t test . && docker run --rm -it --dns 8.8.8.8 test which builds and runs the container. The difference being that I set the --dns flag (see the Docker docs for details). The result being:
$ docker build -t test . && docker run --rm -it --dns 8.8.8.8 test
Sending build context to Docker daemon 27.14kB
Step 1/7 : FROM golang:1.12.4
---> b860ab44e93e
Step 2/7 : RUN mkdir /app
---> Using cache
---> 2a339a5e5fde
Step 3/7 : RUN uname -a
---> Using cache
---> dac4362453e6
Step 4/7 : RUN go version
---> Using cache
---> ae654c1c4aa6
Step 5/7 : WORKDIR /app
---> Using cache
---> db3c82038173
Step 6/7 : COPY . /app/
---> 9dba317a267d
Step 7/7 : CMD ["go","run","run_stuff.go"]
---> Running in 2ea6b38869f1
Removing intermediate container 2ea6b38869f1
---> 0a0f817b51bb
Successfully built 0a0f817b51bb
Successfully tagged test:latest
cname: _xmpp-server._tcp.google.com.
xmpp-server.l.google.com.:5269:5:0
alt4.xmpp-server.l.google.com.:5269:20:0
alt3.xmpp-server.l.google.com.:5269:20:0
alt1.xmpp-server.l.google.com.:5269:20:0
alt2.xmpp-server.l.google.com.:5269:20:0
The default DNS server (192.168.65.1, usually set in /etc/resolv.conf) isn't capable of resolving the query. You could update the DNS settings for your host system, or add the --dns flag to make your code work properly in a Docker container.

byfn.sh fails with binaries and docker images out of sync

Following the build your first network tutorial on hyperledger docs when I type
$ ./byfn.sh up
I get following warning
LOCAL_VERSION=1.2.0
DOCKER_IMAGE_VERSION=1.2.1
=================== WARNING ===================
Local fabric binaries and docker images are
out of sync. This may cause problems.
===============================================
and finally the script fails with this error
Error: failed to create deliver client: orderer client failed to connect to orderer.example.com:7050: failed to create new connection: context deadline exceeded
!!!!!!!!!!!!!!! Channel creation failed !!!!!!!!!!!!!!!!
Restart the bootstrap process by cding into the folder fabric-samples/scripts and run bootstrap.sh again like this
$ ./bootstrap.sh
This made my images to get synced like this in the output
LOCAL_VERSION=1.3.0
DOCKER_IMAGE_VERSION=1.3.0
You need to first delete the existing and then download the higher version of Hyperledger Fabric. You can run the following commands to do this:
cd fabric-samples
rm -Rf bin
curl -sSL https://raw.githubusercontent.com/hyperledger/fabric/master/scripts/bootstrap.sh | bash -s 1.3.0
./scripts/bootstrap.sh

docker stop doesn't work for node process

I want to be able to run node inside a docker container, and then be able to run docker stop <container>. This should stop the container on SIGTERM rather than timing out and doing a SIGKILL. Unfortunately, I seem to be missing something, and the information I have found seems to contradict other bits.
Here is a test Dockerfile:
FROM ubuntu:14.04
RUN apt-get update && apt-get install -y curl
RUN curl -sSL http://nodejs.org/dist/v0.11.14/node-v0.11.14-linux-x64.tar.gz | tar -xzf -
ADD test.js /
ENTRYPOINT ["/node-v0.11.14-linux-x64/bin/node", "/test.js"]
Here is the test.js referred to in the Dockerfile:
var http = require('http');
var server = http.createServer(function (req, res) {
console.log('exiting');
process.exit(0);
}).listen(3333, function (err) {
console.log('pid is ' + process.pid)
});
I build it like so:
$ docker build -t test .
I run it like so:
$ docker run --name test -p 3333:3333 -d test
Then I run:
$ docker stop test
Whereupon the SIGTERM apparently doesn't work, causing it to timeout 10 seconds later and then die.
I've found that if I start the node task through sh -c then I can kill it with ^C from an interactive (-it) container, but I still can't get docker stop to work. This is contradictory to comments I've read saying sh doesn't pass on the signal, but might agree with other comments I've read saying that PID 1 doesn't get SIGTERM (since it's started via sh, it'll be PID 2).
The end goal is to be able to run docker start -a ... in an upstart job and be able to stop the service and it actually exits the container.
My way to do this is to catch SIGINT (interrupt signal) in my JavaScript.
process.on('SIGINT', () => {
console.info("Interrupted");
process.exit(0);
})
This should do the trick when you press Ctrl+C.
Ok, I figured out a workaround myself, which I'll venture as an answer in the hope it helps others. It doesn't completely answer why the signals weren't working before, but it does give me the behaviour I want.
Using baseimage-docker seems to solve the issue. Here's what I did to get this working with the minimal test example above:
Keep test.js as is.
Modify Dockerfile to look like the following:
FROM phusion/baseimage:0.9.15
# disable SSH
RUN rm -rf /etc/service/sshd /etc/my_init.d/00_regen_ssh_host_keys.sh
# install curl and node as before
RUN apt-get update && apt-get install -y curl
RUN curl -sSL http://nodejs.org/dist/v0.11.14/node-v0.11.14-linux-x64.tar.gz | tar -xzf -
# the baseimage init process
CMD ["/sbin/my_init"]
# create a directory for the runit script and add it
RUN mkdir /etc/service/app
ADD run.sh /etc/service/app/run
# install the application
ADD test.js /
baseimage-docker includes an init process (/sbin/my_init) which handles starting other processes and dealing with zombie processes. It uses runit for service supervision. The Dockerfile therefore sets the my_init process as the command to run on boot, and adds a script /etc/service for runit to pick it up.
The run.sh script is simple:
#!/bin/sh
exec /node-v0.11.14-linux-x64/bin/node /test.js
Don't forget to chmod +x run.sh!
By default, runit will automatically restart the service if it goes down.
Following these steps (and build, run, and stop as before), the container properly responds to requests for it to shutdown, in a timely fashion.

Resources