Bitbucket self hosted runner failing during step setup - bitbucket-pipelines

I tried posting this in Atlassian/Bitbucket community support, but my posts aren't posting over there. They aren't even showing up as pending/needing moderation/whatever.
I can't figure out why a step is failing during setup. I'm running a self-hosted runner (and giving 6 GB of memory to the runner container, so I don't think this is an OOM error) and I'm testing a fairly simple pipeline:
Step 1 - bump version as needed - super lightweight - passes fine.
Step 2 - build a go executable into a Docker image, publish to gcr.io - still pretty lightweight - fails during setup
Here is my bitbucket-pipelines.yml:
image:
name: gcr.io/novo00/bitbucket-pipelines:latest
username: _json_key
password: '$NOVO_GCR_IO_JSON_KEY'
pipelines:
branches:
master:
- step:
runs-on:
- self.hosted
script:
- build/bitbucket_pipelines/version.sh
- step:
runs-on:
- self.hosted
script:
- build/bitbucket_pipelines/build.sh
services:
- docker
Here is the output of "Build setup" in the pipeline dashboard:
+ umask 000
+ GIT_LFS_SKIP_SMUDGE=1 retry 6 git clone --branch="master" --depth 50 https://x-token-auth:$REPOSITORY_OAUTH_ACCESS_TOKEN#bitbucket.org/$BITBUCKET_REPO_FULL_NAME.git $BUILD_DIRCloning into '/opt/atlassian/pipelines/agent/build'...
Nothing after that, then it goes into "Build teardown".
Here is a snippet of my runner's stdout:
[2021-07-27 21:52:30,706] Starting container.
[2021-07-27 21:52:30,739] Adding container log: /var/lib/docker/containers/02152994b7d0485fa65a5dd8a4dd5254ea1c7a41b9b00f98ed5478245e3255da/02152994b7d0485fa65a5dd8a4dd5254ea1c7a41b9b00f98ed5478245e3255da-json.log
[2021-07-27 21:52:30,740] Waiting on container to exit.
[2021-07-27 21:52:30,741] Creating exec into container.
[2021-07-27 21:52:30,749] Starting exec into container and waiting for exec to exit.
[2021-07-27 21:52:30,949] Adding container log: /var/lib/docker/containers/1467fcbb959c9f17eacd2bc04967e4fe8042ba1fbc3bcb0a345d6da6db1717d0/1467fcbb959c9f17eacd2bc04967e4fe8042ba1fbc3bcb0a345d6da6db1717d0-json.log
[2021-07-27 21:52:30,949] Waiting on container to exit.
[2021-07-27 21:52:30,981] Adding container log: /var/lib/docker/containers/a62ece4849a5ee784f4f483d58eb87c3684f52ebca35308e6d485e014a17ac99/a62ece4849a5ee784f4f483d58eb87c3684f52ebca35308e6d485e014a17ac99-json.log
[2021-07-27 21:52:30,981] Waiting on container to exit.
[2021-07-27 21:52:31,051] Container has state (exitCode: Some(4), OOMKilled Some(false))
[2021-07-27 21:52:31,054] Removing container build
[2021-07-27 21:52:31,065] Not uploading caches. (numberOfCaches: 0, resultOrError: FAILED)
[2021-07-27 21:52:31,066] Not uploading artifacts. (numberOfArtifacts: 0, resultOrError: FAILED)
[2021-07-27 21:52:31,066] Updating step progress to PARSING_TEST_RESULTS.
[2021-07-27 21:52:31,401] Test report processing complete.
[2021-07-27 21:52:31,401] Removing container clone
[2021-07-27 21:52:31,562] Removing container clone
[2021-07-27 21:52:31,565] Removing container build
[2021-07-27 21:52:31,568] Removing container system-docker
[2021-07-27 21:52:31,583] Removing container system-auth-proxy
[2021-07-27 21:52:31,698] Removing container pause
[2021-07-27 21:52:31,749] Appending log line to log: {bc88f95e-9f7b-425a-adaf-e3060f3a53ee}.
[2021-07-27 21:52:31,754] Appending log line to main log.
[2021-07-27 21:52:31,861] Updating step progress to COMPLETING_LOGS.
[2021-07-27 21:52:31,972] Appending log line to log: {71708781-63f3-46de-8042-730f079fe9d6}.
[2021-07-27 21:52:32,009] Shutting down log uploader.
[2021-07-27 21:52:32,224] Tearing down directories.
[2021-07-27 21:52:32,227] Cancelling timeout
[2021-07-27 21:52:32,228] Completing step with result Result{status=FAILED, error=None}.
Notice the Container has state (exitCode: Some(4), OOMKilled Some(false)) Any help? Any other data I could give? I tried looking at some of the logs mentioned in the output, but they either were no longer there or didn't contain much useful information.
UPDATE: removing services: - docker from step 2 fixes the current issue, but I need docker services available to the step. Looking for a fix now.

I've got a couple of problems while using self-hosted runners. After 2 days of digging through internet, i know:
Self-hosted docker runners work well in most cases
They work well except docker-in-docker case. If you need to run docker inside your pipeline, you should use selected Linux types and versions. I've tested Debian 9, Debian 11, Ubuntu 20 LTS, CentOS8, CentOS 7.9.2009. Only the last one works correctly.
Even if you use compatible Linux installation, you can get problems during your docker build process. Different problems. Thanks to this post, I've fixed it.
If you use WSL2 and DockerDesktop, you won't have any logs from containers. That's due to another log files location in DD (and they're not available in executing WSL machines). If you need logs (i.e. for debugging) and you use Windows, just install VirtualBox with Linux, setup docker and additional runner inside it.

Related

Packer failed when executed on Gitlab-runner

I have a packer file to deploy Centos 7 using vSphere-Iso builder that works Ok when executed directly on a linux server but when I try to run the same packer file using a gitlab-runner it fails as it does not wait until the OS is installed. It fails after waiting for 1 minute but if I run the packer command with -on-error=run-cleanup-provisioner the OS install finish succesuflly so clear the issue is that packer is just not waiting.
2021/07/20 12:02:40 packer.io plugin: [INFO] Waiting for IP, up to total timeout: 30m0s, settle timeout: 5m0s
==> vsphere-iso.autogenerated_1: Waiting for IP...
==> vsphere-iso.autogenerated_1: Clear boot order...
==> vsphere-iso.autogenerated_1: Power off VM...
==> vsphere-iso.autogenerated_1: Destroying VM...
2021/07/20 12:03:12 [INFO] (telemetry) ending
==> Wait completed after 1 minute 2 seconds
2021/07/20 12:03:12 machine readable: error-count []string{"1"}
==> Some builds didn't complete successfully and had errors:
My boot command is the following as I do not use DHCP.
boot_command = ["<up><tab> text inst.ks=http://{{ .HTTPIP }}:{{ .HTTPPort }}/vmware-ks.cfg ip=10.118.12.117::10.118.12.1:255.255.255.0:{{ .Name }}.localhost:ens192:none<enter><wait>"]
I have tested using options like ssh_host, ip_wait_address, ip_settle_timeout, ssh_wait_timeout, pause_before_connecting but nothing seems to work.
As I said, the same packer pkr.hcl file works OK if run it manually on a regular linux but not on my gitlab-runner that is a runner installed directly on my Gitlab server (Yes, I know is not the best practice but I only use the runner for this task)
Packer versions 1.7.2 and 1.7.3 tested, gitlab-runner 14.0.0 and 14.0.1 tested.
Managed to make it work by changing the las wait on my boot command for wait5m. This will give the OS enough time to get installed and the VM rebooted.
New boot command boot_command = ["<up><tab> text inst.ks=http://{{ .HTTPIP }}:{{ .HTTPPort }}/vmware-ks.cfg ip=10.118.12.117::10.118.12.1:255.255.255.0:{{ .Name }}.localhost:ens192:none<enter><wait5m>"]
All the other wait options from packer are no longer needed with this boot command.
Doing some test I managed to make it work as well by creating a Firewall drop rule for the VM just after the kickstar file was loaded and removing the FW rules once the OS was installed. Definitelly, packer is just ignoring all the wait machanism native to packer when running on the gitlab-runner
EDIT: After having the same issue with my Windows Templates y tested using a different gitlab-runner installed on a different server instead of the one in the same gitlab server and it worked perfectly with my initial contifiguration for both, windows and centos.

Gitlab runner starting another job before one before it finishes

I have one gitlab runner configured for a single project. The issue that I am seeing is that the runner will not wait until the prior job finished, and instead does a checkout in the same directory as the prior job and stomps over everything. I have one job already running, and then another develop commits and thus another job is started. Why can't I configure the pipeline not to run so that it doesn't corrupt the already running workspace?
Here is the log from both of the jobs (only difference is the timestamp)
[0K] Running with gitlab-runner 12.6.0 (ac8e767a)
[0K] on gitlab.xxxx.com rz8RmGp4
[0K] section_start:1578357551:prepare_executor
[0K] Using Docker executor with image my-image-build ...
[0K] Using locally found image version due to if-not-present pull policy
[0K] Using docker image sha256:xxxxxxxxxx for my-image-build ...
[0;msection_end:1578357553:prepare_executor
[0Ksection_start:1578357553:prepare_script
[0K] Running on runner-rz8RmGp4-project-23-concurrent-0 via gitlab.xxxx.com...
section_end:1578357554:prepare_script
[0K] section_start:1578357554:get_sources
[0K[32;1mFetching changes with git depth set to 50...[0;m
Initialized empty Git repository in /builds/my-project/.git/
<proceeds to checkout and stomp over the already running runner>
Main issue I see is that they both checkout to the same directory of Initialized empty Git repository in /builds/my-project/.git/ which causes the problem.
You can use resource_group to keep jobs from running in parallel.
e.g.
Job 1:
stage: My Stage
resource_group: stage-wedge
...
Job 2:
stage: My Stage
resource_group: stage-wedge
...
In the above example Job 2 will run after Job 1 is finished.
Jobs of same stage are executed in parallel.
if you need it to be sequential, you may add a stage for each of those jobs.
see the docs
https://docs.gitlab.com/ee/ci/yaml/#stages
In case of multiple pipelines running you may want to configure your gitlab-runner options: limit / concurrent
https://docs.gitlab.com/runner/configuration/advanced-configuration.html

gitlab-runner:Pipeling is pending infinitely

I install a Specific Runners,and the status is actived.
my .gitlab-ci.ymi file code:
stages:
- build
build_maven:
stage: build
only:
- master
script:
- echo "hello CI/CD"
tags:
- vue-dev-pub
when I push the master branch,the gitlab-runner is running,but it's pending infinitely。
the job page show:
This job has not started yet
This job is in pending state and is waiting to be picked by a runner
if I excute the runner manually,the job can pass.
the command of gitlab-runner verify shows:
Runtime platform arch=amd64 os=linux pid=24616 revision=d0b76032 version=12.0.2
WARNING: Running in user-mode.
WARNING: The user-mode requires you to manually start builds processing:
WARNING: $ gitlab-runner run
WARNING: Use sudo for system-mode:
WARNING: $ sudo gitlab-runner...
Verifying runner... is alive runner=T4iKvsT3
I am waiting for you respond,thanks!
If you run the runner manually in debug mode gitlab-runner --debug run you may see the actual error message, in my case it was:
WARNING: Failed to process runner builds=0 error=failed to update executor: missing Machine options executor=docker+machine runner=pSUsX4yR
That's because on runner creation, I selected option docker+machine rather than docker.
After amending /etc/gitlab-runner/config.toml to docker and running gitlab-runner restart followed by gitlab-runner verify, pipeline started running again.
I had a similar problem with my (shell) runners on linux. It would work fine on runners installed and registered on one of my computers but not another. (Even as tags matched correctly in runner and job)
After
gitlab-runner register I would get:
New runner. Has not connected yet
After
gitlab-runner verify that error would go away. But I would get This job is in pending state and is waiting to be picked by a runner
After
gitlab-runner restart
It would all work.
gitlab-runner status
gitlab-runner: Service is running!
Maybe you have tagged your runner but your job has no tags. Refer :how to run untagged jobs
https://stackoverflow.com/a/53371027/10570524
The tags section in your .gitlab-ci.yml file specifies this job has to be picked by a runner that has the same tags (reference).
tags:
- vue-dev-pub
So unless there is actually a runner available for your project that has the vue-dev-pub tag it will keep waiting for one to become available.
first, remove the old config in sys
rm /etc/systemd/system/gitlab-runner.servicetemd
now, you need install gitlab-runner with gitlab user:
gitlab-runner install --user=gitlab-runner --working-directory=/home/gitlab-runner
root installations fail

Kubernetes fails to deploy valid container image

I have a docker image containing an NodeJS app. The Dockerfile is:
FROM node:8
WORKDIR /app
ADD . /app
RUN npm install
EXPOSE 80
ENTRYPOINT [ "/bin/sh", "./start.sh" ]
The start.sh script is:
#!/bin/bash
...
echo "Starting application"
npm start
I'm able to launch and test the image manually:
$ gcloud docker -- run -it --rm my-container
...
Starting application
...
> node index.js
...
The same container is used by a kubernetes deployment:
apiVersion: extensions/v1beta1
kind: Deployment
...
spec:
...
template:
...
spec:
containers:
- image: my-container
...
The container starts, the start.sh script is correctly executed but it terminates and the container goes into a CrashLoopBackOff loop.
After inspecting the pod manually:
kubectl exec -ti my-pod -- bash
I have no name!#my-pod:/app# cat /etc/passwd
... empty response
-> It appears that somehow there are no system users on the container, which makes most commands (like npm) fail silently and terminate the container
I have also tried, without success:
deleting the pod
deleting and re-creating the deployment
running the node image with the node user -> unable to find user node: no matching entries in passwd file
Last note: I actually have many deployments (using the same template with just a different name) which are running fine with an image that was built a few days ago with the same source code.
For some deployments, it actually worked after manually deleting the pod and letting kubernetes recreate it.
Any ideas?
Edit 18/01/2018 I have tried rebuilding an image with the same source code that old working images use, without success. I have also tried a simpler Dockerfile:
FROM node:8
USER node
But I still get an error related to the fact that no users seem to be there:
Error response from daemon: {"message":"linux spec user: unable to find user node: no matching entries in passwd file"}
I have checked with the docker-node guys, the image hasn't changed recently. Could it be related to kubernetes changes? Keep it mind that my images do run when I run them manually with the docker command.
I tried to reproduce your issue, but didn't get it to fail in anything like the same fashion. I made a dummy express app and stuck it on github that matches your example above, and then invoked it into a local minikube instance I had. The base image size is reasonably large, but it started up just fine.
I had to interpret what was happening within npm start for your example since you didn't specify, but you can see my package.json, which I suspect is pretty close to what you're doing based on the description.
When I fire this up:
git clone https://github.com/heckj/dummyexpress
cd dummyexpress
kubectl apply -f deploy/
The I got a running instance right off the bat:
NAME READY STATUS RESTARTS AGE
dummynodeapp-7788b95497-tkw2s 1/1 Running 0 1d
And the logs show pretty much what you'd expect:
**kubectl log dummynodeapp-7788b95497-tkw2s**
W0117 19:41:00.986498 20648 cmd.go:353] log is DEPRECATED and will be removed in a future version. Use logs instead.
Starting application
> blah#1.0.0 start /app
> node index.js
Example app listening on port 3000!
My guess is that you've got something going awry within your npm start execution, so I'd recommend fiddling with that aspect of your deployment and see if you can't resolve it that way.
Well as #heckj pointed out, it was a Docker issue on my kubernetes cluster. I updated the cluster from 1.6.13-gke.1 to v1.7.12-gke.0 and the pods worked fine again. I'm not sure what Docker version was used since there's another kubernetes bug that is preventing me from seeing it.

Does 'docker run' modify image state?

I have a Dockerfile that uses an ubuntu base image and installs a bunch of dependencies with apt-get and dpkg. Then it copies some javascript files and runs a node app. The node app spawns a child process and executes xvfb-run selenium-standalone start.
If I build the docker image with --no-cache and run it using docker run -i -t <image id> my app starts and connects to the selenium server immediately. If I kill the container using CTRL-C or docker stop <container id> and then run the exact same docker run command as above, my app starts as normal, but cannot connect to the selenium server. If I leave it alone, five minutes later, it will connect properly on its own. It behaves this way every time I run docker run until I do a clean image build.
Changing a node source file and rebuilding mostly from cache does not alter this behavior. I've repeated the process several times and it's always the same.
I can't figure out how the behavior can change from one docker run to the next, if the same image is used. Where is the shared state?
Log when working:
gulp run
22:42:31.541 INFO - Launching a standalone Selenium Server
Setting system property webdriver.chrome.driver to /usr/lib/node_modules/selenium-standalone/.selenium/chromedriver/2.16-x64-chromedriver
22:42:31.579 INFO - Java: Oracle Corporation 24.79-b02
22:42:31.579 INFO - OS: Linux 3.18.5-tinycore64 amd64
22:42:31.594 INFO - v2.46.0, with Core v2.46.0. Built from revision 87c69e2
22:42:31.676 INFO - Driver provider org.openqa.selenium.ie.InternetExplorerDriver registration is skipped:
registration capabilities Capabilities [{platform=WINDOWS, ensureCleanSession=true, browserName=internet explorer, version=}] does not match the current platform LINUX
22:42:31.676 INFO - Driver class not found: com.opera.core.systems.OperaDriver
22:42:31.677 INFO - Driver provider com.opera.core.systems.OperaDriver is not registered
[22:42:31] Using gulpfile /opt/app/gulpfile.js
[22:42:31] Starting 'run'...
[22:42:31] Finished 'run' after 1.29 ms
Started App.
22:42:31.764 INFO - RemoteWebDriver instances should connect to: http://127.0.0.1:4444/wd/hub
22:42:31.764 INFO - Selenium Server is up and running
Selenium started
2015-08-19T22:42:32.445Z Starting app on port: 8000
Logs when not working are exactly the same except missing the RemoteWebDriver, 'Selenium Server is up and running', and 'Selenium started.' lines.
Try removing the container instead of just stopping it:
docker stop <container id>
docker rm <container id>

Resources