How do solve this error when running apache spark 2.3.0 on Kubernetes with a jar from a remote source - apache-spark

Following the instructions here I have been trying to submit a spark job to minikube, using a remote URL:
minikube start
bin/spark-submit --master k8s://https://192.168.99.100:8443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=<default-spark-k8s-image-build> --conf spark.kubernetes.namespace=spark <https://remote-location-with-spark-example-jar>
The pod fails and when I describe it I get the error configmaps "spark-pi-ad386ea0f7e4333dbd2a0ad705e94d66-init-config" not found:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 50s default-scheduler Successfully assigned spark-pi-ad386ea0f7e4333dbd2a0ad705e94d66-driver to minikube
Warning FailedMount 49s kubelet, minikube MountVolume.SetUp failed for volume "spark-init-properties" : **configmaps "spark-pi-ad386ea0f7e4333dbd2a0ad705e94d66-init-config" not found**
Normal SuccessfulMountVolume 49s kubelet, minikube MountVolume.SetUp succeeded for volume "download-jars-volume"
Normal SuccessfulMountVolume 49s kubelet, minikube MountVolume.SetUp succeeded for volume "download-files-volume"
Normal SuccessfulMountVolume 49s kubelet, minikube MountVolume.SetUp succeeded for volume "default-token-4ghj8"
Normal SuccessfulMountVolume 49s kubelet, minikube MountVolume.SetUp succeeded for volume "spark-init-properties"
Normal Pulled 49s kubelet, minikube Container image "timg-spark/spark:latest" already present on machine
Normal Created 49s kubelet, minikube Created container
Normal Started 48s kubelet, minikube Started container
Normal Pulled 43s kubelet, minikube Container image "timmeh/spark:latest" already present on machine
Normal Created 43s kubelet, minikube Created container
Normal Started 43s kubelet, minikube Started container
However there is no mention in the docs of creating any configmaps, and because the name of the configmap isn't known until you run spark-submit, I can't create one in advance to get more information.
For now my plan is to work around by baking in jar files to the spark docker image, but if anyone knows more on why this is failing that'd be great!

Related

AKS cannot pull docker image from private registry with letsencryptcertificate

I am gettix x509 certificate issue when AKS is trying to pull docker image from my private repository secured with LetsEncrypt certificate. How can I menage certificate store in AKS to add CA of my certificate etc.
Normal Scheduled 8m8s default-scheduler Successfully assigned default/proxy-deployment-568646f8d4-7gnnt to aks-default-26787434-vmss000000
Normal Pulling 6m34s (x4 over 8m7s) kubelet Pulling image "my registry/my-image:lts"
Warning Failed 6m34s (x4 over 8m7s) kubelet Failed to pull image "my registry/my-image:lts": rpc error: code = Unknown desc = Error response from daemon: Get https://my registry/v2/: x509: certificate signed by unknown authority
Warning Failed 6m34s (x4 over 8m7s) kubelet Error: ErrImagePull
Normal BackOff 6m18s (x6 over 8m7s) kubelet Back-off pulling image "my registry/my-image:lts"
Warning Failed 3m5s (x19 over 8m7s) kubelet Error: ImagePullBackOff

Spark submit on Kubernetes cloud engine using only one node, one cpu requested

I have set up a cluster with 4 nodes each having 2 CPUs so 8 in total.
My code end up running only on one CPU and no matter the settings the requested CPUs is always 1 in the pod description and the execution time stays the same. Tried the spark examples and same thing applies.
Spark submit script I use:
./bin/spark-submit \
--master k8s://https://34.64.87.144 \
--deploy-mode cluster \
--name spark-counter \
--class DataCounter \
--driver-java-options "-Dlog4j.configuration=file:////opt/spark/data/log4j.properties" \
--conf spark.executor.cores=4 \
--conf spark.kubernetes.executor.request.cores=3.6 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.pod.name=spark-counter \
--conf spark.kubernetes.container.image=asia.gcr.io/profound-media-298808/spark-base:latest \
local:///opt/spark/data/spark_counter-1.0.jar /opt/spark/data/input1
and the description of the pod
spark-role=driver
Annotations: <none>
Status: Succeeded
IP: 10.32.4.37
IPs:
IP: 10.32.4.37
Containers:
spark-kubernetes-driver:
Container ID: containerd://bd31b8112159145169ab1b6397af8bc2f10cee5429b11c8025f2359ab5194882
Image: asia.gcr.io/profound-media-298808/spark-base:latest
Image ID: asia.gcr.io/profound-media-298808/spark-base#sha256:6aaf817da5606a39bf2aeea769c4ec2d62c7986d06109cb4a38f4f7157702ff1
Ports: 7078/TCP, 7079/TCP, 4040/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
driver
--properties-file
/opt/spark/conf/spark.properties
--class
DataCounter
spark-internal
/opt/spark/data/input1
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 18 Dec 2020 07:13:45 +0000
Finished: Fri, 18 Dec 2020 07:14:28 +0000
Ready: False
Restart Count: 0
Limits:
memory: 1408Mi
Requests:
cpu: 1
memory: 1408Mi
Environment:
SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
SPARK_LOCAL_DIRS: /var/data/spark-f4832458-7450-4d96-b7ed-c672d5ec0eda
SPARK_CONF_DIR: /opt/spark/conf
Mounts:
/opt/spark/conf from spark-conf-volume (rw)
/var/data/spark-f4832458-7450-4d96-b7ed-c672d5ec0eda from spark-local-dir-1 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from spark-token-5zk2c (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
spark-local-dir-1:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
spark-conf-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: spark-counter-1608275621712-driver-conf-map
Optional: false
spark-token-5zk2c:
Type: Secret (a volume populated by a Secret)
SecretName: spark-token-5zk2c
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m44s default-scheduler Successfully assigned default/spark-counter to gke-cluster-1-default-pool-f51b7df5-g048
Warning FailedMount 7m43s (x2 over 7m43s) kubelet MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "spark-counter-1608275621712-driver-conf-map" not found
Normal Pulled 7m42s kubelet Container image "asia.gcr.io/profound-media-298808/spark-base:latest" already present on machine
Normal Created 7m42s kubelet Created container spark-kubernetes-driver
Normal Started 7m42s kubelet Started container spark-kubernetes-driver
I have no additional configuration files set up, using the spark image builder script as a base to build the image used only adding my own jar and data to it. Should I have something more?
How do I set up my cluster to utilize all nodes?

Why pod terminate it self?

i am trying to install fluend with elasticsearch and kibana using bitnami helm chat.
I am following below mention article
Integrate Logging Kubernetes Kibana ElasticSearch Fluentd
But when I deploy the elasticsearch it's pod goes on Terminating or Back-off state.
I am stuck on this from 3 days, any help is appreciated.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 41m (x2 over 41m) default-scheduler error while running "VolumeBinding" filter plugin for pod "elasticsearch-master-0": pod has unbound immediate PersistentVolumeClaims
Normal Scheduled 41m default-scheduler Successfully assigned default/elasticsearch-master-0 to minikube
Normal Pulling 41m kubelet, minikube Pulling image "busybox:latest"
Normal Pulled 41m kubelet, minikube Successfully pulled image "busybox:latest"
Normal Created 41m kubelet, minikube Created container sysctl
Normal Started 41m kubelet, minikube Started container sysctl
Normal Pulling 41m kubelet, minikube Pulling image "docker.elastic.co/elasticsearch/elasticsearch-oss:6.8.6"
Normal Pulled 39m kubelet, minikube Successfully pulled image "docker.elastic.co/elasticsearch/elasticsearch-oss:6.8.6"
Normal Created 39m kubelet, minikube Created container chown
Normal Started 39m kubelet, minikube Started container chown
Normal Created 38m kubelet, minikube Created container elasticsearch
Normal Started 38m kubelet, minikube Started container elasticsearch
Warning Unhealthy 38m kubelet, minikube Readiness probe failed: Get http://172.17.0.7:9200/_cluster/health?local=true: dial tcp 172.17.0.7:9200: connect: connection refused
Normal Pulled 38m (x2 over 38m) kubelet, minikube Container image "docker.elastic.co/elasticsearch/elasticsearch-oss:6.8.6" already present on machine
Warning FailedMount 32m kubelet, minikube MountVolume.SetUp failed for volume "config" : failed to sync configmap cache: timed out waiting for the condition
Normal SandboxChanged 32m kubelet, minikube Pod sandbox changed, it will be killed and re-created.
Normal Pulling 32m kubelet, minikube Pulling image "busybox:latest"
Normal Pulled 32m kubelet, minikube Successfully pulled image "busybox:latest"
Normal Created 32m kubelet, minikube Created container sysctl
Normal Started 32m kubelet, minikube Started container sysctl
Normal Pulled 32m kubelet, minikube Container image "docker.elastic.co/elasticsearch/elasticsearch-oss:6.8.6" already present on machine
Normal Created 32m kubelet, minikube Created container chown
Normal Started 32m kubelet, minikube Started container chown
Normal Pulled 32m (x2 over 32m) kubelet, minikube Container image "docker.elastic.co/elasticsearch/elasticsearch-oss:6.8.6" already present on machine
Normal Created 32m (x2 over 32m) kubelet, minikube Created container elasticsearch
Normal Started 32m (x2 over 32m) kubelet, minikube Started container elasticsearch
Warning Unhealthy 32m kubelet, minikube Readiness probe failed: Get http://172.17.0.6:9200/_cluster/health?local=true: dial tcp 172.17.0.6:9200: connect: connection refused
Warning BackOff 32m (x2 over 32m) kubelet, minikube Back-off restarting failed container
The issue here is the pod has unbound immediate PersistentVolumeClaims. You can set master.persistence.enabled to false while using helm to deploy it. Alternatively you need check if a default storage class exists in the cluster and if it doesn't then create a storage class and make it default.
Short answer: it crashed. You can check the Pod status object for some details like exit status and if was an oomkill and then look at the container logs to see if they show anything.

NodeJs api container crashing in kubernetes

As part of the CICD pipeline I deploy my web api to kubernetes, the most recent branch I'm working on keeps crashing.
I have made sure the app runs locally for all the configurations, also the CICD pipeline on the master branch succeeds. I'm assuming is some change I introduced is making the app fail but I can't see any problem on the logs.
This is my DOCKERFILE
FROM node:12
WORKDIR /usr/src/app
ARG NODE_ENV
ENV NODE_ENV $NODE_ENV
COPY package.json /usr/src/app/
RUN npm install
COPY . /usr/src/app
ENV PORT 5000
EXPOSE $PORT
CMD [ "npm", "start" ]
this is what I get when I run kubectl describe on the corresponding pod
Controlled By: ReplicaSet/review-refactor-e-0jmik1-7f75c45779
Containers:
auto-deploy-app:
Container ID: docker://8d6035b8ee0938262ea50e2f74d3ab627761fdf5b1811460b24f94a74f880810
Image: registry.gitlab.com/hidden-fox/metadata-service/refactor-endpoints:5e986c65d41743d9d6e6ede441a1cae316b3e751
Image ID: docker-pullable://registry.gitlab.com/hidden-fox/metadata-service/refactor-endpoints#sha256:de1e4478867f54a76f1c82374dcebb1d40b3eb0cde24caf936a21a4d16471312
Port: 5000/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 27 Jul 2019 19:18:07 +0100
Finished: Sat, 27 Jul 2019 19:18:49 +0100
Ready: False
Restart Count: 7
Liveness: http-get http://:5000/ delay=15s timeout=15s period=10s #success=1 #failure=3
Readiness: http-get http://:5000/ delay=5s timeout=3s period=10s #success=1 #failure=3
Environment Variables from:
review-refactor-e-0jmik1-secret Secret Optional: false
Environment:
DATABASE_URL: postgres://:#review-refactor-e-0jmik1-postgres:5432/
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-mvvfv (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-mvvfv:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-mvvfv
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m52s default-scheduler Successfully assigned metadata-service-13359548/review-refactor-e-0jmik1-7f75c45779-jfw22 to gke-qa2-default-pool-4dc045be-g8d9
Normal Pulling 9m51s kubelet, gke-qa2-default-pool-4dc045be-g8d9 pulling image "registry.gitlab.com/hidden-fox/metadata-service/refactor-endpoints:5e986c65d41743d9d6e6ede441a1cae316b3e751"
Normal Pulled 9m45s kubelet, gke-qa2-default-pool-4dc045be-g8d9 Successfully pulled image "registry.gitlab.com/hidden-fox/metadata-service/refactor-endpoints:5e986c65d41743d9d6e6ede441a1cae316b3e751"
Warning Unhealthy 8m58s kubelet, gke-qa2-default-pool-4dc045be-g8d9 Readiness probe failed: Get http://10.48.1.34:5000/: dial tcp 10.48.1.34:5000: connect: connection refused
Warning Unhealthy 8m28s (x6 over 9m28s) kubelet, gke-qa2-default-pool-4dc045be-g8d9 Readiness probe failed: HTTP probe failed with statuscode: 404
Normal Started 8m23s (x3 over 9m42s) kubelet, gke-qa2-default-pool-4dc045be-g8d9 Started container
Warning Unhealthy 8m23s (x6 over 9m23s) kubelet, gke-qa2-default-pool-4dc045be-g8d9 Liveness probe failed: HTTP probe failed with statuscode: 404
Normal Killing 8m23s (x2 over 9m3s) kubelet, gke-qa2-default-pool-4dc045be-g8d9 Killing container with id docker://auto-deploy-app:Container failed liveness probe.. Container will be killed and recreated.
Normal Pulled 8m23s (x2 over 9m3s) kubelet, gke-qa2-default-pool-4dc045be-g8d9 Container image "registry.gitlab.com/hidden-fox/metadata-service/refactor-endpoints:5e986c65d41743d9d6e6ede441a1cae316b3e751" already present on machine
Normal Created 8m23s (x3 over 9m43s) kubelet, gke-qa2-default-pool-4dc045be-g8d9 Created container
Warning BackOff 4m42s (x7 over 5m43s) kubelet, gke-qa2-default-pool-4dc045be-g8d9 Back-off restarting failed container
I expect the app to get deployed to kubernetes but instead I see a CrashLoopBackOff error on kubernetes.
I also don't see any application specific errors in the logs.
I figured it out. I had to add an endpoint mapped to the root url, apparently as part of the CD it gets ping and if there is no response then the job fails.

test image from azure container registry

I created a simple Docker image from a "Hello World" java application.
This is my Dockerfile
FROM java:8
COPY . /var/www/java
WORKDIR /var/www/java
RUN javac HelloWorld.java
CMD ["java", "HelloWorld"]
I pushed the image (java-app) to Azure Container Registry.
$ az acr repository list --name AContainerRegistry --output tableResult
----------------
java-app
I want to deploy it
amhg$ kubectl run dockerproject --image=acontainerregistry.azurecr.io/java-app:v1
deployment.apps "dockerproject" created
amhg$ kubectl expose deployments dockerproject --port=80 --type=LoadBalancer
service "dockerproject" exposed
and see the pods, the pod is crashed
amhg$ kubectl get pods
NAME READY STATUS RESTARTS AGE
dockerproject-b6799d879-pt5rx 0/1 CrashLoopBackOff 8 19m
Is there a way to "test"/run the image from the central registry, how come it crashes?
HERE DESCRIBE POD
amhg$ kubectl describe pod dockerproject-64fbf7649-spc7h
Name: dockerproject-64fbf7649-spc7h
Namespace: default
Node: aks-nodepool1-39744669-0/10.240.0.4
Start Time: Thu, 19 Apr 2018 11:53:58 +0200
Labels: pod-template-hash=209693205
run=dockerproject
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"dockerproject-64fbf7649","uid":"946610e4-43b7-11e8-9537-0a58ac1...
Status: Running
IP: 10.244.0.38
Controlled By: ReplicaSet/dockerproject-64fbf7649
Containers:
dockerproject:
Container ID: docker://1f2a7a6870a37e4d6b53fc834b0d4d3b681e9faaacc3772177a918e66357404e
Image: acontainerregistry.azurecr.io/java-app:v1
Image ID: docker-pullable://acontainerregistry.azurecr.io/java-app#sha256:eaf6fe53a59de287ad76a18de2c7f05580b1f25153624161aadcc7b8ef47b0c4
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 19 Apr 2018 12:35:22 +0200
Finished: Thu, 19 Apr 2018 12:35:23 +0200
Ready: False
Restart Count: 13
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-vkpjm (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-vkpjm:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-vkpjm
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 43m default-scheduler Successfully assigned dockerproject2-64fbf7649-spc7h to aks-nodepool1-39744669-0
Normal SuccessfulMountVolume 43m kubelet, aks-nodepool1-39744669-0 MountVolume.SetUp succeeded for volume "default-token-vkpjm"
Normal Pulled 43m (x4 over 43m) kubelet, aks-nodepool1-39744669-0 Container image "acontainerregistry.azurecr.io/java-app:v1" already present on machine
Normal Created 43m (x4 over 43m) kubelet, aks-nodepool1-39744669-0 Created container
Normal Started 43m (x4 over 43m) kubelet, aks-nodepool1-39744669-0 Started container
Warning FailedSync 8m (x161 over 43m) kubelet, aks-nodepool1-39744669-0 Error syncing pod
Warning BackOff 3m (x184 over 43m) kubelet, aks-nodepool1-39744669-0 Back-off restarting failed container
When you run an application in the Pod, Kubernetes expects that it will work all the time as a daemon until you will stop it somehow.
In your details about the pod I see this:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 19 Apr 2018 12:35:22 +0200
Finished: Thu, 19 Apr 2018 12:35:23 +0200
It means that your application exited with code 0 (which means "all is ok") right after start. So, the image was successfully downloaded (registry is OK) and run, but the application exited.
That's why Kubernetes tries to restart the pod all the time.
The only thing I can suggest - find a reason why the application stops and fix it.

Resources