I am trying out the Kubernetes setup from here The problem is that it hangs with the following output :
Waiting for cluster initialization.
This will continually check to see if the API for kubernetes is reachable.
This might loop forever if there was some uncaught error during start
up.
...................................................................................................................................................................
And it just hangs there?
According to https://github.com/chanezon/azure-linux/tree/master/coreos/kubernetes Kubernetes for GIT is currently broken, I also wasnt able to get it running unfortunately :(
Related
I'm trying to use the following Helm Chart for Spark on Kubernetes
https://github.com/bitnami/charts/tree/main/bitnami/spark
The documentation is of course spotty but I've muddled along. So I have it installed with custom values that assign things like resource limits etc. I'm accessing the master through a NodePort and the WebUI through a port forward. I am NOT using spark-submit, I'm writing Python code to drive the Spark Cluster as follows:
import pyspark
sc = pyspark.SparkContext(appName="Testy", master="spark://<IP>:<PORT>")
This Python code is running locally on my Windows laptop, the Kubernetes cluster is on a separate set of servers. It connects and I can see the app appear in the WebUI but the second it tries to do something I get the following:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
The master seems to be in a cycle of removing and launching executors and the 3 workers each just fail to run a launch command. Interestingly the command has the hostname of my laptop in here:
"--driver-url" "spark://CoarseGrainedScheduler#<laptop hostname>:60557"
Got to imagine that's not right. So in this setup where should I be actually running the python code? On the kubernetes cluster? Can I run it locally on my laptop? These details are of course missing from the docs. I'm new to Spark so just looking for the absolute basics. My preferred workflow would be to develop code locally on my laptop then run it on the Kubernetes cluster I have access to.
Instead of the expected output from a display(my_dataframe), I get Failed to fetch the result. Retry when looking at the completed run (also marked as success).
The notebook runs fine, including the expected outputs, when run as an on-demand notebook (same cluster config etc.). It seems to be a UI issue? I honestly don't even know where to look for possible causes.
I had the same problem while running a job on Azure Databricks and restarting my computer (or maybe the explorer..) helped.
I have the strange problem that a Spark job ran on Kubernetes fails with a lot of "Missing an output location for shuffle X" in jobs where there is a lot of shuffling going on. Increasing executor memory does not help. The same job run on just a single node of the Kubernetes cluster in local[*] mode runs fine however so I suspect it has to do with Kubernetes or underlying Docker.
When an executor dies, the pods are deleted immediately so I cannot track down why it failed. Is there an option that keeps failed pods around so I can view their logs?
You can view the logs of the previous terminated pod like this:
kubectl logs -p <terminated pod name>
Also use spec.ttlSecondsAfterFinished field of a Job as mentioned here
Executors are deleted by default on any failures and you cannot do anything with that unless you customize Spark on K8s code or use some advanced K8s tooling.
What you can do (and most probably is the easiest approach to start with) is configuring some external log collectors, eg. Grafana Loki which can be deployed with 1 click to any K8s cluster, or some ELK stack components. These will help you to persist logs even after pods are deleted.
There is a deleteOnTermination setting in the spark application yaml. See the spark-on-kubernetes README.md.
deleteOnTermination - (Optional)
DeleteOnTermination specify whether executor pods should be deleted in case of failure or normal termination. Maps to spark.kubernetes.executor.deleteOnTermination that is available since Spark 3.0.
I have a problem with spark application on kuberenetes. Spark driver tries to create an executor pod and executor pod fails to start. The problem is that as soon as the pod fails, spark driver removes it and creates a new one. The new one fails dues to the same reason. So, how can i recover logs from already removed pods as it seems like default spark behavior on kubernetes. Also, i am not able to catch the pods since the removal is instantaneous. I have to wonder how i am ever supposed to fix the failing pod issue if i cannot recover the errors.
In your case it would be helpful to implement cluster logging. Even if the pod gets restarted or deleted, its logs will stay in a log aggregator storage.
There are more than one solution to the cluster logging, but most popular is EFK (Elasticsearch, Fluentd, Kibana).
Actually, you can go even without Elasticsearch and Kibana.
Check out an excellent article Application Logging in Kubernetes with fluentd by Rosemary Wang that explains how to configure fluentd to put aggregated logs to fluentd pod stdout and access it later using the command:
kubectl logs <fluentd pod>…
I'm trying to install Giraph on HDInsight cluster with hadoop, using script actions.
After 30+- minutes when deploying the cluster, an error shows up.
Deployment failed
Deployment to resource group 'graphs' failed. Additional details from
the underlying API that might be helpful: At least one resource
deployment operation failed. Please list deployment operations for
details. Please see https://aka.ms/arm-debug for usage details.
Thanks in advance.
Thanks a lot for reporting this issue. We found what the issue is and fixed it.
Issue: There’s a deadlock when the Giraph script is provided during cluster creation. The Giraph script waits for /example/jars to be created in DFS (Wasb/ADLS), but /example/jars can only be created after the Giraph script completes. This issue doesn’t repro for runtime scripts since at the point the script is run, /example/jars has already existed.
Note: We have created and deployed the fix for the scripts. And I have also tested creating a cluster with the updated version, which works fine. Please test on your side and let me know.