Spark on k8s - emptyDir not mounted to directory - apache-spark

I kicked off some Spark job on Kubernetes with quite big volume of data, and job failed that there is no enough space in /var/data/spark-xxx directory.
As the Spark documentation says on https://github.com/apache/spark/blob/master/docs/running-on-kubernetes.md
Spark uses temporary scratch space to spill data to disk during
shuffles and other operations. When using Kubernetes as the resource
manager the pods will be created with an emptyDir volume mounted for
each directory listed in SPARK_LOCAL_DIRS. If no directories are
explicitly specified then a default directory is created and
configured appropriately
Seems like /var/data/spark-xx directory is the default one for emptyDir. Thus, I tried to map that emptyDir to Volume (with bigger space) which is already mapped to Driver and Executors Pods.
I mapped it in the properties file and I can see that is mounted in the shell:
spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint
spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false
spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.options.claimName=sparkstorage
spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint
spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false
spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.options.claimName=sparkstorage
I am wondering if it's possible to mount emptyDir somehow on my persistent storage, so I can spill more data and avoid job failures?

I found that spark 3.0 has considered this problem and has completed the feature.
Spark supports using volumes to spill data during shuffles and other operations. To use a volume as local storage, the volume's name should starts with spark-local-dir-, for example:
--conf spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].mount.path=<mount path>
--conf spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].mount.readOnly=false
Reference:
https://issues.apache.org/jira/browse/SPARK-28042
https://github.com/apache/spark/pull/24879

Related

Is there a way to retrieve kubernetes container's ephemeral-storage usage details?

I create some pods with containers for which I set ephemeral-storage request and limit, like: (here 10GB)
Unfortunately, for some containers, the ephemeral-storage will be fully filled for unknown reasons. I would like to understand which dirs/files are responsible for filling it all, but I did not find a solution to do it.
I tried with df -h, but unfortunately, it will give stats for the whole node and not only for the particular pod/container.
Is there a way to retrieve the kubernetes container's ephemeral-storage usage details?
Pods use ephemeral local storage for scratch space, caching, and for logs. The kubelet can provide scratch space to Pods using local ephemeral storage to mount emptyDir volumes into containers.
Depending on your Kubernetes platform, You may not be able to easily determine where these files are being written, any filesystem can fill up, but rest assured that disk is being consumed somewhere (or worse, memory - depending on the specific configuration of your emptyDir and/or Kubernetes platform).
Refer to this SO link for more details on how by default & allocatable ephemeral-storage in a standard kubernetes environment is sourced from filesystem(mounted to /var/lib/kubelet).
And also refer to kubernetes documentation on how ephemeral storage can be managed & Ephemeral storage consumption management works.
I am assuming you're a GCP user, you can get a sense of your ephemeral-storage usage way:
Menu>Monitoring>Metrics Explorer>
Resource type: kubernetes node & Metric: Ephemeral Storage
Try the below commands to know kubernetes pod/container's ephemeral-storage usage details :
Try du -sh / [run inside a container] : du -sh will give the space consumed by your container files. Which simply returns the amount of disk space the current directory and all those stuff in it are using as a whole, something like: 2.4G.
Also you can check the complete file size using the du -h someDir command.
Inspecting container filesystems : You can use /bin/df as a tool to monitor ephemeral storage usage on the volume where ephemeral container data is located, which is /var/lib/kubelet and /var/lib/containers.

Issue mounting NFS share using Apache Spark 3.1.1 running on Kubernetes 1.21

From a jupyter notebook I am creating a spark context which deploys spark on kubernetes. This has been working fine for some time. I am now trying to configure the spark context so that the driver and executors mount an nfs share to a local directory. Note the nfs share i am trying to mount has been in use for some time both via my k8 cluster as well as via other means.
According to the official documentation and release article for 3.1.x I should be able to modify my spark conf with options that are in turn passed to kubernetes.
My spark conf in this example is set as:
sparkConf.set(f"spark.kubernetes.driver.volumes.nfs.myshare.mount.readOnly", "false")
sparkConf.set(f"spark.kubernetes.driver.volumes.nfs.myshare.mount.path", "/deltalake")
sparkConf.set(f"spark.kubernetes.driver.volumes.nfs.myshare.options.server", "15.4.4.1")
sparkConf.set(f"spark.kubernetes.driver.volumes.nfs.myshare.options.path", "/deltalake")
In my scenario the nfs share is "15.4.4.1:/deltalake" and I arbitrarily selected the name myshare to represent this nfs mount.
When I describe the pods created when I instantiate the spark context I to not see any mounts resembling these directives.
# kubectldescribe <a-spark-pod>
...
Volumes:
spark-conf-volume-exec:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: spark-exec-85efd381ea403488-conf-map
Optional: false
spark-local-dir-1:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-947xd:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
I also do not see anything in the logs for the pod indicating an issue.
Update:
I missed a key line of the documentation which states that drivers and executors have different configs.
The configuration properties for mounting volumes into the executor pods use prefix spark.kubernetes.executor. instead of spark.kubernetes.driver.
The second thing I missed is that the docker container being used by the spark conf to provision the kubernetes pod which hosts the spark executors needs to have the software installed to mount nfs servers (ie run the command line utility to mount an nfs share). The spark integration solution will silently fail in the event that the nfs utils are not installed. If we describe the pod in this scenario, the pod will list an nfs volume and if we execute code on each executor to list the contents of the mount dir, the mount path will show an empty directory. There is not indication of the failure if we describe the pod or look at the logs of the pod.
I am rebuilding the container images and will try again
There are a few things needed to get this to work:
The spark conf needs to be configured for the driver and executor
The nfs utils package needs to be installed on the driver and executor nodes
The nfs server needs to be active and properly configured to allow connections
There are a few possible problems:
The mount does not succeed (server offline, path doesn't exist, path in use)
As a workaround:
After the spark session is created, run a shell command on all the workers to confirm they have access to the mount and the contents look right.

How to configure where spark spills to disk?

I am not able to find this configuration anywhere in the official documentation. Say I decide to install spark, or use a spark docker image. I would like to configure where the "spill to disk" happens so that I may mount a volume that can accommodate that. Where does the default location of the spill to disk occur and how is it possible to change it?
Cloud or bare metal worker nodes have spill location per node that is
local file system, not HDFS. This is standardly handled, but not by
you explicitly. A certain amount of the fs is used for spilling,
shuffling and is local fs, the rest for HDFS. You can name a location
or let HDFS handle that for local fs, or the fs can be an NFS, etc.
For Docker, say, you need simulated HDFS or some linux-like fs for Spark intermediate processing. See https://www.kdnuggets.com/2020/07/apache-spark-cluster-docker.html an excellent guide.
For Spark with YARN, use yarn.nodemanager.local-dirs. See https://spark.apache.org/docs/latest/running-on-yarn.html
For Spark Standalone, use SPARK_LOCAL_DIRS."Scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.

Local disk configuration in Spark

Hi the official Spark documentation state:
While Spark can perform a lot of its computation in memory, it still
uses local disks to store data that doesn’t fit in RAM, as well as to
preserve intermediate output between stages. We recommend having 4-8
disks per node, configured without RAID (just as separate mount
points). In Linux, mount the disks with the noatime option to reduce
unnecessary writes. In Spark, configure the spark.local.dir variable
to be a comma-separated list of the local disks. If you are running
HDFS, it’s fine to use the same disks as HDFS.
I wonder what is the purpose of 4-8 per node
Is it for parallel write ? I am not sure to understand the reason why as it is not explained.
I have no clue for this: "If you are running HDFS, it’s fine to use
the same disks as HDFS".
Any idea what is meant here...
Purpose of usage 4-8 RAID disks to mirror partitions adding redundancy to prevent data lost in case of fault on hardware level. In case of HDFS the redundancy that RAID provides is not needed, since HDFS handles it by replication between nodes.
Reference

How to update spark configuration after resizing worker nodes in Cloud Dataproc

I have a DataProc Spark cluster. Initially, the master and 2 worker nodes are of type n1-standard-4 (4 vCPU, 15.0 GB memory), then I resized all of them to n1-highmem-8 (8 vCPUs, 52 GB memory) via the web console.
I noticed that the two workers nodes are not being fully used. In particular, there are only 2 executors on the first worker node and 1 executor on the second worker node, with
spark.executor.cores 2
spark.executor.memory 4655m
in the /usr/lib/spark/conf/spark-defaults.conf. I thought with spark.dynamicAllocation.enabled true, the number of executors will be increased automatically.
Also, The information on DataProc page of the web console doesn't get updated automatically, either. It seems that DataProc still think that all nodes are n1-standard-4.
My questions are
why are there more executors on the first worker node than the second?
why are not more executors added to each node?
Ideally, I want the whole cluster to get fully utilized, if the spark configuration needs updated, how?
As you've found a cluster's configuration is set when the cluster is first created and does not adjust to manual resizing.
To answer your questions:
The Spark ApplicationMaster takes a container in YARN on a worker node, usually the first worker if only a single spark application is running.
When a cluster is started, Dataproc attempts to fit two YARN containers per machine.
The YARN NodeManager configuration on each machine determines how much of the machine's resources should be dedicated to YARN. This can be changed on each VM under /etc/hadoop/conf/yarn-site.xml, followed by a sudo service hadoop-yarn-nodemanager restart. Once machines are advertising more resources to the ResourceManager, Spark can start more containers. After adding more resources to YARN, you may want to modify the size of containers requested by Spark by modifying spark.executor.memory and spark.executor.cores.
Instead of resizing cluster nodes and manually editing configuration files afterwards, consider starting a new cluster with new machine sizes and copy any data from your old cluster to the new cluster. In general, the simplest way to move data is to use hadoop's built in distcp utility. An example usage would be something along the lines of:
$ hadoop distcp hdfs:///some_directory hdfs://other-cluster-m:8020/
Or if you can use Cloud Storage:
$ hadoop distcp hdfs:///some_directory gs://<your_bucket>/some_directory
Alternatively, consider always storing data in Cloud Storage and treating each cluster as an ephemeral resource that can be torn down and recreated at any time. In general, any time you would save data to HDFS, you can also save it as:
gs://<your_bucket>/path/to/file
Saving to GCS has the nice benefit of allowing you to delete your cluster (and data in HDFS, on persistent disks) when not in use.

Resources