Loading Hadoop-Azure in Kubernetes - azure

I am trying to get Apache Spark to load Hadoop-Azure when running using the new Kubernetes feature.
No matter my efforts, Apache Spark always gives me the following error java.lang.classnotfoundexception: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found when trying to load a file using the wabs:// schema
My dockerfile right now
FROM spark:latest
COPY *.jar $SPARK_HOME/jars
ENV SPARK_EXTRA_CLASSPATH="$SPARK_HOME/jars/hadoop-azure-3.2.0.jar:$SPARK_HOME/jars/azure-keyvault-core-1.2.4.jar:$SPARK_HOME/jars/azure-storage-8.6.6.jar:$SPARK_HOME/jars/azure-storage-8.6.6.jar:$SPARK_HOME/jars/jetty-util-ajax-9.3.24.v20180605.jar:$SPARK_HOME/jars/wildfly-openssl-2.1.3.Final.jar"
ENV HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-azure-datalake"
(spark:latest is a build of kubernetes\dockerfiles\spark\Dockerfile from spark-3.1.1-bin-hadoop3.2)
The directory contains the following jars:
hadoop-azure-3.2.0.jar
azure-storage-8.6.6.jar
azure-keyvault-core-1.2.4.jar
jetty-util-ajax-9.3.24.v20180605.jar
wildfly-openssl-2.1.3.Final.jar
I have validated that the files are copied and stored in /opt/spark/jars

This code works for me:
ENV \
AZURE_STORAGE_VER=8.6.6 \
HADOOP_AZURE_VER=3.3.0 \
JETTY_VER=9.4.38.v20210224
# Set JARS env
ENV JARS=${SPARK_HOME}/jars/azure-storage-${AZURE_STORAGE_VER}.jar,${SPARK_HOME}/jars/hadoop-azure-${HADOOP_AZURE_VER}.jar,${SPARK_HOME}/jars/jetty-util-ajax-${JETTY_VER}.jar,${SPARK_HOME}/jars/jetty-util-${JETTY_VER}.jar
RUN echo "spark.jars ${JARS}" >> $SPARK_HOME/conf/spark-defaults.conf
But you will need to add this jar as well: jetty-util-9.4.38.v20210224.jar

So I needed to-do two things before this worked.
Number one was getting the correct dependencies installed. I did this by creating a new pom.xml with the following dependencies
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<version>3.2.2</version>
</dependency>
And then running mvn install dependency:copy-dependencies. I could then copy the output jars files to the directory that contains my Dockerfile and create a simple Dockerfile with this content:
FROM spark:latest
USER root
COPY ./*.jar $SPARK_HOME/jars
RUN chmod -R 775 $SPARK_HOME/jars
USER 185
However I had another issue as well. I was trying to use wabs:// when running in Client Mode. If your running in Client Mode, the jars also have to be copied to /opt/spark/jars on the machine creating the SparkContext.

Related

how do we copy file from hadoop to abfs remotely

how do we copy files from Hadoop to abfs (azure blob file system)
I want to copy from Hadoop fs to abfs file system but it throws an error
this is the command I ran
hdfs dfs -ls abfs://....
ls: No FileSystem for scheme "abfs"
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
any idea how this can be done ?
In the core-site.xml you need to add a config property for fs.abfs.impl with value org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem, and then add any other related authentication configurations it may need.
More details on installation/configuration here - https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
the abfs binding is already in core-default.xml for any release with the abfs client present. however, the hadoop-azure jar and dependency is not in the hadoop common/lib dir where it is needed (it is in HDI, CDH, but not the apache one)
you can tell the hadoop script to pick it and its dependencies up by setting the HADOOP_OPTIONAL_TOOLS env var; you can do this in ~/.hadoop-env; just try on your command line first
export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-aws"
after doing that, download the latest cloudstore jar and use its storediag command to attempt to connect to an abfs URL; it's the place to start debugging classpath and config issues
https://github.com/steveloughran/cloudstore

Where to set the S3 configuration in Spark locally?

I've setup a docker container that is starting a jupyter notebook using spark. I've integrated the necessary jars into spark's directoy for being able to access the S3 filesystem.
My Dockerfile:
FROM jupyter/pyspark-notebook
EXPOSE 8080 7077 6066
RUN conda install -y --prefix /opt/conda pyspark==3.2.1
USER root
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.1/hadoop-aws-3.2.1.jar)
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.213/aws-java-sdk-bundle-1.12.213.jar )
# The aws sdk relies on guava, but the default guava lib in jars is too old for being compatible
RUN rm /usr/local/spark/jars/guava-14.0.1.jar
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/com/google/guava/guava/29.0-jre/guava-29.0-jre.jar )
USER jovyan
ENV AWS_ACCESS_KEY_ID=XXXXX
ENV AWS_SECRET_ACCESS_KEY=XXXXX
ENV PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser"
ENV PYSPARK_DRIVER_PYTHON=/opt/conda/bin/jupyter
This works nicely so far. However, everytime I'm creating a kernel session in jupyter, I do need to setup the EnvironmentCredentialsProvider manually, because by default, it's expecting the IAMInstanceCredentialsProvider to deliver the credentials which obviously isn't there. Because of this, I need to set this in jupyter everytime:
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.EnvironmentVariableCredentialsProvider")
Can I configure this somewhere in a file, so that the credentialprovider is set correctly be default?
I've tried to create ~/.aws/credentials to see if spark would read the credentials by default from there but nope.
s3a connector actually looks for s3a options, then env vars, before IAM properties
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3
something may be wrong with your spark defaults config file
After a few days of browsing the web, I found the correct properties (not in the official documentation though) that were missing in the spark-defaults.conf:
spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.EnvironmentVariableCredentialsProvider

How PYSPARK environmental setup is executed by YARN in launch_container.sh

While analyzing the yarn launch_container.sh logs for a spark job, I got confused by some part of log.
I will point out those asks step by step here
When you will submit a spark job with spark-submit having --pyfiles and --files on cluster mode on YARN:
The config files passed in --files , executable python files passed in --pyfiles are getting uploaded into .sparkStaging directory created under user hadoop home directory.
Along with these files pyspark.zip and py4j-version_number.zip from $SPARK_HOME/python/lib is also getting copied
into .sparkStaging directory created under user hadoop home directory
After this launch_container.sh is getting triggered by yarn and this will export all env variables required.
If we have exported anything explicitly such as PYSPARK_PYTHON in .bash_profile or at the time of building the spark-submit job in a shell script or in spark_env.sh , the default value will be replaced by the value which we
are providing
This PYSPARK_PYTHON is a path in my edge node.
Then how a container launched in another node will be able to use this python version ?
The default python version in data nodes of my cluster is 2.7.5.
So without setting this pyspark_python , containers are using 2.7.5.
But when I will set pyspark_python to 3.5.x , they are using what I have given.
It is defining PWD='/data/complete-path'
Where this PWD directory resides ?
This directory is getting cleaned up after job completion.
I have even tried to run the job in one session of putty
and kept the /data folder opened in another session of putty to see
if any directories are getting created on run time. but couldn't find any?
It is also setting the PYTHONPATH to $PWD/pyspark.zip:$PWD/py4j-version.zip
When ever I am doing a python specific operation
in spark code , its using PYSPARK_PYTHON. So for what purpose this PYTHONPATH is being used?
3.After this yarn is creating softlinks using ln -sf for all the files in step 1
soft links are created for for pyspark.zip , py4j-<version>.zip,
all python files mentioned in step 1.
Now these links are again pointing to '/data/different_directories'
directory (which I am not sure where they are present).
I know soft links can be used for accessing remote nodes ,
but here why the soft links are created ?
Last but not the least , whether this launch_container.sh will run for each container launch ?
Then how a container launched in another node will be able to use this python version ?
First of all, when we submit a Spark application, there are several ways to set the configurations for the Spark application.
Such as:
Setting spark-defaults.conf
Setting environment variables
Setting spark-submit options (spark-submit —help and —conf)
Setting a custom properties file (—properties-file)
Setting values in code (exposed in both SparkConf and SparkContext APIs)
Setting Hadoop configurations (HADOOP_CONF_DIR and spark.hadoop.*)
In my environment, the Hadoop configurations are placed in /etc/spark/conf/yarn-conf/, and the spark-defaults.conf and spark-env.sh is in /etc/spark/conf/.
As the order of precedence for configurations, this is the order that Spark will use:
Properties set on SparkConf or SparkContext in code
Arguments passed to spark-submit, spark-shell, or pyspark at run time
Properties set in /etc/spark/conf/spark-defaults.conf, a specified properties file
Environment variables exported or set in scripts
So broadly speaking:
For properties that apply to all jobs, use spark-defaults.conf,
for properties that are constant and specific to a single or a few applications use SparkConf or --properties-file,
for properties that change between runs use command line arguments.
Now, regarding the question:
In Cluster mode of Spark, the Spark driver is running in container in YARN, the Spark executors are running in container in YARN.
In Client mode of Spark, the Spark driver is running outside of the Hadoop cluster(out of YARN), and the executors are always in YARN.
So for your question, it is mostly relative with YARN.
When an application is submitted to YARN, first there will be an ApplicationMaster container, which nigotiates with NodeManager, and is responsible to control the application containers(in your case, they are Spark executors).
NodeManager will then create a local temporary directory for each of the Spark executors, to prepare to launch the containers(that's why the launch_container.sh has such a name).
We can find the location of the local temporary directory is set by NodeManager's ${yarn.nodemanager.local-dirs} defined in yarn-site.xml.
And we can set yarn.nodemanager.delete.debug-delay-sec to 10 minutes and review the launch_container.sh script.
In my environment, the ${yarn.nodemanager.local-dirs} is /yarn/nm, so in this directory, I can find the tempory directories of Spark executor containers, they looks like:
/yarn/nm/nm-local-dir/container_1603853670569_0001_01_000001.
And in this directory, I can find the launch_container.sh for this specific container and other stuffs for running this container.
Where this PWD directory resides ?
I think this is a special Environment Variable in Linux OS, so better not to modify it unless you know how it works percisely in your application.
As per above, if you export this PWD environment at the runtime, I think it is passed to Spark as same as any other Environment Variables.
I'm not sure how the PYSPARK_PYTHON Environment Variable is used in Spark's launch scripts chain, but here you can find the instruction in the official documentation, showing how to set Python binary executable while you are using spark-submit:
spark-submit --conf spark.pyspark.python=/<PATH>/<TO>/<FILE>
As for the last question, yes, YARN will create a temp dir for each of the containers, and the launch_container.sh is included in the dir.

Add conf file to classpath in Google Dataproc

We're building a Spark application in Scala with a HOCON configuration, the config is called application.conf.
If I add the application.conf to my jar file and start a job on Google Dataproc, it works correctly:
gcloud dataproc jobs submit spark \
--cluster <clustername> \
--jar=gs://<bucketname>/<filename>.jar \
--region=<myregion> \
-- \
<some options>
I don't want to bundle the application.conf with my jar file but provide it separately, which I can't get working.
Tried different things, i.e.
Specifying the application.conf with --jars=gs://<bucketname>/application.conf (which should work according to this answer)
Using --files=gs://<bucketname>/application.conf
Same as 1. + 2. with the application conf in /tmp/ on the Master instance of the cluster, then specifying the local file with file:///tmp/application.conf
Defining extraClassPath for spark using --properties=spark.driver.extraClassPath=gs://<bucketname>/application.conf (and for executors)
With all these options I get an error, it can't find the key in the config:
Exception in thread "main" com.typesafe.config.ConfigException$Missing: system properties: No configuration setting found for key 'xyz'
This error usually means that there's an error in the HOCON config (key xyz is not defined in HOCON) or that the application.conf is not in the classpath. Since the exact same config is working when inside my jar file, I assume it's the latter.
Are there any other options to put the application.conf on the classpath?
If --jars doesn't work as suggested in this answer, you can try init action. First upload your config to GCS, then write an init action to download it to the VMs, putting it to a folder in the classpath or update spark-env.sh to include the path to the config.

Spark on K8s - getting error: kube mode not support referencing app depenpendcies in local

I am trying to setup a spark cluster on k8s. I've managed to create and setup a cluster with three nodes by following this article:
https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
After that when I tried to deploy spark on the cluster it failed at spark submit setup.
I used this command:
~/opt/spark/spark-2.3.0-bin-hadoop2.7/bin/spark-submit \
--master k8s://https://206.189.126.172:6443 \
--deploy-mode cluster \
--name word-count \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=docker.io/garfiny/spark:v2.3.0 \
—-conf spark.kubernetes.driver.pod.name=word-count \
local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
And it gives me this error:
Exception in thread "main" org.apache.spark.SparkException: The Kubernetes mode does not yet support referencing application dependencies in the local file system.
at org.apache.spark.deploy.k8s.submit.DriverConfigOrchestrator.getAllConfigurationSteps(DriverConfigOrchestrator.scala:122)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:229)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:227)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2585)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:227)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:192)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2018-06-04 10:58:24 INFO ShutdownHookManager:54 - Shutdown hook called
2018-06-04 10:58:24 INFO ShutdownHookManager:54 - Deleting directory /private/var/folders/lz/0bb8xlyd247cwc3kvh6pmrz00000gn/T/spark-3967f4ae-e8b3-428d-ba22-580fc9c840cd
Note: I followed this article for installing spark on k8s.
https://spark.apache.org/docs/latest/running-on-kubernetes.html
The error message comes from commit 5d7c4ba4d73a72f26d591108db3c20b4a6c84f3f and include the page you mention: "Running Spark on Kubernetes" with the mention that you indicate:
// TODO(SPARK-23153): remove once submission client local dependencies are supported.
if (existSubmissionLocalFiles(sparkJars) || existSubmissionLocalFiles(sparkFiles)) {
throw new SparkException("The Kubernetes mode does not yet support referencing application " +
"dependencies in the local file system.")
}
This is described in SPARK-18278:
it wouldn't accept running a local: jar file, e.g. local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar, on my spark docker image (allowsMixedArguments and isAppResourceReq booleans in SparkSubmitCommandBuilder.java get in the way).
And this is linked to kubernetes issue 34377
The issue SPARK-22962 "Kubernetes app fails if local files are used" mentions:
This is the resource staging server use-case. We'll upstream this in the 2.4.0 timeframe.
In the meantime, that error message was introduced in PR 20320.
It includes the comment:
The manual tests I did actually use a main app jar located on gcs and http.
To be specific and for record, I did the following tests:
Using a gs:// main application jar and a http:// dependency jar. Succeeded.
Using a https:// main application jar and a http:// dependency jar. Succeeded.
Using a local:// main application jar. Succeeded.
Using a file:// main application jar. Failed.
Using a file:// dependency jar. Failed.
That issue should been fixed by now, and the OP garfiny confirms in the comments:
I used the newest spark-kubernetes jar to replace the one in spark-2.3.0-bin-hadoop2.7 package. The exception is gone.
According to the mentioned documentation:
Dependency Management
If your application’s dependencies are all
hosted in remote locations like HDFS or HTTP servers, they may be
referred to by their appropriate remote URIs. Also, application
dependencies can be pre-mounted into custom-built Docker images. Those
dependencies can be added to the classpath by referencing them with
local:// URIs and/or setting the SPARK_EXTRA_CLASSPATH environment
variable in your Dockerfiles. The local:// scheme is also required
when referring to dependencies in custom-built Docker images in
spark-submit.
Note that using application dependencies from the
submission client’s local file system is currently not yet supported.

Resources