Where to set the S3 configuration in Spark locally?

Where to set the S3 configuration in Spark locally? - apache-spark

I've setup a docker container that is starting a jupyter notebook using spark. I've integrated the necessary jars into spark's directoy for being able to access the S3 filesystem.
My Dockerfile:
FROM jupyter/pyspark-notebook
EXPOSE 8080 7077 6066
RUN conda install -y --prefix /opt/conda pyspark==3.2.1
USER root
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.1/hadoop-aws-3.2.1.jar)
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.213/aws-java-sdk-bundle-1.12.213.jar )
# The aws sdk relies on guava, but the default guava lib in jars is too old for being compatible
RUN rm /usr/local/spark/jars/guava-14.0.1.jar
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/com/google/guava/guava/29.0-jre/guava-29.0-jre.jar )
USER jovyan
ENV AWS_ACCESS_KEY_ID=XXXXX
ENV AWS_SECRET_ACCESS_KEY=XXXXX
ENV PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser"
ENV PYSPARK_DRIVER_PYTHON=/opt/conda/bin/jupyter
This works nicely so far. However, everytime I'm creating a kernel session in jupyter, I do need to setup the EnvironmentCredentialsProvider manually, because by default, it's expecting the IAMInstanceCredentialsProvider to deliver the credentials which obviously isn't there. Because of this, I need to set this in jupyter everytime:
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.EnvironmentVariableCredentialsProvider")
Can I configure this somewhere in a file, so that the credentialprovider is set correctly be default?
I've tried to create ~/.aws/credentials to see if spark would read the credentials by default from there but nope.

s3a connector actually looks for s3a options, then env vars, before IAM properties
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3
something may be wrong with your spark defaults config file

After a few days of browsing the web, I found the correct properties (not in the official documentation though) that were missing in the spark-defaults.conf:
spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.EnvironmentVariableCredentialsProvider

Related

SHAREDLIBRARYNAME Utimaco is not referring When I start signserver with docker

I start signserver with docker:
docker run -it --rm --name signserver \
-p 80:8080 -p 443:8443 \
-e CRYPTO_SERVER_IP=**** \
-v /ca-cert.pem:/mnt/external/secrets/tls/cas/ManagementCA.crt \
signserver:1.0
Now, i need connect signserver to PKCS11 on HSM.I has changed signserver-deploy.configuaration:
cryptotoken.p11.lib.30.name=Utimaco
cryptotoken.p11.lib.30.file=/opt/utimaco/p11/libcs_pkcs11_R3.so
Then I add PKCS#11 crypto worker from template,and i change the configuration:
WORKERGENID1.SHAREDLIBRARYNAME=Utimaco
The PKCS#11 crypto worker status is offline,so i active it and enter authentication Code.but i get errors:
- Failed to initialize crypto token: SHAREDLIBRARYNAME Utimaco is not referring to a defined value
Could you please help me
Thank you so much!

This is being discussed at the SignServer CE project's GitHub Discussions page where it is being answered that:
The current SignServer CE container does not support changing configuration in the signserver_deploy.properties.
A theoretical short-term solution for doing this could be something like this:
Find where the signserver.ear is in the container (probably under the appserver deployments folder and it might be folder instead of a ZIP file).
Find the JAR file which has the configuration, likely lib/SignServer-Common.jar
Find the properties file in that JAR file, something like org/signserver/common/.../signservercompile.properties
Change the property in that file and save it back to the ZIP file

Elastalert2 rules folder config not working

I'm using Elastalert2 now to get notifications from error log in slack.
We need to receive alarms of all service logs through our dozens of rules.
Docker builds ElastAlert2 and deploy it on Argocd.
But, there is a problem that the rules_folder config does not work
There is rules_folder in config.yaml
rules_folder: /home/elastalert/rules
and this is Example Dockerfile
FROM python:3.9.13-slim
# installation
RUN pip3 install --upgrade pip \
&& pip3 install cryptography elastalert2
ENV LANG="en_US.UTF-8"
# add configuration and alarm
RUN mkdir -p /home/elastalert
WORKDIR /home/elastalert
ADD ./config.yaml /home/elastalert
COPY ./rules /home/elastalert/rules
and this is run command
command: [ "/bin/sh", "-c" ]
args:
- >-
echo "Finda Elastalert is started!!" &&
elastalert-create-index &&
elastalert --verbose --config config.yaml
...
but error occur like...
[error][1]
I think the rule files cannot be imported as args.
In other words, it seems that rules_folder does not apply
If, specify a specific rule file in the start command, it works well.
For example,
elastalert --verbose --config config.yaml --rule ./rules/example/example.yaml
However, it can only execute one rule.
We have dozens of rules.
What's the problem?

Solve.
Don't store empty yaml in your rules/ sub.
The problem was that I commented out all the yaml files except the test rule yaml for the operation test.
By replacing the commented yaml file with another extension such as .text.
Now elastalert recognizes and operates all rules.

Loading Hadoop-Azure in Kubernetes

I am trying to get Apache Spark to load Hadoop-Azure when running using the new Kubernetes feature.
No matter my efforts, Apache Spark always gives me the following error java.lang.classnotfoundexception: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found when trying to load a file using the wabs:// schema
My dockerfile right now
FROM spark:latest
COPY *.jar $SPARK_HOME/jars
ENV SPARK_EXTRA_CLASSPATH="$SPARK_HOME/jars/hadoop-azure-3.2.0.jar:$SPARK_HOME/jars/azure-keyvault-core-1.2.4.jar:$SPARK_HOME/jars/azure-storage-8.6.6.jar:$SPARK_HOME/jars/azure-storage-8.6.6.jar:$SPARK_HOME/jars/jetty-util-ajax-9.3.24.v20180605.jar:$SPARK_HOME/jars/wildfly-openssl-2.1.3.Final.jar"
ENV HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-azure-datalake"
(spark:latest is a build of kubernetes\dockerfiles\spark\Dockerfile from spark-3.1.1-bin-hadoop3.2)
The directory contains the following jars:
hadoop-azure-3.2.0.jar
azure-storage-8.6.6.jar
azure-keyvault-core-1.2.4.jar
jetty-util-ajax-9.3.24.v20180605.jar
wildfly-openssl-2.1.3.Final.jar
I have validated that the files are copied and stored in /opt/spark/jars

This code works for me:
ENV \
AZURE_STORAGE_VER=8.6.6 \
HADOOP_AZURE_VER=3.3.0 \
JETTY_VER=9.4.38.v20210224
# Set JARS env
ENV JARS=${SPARK_HOME}/jars/azure-storage-${AZURE_STORAGE_VER}.jar,${SPARK_HOME}/jars/hadoop-azure-${HADOOP_AZURE_VER}.jar,${SPARK_HOME}/jars/jetty-util-ajax-${JETTY_VER}.jar,${SPARK_HOME}/jars/jetty-util-${JETTY_VER}.jar
RUN echo "spark.jars ${JARS}" >> $SPARK_HOME/conf/spark-defaults.conf
But you will need to add this jar as well: jetty-util-9.4.38.v20210224.jar

So I needed to-do two things before this worked.
Number one was getting the correct dependencies installed. I did this by creating a new pom.xml with the following dependencies
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<version>3.2.2</version>
</dependency>
And then running mvn install dependency:copy-dependencies. I could then copy the output jars files to the directory that contains my Dockerfile and create a simple Dockerfile with this content:
FROM spark:latest
USER root
COPY ./*.jar $SPARK_HOME/jars
RUN chmod -R 775 $SPARK_HOME/jars
USER 185
However I had another issue as well. I was trying to use wabs:// when running in Client Mode. If your running in Client Mode, the jars also have to be copied to /opt/spark/jars on the machine creating the SparkContext.

How PYSPARK environmental setup is executed by YARN in launch_container.sh

While analyzing the yarn launch_container.sh logs for a spark job, I got confused by some part of log.
I will point out those asks step by step here
When you will submit a spark job with spark-submit having --pyfiles and --files on cluster mode on YARN:
The config files passed in --files , executable python files passed in --pyfiles are getting uploaded into .sparkStaging directory created under user hadoop home directory.
Along with these files pyspark.zip and py4j-version_number.zip from $SPARK_HOME/python/lib is also getting copied
into .sparkStaging directory created under user hadoop home directory
After this launch_container.sh is getting triggered by yarn and this will export all env variables required.
If we have exported anything explicitly such as PYSPARK_PYTHON in .bash_profile or at the time of building the spark-submit job in a shell script or in spark_env.sh , the default value will be replaced by the value which we
are providing
This PYSPARK_PYTHON is a path in my edge node.
Then how a container launched in another node will be able to use this python version ?
The default python version in data nodes of my cluster is 2.7.5.
So without setting this pyspark_python , containers are using 2.7.5.
But when I will set pyspark_python to 3.5.x , they are using what I have given.
It is defining PWD='/data/complete-path'
Where this PWD directory resides ?
This directory is getting cleaned up after job completion.
I have even tried to run the job in one session of putty
and kept the /data folder opened in another session of putty to see
if any directories are getting created on run time. but couldn't find any?
It is also setting the PYTHONPATH to $PWD/pyspark.zip:$PWD/py4j-version.zip
When ever I am doing a python specific operation
in spark code , its using PYSPARK_PYTHON. So for what purpose this PYTHONPATH is being used?
3.After this yarn is creating softlinks using ln -sf for all the files in step 1
soft links are created for for pyspark.zip , py4j-<version>.zip,
all python files mentioned in step 1.
Now these links are again pointing to '/data/different_directories'
directory (which I am not sure where they are present).
I know soft links can be used for accessing remote nodes ,
but here why the soft links are created ?
Last but not the least , whether this launch_container.sh will run for each container launch ?

Then how a container launched in another node will be able to use this python version ?
First of all, when we submit a Spark application, there are several ways to set the configurations for the Spark application.
Such as:
Setting spark-defaults.conf
Setting environment variables
Setting spark-submit options (spark-submit —help and —conf)
Setting a custom properties file (—properties-file)
Setting values in code (exposed in both SparkConf and SparkContext APIs)
Setting Hadoop configurations (HADOOP_CONF_DIR and spark.hadoop.*)
In my environment, the Hadoop configurations are placed in /etc/spark/conf/yarn-conf/, and the spark-defaults.conf and spark-env.sh is in /etc/spark/conf/.
As the order of precedence for configurations, this is the order that Spark will use:
Properties set on SparkConf or SparkContext in code
Arguments passed to spark-submit, spark-shell, or pyspark at run time
Properties set in /etc/spark/conf/spark-defaults.conf, a specified properties file
Environment variables exported or set in scripts
So broadly speaking:
For properties that apply to all jobs, use spark-defaults.conf,
for properties that are constant and specific to a single or a few applications use SparkConf or --properties-file,
for properties that change between runs use command line arguments.
Now, regarding the question:
In Cluster mode of Spark, the Spark driver is running in container in YARN, the Spark executors are running in container in YARN.
In Client mode of Spark, the Spark driver is running outside of the Hadoop cluster(out of YARN), and the executors are always in YARN.
So for your question, it is mostly relative with YARN.
When an application is submitted to YARN, first there will be an ApplicationMaster container, which nigotiates with NodeManager, and is responsible to control the application containers(in your case, they are Spark executors).
NodeManager will then create a local temporary directory for each of the Spark executors, to prepare to launch the containers(that's why the launch_container.sh has such a name).
We can find the location of the local temporary directory is set by NodeManager's ${yarn.nodemanager.local-dirs} defined in yarn-site.xml.
And we can set yarn.nodemanager.delete.debug-delay-sec to 10 minutes and review the launch_container.sh script.
In my environment, the ${yarn.nodemanager.local-dirs} is /yarn/nm, so in this directory, I can find the tempory directories of Spark executor containers, they looks like:
/yarn/nm/nm-local-dir/container_1603853670569_0001_01_000001.
And in this directory, I can find the launch_container.sh for this specific container and other stuffs for running this container.
Where this PWD directory resides ?
I think this is a special Environment Variable in Linux OS, so better not to modify it unless you know how it works percisely in your application.
As per above, if you export this PWD environment at the runtime, I think it is passed to Spark as same as any other Environment Variables.
I'm not sure how the PYSPARK_PYTHON Environment Variable is used in Spark's launch scripts chain, but here you can find the instruction in the official documentation, showing how to set Python binary executable while you are using spark-submit:
spark-submit --conf spark.pyspark.python=/<PATH>/<TO>/<FILE>
As for the last question, yes, YARN will create a temp dir for each of the containers, and the launch_container.sh is included in the dir.

The environment variable AWS_ACCESS_KEY_ID must be set

I am using Linux 18.04 and I want to lunch a spark cluster on EC2.
I used the export command to set environment variables
export AWS_ACCESS_KEY_ID=MyAccesskey
export AWS_SECRET_ACCESS_KEY=Mysecretkey
but when I run the command to lunch the spark cluster I get
ERROR: The environment variable AWS_ACCESS_KEY_ID must be set
I put all the commands I used in case I made a mistake:
sudo mv ~/Downloads/keypair.pem /usr/local/spark/keypair.pem
sudo mv ~/Downloads/credentials.csv /usr/local/spark/credentials.csv
# Make sure the .pem file is readable by the current user.
chmod 400 "keypair.pem"
# Go into the spark directory and set the environment variables with the credentials information
cd spark
export AWS_ACCESS_KEY_ID=ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY=SECRET_KEY
# To install Spark 2.0 on the cluster:
sudo spark-ec2/spark-ec2 -k keypair --identity-file=keypair.pem --region=us-west-2 --zone=us-west-2a --copy-aws-credentials --instance-type t2.micro --worker-instances 1 launch project-launch
I am new to these things and any help is really appreciated

You can also retrieve the value AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY by using the get subcommand for aws configure:
AWS_ACCESS_KEY_ID=$(aws configure get aws_access_key_id)
AWS_SECRET_ACCESS_KEY=$(aws configure get aws_secret_access_key)
In command line:
sudo AWS_ACCESS_KEY_ID=$(aws configure get aws_access_key_id) AWS_SECRET_ACCESS_KEY=$(aws configure get aws_secret_access_key) spark-ec2/spark-ec2 -k keypair --identity-file=keypair.pem --region=us-west-2 --zone=us-west-2a --copy-aws-credentials --instance-type t2.micro --worker-instances 1 launch project-launch
source: AWS Command Line Interface User Guide

Environment variables can be simply passed after sudo in form ENV=VALUE and they'll be accepted by followed command. It's not known to me if there are restrictions to this usage, so my example problem can be solved with:
sudo AWS_ACCESS_KEY_ID=ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY=SECRET_KEY spark-ec2/spark-ec2 -k keypair --identity-file=keypair.pem --region=us-west-2 --zone=us-west-2a --copy-aws-credentials --instance-type t2.micro --worker-instances 1 launch project-launch

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string