I am using Linux 18.04 and I want to lunch a spark cluster on EC2.
I used the export command to set environment variables
export AWS_ACCESS_KEY_ID=MyAccesskey
export AWS_SECRET_ACCESS_KEY=Mysecretkey
but when I run the command to lunch the spark cluster I get
ERROR: The environment variable AWS_ACCESS_KEY_ID must be set
I put all the commands I used in case I made a mistake:
sudo mv ~/Downloads/keypair.pem /usr/local/spark/keypair.pem
sudo mv ~/Downloads/credentials.csv /usr/local/spark/credentials.csv
# Make sure the .pem file is readable by the current user.
chmod 400 "keypair.pem"
# Go into the spark directory and set the environment variables with the credentials information
cd spark
export AWS_ACCESS_KEY_ID=ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY=SECRET_KEY
# To install Spark 2.0 on the cluster:
sudo spark-ec2/spark-ec2 -k keypair --identity-file=keypair.pem --region=us-west-2 --zone=us-west-2a --copy-aws-credentials --instance-type t2.micro --worker-instances 1 launch project-launch
I am new to these things and any help is really appreciated
You can also retrieve the value AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY by using the get subcommand for aws configure:
AWS_ACCESS_KEY_ID=$(aws configure get aws_access_key_id)
AWS_SECRET_ACCESS_KEY=$(aws configure get aws_secret_access_key)
In command line:
sudo AWS_ACCESS_KEY_ID=$(aws configure get aws_access_key_id) AWS_SECRET_ACCESS_KEY=$(aws configure get aws_secret_access_key) spark-ec2/spark-ec2 -k keypair --identity-file=keypair.pem --region=us-west-2 --zone=us-west-2a --copy-aws-credentials --instance-type t2.micro --worker-instances 1 launch project-launch
source: AWS Command Line Interface User Guide
Environment variables can be simply passed after sudo in form ENV=VALUE and they'll be accepted by followed command. It's not known to me if there are restrictions to this usage, so my example problem can be solved with:
sudo AWS_ACCESS_KEY_ID=ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY=SECRET_KEY spark-ec2/spark-ec2 -k keypair --identity-file=keypair.pem --region=us-west-2 --zone=us-west-2a --copy-aws-credentials --instance-type t2.micro --worker-instances 1 launch project-launch
Related
I've setup a docker container that is starting a jupyter notebook using spark. I've integrated the necessary jars into spark's directoy for being able to access the S3 filesystem.
My Dockerfile:
FROM jupyter/pyspark-notebook
EXPOSE 8080 7077 6066
RUN conda install -y --prefix /opt/conda pyspark==3.2.1
USER root
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.1/hadoop-aws-3.2.1.jar)
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.213/aws-java-sdk-bundle-1.12.213.jar )
# The aws sdk relies on guava, but the default guava lib in jars is too old for being compatible
RUN rm /usr/local/spark/jars/guava-14.0.1.jar
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/com/google/guava/guava/29.0-jre/guava-29.0-jre.jar )
USER jovyan
ENV AWS_ACCESS_KEY_ID=XXXXX
ENV AWS_SECRET_ACCESS_KEY=XXXXX
ENV PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser"
ENV PYSPARK_DRIVER_PYTHON=/opt/conda/bin/jupyter
This works nicely so far. However, everytime I'm creating a kernel session in jupyter, I do need to setup the EnvironmentCredentialsProvider manually, because by default, it's expecting the IAMInstanceCredentialsProvider to deliver the credentials which obviously isn't there. Because of this, I need to set this in jupyter everytime:
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.EnvironmentVariableCredentialsProvider")
Can I configure this somewhere in a file, so that the credentialprovider is set correctly be default?
I've tried to create ~/.aws/credentials to see if spark would read the credentials by default from there but nope.
s3a connector actually looks for s3a options, then env vars, before IAM properties
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3
something may be wrong with your spark defaults config file
After a few days of browsing the web, I found the correct properties (not in the official documentation though) that were missing in the spark-defaults.conf:
spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.EnvironmentVariableCredentialsProvider
What I'm doing
I am using AWS batch to run a docker container for a large compute job. I have configured the ECR/ECS successfully to the best of my knowledge but am having issues running the required commands for reasons that are beyond my level of understanding with docker ( newbie )
What I need to do is pass the below commands into my application and start my application to perform some heavy computing tasks; all commands listed below must be present.
The Issue(s)
The issue arises when I send the submit job to AWS batch; this service pulls the image from the ACR ( amazon container repository ) and spins up a compute environment. The issue comes from when I try to run the command I pass in, below I will go throgh it.
"command": [
"mkdir -p logging",
"chmod 777 logging/",
"docker run -t -i -e my-application", # container name
"-e APIKEY",
"-e BASEURI",
"-e APIUSER",
"-v WORKSPACE /logging:/src/log",
"DOCKERIMAGE",
"python my_app.py",
"-t APP_USER",
"-e APP_ENVIRONMENT",
"-u APP_USERNAME",
"-p APP_PASSWORD",
"-i IN_PATH",
"-o OUT_PATH",
"-b tmp/"
]
The command above generates the following error(s)
container_linux.go:370: starting container process caused: exec: "mkdir -p log": executable file not found in $PATH
I tried to pass in the command to echo the env var $PATH but was unsuccesfull getting a response and resulted in a similar error.
I have ran successfully "ls" and was able to see the directory contents of my application inside.
I am not however able to run any of these commands that I have included in the command [] section. I have tried just running python and such in hopes of getting a more detailed error but was unsuccessful.
Logic in plain English
Create a path called logging if it doesnt exist
set the permissions for logging
run the docker container and pass in the environment variables while doing so
Tell docker to run the python file my_app.py and pass in the expected runtime args
Execute and perform the required logic deligated in the python3 application
Questions
Why can I not create a directory here called "logging" where am I ?
Am I running these properly as defined by AWS batch? or docker
What am I missing or where am I going wrong?
AWS Batch high level doc
AWS Batch link specific to what i'm doing
Assuming that you're following the syntax described in the Container
Properties
section of the AWS docs, you have several problems with the syntax of
your command directive.
First
The command directive can only run a single command. You can't mash together a bunch of commands as you're trying to do in your example. If you need to run multiple commands you would need to embed them as an argument to a shell. For example, something like:
command: ["/bin/sh", "-c", "mkdir -p logging; chmod 777 logging; ..."]
Second
You must properly tokenize your
command lines -- that is, when you type mkdir -p logging at the
command prompt, the shell splits this into three parts (or "tokens"): ['mkdir', '-p', 'logging']. You need to do the same thing when building up the
list of arguments to command.
This is invalid:
command: ["mkdir -p logging"]
That would looking for a command named mkdir -p logging, and of course no such command exists. That would properly be written as:
command: ["mkdir", "-p", "logging"]
Third
I'm not very familiar with the AWS batch environment, but it's unlikely you can run a docker command inside a docker` container as you're trying to do. It's unclear why you're doing this, though: why not just configure your AWS batch job with the appropriate image, environment variables, etc?
Take a look at some of these example job definitions.
I am at my wits end trying to figure this out
When I execute the following command:
sudo -u icinga '/usr/lib//nagios/plugins/check_db2_health' '--database' 'mydatabase' '--environment' 'DB2DIR=/opt/IBM/db2/V11.1.4fp5a' '--environment' 'DB2INSTANCE=mydatabase' '--environment' 'INSTHOME=/srv/db2/home/mydatabase' '--report' 'short' '--username' 'icinga' '--mode' 'connection-time' '--warning' '50'
The output as follow
[DBinstance : mydatabase] Status : CRITICAL - cannot connect to mydatabase. install_driver(DB2) failed: Can't load '/usr/lib/nagios/plugins/PerlLib/lib/perl5/site_perl/5.18.2/x86_64-linux-thread-multi/auto/DBD/DB2/DB2.so' for module DBD::DB2: libdb2.so.1: cannot open shared object file: No such file or directory at /usr/lib/perl5/5.18.2/x86_64-linux-thread-multi/DynaLoader.pm line 190.
at (eval 10) line 3.
Compilation failed in require at (eval 10) line 3.
Perhaps a required shared library or dll isn't installed where expected
at /usr/lib//nagios/plugins/check_db2_health line 2627.
But when I login to the user icinga using su - icinga
And run
'/usr/lib//nagios/plugins/check_db2_health' '--database' 'mydatabase' '--environment' 'DB2DIR=/opt/IBM/db2/V11.1.4fp5a' '--environment' 'DB2INSTANCE=mydatabase' '--environment' 'INSTHOME=/srv/db2/home/mydatabase' '--report' 'short' '--username' 'icinga' '--mode' 'connection-time' '--warning' '50'
It works fine.
How do I setup environment variables when sudo - u icinga command is fired ?
I am on a SUSE linux
I am kind of trying to setup a global environment variable just like the environment variable in icinga which can work across all commands executed on the server without have to use sudo -E etc because I cannot change the way icinga calls the plugin
you need to run the db2profile when you sudo
sudo -u icinga sqllib/db2profile; '/usr/lib//nagios/plugins/check_db2_health' ...
I have an elastic beanstalk environment, which is running a docker container that has a node js API. On the AWS Console, if I select my environment, then go to Configuration/Software I have the following:
Log groups: /aws/elasticbeanstalk/my-environment
Log streaming: Enabled
Retention: 3 days
Lifecycle: Keep after termination.
However, if I click on that log group on the Cloudwatch console, I have a Last Event Time of some weeks ago (which I believe corresponds to when the environment was created) and have no content on the logs.
Since this is a dockerized application, Logs for the server itself should be at /aws/elasticbeanstalk/my-environment/var/log/eb-docker/containers/eb-current-app/stdouterr.log.
If I instead get the Logs directly from the instances by going once again to my EB environment, clicking "Logs" and then "Request last 100 Lines" the logging is happening correctly. I just can't see a thing when using CloudWatch.
Any help is gladly appreciated
I was able to get around this problem.
So CloudWatch makes a hash based on the first line of your log file and the log stream key, and the problem is that my first line on the stdouterr.log file was actually an empty line!
After couple of days playing around and getting help from the good AWS support team, I first connected via SSH to my EC2 instance associated to the EB environment and you need to add the following line to the /etc/awslogs/config/beanstalklogs.conf file, right after the "file=/var/log/eb-docker/containers/eb-current-app/stdouterr.log" line:
file_fingerprint_lines=1-20
With these, you tell the AWS service that it should calculate the hash using lines 1 through 20 on the log file. You could change 20 for larger or smaller numbers depending on your logging content; however I don't know if there is an upper limit for the value.
After doing so, you need to restart the AWS Logs Service on the instance.
For this you would execute:
sudo service awslogs stop
sudo service awslogs start
or simpler:
sudo service awslogs restart
After these steps I started using my environment and the logging was now being properly streamed to the CloudWatch console!
However this would not work if a new deployment is made, if the EC2 instance gets replaced or the auto scalable group spawns another.
To have a fix for this, it is possible to add log config via the .ebextensions directory, at the root of your application before deploying.
I added a file called logs.config to the newly created .ebextensions directory and placed the following content:
files:
"/etc/awslogs/config/beanstalklogs.conf":
mode: "000644"
user: root
group: root
content: |
[/var/log/eb-docker/containers/eb-current-app/stdouterr.log]
log_group_name=/aws/elasticbeanstalk/EB-ENV-NAME/var/log/eb-docker/containers/eb-current-app/stdouterr.log
log_stream_name={instance_id}
file=/var/log/eb-docker/containers/eb-current-app/*stdouterr.log
file_fingerprint_lines=1-20
commands:
01_remove_eb_stream_config:
command: 'rm /etc/awslogs/config/beanstalklogs.conf.bak'
02_restart_log_agent:
command: 'service awslogs restart'
Changing of course EB-ENV-NAME by my environment name on EB.
Hope it can help someone else!
For 64 bit Amazon Linux 2 the setup is slightly different.
For the delivery of log the AWS CloudWatch Agent is installed in /opt/aws/amazon-cloudwatch-agent and the Elastic Beanstalk configuration is in /opt/aws/amazon-cloudwatch-agent/etc/beanstalk.json. It is set to log the output of the container assuming there's a file called stdouterr.log, here's a snippet of the config:
{
"file_path": "/var/log/eb-docker/containers/eb-current-app/stdouterr.log",
"log_group_name": "/aws/elasticbeanstalk/EB-ENV-NAME/var/log/eb-docker/containers/eb-current-app/stdouterr.log",
"log_stream_name": "{instance_id}"
}
However when I look for the file_path it doesn't exist, instead I have a file path that encodes the current docker container ID /var/log/eb-docker/containers/eb-current-app/eb-e4e26c0bc464-stdouterr.log.
This logfile is created by a script /opt/elasticbeanstalk/config/private/eb-docker-log-start that is started by the eb-docker-log service, the default contents of this file are:
EB_CONFIG_DOCKER_CURRENT_APP=`cat /opt/elasticbeanstalk/deployment/.aws_beanstalk.current-container-id | cut -c 1-12`
mkdir -p /var/log/eb-docker/containers/eb-current-app/
docker logs -f $EB_CONFIG_DOCKER_CURRENT_APP >> /var/log/eb-docker/containers/eb-current-app/eb-$EB_CONFIG_DOCKER_CURRENT_APP-stdouterr.log 2>&1
To temporarily fix the logging you can manually run (replacing the docker ID) and then logs will start to appear in CloudWatch:
ln -sf /var/log/eb-docker/containers/eb-current-app/eb-e4e26c0bc464-stdouterr.log /var/log/eb-docker/containers/eb-current-app/stdouterr.log
To make this permanant I added an .ebextension to fix the eb-docker-log service so it re-makes this link so create a file in your source code in .ebextensions called fix-cloudwatch-logging.config and set it's contents to:
files:
"/opt/elasticbeanstalk/config/private/eb-docker-log-start" :
mode: "000755"
owner: root
group: root
content: |
EB_CONFIG_DOCKER_CURRENT_APP=`cat /opt/elasticbeanstalk/deployment/.aws_beanstalk.current-container-id | cut -c 1-12`
mkdir -p /var/log/eb-docker/containers/eb-current-app/
ln -sf /var/log/eb-docker/containers/eb-current-app/eb-$EB_CONFIG_DOCKER_CURRENT_APP-stdouterr.log /var/log/eb-docker/containers/eb-current-app/stdouterr.log
docker logs -f $EB_CONFIG_DOCKER_CURRENT_APP >> /var/log/eb-docker/containers/eb-current-app/eb-$EB_CONFIG_DOCKER_CURRENT_APP-stdouterr.log 2>&1
commands:
fix_logging:
command: systemctl restart eb-docker-log.service
cwd: /home/ec2-user
test: "[ ! -L /var/log/eb-docker/containers/eb-current-app/stdouterr.log ] && systemctl is-active --quiet eb-docker-log"
I'm trying to run some commands on my NodeJS app that need to be run via SSH (Sequelize seeding for instance), however when I do so, I noticed that the expected env vars were missing.
If I run eb printenv on my local machine I see the expected environment variables that were set in my EB Dashboard
If I SSH in, and run printenv, all of those variables I expect are missing.
So what happens, is when I run my seeds, I get an error:
node_modules/.bin/sequelize db:seed:all
ERROR: connect ECONNREFUSED 127.0.0.1:3306
I noticed that the port was wrong, it should be 5432. I checked to see if my environment variables were set with printenv and they are not there. This leads me to suspect that the proper env variables are not loaded in my ssh session, and NodeJS is falling back to the default values I provided in my config.
I found some solutions online, like running the following to load the env variables:
/opt/python/current/env
But the python directory doesn't exist. All that's inside /opt/ is elasticbeanstalk and aws directories.
I had some success so I could at least see that the env variables exist somewhere on the server by running the following:
sudo /opt/elasticbeanstalk/bin/get-config environment --output YAML
But simply running this command does not fix the problem. All it does is output the expected env variables to the screen. That's something! At least I know they are definitely there! But the env variables are still not there when I run printenv
So how do I fix this problem? Sequelize and NodeJS are clearly not seeing the env variables either, and it's trying to use the fallback default values that are set in my config file.
I know my answer is late, but I had the same problem and after some attempts with bash script I found a way to store it in your env vars.
you can simply run the following command:
export env=`/opt/elasticbeanstalk/bin/get-config environment -k <your-variable-name>`
now you will be able to easily access this variable:
echo $your-variable-name
afterward, you can utilize the env var to do what ever you like. in my case, I use it to decide which version of my code to build in a file called build-script.sh and its content is as follows:
# get env variable to know in which environment this code is running in
export env=`/opt/elasticbeanstalk/bin/get-config environment -k environment`
# building the code based on the current environment
if [ $env = "production" ]
then
echo "building for production"
npm --prefix /var/app/current run build-prod
else
echo "building for non production"
npm --prefix /var/app/current run build-prod
fi
hope this helps anyone facing the same issue 🤟🏻