AWS EMR - Run a bash script on master node - apache-spark

I am adding a step to an EMR cluster via Airflow using a BashOperator. In the bash command, I want to extract information about a previous Spark step. But the issue is, the previous spark step information is contained only in the master node and I have to make sure to run my current bash command in the master node. Is there any way to make sure that my command runs only on master node and not on worker nodes?
bash_cmd = \
"steps=`aws emr add-steps --region ap-southeast-1 --cluster-id xxxxxxxx " + \
"--steps 'Type=CUSTOM_JAR,Name=bash_test,ActionOnFailure=CONTINUE,Jar=command-runner.jar,Args=[" + \
"bash, " + \
"-c, " + \
" aws s3 cp s3://path_to_bucket_S3/userdata.sh .; chmod +x userdata.sh; ./userdata.sh]'`; "
step1 = BashOperator(
task_id='step_1',
bash_command=bash_cmd,
xcom_push=True,
dag=dag
)
Is there any way to make sure the above step/bash commands run only on master node?

Check out this from the documentation of AWS EMR:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html#emr-bootstrap-runif
You can incorporate this check into your Bash command, and run it only if the current node is the master node (by pre-checking with grep isMaster /mnt/var/lib/info/instance.json)

Related

Spark kubernetes pod fails with no discernable errors

I'm using spark-submit on version 2.4.5 to create a spark driver pod on my k8s cluster. When I run
bin/spark-submit
--master k8s://https://my-cluster-url:443
--deploy-mode cluster
--name spark-test
--class com.my.main.Class
--conf spark.executor.instances=3
--conf spark.kubernetes.allocation.batch.size=3
--conf spark.kubernetes.namespace=my-namespace
--conf spark.kubernetes.container.image.pullSecrets=my-cr-secret
--conf spark.kubernetes.container.image.pullPolicy=Always
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.my-vol.mount.path=/opt/spark/work-dir/src/main/resources/
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.my-vol.options.claimName=my-pvc
--conf spark.kubernetes.container.image=my-registry.io/spark-test:test-2.4.5
local:///opt/spark/work-dir/my-service.jar
spark-submit successfully creates a pod in my k8s cluster, and the pod makes it into the running state. The pod then quickly stops with an error status. Looking at the pod's logs I see
++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ SPARK_K8S_CMD=driver
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ sed 's/[^=]*=\(.*\)/\1/g'
+ sort -t_ -k4 -n
+ grep SPARK_JAVA_OPT_
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$#")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=<SPARK_DRIVER_BIND_ADDRESS> --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class com.my.main.Class spark-internal
20/03/04 16:44:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.spark.deploy.SparkSubmit$$anon$2).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
But no other errors. The + lines in the log correspond to the commands executed in kubernetes/dockerfiles/spark/entrypoint.sh in the spark distribution. So it looks like it makes it through the entire entrypoint script, and attempts to run the final command exec /usr/bin/tini -s -- "${CMD[#]}"
before failing after those log4j warnings. How can I debug this issue further?
edit for more details:
Pod events, as seen in kubectl describe po ...:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m41s default-scheduler Successfully assigned my-namespace/spark-test-1583356942292-driver to aks-agentpool-12301882-10
Warning FailedMount 3m40s kubelet, aks-agentpool-12301882-10 MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "spark-test-1583356942292-driver-conf-map" not found
Normal Pulling 3m37s kubelet, aks-agentpool-12301882-10 Pulling image "my-registry.io/spark-test:test-2.4.5"
Normal Pulled 3m37s kubelet, aks-agentpool-12301882-10 Successfully pulled image "my-registry.io/spark-test:test-2.4.5"
Normal Created 3m36s kubelet, aks-agentpool-12301882-10 Created container spark-kubernetes-driver
Normal Started 3m36s kubelet, aks-agentpool-12301882-10 Started container spark-kubernetes-driver
My Dockerfile – slightly adapted from the provided spark Dockerfile, and built using ./bin/docker-image-tool.sh:
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
FROM openjdk:8-jdk-slim
ARG spark_jars=jars
ARG img_path=kubernetes/dockerfiles
ARG k8s_tests=kubernetes/tests
ARG work_dir=/opt/spark/work-dir
# Before building the docker image, first build and make a Spark distribution following
# the instructions in http://spark.apache.org/docs/latest/building-spark.html.
# If this docker file is being used in the context of building your images from a Spark
# distribution, the docker build command should be invoked from the top level directory
# of the Spark distribution. E.g.:
# docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .
RUN set -ex && \
apt-get update && \
ln -s /lib /lib64 && \
apt install -y bash tini libc6 libpam-modules libnss3 && \
mkdir -p /opt/spark && \
mkdir -p ${work_dir} && \
mkdir -p /opt/spark/conf && \
touch /opt/spark/RELEASE && \
rm /bin/sh && \
ln -sv /bin/bash /bin/sh && \
echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
rm -rf /var/cache/apt/* && \
mkdir -p ${work_dir}/src/main/resources && \
mkdir -p /var/run/my-service && \
mkdir -p /var/log/my-service
COPY ${spark_jars} /opt/spark/jars
COPY bin /opt/spark/bin
COPY sbin /opt/spark/sbin
COPY ${img_path}/spark/entrypoint.sh /opt/
COPY examples /opt/spark/examples
COPY ${k8s_tests} /opt/spark/tests
COPY data /opt/spark/data
ADD conf/log4j.properties.template /opt/spark/conf/log4j.properties
ADD kubernetes/jars/my-service-*-bin.tar.gz ${work_dir}
RUN mv "${work_dir}/my-service-"*".jar" "${work_dir}/my-service.jar"
ENV SPARK_HOME /opt/spark
WORKDIR ${work_dir}
ENTRYPOINT [ "/opt/entrypoint.sh" ]

How to create Docker Container for Mongodb with data which should be deployed to azure cluster

Can anyone know how to create a docker container for the mongo db database files copied. I do see use dump and restore mechanism which would not be helpful if I need to deploy container to azure cluster. In my case Mongodb database is getting changed very frequently.
Here is my current docker file:
FROM microsoft/windowsservercore:ltsc2016
SHELL ["powershell", "-Command", "$ErrorActionPreference = 'Stop';"]
ENV MONGO_VERSION 3.4.19
COPY mongo.msi mongo.msi
COPY MongoData/db C:\\data\\db
RUN Write-Host 'Installing ...'; \
Start-Process msiexec -Wait \
-ArgumentList #( \
'/i', \
'mongo.msi', \
'/quiet', \
'/qn', \
'INSTALLLOCATION=C:\mongodb', \
'ADDLOCAL=all' \
); \
$env:PATH = 'C:\mongodb\bin;' + $env:PATH; \
[Environment]::SetEnvironmentVariable('PATH', $env:PATH, [EnvironmentVariableTarget]::Machine); \
\
Write-Host 'Verifying install ...'; \
Write-Host ' mongo --version'; mongo --version; \
Write-Host ' mongod --version'; mongod --version; \
\
Write-Host 'Removing ...'; \
Remove-Item C:\mongodb\bin\*.pdb -Force; \
Remove-Item C:\windows\installer\*.msi -Force; \
Remove-Item mongo.msi -Force; \
\
Write-Host 'Complete.';
EXPOSE 27017
CMD ["mongod", "--bind_ip_all"]
Your help would be greatly appreciated.
Thanks,
Buddha
Thanks for the quick response #Charles Xu. I found the solution for this. What we should do is:
Create a docker container
With Mongo service, dump the database which extracts bson and json files from mongodb files). Run the mongodump --host "localhost:27017" command in powershell with admin mode
Restore the extracted data to container. Run the command mongorestore --host
localhost:27030 (docker container running with this port).
Commit the docker changes, which push the data changes done in step 3 to docker image.
Publish/push the docker image to azure which will have data persisted
Hope this helps somebody.

sidekiq.yml file is not being considered

I have installed gitlab community edition on my raspberry pi 3. Everything is working fine. But when the application is up there are 25 sidekiq threads. It's eating up my memory and I don't want so many threads.
I tried controlling by adding the file /opt/gitlab/embedded/service/gitlab-rails/config/sidekiq.yml.
# Sample configuration file for Sidekiq.
# Options here can still be overridden by cmd line args.
# Place this file at config/sidekiq.yml and Sidekiq will
# pick it up automatically.
---
:verbose: false
:concurrency: 5
# Set timeout to 8 on Heroku, longer if you manage your own systems.
:timeout: 30
# Sidekiq will run this file through ERB when reading it so you can
# even put in dynamic logic, like a host-specific queue.
# http://www.mikeperham.com/2013/11/13/advanced-sidekiq-host-specific-queues/
:queues:
- critical
- default
- <%= `hostname`.strip %>
- low
# you can override concurrency based on environment
production:
:concurrency: 5
staging:
:concurrency: 5
I have restarted the application many times and even ran "reconfigure". It's not helping. It's not considering the sidekiq.yml file at all.
Can anybody please let me know where I am going wrong?
i found your question by searching for a solution for the same problem. All i found doesn't work. So i tried bye myself and found the right place for reducing sidekiq from 25 to 5. I use the gitlab omnibus version. I think the path is idetical to yours:
/opt/gitlab/sv/sidekiq/run
In this file you find the following code:
#!/bin/sh
cd /var/opt/gitlab/gitlab-rails/working
exec 2>&1
exec chpst -e /opt/gitlab/etc/gitlab-rails/env -P \
-U git -u git \
/opt/gitlab/embedded/bin/bundle exec sidekiq \
-C /opt/gitlab/embedded/service/gitlab-rails/config/sidekiq_queues.yml \
-e production \
-r /opt/gitlab/embedded/service/gitlab-rails \
-t 4 \
-c 25
Change the last line to "-c 5". The result should look like this:
#!/bin/sh
cd /var/opt/gitlab/gitlab-rails/working
exec 2>&1
exec chpst -e /opt/gitlab/etc/gitlab-rails/env -P \
-U git -u git \
/opt/gitlab/embedded/bin/bundle exec sidekiq \
-C /opt/gitlab/embedded/service/gitlab-rails/config/sidekiq_queues.yml \
-e production \
-r /opt/gitlab/embedded/service/gitlab-rails \
-t 4 \
-c 5
Last but no least yout have to resart gitlab service
sudo gitlab-ctl restart
No idea, what happening on the gitlab update. I think i have to change this value again. It would be nice, if the gitlab developers add this option to gitlab.rb in /etc/gitlab directory.

virt-install script in crontab how to control tty

I have a script that creates a virtual machine using virt-install. This script uses a kickstart file for unattended installation. It works perfectly fine when triggered through shell but its throws the following error when triggered through crontab:
error: Cannot run interactive console without a controlling TTY
The VM creation process continues at the backend but in my script it doesn't wait for the virt-install to complete and moves on to the next commands. I wanted my script to wait for the virt-install command to complete its job and then move to the next command. Is there any way i can either get a controll on TTY or make my script to wait for virt-install to complete?
Edit
Here is the virt-install command that my script executes (in case it helps you figuring out the issue):
virt-install --connect=qemu:///system \
--network=bridge:$BRIDGE \
$nic2 \
--initrd-inject=$tmp_ks_file \
--controller type=scsi,model=virtio-scsi \
--extra-args="ks=file:/$(basename $tmp_ks_file) console=tty0 console=ttyS0,115200" \
--name=$img_name \
--disk $libvirt_dir/$img_name.img,size=$disk \
--ram $mem \
--vcpus=2 \
--check-cpu \
--accelerate \
--hvm \
--location=$tree \
--nographics
Thanks in advance,
Kashif
I finally able to cater this issue through two steps:
First of all remove the 'console' related configurations from the virt-install command. See extra-args in above command.
Put some logic to wait for virt-install to complete. I did add shutdown in the post install section of kickstart file so that VM shutoff after it is done installing all the packages. Then in my script i actually 'waited' for VM to go to shutdown state before moving to the next command.
This way I am able to run my script in crontab. It also worked with jenkins too.
Hope this helps someone facing the same issue.

Inside Docker container, cronjobs are not getting executed

I have made a Docker image, from a Dockerfile, and I want a cronjob executed periodically when a container based on this image is running. My Dockerfile is this (the relevant parts):
FROM l3iggs/archlinux:latest
COPY source /srv/visitor
WORKDIR /srv/visitor
RUN pacman -Syyu --needed --noconfirm \
&& pacman -S --needed --noconfirm make gcc cronie python2 nodejs phantomjs \
&& printf "*/2 * * * * node /srv/visitor/visitor.js \n" >> cronJobs \
&& crontab cronJobs \
&& rm cronJobs \
&& npm install -g node-gyp \
&& PYTHON=/usr/sbin/python2 && export PYTHON \
&& npm install
EXPOSE 80
CMD ["/bin/sh", "-c"]
After creation of the image I run a container and verify that indeed the cronjob has been added:
crontab -l
*/2 * * * * node /srv/visitor/visitor.js
Now, the problem is that the cronjob is never executed. I have, of course, tested that "node /srv/visitor/visitor.js" executes properly when run manually from the console.
Any ideas?
One option is to use the host's crontab in the following way:
0 5 * * * docker exec mysql mysqldump --databases myDatabase -u myUsername -pmyPassword > /backups/myDatabase.sql
The above will periodically take a daily backup of a MySQL database.
If you need to chain complicated commands you can also use this format:
0 5 * * * docker exec mysql sh -c 'mkdir -p /backups/`date +\%d` && for DB in myDB1 myDB2 myDB3; do mysqldump --databases $DB -u myUser -pmyPassword > /backups/`date +\%d`/$DB.sql; done'
The above takes a 30 day rolling backup of multiple databases and does a bash for loop in a single line rather than writing and calling a shell script to do the same. So it's pretty flexible.
Or you could also put complicated scripts inside the docker container and run them like so:
0 5 * * * docker exec mysql /dailyCron.sh
It's a little tricky to answer this definitively, as I don't have time to test, but you have various options open to you:
You could use the Phusion base image, which comes with an init system and cron installed. It is based on Ubuntu and is comparatively heavyweight (at least compared to archlinux) https://registry.hub.docker.com/u/phusion/baseimage/
If you're happy to have everything started from cron jobs, you could just start cron from your CMD and keep it in the foreground (cron -f).
You can use lightweight process manager to start cron and whatever other processes you need (Phusion use runit, Docker seem to recommend supervisor).
You could write your own CMD or ENTRYPOINT script that starts cron and your process. The only issue with this is that you will need to be careful to handle signals properly or you may end up with zombie processes.
In your case, if your just playing around, I'd go with the last option, if it's anything more serious, I'd go with a process manager.
If you're running your Docker container with --net=host, see this thread:
https://github.com/docker/docker/issues/5899
I had the same issue, and my cron tasks started running when I included --pid=host in the docker run command line arguments.

Resources