Sequential execution of multiple spark jobs in dataproc / gcp

Sequential execution of multiple spark jobs in dataproc / gcp - apache-spark

I would like to launch sequentially multiple spark jobs in gcp, like
gcloud dataproc jobs submit spark file1.py
gcloud dataproc jobs submit spark file2.py
...
so that the execution of one of those starts just when the execution of the previous job is completed.
Is there any way to do it?

This can be done using Dataproc Workflows templates
This workflow will create and delete the cluster as part of the workflow.
These are the steps you can follow to create the workflow:
Create your workflow template
export REGION=us-central1
gcloud dataproc workflow-templates create workflow-id \
--region $REGION
Set a Dataproc cluster type that will be used for the jobs
gcloud dataproc workflow-templates set-managed-cluster workflow-id \
--region $REGION \
--master-machine-type machine-type \
--worker-machine-type machine-type \
--num-workers number \
--cluster-name cluster-name
Add the jobs as steps to your workflow
gcloud dataproc workflow-templates add-job pyspark gs://bucket-name/file1.py \
--region $REGION \
--step-id job1 \
--workflow-template workflow-id
The second job needs the parameter --start-after to make sure it runs after the first job.
gcloud dataproc workflow-templates add-job pyspark gs://bucket-name/file2.py \
--region $REGION \
--step-id job2 \
--start-after job1 \
--workflow-template workflow-id
Run the workflow
gcloud dataproc workflow-templates instantiate template-id \
--region $REGION \

Related

Creating vm using azure cli

I'm learning azure for 2 weeks now, I'm curious about creating vm on the cli. For example if I want to create a vm in a region with a specific size using cli, but it turns out that the size isn't available so I want the script to auto select another size with the same specs that I chose in the first place but I don't know how. This is my script now:
az group create --name TestingVM --location eastus
az vm create \
--resource-group TestingVM \
--name TestingVM \
--image Debian \
--admin-username aloha \
--size Standard_F8s_v2 \
--location eastus \
--admin-password *somepassword*
thanks!

How to include security context for running a spark-submit job on kubernetes

I'm using Spark 2.4.5 to run a spark application on kubernetes through the spark-submit command. The application fails while trying to write outputs as detailed here, probably due to an issue with an incorrect security context. So I tried setting up a security context and running the application. I did this by creating a pod template as mentioned here, but I haven't been able to validate if the pod template has been set up properly (because I couldn't find proper examples), or if it's accessible from the driver and executor pods (since I couldn't find anything related to the template in the driver or kubernetes logs). This is the content of the pod template I used to set a security context.
apiVersion: v1
kind: Pod
metadata:
name: spark-pod-template
spec:
securityContext:
runAsUser: 1000
This is the command I used.
<SPARK_PATH>/bin/spark-submit --master k8s://https://dssparkcluster-dns-fa326f6a.hcp.southcentralus.azmk8s.io:443 \
--deploy-mode cluster --name spark-pi3 --conf spark.executor.instances=2 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=docker.io/datamechanics/spark:2.4.5-hadoop-3.1.0-java-8-scala-2.11-python-3.7-dm14 \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.azure-fileshare-pvc.options.claimName=azure-fileshare-pvc \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.azure-fileshare-pvc.mount.path=/opt/spark/work-dir \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.azure-fileshare-pvc.options.claimName=azure-fileshare-pvc \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.azure-fileshare-pvc.mount.path=/opt/spark/work-dir \
--conf spark.kubernetes.driver.podTemplateFile=/opt/spark/work-dir/spark_pod_template.yml \
--conf spark.kubernetes.executor.podTemplateFile=/opt/spark/work-dir/spark_pod_template.yml \
--verbose /opt/spark/work-dir/wordcount2.py
I've placed the pod template file in a persistent volume mounted at /opt/spark/work-dir. The questions I have are:
Is the pod template file accessible from the persistent volume?
Are the file contents in the appropriate format for setting a runAsUser?
Is the pod template functionality supported for Spark 2.4.5? Although it is mentioned in the 2.4.5 docs that security contexts can be implemented using pod templates, there is no pod template section as in the 3.2.0 docs.
Any help would be greatly appreciated. Thanks.

As you can read at https://spark.apache.org/docs/latest/running-on-kubernetes.html#pod-template:
spark.kubernetes.executor.podTemplateFile to point to files accessible to the spark-submit process.
So here, there is nothing about PVC, but local FS where spark-submit is sent. BTW, you can check by inspecting generated pod if this is working.
A good pod & container security-context:
securityContext:
fsGroup: 1000
runAsGroup: 1000
runAsNonRoot: true
runAsUser: 1000
containers:
- name: spark
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true

Dataproc is not installing custom Conda package from custom Conda channel

I am attempting to spin up a single node Dataproc "cluster" in GCP that installs additional packages from both conda-forge and a custom Conda channel. The gcloud command I run is:
gcloud beta dataproc clusters create MY_CLUSTER_NAME \
--enable-component-gateway \
--bucket MY_GCS_BUCKET \
--region us-central1 \
--subnet default \
--zone us-central1-a \
--single-node \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--image-version 1.5-ubuntu18 \
--properties spark:spark.jars.packages=org.apache.spark:spark-avro_2.12:2.4.4,spark-env:spark.jars.packages=org.apache.spark:spark-avro_2.12:2.4.4 \
--optional-components ANACONDA,JUPYTER \
--max-idle 7200s \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project MY_PROJECT_ID \
--metadata='CONDA_PACKAGES=pandas matplotlib seaborn scikit-learn MY_CUSTOM_PACKAGE' \
--metadata='CONDA_CHANNELS=conda-forge https://MY_CUSTOM_CONDA_CHANNEL'
I have verified I can conda install -c https://MY_CUSOMT_CONDA_CHANNEL MY_CUSTOM_PACKAGE locally, and that other packages are being installed. When searching through the logs for the cluster, I find no entries about the installation of the additional conda packages.
Questions:
Where can I find logs that will help me debug this problem?
Is there something wrong with the above command?

It seems that you didn't add the conda-install.sh init action when creating the cluster, see more details in this doc, e.g.:
gcloud dataproc clusters create my-cluster \
--image-version=1.4 \
--region=${REGION} \
--metadata='CONDA_PACKAGES=pandas matplotlib seaborn scikit-learn MY_CUSTOM_PACKAGE' \
--metadata='CONDA_CHANNELS=conda-forge https://MY_CUSTOM_CONDA_CHANNEL' \
--initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/python/conda-install.sh
You should be able to find the init action log at /var/log/dataproc-initialization-script-0.log, see more details in this doc.

using aws cli is it possible to find what are the task running in a instance for ECS

Using aws cli is it possible to find the tasks running in a EC2 instance. I tried
aws ecs describe-container-instances --cluster my-prod --container-instances xxxxx-b5ab-4606-b8ec-xxxxxxxxx --region us-east-1 --profile mfa
but it did not return such information.
From the console browser under ECS Instance tab if I select under "Container Instance" I do get this information.

you can try below script
Get container instances
Find task in each container Instances
Grap instance from each container Instance
Map task
#!/bin/bash
CLUSTER_NAME=my-cluster
CONTAINER_INSTANCE="$(aws ecs list-container-instances --cluster $CLUSTER_NAME --query 'containerInstanceArns[]' --output text)"
for container in $CONTAINER_INSTANCE; do
TASK=$(aws ecs list-tasks --cluster $CLUSTER_NAME --container-instance $container --query 'taskArns[]' --output text)
EC2_INSTANCE_ID=$(aws ecs describe-container-instances --cluster $CLUSTER_NAME --container-instances $container --region us-west-2 --query 'containerInstances[*].ec2InstanceId' --output text)
echo "**************************************"
echo "ECS TASK having ARN is $TASK"
echo "Running in EC2 instance having ID $EC2_INSTANCE_ID"
done
Find instance or instance ID of AWS ECS running TASK or Services
You might be interested in this as well get-ecsIP-for-ecs-service

Container instances metrics are missing

Metrics for our Azure Container instances has stopped showing up in the portal and when querying Azure Monitor using the CLI.
I've tried to redeploying instances, restart the containers, and disabling features such as log analytics.
These are the options to az we use to deploy our containers, with redacted values:
az container create \
--resource-group "" \
--name "" \
--image \
--registry-username "" \
--registry-password "" \
--ports \
--ip-address public \
--dns-name-label "" \
--azure-file-volume-account-name "" \
--azure-file-volume-account-key "" \
--azure-file-volume-share-name "" \
--azure-file-volume-mount-path "" \
--cpu 1 \
--memory 1 \
--log-analytics-workspace "" \
--log-analytics-workspace-key ""
According to the documentation metrics should just be there, so I'm curious as to why metrics have seemingly stopped. I'm not sure if there's some newly introduced option that needs to be enabled?

For your issue, it seems there is no problem if you create the Azure container instance with the CLI command and input the correct parameters. So I guess the possible reason is that the container instance is not in the running state so that the metrics are also stoped.
You can take a look at the steps about the Container instance logging with Azure Monitor logs to see if there are other steps missing.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Sequential execution of multiple spark jobs in dataproc / gcp - apache-spark

I would like to launch sequentially multiple spark jobs in gcp, like gcloud dataproc jobs submit spark file1.py gcloud dataproc jobs submit spark file2.py ... so that the execution of one of those starts just when the execution of the previous job is completed. Is there any way to do it?

Related

Creating vm using azure cli

How to include security context for running a spark-submit job on kubernetes

Dataproc is not installing custom Conda package from custom Conda channel

using aws cli is it possible to find what are the task running in a instance for ECS

Container instances metrics are missing

Categories

Resources