airflow 2.3.3 SparkKubernetesOperator with SparkKubernetesSensor in TaskGroup

airflow 2.3.3 SparkKubernetesOperator with SparkKubernetesSensor in TaskGroup - apache-spark

I have SparkKubernetesOperator >> SparkKubernetesSensor dependency.
It works fine when outside of TaskGroup but it does not work when put within TaskGroup because it complains about metadata field.
What am I doing group.
with TaskGroup("tg-task-1", default_args=default_args) as tg_task_1:
task_1 = SparkKubernetesOperator(
task_id='task-1',
namespace="batch",
application_file="k8s/task-1.yaml",
do_xcom_push=True,
dag=dag,
)
task_1_sensor = SparkKubernetesSensor(
task_id='task-1-sensor',
namespace="batch",
application_name="{{ task_instance.xcom_pull(task_ids='task-1')['metadata']['name'] }}",
kubernetes_conn_id="kubernetes_default",
dag=dag,
attach_log=True,
)
this is the error I get
jinja2.exceptions.UndefinedError: 'None' has no attribute 'metadata'

I have just realised the task_ids differs when tasks are defined within TaskGroup. So, the tasks_ids shoul be:
application_name="{{ task_instance.xcom_pull(task_ids='tg-task-1.task-1')['metadata']['name'] }}",

Related

How to get the passed parameter inside python container in AWS Batch job?

I have 2 job definitions (job-1, job-2) and I'm executing Job1 first. Then Job1 will submit Job2 and starts its execution. I need to pass some parameters to Job2 when submitting the job. Below is my Python3 code,
# job1
import boto3
import os
env = os.environ.get('environment')
batch = boto3.client('batch')
def submit_job():
return batch.submit_job(
jobName='Job2',
jobQueue='job2-queue-dev',
jobDefinition='job-2',
containerOverrides= {
'environment': [
{
'name': 'environment',
'value': env
},
]
},
parameters={
'opco': '123',
'app' : 'app1'
},
);
submit_job()
In the Job2 i can easily get the environment variable with below code.
# job2
env = os.environ.get('environment')
def get_index_name(env):
return 'liberty-'+env
....
So my question is How can we get those parameters (opco, app) inside the job2?
FYI, i could pass them as environment variable, But i want to know how parameter retrieval is done here.
Thanks in advance

KubernetesPodOperator Not Sending Arguments as expected

I have airflow running KubernetesPodOperator in order to do a Spark-submit call:
spark_image = f'{getenv("REGISTRY")}/myApp:{getenv("TAG")}'
j2g = KubernetesPodOperator(
dag=dag,
task_id='myApp',
name='myApp',
namespace='data',
image=spark_image,
cmds=['/opt/spark/bin/spark-submit'],
configmaps=["data"],
arguments=[
'--master k8s://https://10.96.0.1:443',
'--deploy-mode cluster',
'--name myApp',
f'--conf spark.kubernetes.container.image={spark_image}',
'local:///app/run.py'
],
However, I'm getting the following error:
Error: Unrecognized option: --master k8s://https://10.96.0.1:443
Which is weird, because when I bin/bash to a running pod and execute the spark-submit command, it works.
Any idea how to pass the arguments as expected?

Solution from GitHub ticket: Parameter should be sent like:
'--master=k8s://https://10.96.0.1:443',

Airflow/Luigi for AWS EMR automatic cluster creation and pyspark deployment

I am new to airflow automation, i dont now if it is possible to do this with apache airflow(or luigi etc) or should i just make a long bash file to do this.
I want to build dag for this
Create/clone a cluster on AWS EMR
Install python requirements
Install pyspark related libararies
Get latest code from github
Submit spark job
Terminate cluster on finish
for individual steps, i can make .sh files like below(not sure if it is good to do this or not) but dont know how to do it in airflow
1) creating a cluser with cluster.sh
aws emr create-cluster \
--name "1-node dummy cluster" \
--instance-type m3.xlarge \
--release-label emr-4.1.0 \
--instance-count 1 \
--use-default-roles \
--applications Name=Spark \
--auto-terminate
2 & 3 & 4) clone git and install requirements codesetup.sh
git clone some-repo.git
pip install -r requirements.txt
mv xyz.jar /usr/lib/spark/xyz.jar
5) Running spark job sparkjob.sh
aws emr add-steps --cluster-id <Your EMR cluster id> --steps Type=spark,Name=TestJob,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,pythonjob.py,s3a://your-source-bucket/data/data.csv,s3a://your-destination-bucket/test-output/],ActionOnFailure=CONTINUE
6) Not sure, may be this
terminate-clusters
--cluster-ids <value> [<value>...]
Finally this all can be executed as one .sh file. I need to know the good approach to this with airflow/luigi.
What i found:
I find this post to be close but its outdated(2016) and misses the connections and code for playbooks
https://www.agari.com/email-security-blog/automated-model-building-emr-spark-airflow/

I figured out that, There can be two option to do this
1) we can make a bash script with the help of emr create-cluster and addstep and then use airflow Bashoperator to schedule it
Alternatively, there is wrapper around these two, called sparksteps
An example from their documentation
sparksteps examples/episodes.py \
--s3-bucket $AWS_S3_BUCKET \
--aws-region us-east-1 \
--release-label emr-4.7.0 \
--uploads examples/lib examples/episodes.avro \
--submit-args="--deploy-mode client --jars /home/hadoop/lib/spark-avro_2.10-2.0.2-custom.jar" \
--app-args="--input /home/hadoop/episodes.avro" \
--tags Application="Spark Steps" \
--debug
you can make a .sh script with default option of your choice. After preparing this script you can call this from airflow bashoperator as below
create_command = "sparkstep_custom.sh "
t1 = BashOperator(
task_id= 'create_file',
bash_command=create_command,
dag=dag
)
2) You can use airflow's own operators for aws to do this.
EmrCreateJobFlowOperator (for launching cluster) EmrAddStepsOperator(for submitting spark job)
EmrStepSensor (to track when step finishes)
EmrTerminateJobFlowOperator (to terminate clluster when step finishes)
Basic example to create cluster and submit step
my_step=[
{
'Name': 'setup - copy files',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['aws', 's3', 'cp', S3_URI + 'test.py', '/home/hadoop/']
}
},
{
'Name': 'setup - copy files 3',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['aws', 's3', 'cp', S3_URI + 'myfiledependecy.py', '/home/hadoop/']
}
},
{
'Name': 'Run Spark',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit','--jars', "jar1.jar,jar2.jar", '--py-files','/home/hadoop/myfiledependecy.py','/home/hadoop/test.py']
}
}
]
cluster_creator = EmrCreateJobFlowOperator(
task_id='create_job_flow2',
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id='aws_default',
emr_conn_id='emr_default',
dag=dag
)
step_adder_pre_step = EmrAddStepsOperator(
task_id='pre_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
aws_conn_id='aws_default',
steps=my_steps,
dag=dag
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
step_id="{{ task_instance.xcom_pull('pre_step', key='return_value')[0] }}",
aws_conn_id='aws_default',
dag=dag
)
cluster_remover = EmrTerminateJobFlowOperator(
task_id='remove_cluster',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
aws_conn_id='aws_default',
dag=dag
)
Also, to upload code to s3 (where i was curious to get latest code from github_ it can be done with s3, boto3 and Pythonoperator
Simple example
S3_BUCKET = 'you_bucket_name'
S3_URI = 's3://{bucket}/'.format(bucket=S3_BUCKET)
def upload_file_to_S3(filename, key, bucket_name):
s3.Bucket(bucket_name).upload_file(filename, key)
upload_to_S3_task = PythonOperator(
task_id='upload_to_S3',
python_callable=upload_file_to_S3,
op_kwargs={
'filename': configdata['project_path']+'test.py',
'key': 'test.py',
'bucket_name': 'dep-buck',
},
dag=dag)

Airflow has operators for this. airflow doc

Convert HQL to SparkSQL

I'm trying to convert HQL to Spark.
I have the following query (Works in Hue with Hive editor):
select reflect('java.util.UUID', 'randomUUID') as id,
tt.employee,
cast( from_unixtime(unix_timestamp (date_format(current_date(),'dd/MM/yyyy HH:mm:ss'), 'dd/MM/yyyy HH:mm:ss')) as timestamp) as insert_date,
collect_set(tt.employee_detail) as employee_details,
collect_set( tt.emp_indication ) as employees_indications,
named_struct ('employee_info', collect_set(tt.emp_info),
'employee_mod_info', collect_set(tt.emp_mod_info),
'employee_comments', collect_set(tt.emp_comment) )
as emp_mod_details,
from (
select views_ctr.employee,
if ( views_ctr.employee_details.so is not null, views_ctr.employee_details, null ) employee_detail,
if ( views_ctr.employee_info.so is not null, views_ctr.employee_info, null ) emp_info,
if ( views_ctr.employee_comments.so is not null, views_ctr.employee_comments, null ) emp_comment,
if ( views_ctr.employee_mod_info.so is not null, views_ctr.employee_mod_info, null ) emp_mod_info,
if ( views_ctr.emp_indications.so is not null, views_ctr.emp_indications, null ) employees_indication,
from
( select * from views_sta where emp_partition=0 and employee is not null ) views_ctr
) tt
group by employee
distribute by employee
First, What I'm trying is to write it in spark.sql as follow:
sparkSession.sql("select reflect('java.util.UUID', 'randomUUID') as id, tt.employee, cast( from_unixtime(unix_timestamp (date_format(current_date(),'dd/MM/yyyy HH:mm:ss'), 'dd/MM/yyyy HH:mm:ss')) as timestamp) as insert_date, collect_set(tt.employee_detail) as employee_details, collect_set( tt.emp_indication ) as employees_indications, named_struct ('employee_info', collect_set(tt.emp_info), 'employee_mod_info', collect_set(tt.emp_mod_info), 'employee_comments', collect_set(tt.emp_comment) ) as emp_mod_details, from ( select views_ctr.employee, if ( views_ctr.employee_details.so is not null, views_ctr.employee_details, null ) employee_detail, if ( views_ctr.employee_info.so is not null, views_ctr.employee_info, null ) emp_info, if ( views_ctr.employee_comments.so is not null, views_ctr.employee_comments, null ) emp_comment, if ( views_ctr.employee_mod_info.so is not null, views_ctr.employee_mod_info, null ) emp_mod_info, if ( views_ctr.emp_indications.so is not null, views_ctr.emp_indications, null ) employees_indication, from ( select * from views_sta where emp_partition=0 and employee is not null ) views_ctr ) tt group by employee distribute by employee")
But I got the following exception:
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failute: Task not serializable:
java.io.NotSerializableException:
org.apache.spark.unsafe.types.UTF8String$IntWrapper
-object not serializable (class : org.apache.spark.unsafe.types.UTF8String$IntWrapper, value:
org.apache.spark.unsafe.types.UTF8String$IntWrapper#30cfd641)
If I'm trying to run my query without collect_set function its work, It's can fail because struct column types in my table?
How can I write my HQL query in Spark / fix my exception?

Difficulties in using a Gcloud Composer DAG to run a Spark job

I'm playing around with Gcloud Composer, trying to create a DAG that creates a DataProc cluster, runs a simple Spark job, then tears down the cluster. I am trying to run the Spark PI example job.
I understand that when calling DataProcSparkOperator I can choose only to define either the main_jar or the main_class property. When I define main_class, the job fails with the error:
java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:239)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
When I choose to define the main_jar property, the job fails with the error:
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
I'm at a bit of a loss as to how to resolve this, as I am kinda new to both Spark and DataProc.
My DAG:
import datetime as dt
from airflow import DAG, models
from airflow.contrib.operators import dataproc_operator as dpo
from airflow.utils import trigger_rule
MAIN_JAR = 'file:///usr/lib/spark/examples/jars/spark-examples.jar'
MAIN_CLASS = 'org.apache.spark.examples.SparkPi'
CLUSTER_NAME = 'quickspark-cluster-{{ ds_nodash }}'
yesterday = dt.datetime.combine(
dt.datetime.today() - dt.timedelta(1),
dt.datetime.min.time())
default_dag_args = {
'start_date': yesterday,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': dt.timedelta(seconds=30),
'project_id': models.Variable.get('gcp_project')
}
with DAG('dataproc_spark_submit', schedule_interval='0 17 * * *',
default_args=default_dag_args) as dag:
create_dataproc_cluster = dpo.DataprocClusterCreateOperator(
project_id = default_dag_args['project_id'],
task_id = 'create_dataproc_cluster',
cluster_name = CLUSTER_NAME,
num_workers = 2,
zone = models.Variable.get('gce_zone')
)
run_spark_job = dpo.DataProcSparkOperator(
task_id = 'run_spark_job',
#main_jar = MAIN_JAR,
main_class = MAIN_CLASS,
cluster_name = CLUSTER_NAME
)
delete_dataproc_cluster = dpo.DataprocClusterDeleteOperator(
project_id = default_dag_args['project_id'],
task_id = 'delete_dataproc_cluster',
cluster_name = CLUSTER_NAME,
trigger_rule = trigger_rule.TriggerRule.ALL_DONE
)
create_dataproc_cluster >> run_spark_job >> delete_dataproc_cluster

I compared it with a successful job using the CLI and saw that, even when the class was populating the Main class or jar field, the path to the Jar was specified in Jar files:
Checking the operator I noticed there is also a dataproc_spark_jars parameter which is not mutually exclusive to main_class:
run_spark_job = dpo.DataProcSparkOperator(
task_id = 'run_spark_job',
dataproc_spark_jars = [MAIN_JAR],
main_class = MAIN_CLASS,
cluster_name = CLUSTER_NAME
)
Adding it did the trick:

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

airflow 2.3.3 SparkKubernetesOperator with SparkKubernetesSensor in TaskGroup - apache-spark

I have just realised the task_ids differs when tasks are defined within TaskGroup. So, the tasks_ids shoul be: application_name="{{ task_instance.xcom_pull(task_ids='tg-task-1.task-1')['metadata']['name'] }}",

Related

How to get the passed parameter inside python container in AWS Batch job?

KubernetesPodOperator Not Sending Arguments as expected

Airflow/Luigi for AWS EMR automatic cluster creation and pyspark deployment

Convert HQL to SparkSQL

Difficulties in using a Gcloud Composer DAG to run a Spark job

Categories

Resources