I have SparkKubernetesOperator >> SparkKubernetesSensor dependency.
It works fine when outside of TaskGroup but it does not work when put within TaskGroup because it complains about metadata field.
What am I doing group.
with TaskGroup("tg-task-1", default_args=default_args) as tg_task_1:
task_1 = SparkKubernetesOperator(
task_id='task-1',
namespace="batch",
application_file="k8s/task-1.yaml",
do_xcom_push=True,
dag=dag,
)
task_1_sensor = SparkKubernetesSensor(
task_id='task-1-sensor',
namespace="batch",
application_name="{{ task_instance.xcom_pull(task_ids='task-1')['metadata']['name'] }}",
kubernetes_conn_id="kubernetes_default",
dag=dag,
attach_log=True,
)
this is the error I get
jinja2.exceptions.UndefinedError: 'None' has no attribute 'metadata'
I have just realised the task_ids differs when tasks are defined within TaskGroup. So, the tasks_ids shoul be:
application_name="{{ task_instance.xcom_pull(task_ids='tg-task-1.task-1')['metadata']['name'] }}",
Related
I have 2 job definitions (job-1, job-2) and I'm executing Job1 first. Then Job1 will submit Job2 and starts its execution. I need to pass some parameters to Job2 when submitting the job. Below is my Python3 code,
# job1
import boto3
import os
env = os.environ.get('environment')
batch = boto3.client('batch')
def submit_job():
return batch.submit_job(
jobName='Job2',
jobQueue='job2-queue-dev',
jobDefinition='job-2',
containerOverrides= {
'environment': [
{
'name': 'environment',
'value': env
},
]
},
parameters={
'opco': '123',
'app' : 'app1'
},
);
submit_job()
In the Job2 i can easily get the environment variable with below code.
# job2
env = os.environ.get('environment')
def get_index_name(env):
return 'liberty-'+env
....
So my question is How can we get those parameters (opco, app) inside the job2?
FYI, i could pass them as environment variable, But i want to know how parameter retrieval is done here.
Thanks in advance
I have airflow running KubernetesPodOperator in order to do a Spark-submit call:
spark_image = f'{getenv("REGISTRY")}/myApp:{getenv("TAG")}'
j2g = KubernetesPodOperator(
dag=dag,
task_id='myApp',
name='myApp',
namespace='data',
image=spark_image,
cmds=['/opt/spark/bin/spark-submit'],
configmaps=["data"],
arguments=[
'--master k8s://https://10.96.0.1:443',
'--deploy-mode cluster',
'--name myApp',
f'--conf spark.kubernetes.container.image={spark_image}',
'local:///app/run.py'
],
However, I'm getting the following error:
Error: Unrecognized option: --master k8s://https://10.96.0.1:443
Which is weird, because when I bin/bash to a running pod and execute the spark-submit command, it works.
Any idea how to pass the arguments as expected?
Solution from GitHub ticket: Parameter should be sent like:
'--master=k8s://https://10.96.0.1:443',
I am new to airflow automation, i dont now if it is possible to do this with apache airflow(or luigi etc) or should i just make a long bash file to do this.
I want to build dag for this
Create/clone a cluster on AWS EMR
Install python requirements
Install pyspark related libararies
Get latest code from github
Submit spark job
Terminate cluster on finish
for individual steps, i can make .sh files like below(not sure if it is good to do this or not) but dont know how to do it in airflow
1) creating a cluser with cluster.sh
aws emr create-cluster \
--name "1-node dummy cluster" \
--instance-type m3.xlarge \
--release-label emr-4.1.0 \
--instance-count 1 \
--use-default-roles \
--applications Name=Spark \
--auto-terminate
2 & 3 & 4) clone git and install requirements codesetup.sh
git clone some-repo.git
pip install -r requirements.txt
mv xyz.jar /usr/lib/spark/xyz.jar
5) Running spark job sparkjob.sh
aws emr add-steps --cluster-id <Your EMR cluster id> --steps Type=spark,Name=TestJob,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,pythonjob.py,s3a://your-source-bucket/data/data.csv,s3a://your-destination-bucket/test-output/],ActionOnFailure=CONTINUE
6) Not sure, may be this
terminate-clusters
--cluster-ids <value> [<value>...]
Finally this all can be executed as one .sh file. I need to know the good approach to this with airflow/luigi.
What i found:
I find this post to be close but its outdated(2016) and misses the connections and code for playbooks
https://www.agari.com/email-security-blog/automated-model-building-emr-spark-airflow/
I figured out that, There can be two option to do this
1) we can make a bash script with the help of emr create-cluster and addstep and then use airflow Bashoperator to schedule it
Alternatively, there is wrapper around these two, called sparksteps
An example from their documentation
sparksteps examples/episodes.py \
--s3-bucket $AWS_S3_BUCKET \
--aws-region us-east-1 \
--release-label emr-4.7.0 \
--uploads examples/lib examples/episodes.avro \
--submit-args="--deploy-mode client --jars /home/hadoop/lib/spark-avro_2.10-2.0.2-custom.jar" \
--app-args="--input /home/hadoop/episodes.avro" \
--tags Application="Spark Steps" \
--debug
you can make a .sh script with default option of your choice. After preparing this script you can call this from airflow bashoperator as below
create_command = "sparkstep_custom.sh "
t1 = BashOperator(
task_id= 'create_file',
bash_command=create_command,
dag=dag
)
2) You can use airflow's own operators for aws to do this.
EmrCreateJobFlowOperator (for launching cluster) EmrAddStepsOperator(for submitting spark job)
EmrStepSensor (to track when step finishes)
EmrTerminateJobFlowOperator (to terminate clluster when step finishes)
Basic example to create cluster and submit step
my_step=[
{
'Name': 'setup - copy files',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['aws', 's3', 'cp', S3_URI + 'test.py', '/home/hadoop/']
}
},
{
'Name': 'setup - copy files 3',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['aws', 's3', 'cp', S3_URI + 'myfiledependecy.py', '/home/hadoop/']
}
},
{
'Name': 'Run Spark',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit','--jars', "jar1.jar,jar2.jar", '--py-files','/home/hadoop/myfiledependecy.py','/home/hadoop/test.py']
}
}
]
cluster_creator = EmrCreateJobFlowOperator(
task_id='create_job_flow2',
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id='aws_default',
emr_conn_id='emr_default',
dag=dag
)
step_adder_pre_step = EmrAddStepsOperator(
task_id='pre_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
aws_conn_id='aws_default',
steps=my_steps,
dag=dag
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
step_id="{{ task_instance.xcom_pull('pre_step', key='return_value')[0] }}",
aws_conn_id='aws_default',
dag=dag
)
cluster_remover = EmrTerminateJobFlowOperator(
task_id='remove_cluster',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
aws_conn_id='aws_default',
dag=dag
)
Also, to upload code to s3 (where i was curious to get latest code from github_ it can be done with s3, boto3 and Pythonoperator
Simple example
S3_BUCKET = 'you_bucket_name'
S3_URI = 's3://{bucket}/'.format(bucket=S3_BUCKET)
def upload_file_to_S3(filename, key, bucket_name):
s3.Bucket(bucket_name).upload_file(filename, key)
upload_to_S3_task = PythonOperator(
task_id='upload_to_S3',
python_callable=upload_file_to_S3,
op_kwargs={
'filename': configdata['project_path']+'test.py',
'key': 'test.py',
'bucket_name': 'dep-buck',
},
dag=dag)
Airflow has operators for this. airflow doc
I'm trying to convert HQL to Spark.
I have the following query (Works in Hue with Hive editor):
select reflect('java.util.UUID', 'randomUUID') as id,
tt.employee,
cast( from_unixtime(unix_timestamp (date_format(current_date(),'dd/MM/yyyy HH:mm:ss'), 'dd/MM/yyyy HH:mm:ss')) as timestamp) as insert_date,
collect_set(tt.employee_detail) as employee_details,
collect_set( tt.emp_indication ) as employees_indications,
named_struct ('employee_info', collect_set(tt.emp_info),
'employee_mod_info', collect_set(tt.emp_mod_info),
'employee_comments', collect_set(tt.emp_comment) )
as emp_mod_details,
from (
select views_ctr.employee,
if ( views_ctr.employee_details.so is not null, views_ctr.employee_details, null ) employee_detail,
if ( views_ctr.employee_info.so is not null, views_ctr.employee_info, null ) emp_info,
if ( views_ctr.employee_comments.so is not null, views_ctr.employee_comments, null ) emp_comment,
if ( views_ctr.employee_mod_info.so is not null, views_ctr.employee_mod_info, null ) emp_mod_info,
if ( views_ctr.emp_indications.so is not null, views_ctr.emp_indications, null ) employees_indication,
from
( select * from views_sta where emp_partition=0 and employee is not null ) views_ctr
) tt
group by employee
distribute by employee
First, What I'm trying is to write it in spark.sql as follow:
sparkSession.sql("select reflect('java.util.UUID', 'randomUUID') as id, tt.employee, cast( from_unixtime(unix_timestamp (date_format(current_date(),'dd/MM/yyyy HH:mm:ss'), 'dd/MM/yyyy HH:mm:ss')) as timestamp) as insert_date, collect_set(tt.employee_detail) as employee_details, collect_set( tt.emp_indication ) as employees_indications, named_struct ('employee_info', collect_set(tt.emp_info), 'employee_mod_info', collect_set(tt.emp_mod_info), 'employee_comments', collect_set(tt.emp_comment) ) as emp_mod_details, from ( select views_ctr.employee, if ( views_ctr.employee_details.so is not null, views_ctr.employee_details, null ) employee_detail, if ( views_ctr.employee_info.so is not null, views_ctr.employee_info, null ) emp_info, if ( views_ctr.employee_comments.so is not null, views_ctr.employee_comments, null ) emp_comment, if ( views_ctr.employee_mod_info.so is not null, views_ctr.employee_mod_info, null ) emp_mod_info, if ( views_ctr.emp_indications.so is not null, views_ctr.emp_indications, null ) employees_indication, from ( select * from views_sta where emp_partition=0 and employee is not null ) views_ctr ) tt group by employee distribute by employee")
But I got the following exception:
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failute: Task not serializable:
java.io.NotSerializableException:
org.apache.spark.unsafe.types.UTF8String$IntWrapper
-object not serializable (class : org.apache.spark.unsafe.types.UTF8String$IntWrapper, value:
org.apache.spark.unsafe.types.UTF8String$IntWrapper#30cfd641)
If I'm trying to run my query without collect_set function its work, It's can fail because struct column types in my table?
How can I write my HQL query in Spark / fix my exception?
I'm playing around with Gcloud Composer, trying to create a DAG that creates a DataProc cluster, runs a simple Spark job, then tears down the cluster. I am trying to run the Spark PI example job.
I understand that when calling DataProcSparkOperator I can choose only to define either the main_jar or the main_class property. When I define main_class, the job fails with the error:
java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:239)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
When I choose to define the main_jar property, the job fails with the error:
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
I'm at a bit of a loss as to how to resolve this, as I am kinda new to both Spark and DataProc.
My DAG:
import datetime as dt
from airflow import DAG, models
from airflow.contrib.operators import dataproc_operator as dpo
from airflow.utils import trigger_rule
MAIN_JAR = 'file:///usr/lib/spark/examples/jars/spark-examples.jar'
MAIN_CLASS = 'org.apache.spark.examples.SparkPi'
CLUSTER_NAME = 'quickspark-cluster-{{ ds_nodash }}'
yesterday = dt.datetime.combine(
dt.datetime.today() - dt.timedelta(1),
dt.datetime.min.time())
default_dag_args = {
'start_date': yesterday,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': dt.timedelta(seconds=30),
'project_id': models.Variable.get('gcp_project')
}
with DAG('dataproc_spark_submit', schedule_interval='0 17 * * *',
default_args=default_dag_args) as dag:
create_dataproc_cluster = dpo.DataprocClusterCreateOperator(
project_id = default_dag_args['project_id'],
task_id = 'create_dataproc_cluster',
cluster_name = CLUSTER_NAME,
num_workers = 2,
zone = models.Variable.get('gce_zone')
)
run_spark_job = dpo.DataProcSparkOperator(
task_id = 'run_spark_job',
#main_jar = MAIN_JAR,
main_class = MAIN_CLASS,
cluster_name = CLUSTER_NAME
)
delete_dataproc_cluster = dpo.DataprocClusterDeleteOperator(
project_id = default_dag_args['project_id'],
task_id = 'delete_dataproc_cluster',
cluster_name = CLUSTER_NAME,
trigger_rule = trigger_rule.TriggerRule.ALL_DONE
)
create_dataproc_cluster >> run_spark_job >> delete_dataproc_cluster
I compared it with a successful job using the CLI and saw that, even when the class was populating the Main class or jar field, the path to the Jar was specified in Jar files:
Checking the operator I noticed there is also a dataproc_spark_jars parameter which is not mutually exclusive to main_class:
run_spark_job = dpo.DataProcSparkOperator(
task_id = 'run_spark_job',
dataproc_spark_jars = [MAIN_JAR],
main_class = MAIN_CLASS,
cluster_name = CLUSTER_NAME
)
Adding it did the trick: