Airflow/Luigi for AWS EMR automatic cluster creation and pyspark deployment - apache-spark

I am new to airflow automation, i dont now if it is possible to do this with apache airflow(or luigi etc) or should i just make a long bash file to do this.
I want to build dag for this
Create/clone a cluster on AWS EMR
Install python requirements
Install pyspark related libararies
Get latest code from github
Submit spark job
Terminate cluster on finish
for individual steps, i can make .sh files like below(not sure if it is good to do this or not) but dont know how to do it in airflow
1) creating a cluser with cluster.sh
aws emr create-cluster \
--name "1-node dummy cluster" \
--instance-type m3.xlarge \
--release-label emr-4.1.0 \
--instance-count 1 \
--use-default-roles \
--applications Name=Spark \
--auto-terminate
2 & 3 & 4) clone git and install requirements codesetup.sh
git clone some-repo.git
pip install -r requirements.txt
mv xyz.jar /usr/lib/spark/xyz.jar
5) Running spark job sparkjob.sh
aws emr add-steps --cluster-id <Your EMR cluster id> --steps Type=spark,Name=TestJob,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,pythonjob.py,s3a://your-source-bucket/data/data.csv,s3a://your-destination-bucket/test-output/],ActionOnFailure=CONTINUE
6) Not sure, may be this
terminate-clusters
--cluster-ids <value> [<value>...]
Finally this all can be executed as one .sh file. I need to know the good approach to this with airflow/luigi.
What i found:
I find this post to be close but its outdated(2016) and misses the connections and code for playbooks
https://www.agari.com/email-security-blog/automated-model-building-emr-spark-airflow/

I figured out that, There can be two option to do this
1) we can make a bash script with the help of emr create-cluster and addstep and then use airflow Bashoperator to schedule it
Alternatively, there is wrapper around these two, called sparksteps
An example from their documentation
sparksteps examples/episodes.py \
--s3-bucket $AWS_S3_BUCKET \
--aws-region us-east-1 \
--release-label emr-4.7.0 \
--uploads examples/lib examples/episodes.avro \
--submit-args="--deploy-mode client --jars /home/hadoop/lib/spark-avro_2.10-2.0.2-custom.jar" \
--app-args="--input /home/hadoop/episodes.avro" \
--tags Application="Spark Steps" \
--debug
you can make a .sh script with default option of your choice. After preparing this script you can call this from airflow bashoperator as below
create_command = "sparkstep_custom.sh "
t1 = BashOperator(
task_id= 'create_file',
bash_command=create_command,
dag=dag
)
2) You can use airflow's own operators for aws to do this.
EmrCreateJobFlowOperator (for launching cluster) EmrAddStepsOperator(for submitting spark job)
EmrStepSensor (to track when step finishes)
EmrTerminateJobFlowOperator (to terminate clluster when step finishes)
Basic example to create cluster and submit step
my_step=[
{
'Name': 'setup - copy files',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['aws', 's3', 'cp', S3_URI + 'test.py', '/home/hadoop/']
}
},
{
'Name': 'setup - copy files 3',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['aws', 's3', 'cp', S3_URI + 'myfiledependecy.py', '/home/hadoop/']
}
},
{
'Name': 'Run Spark',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit','--jars', "jar1.jar,jar2.jar", '--py-files','/home/hadoop/myfiledependecy.py','/home/hadoop/test.py']
}
}
]
cluster_creator = EmrCreateJobFlowOperator(
task_id='create_job_flow2',
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id='aws_default',
emr_conn_id='emr_default',
dag=dag
)
step_adder_pre_step = EmrAddStepsOperator(
task_id='pre_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
aws_conn_id='aws_default',
steps=my_steps,
dag=dag
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
step_id="{{ task_instance.xcom_pull('pre_step', key='return_value')[0] }}",
aws_conn_id='aws_default',
dag=dag
)
cluster_remover = EmrTerminateJobFlowOperator(
task_id='remove_cluster',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
aws_conn_id='aws_default',
dag=dag
)
Also, to upload code to s3 (where i was curious to get latest code from github_ it can be done with s3, boto3 and Pythonoperator
Simple example
S3_BUCKET = 'you_bucket_name'
S3_URI = 's3://{bucket}/'.format(bucket=S3_BUCKET)
def upload_file_to_S3(filename, key, bucket_name):
s3.Bucket(bucket_name).upload_file(filename, key)
upload_to_S3_task = PythonOperator(
task_id='upload_to_S3',
python_callable=upload_file_to_S3,
op_kwargs={
'filename': configdata['project_path']+'test.py',
'key': 'test.py',
'bucket_name': 'dep-buck',
},
dag=dag)

Airflow has operators for this. airflow doc

Related

How to connect to Glue catalog from an EMR spark-submit step

We use pyspark in an EMR cluster to run queries against our glue database. We use two methods to execute the python scripts namely through a Zeppelin notebook and through EMR steps. Connecting to the glue database works greats in Zeppelin, but not in EMR steps. When we run a query against the glue database, we get the following error:
pyspark.sql.utils.AnalysisException: Database '{glue_database_name}' does not exist.
This is the spark configuration used in the executed .py file:
spark = SparkSession\
.builder\
.appName("dfp.sln.kunderelation.work")\
.config("spark.sql.broadcastTimeout", "36000")\
.config("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")\
.config("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")\
.config("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")\
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport() \
.getOrCreate()
spark.conf.set("spark.sql.sources.ignoreDataLocality.enabled", "true")
The step is submitted using boto3 with the following configuration:
response = emr_client.add_job_flow_steps(
JobFlowId=cluster_id,
Steps=[
{ 'Name': name,
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit', '--deploy-mode', 'client', '--master', 'yarn', "{path to .py script}"]
}
},
]
)
Both types of EC2 users have been given glue and s3 privileges.
What do we need to setup to connect to the glue database?

PySpark write data to Ceph returns 400 Bad Request

I have a problem with pySpark configuration when writing data inside a ceph bucket.
With the following Python code snippet I can read data from the Ceph bucket but when I try to write inside the bucket, I get the following error:
22/07/22 10:00:58 DEBUG S3ErrorResponseHandler: Failed in parsing the error response :
org.apache.hadoop.shaded.com.ctc.wstx.exc.WstxEOFException: Unexpected EOF in prolog
at [row,col {unknown-source}]: [1,0]
at org.apache.hadoop.shaded.com.ctc.wstx.sr.StreamScanner.throwUnexpectedEOF(StreamScanner.java:701)
at org.apache.hadoop.shaded.com.ctc.wstx.sr.BasicStreamReader.handleEOF(BasicStreamReader.java:2217)
at org.apache.hadoop.shaded.com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:2123)
at org.apache.hadoop.shaded.com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1179)
at com.amazonaws.services.s3.internal.S3ErrorResponseHandler.createException(S3ErrorResponseHandler.java:122)
at com.amazonaws.services.s3.internal.S3ErrorResponseHandler.handle(S3ErrorResponseHandler.java:71)
at com.amazonaws.services.s3.internal.S3ErrorResponseHandler.handle(S3ErrorResponseHandler.java:52)
[...]
22/07/22 10:00:58 DEBUG request: Received error response: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: null; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null
22/07/22 10:00:58 DEBUG AwsChunkedEncodingInputStream: AwsChunkedEncodingInputStream reset (will reset the wrapped stream because it is mark-supported).
Pyspark code (not working):
from pyspark.sql import SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages com.amazonaws:aws-java-sdk-bundle:1.12.264,org.apache.spark:spark-sql-kafka-0-10_2.13:3.3.0,org.apache.hadoop:hadoop-aws:3.3.3 pyspark-shell"
spark = (
SparkSession.builder.appName("app") \
.config("spark.hadoop.fs.s3a.access.key", access_key) \
.config("spark.hadoop.fs.s3a.secret.key", secret_key) \
.config("spark.hadoop.fs.s3a.connection.timeout", "10000") \
.config("spark.hadoop.fs.s3a.endpoint", "http://HOST_NAME:88") \
.config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.endpoint.region", "default") \
.getOrCreate()
)
spark.sparkContext.setLogLevel("TRACE")
# This works
spark.read.csv("s3a://test-data/data.csv")
# This throws the provided error
df_to_write = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
df_to_write.write.csv("s3a://test-data/with_love.csv")
Also, referring to the same ceph bucket, I am able to read and write data to the bucket via boto3:
import boto3
from botocore.exceptions import ClientError
from botocore.client import Config
config = Config(connect_timeout=20, retries={'max_attempts': 0})
s3_client = boto3.client('s3', config=config,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
region_name="defaut",
endpoint_url='http://HOST_NAME:88',
verify=False
)
response = s3_client.list_buckets()
# Read
print('Existing buckets:')
for bucket in response['Buckets']:
print(f' {bucket["Name"]}')
# Write
dummy_data = b'Dummy string'
s3_client.put_object(Body=dummy_data, Bucket='test-spark', Key='awesome_key')
Also s3cmd with the same configuration is working fine.
I think I'm missing some pyspark (hadoop-aws) configuration, could anyone help me in identifying the configuration problem? Thanks.
After some research on the web, I was able to solve the problem using this hadoop-aws configuration:
fs.s3a.signing-algorithm: S3SignerType
I configured this property in pySpark with:
spark = (
SparkSession.builder.appName("app") \
.config("spark.hadoop.fs.s3a.access.key", access_key) \
.config("spark.hadoop.fs.s3a.secret.key", secret_key) \
.config("spark.hadoop.fs.s3a.connection.timeout", "10000") \
.config("spark.hadoop.fs.s3a.endpoint", "http://HOST_NAME:88") \
.config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.endpoint.region", "default") \
.config("spark.hadoop.fs.s3a.signing-algorithm", "S3SignerType") \
.getOrCreate()
)
From what I understand, the version of ceph I am using (16.2.3), does not support the default signing algorithm used in Spark version v3.3.0 on hadoop version 3.3.2.
For further details see this documentation.

Passing in VM arguments to Spark application via Argo

I have a spark application which is being triggered from argo yaml via dockerized image.
The argo workflow yaml is as :
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: test-argo-spark
namespace: argo
spec:
entrypoint: sparkapp
templates:
- name: sparkapp
container:
name: main
command:
args: [
"/bin/sh",
"-c",
"/opt/spark/bin/spark-submit \
--master k8s://https://kubernetes.default.svc \
--deploy-mode cluster \
--conf spark.kubernetes.container.image=/test-spark:latest \
--conf spark.driver.extraJavaOptions='-Divy.cache.dir=/tmp -Divy.home=/tmp' \
--conf spark.app.name=test-spark-job \
--conf spark.jars.ivy=/tmp/.ivy \
--conf spark.kubernetes.driverEnv.HTTP2_DISABLE=true \
--conf spark.kubernetes.namespace=argo \
--conf spark.executor.instances=1 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--packages org.postgresql:postgresql:42.1.4 \
--conf spark.kubernetes.driver.pod.name=custom-app \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=default \
--class SparkMain \
local:///opt/app/test-spark.jar"
]
image: /test-spark:latest
imagePullPolicy: IfNotPresent
resources: {}
This is calling in the spark submit which will invoke the jar file that has code residing in
The java code is as :
public class SparkMain {
public static void main(String args[]) {
SparkSession spark = SparkHelper.getSparkSession("SparkMain_Application");
System.out.println("Spark Java appn " + spark.logName());
}
}
The Spark session is being created as follows:
public static SparkSession getSparkSession(String appName) {
String sparkMode = System.getProperty("spark_mode");
if (sparkMode == null) {
sparkMode = "cluster";
}
if (sparkMode.equalsIgnoreCase("cluster")) {
return createSparkSession(appName);
} else if (sparkMode.equalsIgnoreCase("local")) {
return createLocalSparkSession(appName);
} else {
throw new RuntimeException("Invalid spark_mode option " + sparkMode);
}
}
If we see here there is a system property which needs to be passed in ,
String sparkMode = System.getProperty("spark_mode");
Can anyone tell how can we pass in these VM args from argo yaml when calling in the Spark submit.
Also how can we pass in multiple properties for a program

dataproc create cluster gcloud equivalent command in python

How do I replicate the following gcloud command in python?
gcloud beta dataproc clusters create spark-nlp-cluster \
--region global \
--metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp==2.5.3' \
--worker-machine-type n1-standard-1 \
--num-workers 2 \
--image-version 1.4-debian10 \
--initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
--optional-components=JUPYTER,ANACONDA \
--enable-component-gateway
Here is what I have so far in python:
cluster_data = {
"project_id": project,
"cluster_name": cluster_name,
"config": {
"gce_cluster_config": {"zone_uri": zone_uri},
"master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-1"},
"worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-1"},
"software_config":{"image_version":"1.4-debian10","optional_components":{"JUPYTER","ANACONDA"}}
},
}
cluster = dataproc.create_cluster(
request={"project_id": project, "region": region, "cluster": cluster_data}
)
Not sure how to convert these gcloud commands to python:
--metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp==2.5.3' \
--initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
--enable-component-gateway
You can try as this :
cluster_data = {
"project_id": project,
"cluster_name": cluster_name,
"config": {
"gce_cluster_config": {"zone_uri": zone_uri},
"master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-1"},
"worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-1"},
"software_config":{"image_version":"1.4-debian10","optional_components":{"JUPYTER","ANACONDA"}},
"initialization_actions":{"executable_file" : "gs://dataproc-initialization-actions/python/pip-install.sh"},
"gce_cluster_config": {"metadata": "PIP_PACKAGES=google-cloud-storage,spark-nlp==2.5.3"},
"endpoint_config": {"enable_http_port_access":True},
},
}
You can access for more : GCP Cluster Configs

Elasticsearch pyspark connection in insecure mode

My end goal is to insert data from hdfs to elasticsearch but the issue i am facing is the connectivity
I am able to connect to my elasticsearch node using below curl command
curl -u username -X GET https://xx.xxx.xx.xxx:9200/_cat/indices?v' --insecure
but when it comes to connection with spark I am unable to do so. My command to insert data is
df.write.mode("append").format('org.elasticsearch.spark.sql').option("es.net.http.auth.user", "username").option("es.net.http.auth.pass", "password").option("es.index.auto.create","true").option('es.nodes', 'https://xx.xxx.xx.xxx').option('es.port','9200').save('my-index/my-doctype')
Error i am getting is
org.elastisearch.hadoop.EsHadoopIllegalArgumentException:Cannot detect ES version - typical this happens if then network/Elasticsearch cluster is not accessible or when targetting a Wan/Cloud instance without the proper setting 'es.nodes.wan.only'
....
....
Caused by: org.elasticseach.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proy settings)- all nodes failed; tried [[xx.xxx.xx.xxx:9200]]
....
...
Here, What would be the pyspark equivalent of curl --insecure
Thanks
After many attempt and different config options. I found a way how to connect elastisearch running on https insecurely
dfToEs.write.mode("append").format('org.elasticsearch.spark.sql') \
.option("es.net.http.auth.user", username) \
.option("es.net.http.auth.pass", password) \
.option("es.net.ssl", "true") \
.option("es.net.ssl.cert.allow.self.signed", "true") \
.option("mergeSchema", "true") \
.option('es.index.auto.create', 'true') \
.option('es.nodes', 'https://{}'.format(es_ip)) \
.option('es.port', '9200') \
.option('es.batch.write.retry.wait', '100s') \
.save('{index}/_doc'.format(index=index))
with the
(es.net.ssl, true)
We also have to provide self signed certificate like below
(es.net.ssl.cert.allow.self.signed, true)
I did check a lot of things and finally i can write in AWS ElasticSearch service (ES), but with scala/spark.
In a VPC, create security groups to access from EMR to ES with port 443 (inbound rules in ES to SG of EMR and inbound rules in EMR to same port)
Check connectivity from EMR master node, with a telnet command
telnet xyz.eu-west-1.es.amazonaws.com 443
Once check above, check app level with curl command
curl https://xyz.eu-west-1.es.amazonaws.com:443/domainname/_search?pretty=true&?q=*```
After, goes to the code, in my case i did test with spark-shell, but server confs was included in start like this:
spark-shell --jars elasticsearch-spark-20_2.11-7.1.1.jar --conf spark.es.nodes="xyz.eu-west-1.es.amazonaws.com" --conf spark.es.port=443 --conf spark.es.nodes.wan.only=true --conf spark.es.nodes.discovery="false" --conf spark.es.index.auto.create="true" --conf spark.es.resource="domain/doc" --conf spark.es.scheme="https"
Finally the code to write:
import java.util.Date
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.elasticsearch.spark._
import org.elasticsearch.spark.sql._
val dateformat = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
val currentdate = dateformat.format(new Date)
val colorsDF = spark.read.json("multilinecolors.json")
val mcolors = colorsDF.withColumn("Date",lit(currentdate))
mcolors.write.mode("append").format("org.elasticsearch.spark.sql").option("es.net.http.auth.user", "").option("es.net.http.auth.pass", "").option("es.net.ssl", "true").option("es.net.ssl.cert.allow.self.signed", "true").option("mergeSchema", "true").option("es.index.auto.create", "true").option("es.nodes","https://xyz.eu-west-1.es.amazonaws.com").option("es.port", "443").option("es.batch.write.retry.wait", "100").save("domainname/_doc")```
can you try with the below sparkConfs,
val sparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.es.index.auto.create", "true")
.set("spark.es.nodes", "yourESaddress")
.set("spark.es.port", "9200")
.set("spark.es.net.http.auth.user","")
.set("spark.es.net.http.auth.pass", "")
.set("spark.es.resource", indexName)
.set("spark.es.nodes.wan.only", "true")
still you face the problem then, es.net.ssl = true and see.
If still you get the error try adding the below configs,
'es.resource' = 'ctrl_rater_resumen_lla/hb',
'es.nodes' = 'localhost',
'es.port' = '9200',
'es.index.auto.create' = 'true',
'es.index.read.missing.as.empty' = 'true',
'es.nodes.discovery'='true',
'es.net.ssl'='false'
'es.nodes.client.only'='false',
'es.nodes.wan.only' = 'true'
'es.net.http.auth.user'='xxxxx',
'es.net.http.auth.pass' = 'xxxxx'
'es.nodes.discovery' = 'false'

Resources