dataproc create cluster gcloud equivalent command in python - python-3.x

How do I replicate the following gcloud command in python?
gcloud beta dataproc clusters create spark-nlp-cluster \
--region global \
--metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp==2.5.3' \
--worker-machine-type n1-standard-1 \
--num-workers 2 \
--image-version 1.4-debian10 \
--initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
--optional-components=JUPYTER,ANACONDA \
--enable-component-gateway
Here is what I have so far in python:
cluster_data = {
"project_id": project,
"cluster_name": cluster_name,
"config": {
"gce_cluster_config": {"zone_uri": zone_uri},
"master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-1"},
"worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-1"},
"software_config":{"image_version":"1.4-debian10","optional_components":{"JUPYTER","ANACONDA"}}
},
}
cluster = dataproc.create_cluster(
request={"project_id": project, "region": region, "cluster": cluster_data}
)
Not sure how to convert these gcloud commands to python:
--metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp==2.5.3' \
--initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
--enable-component-gateway

You can try as this :
cluster_data = {
"project_id": project,
"cluster_name": cluster_name,
"config": {
"gce_cluster_config": {"zone_uri": zone_uri},
"master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-1"},
"worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-1"},
"software_config":{"image_version":"1.4-debian10","optional_components":{"JUPYTER","ANACONDA"}},
"initialization_actions":{"executable_file" : "gs://dataproc-initialization-actions/python/pip-install.sh"},
"gce_cluster_config": {"metadata": "PIP_PACKAGES=google-cloud-storage,spark-nlp==2.5.3"},
"endpoint_config": {"enable_http_port_access":True},
},
}
You can access for more : GCP Cluster Configs

Related

Passing in VM arguments to Spark application via Argo

I have a spark application which is being triggered from argo yaml via dockerized image.
The argo workflow yaml is as :
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: test-argo-spark
namespace: argo
spec:
entrypoint: sparkapp
templates:
- name: sparkapp
container:
name: main
command:
args: [
"/bin/sh",
"-c",
"/opt/spark/bin/spark-submit \
--master k8s://https://kubernetes.default.svc \
--deploy-mode cluster \
--conf spark.kubernetes.container.image=/test-spark:latest \
--conf spark.driver.extraJavaOptions='-Divy.cache.dir=/tmp -Divy.home=/tmp' \
--conf spark.app.name=test-spark-job \
--conf spark.jars.ivy=/tmp/.ivy \
--conf spark.kubernetes.driverEnv.HTTP2_DISABLE=true \
--conf spark.kubernetes.namespace=argo \
--conf spark.executor.instances=1 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--packages org.postgresql:postgresql:42.1.4 \
--conf spark.kubernetes.driver.pod.name=custom-app \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=default \
--class SparkMain \
local:///opt/app/test-spark.jar"
]
image: /test-spark:latest
imagePullPolicy: IfNotPresent
resources: {}
This is calling in the spark submit which will invoke the jar file that has code residing in
The java code is as :
public class SparkMain {
public static void main(String args[]) {
SparkSession spark = SparkHelper.getSparkSession("SparkMain_Application");
System.out.println("Spark Java appn " + spark.logName());
}
}
The Spark session is being created as follows:
public static SparkSession getSparkSession(String appName) {
String sparkMode = System.getProperty("spark_mode");
if (sparkMode == null) {
sparkMode = "cluster";
}
if (sparkMode.equalsIgnoreCase("cluster")) {
return createSparkSession(appName);
} else if (sparkMode.equalsIgnoreCase("local")) {
return createLocalSparkSession(appName);
} else {
throw new RuntimeException("Invalid spark_mode option " + sparkMode);
}
}
If we see here there is a system property which needs to be passed in ,
String sparkMode = System.getProperty("spark_mode");
Can anyone tell how can we pass in these VM args from argo yaml when calling in the Spark submit.
Also how can we pass in multiple properties for a program

Spark on Kubernetes with Minio - Postgres -> Minio -unable to create executor due to

Hi I am facing an error with providing dependency jars for spark-submit in kubernetes.
/usr/middleware/spark-3.1.1-bin-hadoop3.2/bin/spark-submit --master k8s://https://112.23.123.23:6443 --deploy-mode cluster --name spark-postgres-minio-kubernetes --jars file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --driver-class-path file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --conf spark.executor.instances=1 --conf spark.kubernetes.namespace=spark --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.file.upload.path=s3a://daci-dataintegration/spark-operator-on-k8s/code --conf spark.hadoop.fs.s3a.fast.upload=true --conf spark.kubernetes.container.image=hostname:5000/spark-py:spark3.1.2 file:///AirflowData/kubernetes/python/postgresminioKube.py
Below is the code to execute. The jars needed for the S3 minio and configurations are placed in the spark_home/conf and spark_home/jars and the docker image is created.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("Postgres-Minio-Kubernetes").getOrCreate()
import json
#spark = SparkSession.builder.config('spark.driver.extraClassPath', '/hadoop/externalJars/db2jcc4.jar').getOrCreate()
jdbcUrl = "jdbc:postgresql://{0}:{1}/{2}".format("hosnamme", "port", "db")
connectionProperties = {
"user" : "username",
"password" : "password",
"driver": "org.postgresql.Driver",
"fetchsize" : "100000"
}
pushdown_query = "(select * from public.employees) emp_als"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, column="employee_id", lowerBound=1, upperBound=100, numPartitions=2, properties=connectionProperties)
df.write.format('csv').options(delimiter=',').mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
df.write.format('parquet').options(delimiter='|').options(header=True).mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
Error is below . It is trying to execute the jar for some reason
21/11/09 17:05:44 INFO SparkContext: Added JAR file:/tmp/spark-d987d7e7-9d49-4523-8415-1e438da1730e/postgresql-42.2.14.jar at spark://spark-postgres-minio-kubernetes-49d7d77d05a980e5-driver-svc.spark.svc:7078/jars/postgresql-42.2.14.jar with timestamp 1636477543573
21/11/09 17:05:49 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.216.12: Unable to create executor due to ./postgresql-42.2.14.jar
The external jars are getting added to the /opt/spark/work-dir and it didnt had access. So i changed the dockerfile to have access to the folder and then it worked.
RUN chmod 777 /opt/spark/work-dir

hudi delta streamer job via apache livy

Please help how to pass --props file and --source-class file to LIVY API POST .
spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 \
--master yarn \
--deploy-mode cluster \
--conf spark.sql.shuffle.partitions=100 \
--driver-class-path $HADOOP_CONF_DIR \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--table-type MERGE_ON_READ \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field tst \
--target-base-path /user/hive/warehouse/stock_ticks_mor \
--target-table test \
--props /var/demo/config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--continuous
I have converted the configs you are using in a json file to be passed to LIVY API
{
"className": "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer",
"proxyUser": "root",
"driverCores": 1,
"executorCores": 2,
"executorMemory": "1G",
"numExecutors": 4,
"queue": "default",
"name": "stock_ticks_mor",
"file": "hdfs://tmp/hudi-utilities-bundle_2.12-0.8.0.jar",
"conf": {
"spark.sql.shuffle.partitions": "100",
"spark.jars.packages": "org.apache.hudi:hudi-spark-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.2",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.task.cpus": "1",
"spark.executor.cores": "1"
},
"args": [
"--props","/var/demo/config/kafka-source.properties",
"--table-type","MERGE_ON_READ",
"--source-class", "org.apache.hudi.utilities.sources.JsonKafkaSource",
"--target-base-path","/user/hive/warehouse/stock_ticks_mor",
"--target-table","test",
"--schemaprovider-class","org.apache.hudi.utilities.schema.FilebasedSchemaProvider",
"--continuous"
]
}
You can submit this json to the LIVY endpoint like
curl -H "X-Requested-By: admin" -H "Content-Type: application/json" -X POST -d #config.json http://localhost:8999/batches
For reference : https://community.cloudera.com/t5/Community-Articles/How-to-Submit-Spark-Application-through-Livy-REST-API/ta-p/247502

Airflow/Luigi for AWS EMR automatic cluster creation and pyspark deployment

I am new to airflow automation, i dont now if it is possible to do this with apache airflow(or luigi etc) or should i just make a long bash file to do this.
I want to build dag for this
Create/clone a cluster on AWS EMR
Install python requirements
Install pyspark related libararies
Get latest code from github
Submit spark job
Terminate cluster on finish
for individual steps, i can make .sh files like below(not sure if it is good to do this or not) but dont know how to do it in airflow
1) creating a cluser with cluster.sh
aws emr create-cluster \
--name "1-node dummy cluster" \
--instance-type m3.xlarge \
--release-label emr-4.1.0 \
--instance-count 1 \
--use-default-roles \
--applications Name=Spark \
--auto-terminate
2 & 3 & 4) clone git and install requirements codesetup.sh
git clone some-repo.git
pip install -r requirements.txt
mv xyz.jar /usr/lib/spark/xyz.jar
5) Running spark job sparkjob.sh
aws emr add-steps --cluster-id <Your EMR cluster id> --steps Type=spark,Name=TestJob,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,pythonjob.py,s3a://your-source-bucket/data/data.csv,s3a://your-destination-bucket/test-output/],ActionOnFailure=CONTINUE
6) Not sure, may be this
terminate-clusters
--cluster-ids <value> [<value>...]
Finally this all can be executed as one .sh file. I need to know the good approach to this with airflow/luigi.
What i found:
I find this post to be close but its outdated(2016) and misses the connections and code for playbooks
https://www.agari.com/email-security-blog/automated-model-building-emr-spark-airflow/
I figured out that, There can be two option to do this
1) we can make a bash script with the help of emr create-cluster and addstep and then use airflow Bashoperator to schedule it
Alternatively, there is wrapper around these two, called sparksteps
An example from their documentation
sparksteps examples/episodes.py \
--s3-bucket $AWS_S3_BUCKET \
--aws-region us-east-1 \
--release-label emr-4.7.0 \
--uploads examples/lib examples/episodes.avro \
--submit-args="--deploy-mode client --jars /home/hadoop/lib/spark-avro_2.10-2.0.2-custom.jar" \
--app-args="--input /home/hadoop/episodes.avro" \
--tags Application="Spark Steps" \
--debug
you can make a .sh script with default option of your choice. After preparing this script you can call this from airflow bashoperator as below
create_command = "sparkstep_custom.sh "
t1 = BashOperator(
task_id= 'create_file',
bash_command=create_command,
dag=dag
)
2) You can use airflow's own operators for aws to do this.
EmrCreateJobFlowOperator (for launching cluster) EmrAddStepsOperator(for submitting spark job)
EmrStepSensor (to track when step finishes)
EmrTerminateJobFlowOperator (to terminate clluster when step finishes)
Basic example to create cluster and submit step
my_step=[
{
'Name': 'setup - copy files',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['aws', 's3', 'cp', S3_URI + 'test.py', '/home/hadoop/']
}
},
{
'Name': 'setup - copy files 3',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['aws', 's3', 'cp', S3_URI + 'myfiledependecy.py', '/home/hadoop/']
}
},
{
'Name': 'Run Spark',
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit','--jars', "jar1.jar,jar2.jar", '--py-files','/home/hadoop/myfiledependecy.py','/home/hadoop/test.py']
}
}
]
cluster_creator = EmrCreateJobFlowOperator(
task_id='create_job_flow2',
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id='aws_default',
emr_conn_id='emr_default',
dag=dag
)
step_adder_pre_step = EmrAddStepsOperator(
task_id='pre_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
aws_conn_id='aws_default',
steps=my_steps,
dag=dag
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
step_id="{{ task_instance.xcom_pull('pre_step', key='return_value')[0] }}",
aws_conn_id='aws_default',
dag=dag
)
cluster_remover = EmrTerminateJobFlowOperator(
task_id='remove_cluster',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow2', key='return_value') }}",
aws_conn_id='aws_default',
dag=dag
)
Also, to upload code to s3 (where i was curious to get latest code from github_ it can be done with s3, boto3 and Pythonoperator
Simple example
S3_BUCKET = 'you_bucket_name'
S3_URI = 's3://{bucket}/'.format(bucket=S3_BUCKET)
def upload_file_to_S3(filename, key, bucket_name):
s3.Bucket(bucket_name).upload_file(filename, key)
upload_to_S3_task = PythonOperator(
task_id='upload_to_S3',
python_callable=upload_file_to_S3,
op_kwargs={
'filename': configdata['project_path']+'test.py',
'key': 'test.py',
'bucket_name': 'dep-buck',
},
dag=dag)
Airflow has operators for this. airflow doc

Displaying ► character in Vim terminal lightline status bar

I am working on SUSE Linux Enterprise Desktop 11 (x86_64) and I am using Vim in terminal as my editor. I have recently installed a plugin called lightline from https://github.com/itchyny/lightline.vim. The plugin uses special characters to make the status line look like this:
The > part of the bar is actually ► character coloured like the square next to it. The problem is that the bar, in my case, looks like this:
The ► character is not displayed properly, although the encoding is set to UTF-8 and all the required fonts are installed on the system (fonts for powerline). In this case the font set on terminal is Liberation Mono for Powerline.
Lightline settings in my vimrc:
set encoding=utf-8
scriptencoding utf-8
let g:lightline = {
\ 'colorscheme': 'wombat',
\ 'separator': {'left': "\u25B6", 'right': ''},
\ 'subseparator': { 'left': '', 'right': ''}
\ }
I also tried copying the ► character like this
let g:lightline = {
\ 'colorscheme': 'wombat',
\ 'separator': {'left': "►", 'right': ''},
\ 'subseparator': { 'left': '', 'right': ''}
\ }
But it manifests in the same way.
Furthermore, there is a problem with ^ characters wherever there is supposed to be whitespace.
Is there any solution for this?
Following is my my_configs.vim for lightline, it works perfectly in my Fedora 26 system.
let g:lightline = {
\ 'colorscheme': 'wombat',
\ }
let g:lightline = {
\ 'colorscheme': 'wombat',
\ 'active': {
\ 'left': [ ['mode', 'paste'],
\ ['fugitive', 'readonly', 'filename', 'modified'] ],
\ 'right': [ [ 'lineinfo' ], ['percent'] ]
\ },
\ 'component': {
\ 'readonly': '%{&filetype=="help"?"":&readonly?"\ue0a2":""}',
\ 'modified': '%{&filetype=="help"?"":&modified?"\ue0a0":&modifiable?"":"-"}',
\ 'fugitive': '%{exists("*fugitive#head")?fugitive#head():""}'
\ },
\ 'component_visible_condition': {
\ 'readonly': '(&filetype!="help"&& &readonly)',
\ 'modified': '(&filetype!="help"&&(&modified||!&modifiable))',
\ 'fugitive': '(exists("*fugitive#head") && ""!=fugitive#head())'
\ },
\ 'separator': { 'left': "\ue0b0", 'right': "\ue0b2" },
\ 'subseparator': { 'left': "\ue0b1", 'right': "\ue0b3" }
\ } "" This is comment: I fotgot this line in my last post, just added
Sorry for my mistake, I just fixed this config.
If you installed hack font from https://github.com/chrissimpkins/Hack/releases
and install powerline-fonts by command "sudo dnf install powerline-fonts" in Fedora 26 system, you probably want to add the following configs to your
/etc/fonts/local.conf
<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">
<fontconfig>
<alias>
<family>Hack</family>
<prefer>
<family>PowerlineSymbols</family>
</prefer>
</alias>
</fontconfig>
The problem was explained in this thread stackoverflow.com/questions/7223309/. It says that if the stl and stlnc have the same values, they will be replaced with ^^^. It works when you put * for stlnc and whitespace for stl.

Resources