To follow with this question I decided to try the workflow template API.
Here's what it looks like :
gcloud beta dataproc workflow-templates create lifestage-workflow --region europe-west2
gcloud beta dataproc workflow-templates set-managed-cluster lifestage-workflow \
--master-machine-type n1-standard-8 \
--worker-machine-type n1-standard-16 \
--num-workers 6 \
--cluster-name lifestage-workflow-cluster \
--initialization-actions gs://..../init.sh \
--zone europe-west2-b \
--region europe-west2 \
gcloud beta dataproc workflow-templates add-job pyspark gs://.../main.py \
--step-id prediction \
--region europe-west2 \
--workflow-template lifestage-workflow \
--jars gs://.../custom.jar \
--py-files gs://.../jobs.zip,gs://.../config.ini \
-- --job predict --conf config.ini
The template is correctly created.
The job works when I run it manually from one of my already existing clusters. It also runs when I use an existing cluster instead of asking the workflow to create one.
The thing is I want the cluster to be created before running the job and deleted just after, that's why I'm using a managed cluster.
But with the managed cluster I just can't make it run. I tried to use the same configuration as my existing clusters but it doesn't change anything.
I always get the same error.
Any idea why my job runs perfectly except for when it is run from a generated cluster ?
The problem came from the version of the managed cluster.
By default the image version was 1.2.31 and my existing cluster was using the image 1.2.28. When I changed the config to add --image-version=1.2.28 it worked.
Dataproc image 1.2.31 Upgraded Spark to 2.2.1 which introduced [SPARK-22472]:
SPARK-22472: added null check for top-level primitive types. Before
this release, for datasets having top-level primitive types, and it
has null values, it might return some unexpected results. For example,
let’s say we have a parquet file with schema , and we read it
into Scala Int. If column a has null values, when transformation is
applied some unexpected value can be returned.
This likely added just enough generated code to take classes of 64k limit.
Related
I am using the free credits of Google Cloud. I followed Dataproc tutorial but when I am running the following command I have an error regarding the storage capacity.
gcloud beta dataproc clusters create ${CLUSTER_NAME} \
--region=${REGION} \
--zone=${ZONE} \
--image-version=1.5 \
--master-machine-type=n1-standard-4 \
--worker-machine-type=n1-standard-4 \
--bucket=${BUCKET_NAME} \
--optional-components=ANACONDA,JUPYTER \
--enable-component-gateway \
--metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage' \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh
Do you have any idea how to fix this? I changed n1-standard-4 to n1-standard-1 but I could not fix it. However, when I removed --image-version=1.5 the command works. Does it create any problem for the rest of the program?
Also from the web interface when I click on JupyterLab link, I can not see Python 3 icon among the kernels available on my Dataproc cluster. I only have Python 2 and it keeps saying connection with the server is gone.
Here is picture of JupyterLab error:
You are seeing an error regarding storage capacity because in 1.5 image version Dataproc uses bigger 1000 GiB disks for master and worker nodes to improve performance. You can reduce disk size by using --master-boot-disk-size=100GB and --worker-boot-disk-size=100GB command flags:
gcloud beta dataproc clusters create ${CLUSTER_NAME} \
--region=${REGION} \
--zone=${ZONE} \
--image-version=1.5 \
--master-machine-type=n1-standard-4 \
--master-boot-disk-size=100GB \
--worker-machine-type=n1-standard-4 \
--worker-boot-disk-size=100GB \
--bucket=${BUCKET_NAME} \
--optional-components=ANACONDA,JUPYTER \
--enable-component-gateway \
--metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage' \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh
When you removed --image-version=1.5 flag this command used default 1.3 image version that does not support Python 3 by default, that's why you are not seeing Python 3 kernel in JupyterLab.
Trying to run Spark-Wiki-Parser on a GCP Dataproc cluster. The code takes in two arguments "dumpfile" and "destloc". When I submit the following I get a [scallop] Error: Excess arguments provided: 'gs://enwiki-latest-pages-articles.xml.bz2 gs://output_dir/'.
gcloud dataproc jobs submit spark --cluster $CLUSTER_NAME --project $CLUSTER_PROJECT \
--class 'com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain' \
--properties=^#^spark.jars.packages='com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0' \
--region=$CLUSTER_REGION \
-- 'gs://enwiki-latest-pages-articles.xml.bz2' 'gs://output_dir/'
How do I get the code to recognize the input arguments?
I spent probably 8 hours figuring this out, but figured I'd dump the solution here since it had not been shared yet.
The gcloud CLI separates the dataproc parameters from the class arguments by -- as noted by another user. However, Scallop also requires a -- prior to each named argument. Your cli should look something like this.
gcloud dataproc jobs submit spark --cluster $CLUSTER_NAME --project $CLUSTER_PROJECT
--class 'com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain'
--properties=^#^spark.jars.packages='com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0'
--region=$CLUSTER_REGION
-- --dumpfile'gs://enwiki-latest-pages-articles.xml.bz2' --destloc 'gs://output_dir/'
It seems like Scala class needs dumpfile and destloc as args.
Could you run following command instead and try if it works?
gcloud dataproc jobs submit spark --cluster $CLUSTER_NAME --project $CLUSTER_PROJECT \
--class 'com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain' \
--properties=^#^spark.jars.packages='com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0' \
--region=$CLUSTER_REGION \
-- dumpfile gs://enwiki-latest-pages-articles.xml.bz2 destloc gs://output_dir/
I was going through this Apache Spark documentation, and it mentions that:
When running Spark on YARN in cluster mode, environment variables
need to be set using the
spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your
conf/spark-defaults.conf file.
I am running my EMR cluster on AWS data pipeline. I wanted to know that where do I have to edit this conf file. Also, if I create my own custom conf file, and specify it as part of --configurations (in the spark-submit), will it solve my use-case?
One way to do it, is the following: (The tricky part is that you might need to setup the environment variables on both executor and driver parameters)
spark-submit \
--driver-memory 2g \
--executor-memory 4g \
--conf spark.executor.instances=4 \
--conf spark.driver.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
--conf spark.executor.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
--master yarn \
--deploy-mode cluster\
--class com.industry.class.name \
assembly-jar.jar
I have tested it in EMR and client mode but should work on cluster mode as well.
For future reference you could directly pass the environment variable when creating the EMR cluster using the Configurations parameter as described in the docs here.
Specifically, the spark-defaults file can be modified by passing a configuration JSON as follows:
{
'Classification': 'spark-defaults',
'Properties': {
'spark.yarn.appMasterEnv.[EnvironmentVariableName]' = 'some_value',
'spark.executorEnv.[EnvironmentVariableName]': 'some_other_value'
}
},
Where spark.yarn.appMasterEnv.[EnvironmentVariableName] would be used to pass a variable in cluster mode using YARN (here). And spark.executorEnv.[EnvironmentVariableName] to pass a variable to the executor process (here).
I am following the instructions for starting a Google DataProc cluster with an initialization script to start a jupyter notebook.
https://cloud.google.com/blog/big-data/2017/02/google-cloud-platform-for-data-scientists-using-jupyter-notebooks-with-apache-spark-on-google-cloud
How can I include extra JAR files (spark-xml, for example) in the resulting SparkContext in Jupyter notebooks (particularly pyspark)?
The answer depends slightly on which jars you're looking to load. For example, you can use spark-xml with the following when creating a cluster:
$ gcloud dataproc clusters create [cluster-name] \
--zone [zone] \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--properties spark:spark.jars.packages=com.databricks:spark-xml_2.11:0.4.1
To specify multiple Maven coordinates, you will need to swap the gcloud dictionary separator character from ',' to something else (as we need to use that to separate the packages to install):
$ gcloud dataproc clusters create [cluster-name] \
--zone [zone] \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--properties=^#^spark:spark.jars.packages=artifact1,artifact2,artifact3
Details on how escape characters are changed can be found in gcloud:
$ gcloud help topic escaping
I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below
Based on some other questions like this and this, i have setup the cluster to use DominantResourceCalculator to consider both vcpus and memory for resource allocation
gcloud dataproc clusters create cluster_name --bucket="profiling-
job-default" \
--zone=europe-west1-c \
--master-boot-disk-size=500GB \
--worker-boot-disk-size=500GB \
--master-machine-type=n1-standard-16 \
--num-workers=10 \
--worker-machine-type=n1-standard-16 \
--initialization-actions gs://custom_init_gcp.sh \
--metadata MINICONDA_VARIANT=2 \
--properties=^--^yarn:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
But when i submit my job with custom spark flags, looks like YARN doesn't respect these custom parameters and defaults to using memory as the yardstick for resource calculation
gcloud dataproc jobs submit pyspark --cluster cluster_name \
--properties spark.sql.broadcastTimeout=900,spark.network.timeout=800\
,yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator\
,spark.dynamicAllocation.enabled=true\
,spark.executor.instances=10\
,spark.executor.cores=14\
,spark.executor.memory=15g\
,spark.driver.memory=50g \
src/my_python_file.py
Can help somebody figure out what's going on here?
What I did wrong was to add the configuration yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator to YARN instead of the capacity-scheduler.xml (as it should be rightly) while cluster creation
Secondly, i changed yarn:yarn.scheduler.minimum-allocation-vcores which was initially set to 1.
I'm not sure if either one of these or both of these changes led to the solution (i will update soon). My new cluster creation looks like below:
gcloud dataproc clusters create cluster_name --bucket="profiling-
job-default" \
--zone=europe-west1-c \
--master-boot-disk-size=500GB \
--worker-boot-disk-size=500GB \
--master-machine-type=n1-standard-16 \
--num-workers=10 \
--worker-machine-type=n1-standard-16 \
--initialization-actions gs://custom_init_gcp.sh \
--metadata MINICONDA_VARIANT=2 \
--properties=^--^yarn:yarn.scheduler.minimum-allocation-vcores=4--capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
First, as you have dynamic allocation enabled, you should set the properties spark.dynamicAllocation.maxExecutors and spark.dynamicAllocation.minExecutors (see https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation)
Second, make sure you have enough partitions in your spark job. As you are using dynamic allocation, yarn only allocates just enough executors to match the number of tasks (partitions). So check SparkUI whether your jobs (more specific : stages) have more than tasks than you have vCores available