Dataproc adding extra field when adding keys using --properties? - apache-spark

I am trying to update or add new fields to hive config in dataproc cluster using --properties field. I am running dataproc cluster command from cloud shell. What i am seeing is dataproc is adding new key with final. I am unable to find what it means?
<property>
<name>hive.compactor.worker.threads</name>
<value>1</value>
<final>false</final>
<source>Dataproc Cluster Properties</source>
</property>
Also when does dataproc applies these changes to hive.xml? after hive service start running on the cluster or before ?
Also i am unable to find any documentation for how to restart hive and spark after making some changes to hive config after cluster creation ?

1) If a property is marked final, it cannot be overriden by users on a per-job basis (e.g. using command line parameters or setting properties in SparkConf/Configuration). We have explicitly made cluster-wide properties overridable. https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configuration.html
2) Dataproc applies --properties to the xml files before starting any services
3) If you manually change properties, you can restart the services relevant services by ssh'ing into the master node of the cluster and running sudo systemctl restart <service>. For hive, that's hive-metastore and hive-server2. For spark, that's spark-history-server. Several initialization actions do this.
4) Consider deleting and recreating your cluster if you want to change properties -- that's a bit easier than figuring out what services to restart.
5) Remember that you can still set per-job configs on a per-job basis. If you're using gcloud, that's something like gcloud dataproc jobs submit spark --properties spark.executors.cores=4 ...other args..., with spark-submit you can use --conf, and with hive, you can use set prop=value.

Related

Do I need to restart nodes if i am running spark on yarn after changing spark-env.sh or spark-defaults?

I am working on changing conf for spark in order to limit the logs for my spark structured streaming log files. I have figured the properties to do so, but it is not working right now. Do i need to restart all nodes (name and worker nodes) or is restarting the jobs is enough.
We are using google dataproc clusters and running spark with yarn .
The simplest will be to set these properties during cluster creation time using Dataproc Cluster Properties:
gcloud dataproc clusters create $CLUSTER_NAME \
--properties spark:<key>=<value>,yarn:<key>=<value>
Or set them when submitting your Spark application.

How to pass Spark job properties to DataProcSparkOperator in Airflow?

I am trying to execute Spark jar on Dataproc using Airflow's DataProcSparkOperator. The jar is located on GCS, and I am creating Dataproc cluster on the fly and then executing this jar on the newly created Dataproc cluster.
I am able to execute this with DataProcSparkOperator of Airflow with default settings, but I am not able to configure Spark job properties (e.g. --master, --deploy-mode, --driver-memory etc.).
From documentation of airflow didn't got any help. Also tried many things but didn't worked out.
Help is appreciated.
To configure Spark job through DataProcSparkOperator you need to use dataproc_spark_properties parameter.
For example, you can set deployMode like this:
DataProcSparkOperator(
dataproc_spark_properties={ 'spark.submit.deployMode': 'cluster' })
In this answer you can find more details.

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

How to make sure Spark master node is using the worker nodes? (Google cluster)

I just created a Google Cloud cluster (1 master and 6 workers) and by default Spark is configured.
I have a pure python code that uses NLTK to build the dependency tree for each line from a text file. When I run this code on the master spark-submit run.py I get the same execution time when I run it using my machine.
How to make sure that the master is using the workers in order to reduce the execution time ?
You can check the spark UI. If its running on top of yarn, please open the yarn UI and click on your application id which will open the spark UI. Check under the executors tab it will have the node ip address also.
could you please share your spark submit config.
Your command 'spark-submit run.py' doesn't seem to send your job to YARN. To do such thing, you need to add the --master parameter. For example, a valid command to execute a job in YARN is:
./bin/spark-submit --master yarn python/pi.py 1000
If you execute your job from the master, this execution will be straightforward. Anyway, check this link for another parameter that spark-submit accept.
For a Dataproc cluster (Hadoop Google cluster) you have two options to check the job history including the ones that are running:
By command line from the master: yarn application -list, this option sometimes needs additional configuration. If you have troubles, this link will be useful.
By UI. Dataproc enables you to access the Spark Web UI, it improves monitoring tasks. Check this link to learn how to access the Spark UI and other Dataproc UIs. In summary, you have to create a tunnel and configure your browser to use socks proxy.
Hope the information above help you.

How to give Jupyterhub access to hive tables through spark in EMR

The default installation of JupytherHub in EMR has no access to the Hive context in Spark. How can I fix this?
To grant spark access to the Hive context, you need to edit the livy.conf file (/etc/livy/conf.dist/livy.conf) like this
livy.repl.enableHiveContext = true
and then restart your notebook and the livy service, following the instructions here, basically:
sudo stop livy-server
sudo start livy-server
An easy way to check if it's working, is to check for the databases on your spark notebook:
spark.sql("show databases").show
Yo may want to configure this on the EMR booting time, by using the standard configuration features of the EMR, https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

Resources