Locally installed spark history server is empty - apache-spark

I have installed spark 3.2.1 locally through homebrew and I wanted to see spark-history server.
Below is my sparks-defaults.conf.
spark.eventLog.enabled true
spark.eventLog.dir file:/tmp/spark-events
spark.history.fs.logDirectory file:/tmp/spark-events
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 7d
spark.history.fs.cleaner.maxAge 15d
When I see in /tmp/spark-events, this is always blank. DO I need to enable anything else?
I am able to see Spark History server, but it is empty
http://localhost:18080/?showIncomplete=true

Related

Spark shuffle fails with AccessDenied exception

I run Apache Spark stream job on cluster with:
spark.master yarn
spark.submit.deployMode cluster
spark.shuffle.service.enabled true
spark.dynamicAllocation.enabled true
But it fails with Caused by: java.nio.file.AccessDeniedException: /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1590438937007_0020/blockmgr-ff867859-36d0-4db7-8243-bfabfb3bd40d/0d/shuffle_0_3_0.index. I run it as hadoop user. I cannot understand why process created that file ain't able to read (or modify) it

Apache Spark 2.4 is not working with hadoop 2.8.3?

I have installed Hadoop version 2.8.3 in my windows 10 environment and its working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn as cluster manager and its not working. When i try to submit a spark job using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI after that it fails.
I have attached the image of yarn container logs.
This is my spark-defaults.conf
spark.master yarn
spark.driver.memory 512m
spark.yarn.am.memory 512m
spark.executor.memory 2G
spark.executor.cores 1
spark.eventLog.enabled true
spark.eventLog.dir hdfs://localhost:9000/spark-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://localhost:9000/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080

How do I run Spark 2.2 on YARN and HDP?

I am trying to run Spark 2.2 with HDP 2.6. I stop Spark2 from Ambari, then I run:
/spark/bin/spark-shell --jars
/home/ed/.ivy2/jars/stanford-corenlp-3.6.0-models.jar,/home/ed/.ivy2/jars/jersey-bundle-1.19.1.jar --packages
databricks:spark-corenlp:0.2.0-s_2.11,edu.stanford.nlp:stanford-corenlp:3.6.0
\--master yarn --deploy-mode client --driver-memory 4g --executor-memory 4g --executor-cores 2 --num-executors 11 --conf spark.hadoop.yarn.timeline-service.enabled=false
It used to run fine, then it started giving me:
17/12/09 10:16:54 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
I can run it OK, without --master yarn --deploy-mode client but then I get the driver only as executor.
I have tried spark.hadoop.yarn.timeline-service.enabled = true.
yarn.nodemanager.vmem-check-enabled and pmem are set to false.
Can anyone help or point me where to look for errors? TIA!
PS spark-defaults.conf:
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.eventLog.dir hdfs:///spark2-history/
spark.eventLog.enabled true
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.history.fs.logDirectory hdfs:///spark2-history/
spark.history.kerberos.keytab none
spark.history.kerberos.principal none
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18081
spark.yarn.historyServer.address master.royble.co.uk:18081
spark.yarn.queue default
spark.yarn.jar=hdfs:///master.royble.co.uk/user/hdfs/sparklib/*.jar
spark.driver.extraJavaOptions -Dhdp.version=2.6.0.3-8
spark.executor.extraJavaOptions -Dhdp.version=2.6.0.3-8
spark.yarn.am.extraJavaOptions -Dhdp.version=2.6.0.3-8
I've also tried the Dhdp.version= fixes from here.
Upgraded to HDP 2.6.3 and it now works.

Setting yarn shuffle for spark makes spark-shell not start

I have a 4 ubuntu 14.04 machines cluster where I am setting up spark 2.1.0 prebuilt for hadoop 2.7 to run on top of hadoop 2.7.3 and I am configuring it to work with yarn. Running jps in each node I get:
node-1
22546 Master
22260 ResourceManager
22916 Jps
21829 NameNode
22091 SecondaryNameNode
node-2
12321 Worker
12485 Jps
11978 DataNode
node-3
15938 Jps
15764 Worker
15431 DataNode
node-4
12251 Jps
12075 Worker
11742 DataNode
Without yarn shuffle configuration
./bin/spark-shell --master yarn --deploy-mode client
starts just fine when called in my node-1.
In order to configure a External Shuffle Service, I read this: http://spark.apache.org/docs/2.1.0/running-on-yarn.html#configuring-the-external-shuffle-service
And what I have done is:
Added the following properties to yarn-site.xml:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/usr/local/spark/spark-2.1.0-bin-hadoop2.7/yarn/spark-2.1.0-yarn-shuffle.jar</value>
</property>
I do have other properties in this file. Leaving these 3 properties out, as I said, let spark-shell --master yarn --deploy-mode client start normally.
My spark-default.conf is:
spark.master spark://singapura:7077
spark.executor.memory 4g
spark.driver.memory 2g
spark.eventLog.enabled true
spark.eventLog.dir hdfs://singapura:8020/spark/logs
spark.history.fs.logDirectory hdfs://singapura:8020/spark/logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.scheduler.mode FAIR
spark.yarn.stagingDir hdfs://singapura:8020/spark
spark.yarn.jars=hdfs://singapura:8020/spark/jars/*.jar
spark.yarn.am.memory 2g
spark.yarn.am.cores 4
All nodes have the same paths. singapura is my node-1. It's already set in my /etc/hosts and nslookup gets the correct ip. The machine name is not the issue here.
So, What happens to me is: when I add these 3 properties to my yarn-site.xml and start spark shell, it gets stuck without much output.
localuser#singapura:~$ /usr/local/spark/spark-2.1.0-bin-hadoop2.7/bin/spark-shell --master yarn --deploy-mode client
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
I wait, wait and wait and nothing more is printed out. I have to kill it and erase the staging directory (if I don't erase it, I get WARN yarn.Client: Failed to cleanup staging dir the next time I call it).

Spark ignores SPARK_WORKER_MEMORY?

I'm using standalone cluster mode, 1.5.2.
Even though I'm setting SPARK_WORKER_MEMORY in spark-env.sh, it looks like this setting is ignored.
I can't find any indications at the scripts under bin/sbin that -Xms/-Xmx are set.
If I use ps command the worker pid, it looks like memory set to 1G:
[hadoop#sl-env1-hadoop1 spark-1.5.2-bin-hadoop2.6]$ ps -ef | grep 20232
hadoop 20232 1 0 02:01 ? 00:00:22 /usr/java/latest//bin/java
-cp /workspace/3rd-party/spark/spark-1.5.2-bin-hadoop2.6/sbin/../conf/:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/workspace/
3rd-party/hadoop/2.6.3//etc/hadoop/ -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker
--webui-port 8081 spark://10.52.39.92:7077
spark-defaults.conf:
spark.master spark://10.52.39.92:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.memory 2g
spark.executor.cores 1
spark-env.sh:
export SPARK_MASTER_IP=10.52.39.92
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=12g
Am I missing something?
Thanks.
When using spark-shell or spark-submit, use the --executor-memory option.
When configuring it for a standalone jar, set the system property programmatically before creating the spark context.
System.setProperty("spark.executor.memory", executorMemory)
You are using wrong setting in cluster mode.
SPARK_EXECUTOR_MEMORY is the right option to set Executor memory in cluster mode.
SPARK_WORKER_MEMORY works only in standalone deploy mode.
Otherway to set executor memory from command line : -Dspark.executor.memory=2g
Have a loook at one more related SE question regarding these settings :
Spark configuration, what is the difference of SPARK_DRIVER_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_WORKER_MEMORY?
This is my configuration on cluster mode, on spark-default.conf
spark.driver.memory 5g
spark.executor.memory 6g
spark.executor.cores 4
Did have something like this?
If you don't add this code (with your options) Spark executor will get 1gb of Ram as default.
Otherwise you can add these options on ./spark-submit like this :
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
Try to check on master(ip/name of master):8080 when you run an application if resources have been allocated correctly.
I've encountered the same problem as yours. The reason is that, in standalone mode, spark.executor.memory is actually ignored. What has an effect is spark.driver.memory, because the executor is living in the driver.
So what you can do is to set spark.driver.memory as high as you want.
This is where I've found the explanation:
How to set Apache Spark Executor memory

Resources