how to reduce apache spark memory footprint? - apache-spark

I have a apache spark full stack+ Apache zeppelin running on a machine with very little resources (512MB) which is crashing.
Spark Command: /usr/lib/jvm/java/bin/java -cp /home/ec2-user/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/ec2-user/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/home/ec2-user/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/ec2-user/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/ec2-user/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar -Xms512m -Xmx512m -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip ip-172-31-24-107 --port 7077 --webui-port 8080
========================================
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000daaa0000, 357957632, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 357957632 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/jvm-17290/hs_error.log
I know this is a bad idea, but I don't have anywhere else to test it and would like to be able to learn some code in scala + apache spark...
Is there a way I can reduce the memory footprint on spark so I can do my tests?
thanks

Apache zeppelin is a great tool, but I have seen the same thing, takes up a lot of RAM. You could use the command like, in the spark home folder, bin/spark-shell will give you a spark scala shell, but its not pretty and intuitive to use.
You can use Eclipse (scala IDE) or IntelliJ (has a scala plugin)for spark scala development, just need to add the jars with maven or sbt.
You can do your prototyping in the scala shell and copy and paste into the IDE.
Also check out https://github.com/andypetrella/spark-notebook, it takes a smaller RAM foot print. Spark by it self takes less, but zeppelin takes a lot of space, from what I have seen.
Also for scala notebook: https://github.com/alexarchambault/jupyter-scala, then you can add the spark jars to the env, create sparkContext object, and use it.

Related

Application master is killed by yarn while running spark job in cluster mode randomly

The error log is as follows :
20/05/10 18:40:47 ERROR yarn.Client: Application diagnostics message: Application application_1588683044535_1067 failed 2 times due to AM Container for appattempt_1588683044535_1067_000002 exited with exitCode: -104
Failing this attempt.Diagnostics: [2020-05-10 18:40:47.661]Container [pid=209264,containerID=container_e142_1588683044535_1067_02_000001] is running 3313664B beyond the 'PHYSICAL' memory limit. Current usage: 1.5 GB of 1.5 GB physical memory used; 3.6 GB of 3.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_e142_1588683044535_1067_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 209264 209262 209264 209264 (bash) 0 0 22626304 372 /bin/bash -c LD_LIBRARY_PATH="/cdhparcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/../../../CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native:" /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/tmp -Dspark.yarn.app.container.log.dir=/hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.airtel.spark.execution.driver.SparkDriver' --jar hdfs:///user/aiuat/lib/platform/di-platform-main-1.0.jar --arg 'hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irsparkbatchjobconf.json,hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irruleexecution.json' --properties-file /hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/__spark_conf__/__spark_conf__.properties 1> /hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/stdout 2> /hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/stderr
|- 209280 209264 209264 209264 (java) 34135 2437 3845763072 393653 /usr/java/jdk1.8.0_181-cloudera/bin/java -server -Xmx1024m -Djava.io.tmpdir=/hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/tmp -Dspark.yarn.app.container.log.dir=/hdfs16/yarn/container-logs/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class com.airtel.spark.execution.driver.SparkDriver --jar hdfs:///user/aiuat/lib/platform/di-platform-main-1.0.jar --arg hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irsparkbatchjobconf.json,hdfs://nameservice1/user/aiuat/conf/FMS/irrule/irruleexecution.json --properties-file /hdfs4/yarn/nm/usercache/aiuat/appcache/application_1588683044535_1067/container_e142_1588683044535_1067_02_000001/__spark_conf__/__spark_conf__.properties
Some of the observations are :
Application master is getting killed. The memory error is in container of application master itself, not of executer containers.
This job is scheduled via oozie and some instances of job get succeeded and some fails randomly without any pattern. The amount of input data is same in every case.
I have tried the most of solutions suggested on internet.
yarn.mapareduce.map.mb and yarn.mapareduce.reduce.mb is set to 8gb already.
I have also tried increasing driver memory , executer memory , overhead memory of both to very high value, low value, tweaking with these configurations but some instances still failed in every case.
yarn.nodemanager.vmem-pmem-ratio is set to 2.1 vnem check is disable and pnem check is enabled. Unfortunately these configurations can't be changed as it's a production cluster.
yarn.app.mapreduce.am.resource.mb is set to 5GB already. yarn.scheduler.maximum-allocation-mb is set to 26GB
Some of my other confusions are :
Why is memory available to Application master container only 1.5GB as shown in logs if yarn.app.mapreduce.am.resource.mb is set to 5GB ?
As this error comes in the container of application master itself and as per my understanding , application master and spark driver runs in the same jvm. I am concluding that that this error is because of either spark driver memory or application master memory not being sufficient. Does my conclusion seem correct ?
I have fixed this error. So, I thought I will answer this here.
In case of cluster mode, driver memory configurations can't be given on runtime after a sparksession is already created as application master was already launched and driver runs inside yarn application master container. What I was trying to do is to pass driver memory conf via "spark.driver.memory" after creating a sparksession. Spark doesn't give any error for this case and even shows the driver memory as exactly what was provided via this conf in the environment tab on spark ui page, which makes identifying the issue even more difficult. Application master memory was taken as default value 1GB instead of the memory I provided and thus, I was getting this error.

Add extra classpath to executors in Spark client mode

I'm using Spark 1.5.1 with the standalone cluster manager. Spark's default spark-assembly-1.5.1-hadoop2.6.0.jar includes Avro 1.7.7. I want to use my custom Avro library for all my Spark jobs, let's call it Avro 1.7.8. This works perfectly in dev mode (master=local[*]). However, when I submit my app to the cluster in client mode, the executors still use Avro 1.7.7 library.
URL url = getClass().getClassLoader().getResource(GenericData.class.getName().replace('.','/')+".class");
When I print this, my executor's log shows :
/opt/spark/lib/spark-assembly-1.5.1-hadoop2.6.0.jar/org/apache/avro/generic/GenericData.class
Here is a part of my spark-env.sh on the worker node :
export SPARK_WORKER_OPTS="-Dspark.executor.extraClassPath=/home/ansible/avro-1.7.8.jar -Dspark.executor.userClassPathFirst=true
Here is my worker process on the worker node (ps aux | grep worker) :
spark 955 1.8 1.9 4161448 243600 ? Sl 13:29 0:09 /usr/java/jdk1.7.0_79/jre/bin/java -cp /home/ansible/avro-1.7.8.jar:/etc/spark-worker/:/opt/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Dspark.executor.extraClassPath=/home/ansible/avro-1.7.8.jar -Dspark.executor.userClassPathFirst=true -Xms512m -Xmx512m -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://spark-a-01:7077
Obviously, I put this jar : /home/ansible/avro-1.7.8.jar in all my worker nodes.
Does anyone knows how to force the executor to use my jar instead of the spark assembly's one ?
Try using the --packages option to spark-submit:
spark-submit --packages org.apache.avro:avro:1.7.8 ....
Something like that. If you're not using spark-submit, use it -- this is exactly what it is for.

Running a simple Spark script on Mesos with Zookeeper

I want to run a simple spark program, but i am restricted by some errors.
My Environment is:
CentOS:6.6
Java: 1.7.0_51
Scala: 2.10.4
Spark: spark-1.4.0-bin-hadoop2.6
Mesos: 0.22.1
All are installed and nodes are up.Now i have one Mesos master and Mesos slave node. My spark properties are below:
spark.app.id 20150624-185838-2885789888-5050-1291-0005
spark.app.name Spark shell
spark.driver.host 192.168.1.172
spark.driver.memory 512m
spark.driver.port 46428
spark.executor.id driver
spark.executor.memory 512m
spark.executor.uri http://192.168.1.172:8080/spark-1.4.0-bin-hadoop2.6.tgz
spark.externalBlockStore.folderName spark-91aafe3b-01a8-4c86-ac3b-999e278807c5
spark.fileserver.uri http://192.168.1.172:51240
spark.jars
spark.master mesos://zk://192.168.1.172:2181/mesos
spark.mesos.coarse true
spark.repl.class.uri http://192.168.1.172:51600
spark.scheduler.mode FIFO
Now when I started spark, it comes to scala prompt(scala>).
After that I am getting following error: mesos task 1 is now TASK_FAILED, blacklisting mesos slave value due to too many failures is Spark installed on it
How to resolve this.
With only 900MB and spark.driver.memory = 512m, you will be able to launch the scheduler/REPL, but you won't have enough memory for spark.executor.memory = 512m, so any tasks will fail. Either increasing your VM memory size or reducing the driver/executor memory requirements will help you get around these memory limits.
Could you check the mesos slave logs/ task information for more output on why the task failed. You could have a look at :5050.
Probably unrelated question: Do you actually have zookeeper:
spark.master mesos://zk://192.168.1.172:2181/mesos
running (as you mentioned you only have one master)?

SparkDeploySchedulerBackend Error: Application has been killed. All masters are unresponsive

While I'm starting Spark shell:
bin>./spark-shell
I get the following error :
Spark assembly has been built with Hive, including Data nucleus jars on classpath
Welcome to SPARK VERSION 1.3.0
Using Scala version 2.10.4 (Java HotSpot(TM) Server VM, Java 1.7.0_75)
Type in expressions to have them evaluated.
Type :help for more information.
15/05/10 12:12:21 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/05/10 12:12:21 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
I have installed spark by follow below link :- http://www.philchen.com/2015/02/16/how-to-install-apache-spark-and-cassandra-stack-on-ubuntu
You should supply your Spark Cluster's Master URL when start a spark-shell
At least:
bin/spark-shell --master spark://master-ip:7077
All the options make up a long list and you can find the suitable ones yourself:
bin/spark-shell --help
I am assuming that you are running this in standalone/local mode.
Run your spark shell with following line. That indicates you are using all the available cores of your master which is local machine.
bin/spark-shell --master local[*]
http://spark.apache.org/docs/1.2.1/submitting-applications.html#master-urls
You also need to start spark master and slave before giving spark-submit command
start-master.sh
start-slave.sh spark://spark:7077
then use
spark-submit --master spark://spark:7077
Look at your log files for "permission denied" errors... It may happens that your client service doesn't have the proper authority to access your Master folders.

Spark SQL thrift server can't run in cluster mode?

In Spark 1.2.0, when I attempt to start the Spark SQL thrift server in cluster mode, I get the following output:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark Command: /usr/java/latest/bin/java -cp ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --deploy-mode cluster --master spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal
========================================
Jar url 'spark-internal' is not in valid format.
Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, file:///XX.jar)
Usage: DriverClient [options] launch <active-master> <jar-url> <main-class> [driver options]
Usage: DriverClient kill <active-master> <driver-id>
Options:
-c CORES, --cores CORES Number of cores to request (default: 1)
-m MEMORY, --memory MEMORY Megabytes of memory to request (default: 512)
-s, --supervise Whether to restart the driver on failure
-v, --verbose Print more debugging output
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
The "spark-internal" argument seems to be a special flag to tell spark-submit that the class to be run is part of Spark's libraries, so it doesn't need to distribute a jar. But for some reason, this doesn't seem to be working here.
I filed this as SPARK-5176 and it will be addressed with an error message that explains that the Thrift server can not run in cluster mode.

Resources