Spark Driver Heap Memory Issues - apache-spark

I am seeing issues where I slowly run out of Java Heap on the master node. Below is a simple example I've created which just repeats itself 200 times. With the settings below the master runs out of memory in about 1 hour with the following error:
16/12/15 17:55:46 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 97578 on executor id: 9 hostname: ip-xxx-xxx-xx-xx
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 20160"...
The Code:
import org.apache.spark.sql.functions._
import org.apache.spark._
object MemTest {
case class X(colval: Long, colname: Long, ID: Long)
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MemTest")
val spark = new SparkContext(conf)
val sc = org.apache.spark.sql.SQLContext.getOrCreate(spark)
import sc.implicits._;
for( a <- 1 to 200)
{
var df = spark.parallelize((1 to 5000000).map(x => X(x.toLong, x.toLong % 10, x.toLong / 10 ))).toDF()
df = df.groupBy("ID").pivot("colname").agg(max("colval"))
df.count
}
spark.stop()
}
}
I'm running on AWS emr-5.1.0 using m4.xlarge (4 nodes+1 master). Here are my spark settings
{
"Classification": "spark-defaults",
"Properties": {
"spark.dynamicAllocation.enabled": "false",
"spark.executor.instances": "16",
"spark.executor.memory": "2560m",
"spark.driver.memory": "768m",
"spark.executor.cores": "1"
}
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "false"
}
},
I compile with sbt using
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.2" % "provided",
"org.apache.spark" %% "spark-sql" % "2.0.2")
and then run it using
spark-submit --class MemTest target/scala-2.11/simple-project_2.11-1.0.jar
Looking at memory with jmap -histo I see java.lang.Long and scala.Tuple2 keep growing.

Are you sure the spark version installed on the cluster is 2.0.2?
Or if there are several Spark installations on your cluster, are you sure you're calling the correct (2.0.2) spark-submit?
(I unfortunately cannot comment so that's the reason I posted this as an answer)

Related

Spark DynamoDB Connectivity Issue

Requirement: Read data from DynamoDB(not local but on AWS) via Spark using Scala from my local machine.
Understanding: Data can be read using the emr-hadoop-dynamodb.jar when we are using an EMR cluster
Question:
Can we read data from DynamoDB(on cloud and not local) using the emr-dynamodb-hadoop.jar?
EMR cluster is not to be used. I directly want to access dynamodb from spark using scala code on my local machine
build.sbt
version := "0.1"
scalaVersion := "2.11.12"
scalacOptions := Seq("-target:jvm-1.8")
libraryDependencies ++= Seq(
"software.amazon.awssdk" % "dynamodb" % "2.15.1",
"org.apache.spark" %% "spark-core" % "2.4.1",
"com.amazon.emr" % "emr-dynamodb-hadoop" % "4.2.0",
"org.apache.httpcomponents" % "httpclient" % "4.5"
)
dependencyOverrides ++= {
Seq(
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.6.7.1",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.6.7",
"com.fasterxml.jackson.core" % "jackson-core" % "2.6.7"
)
}
readDataFromDDB.scala
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.{SparkConf, SparkContext}
object readDataFromDDB {
def main(args: Array[String]): Unit = {
var sc: SparkContext = null
try {
val conf = new SparkConf().setAppName("DynamoDBApplication").setMaster("local")
sc = new SparkContext(conf)
val jobConf = getDynamoDbJobConf(sc, "Music", "TableNameForWrite")
val tableData = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
println(tableData.count())
} catch {
case e: Exception => {
println(e.getStackTrace)
}
} finally {
sc.stop()
}
}
private def getDynamoDbJobConf(sc: JavaSparkContext, tableNameForRead: String, tableNameForWrite: String) = {
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.servicename", "dynamodb")
jobConf.set("dynamodb.input.tableName", tableNameForRead)
jobConf.set("dynamodb.output.tableName", tableNameForWrite)
jobConf.set("dynamodb.awsAccessKeyId", "*****************")
jobConf.set("dynamodb.awsSecretAccessKey", "*********************")
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-1")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
jobConf
}
}
ERROR
java.lang.RuntimeException: Could not lookup table Music in DynamoDB.
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:116)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:57)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:153)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.(AbstractDynamoDBRecordReader.java:84)
at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.(DefaultDynamoDBRecordReader.java:24)
at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.IllegalStateException: Socket not created by this factory
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:105)
... 20 more
Links already reviewed
https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/
read/write dynamo db from apache spark
Spark 2.2.0 - How to write/read DataFrame to DynamoDB
https://github.com/awslabs/emr-dynamodb-connector
This was solved when the following dependency version were updated
"software.amazon.awssdk" % "dynamodb" % "2.15.31",
"com.amazon.emr" % "emr-dynamodb-hadoop" % "4.14.0"

"Unable to instantiate SparkSession with Hive support" error when trying to process hive table with spark

I want to process hive table using spark, but when I run my program, I got this error:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
My application code
object spark_on_hive_table extends App {
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", "hdfs://localhost:54310/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
spark.sql("select * from pbSales").show()
}
build.sbt
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.2",
"org.apache.spark" %% "spark-sql" % "2.3.2",
"org.apache.spark" %% "spark-streaming" % "2.3.2",
"org.apache.spark" %% "spark-hive" % "2.3.2" % "provided"
)
You should remove provided for your spark-hive dependency:
"org.apache.spark" %% "spark-hive" % "2.3.2" % "provided"
change to
"org.apache.spark" %% "spark-hive" % "2.3.2"

Derby Metastore directory is created in spark workspace

I have spark 2.1.0 installed and integrated with eclipse and hive2 installed and metastore configured in Mysql also placed hive-site.xml file in spark >> conf folder. I'm trying to access tables already present in hive from eclipse.
when I execute the program metastore folder and derby.log file is been created in spark workspace and eclipse console show the below INFO:
Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
17/06/13 18:26:43 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/06/13 18:26:43 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/06/13 18:26:43 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/06/13 18:26:43 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/06/13 18:26:43 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery#0" since the connection used is closing
17/06/13 18:26:43 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
spark can't able to locate the configured mysql metastore database
also throwing the error
Exception in thread "main" java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
Code:
import org.apache.spark.SparkContext, org.apache.spark.SparkConf
import com.typesafe.config._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
object hivecore {
def main(args: Array[String]) {
val warehouseLocation = "hdfs://HADOOPMASTER:54310/user/hive/warehouse"
val spark = SparkSession
.builder().master("local[*]")
.appName("hivecore")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
sql("SELECT * FROM sample.source").show()
}
}
Build.sbt
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"
libraryDependencies += "com.typesafe" % "config" % "1.3.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-hive_2.11" % "2.1.0"
libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.42"
NOTE : I can able to access the hive tables from Spark-shell
Thanks
When you put context.setMaster(local), it may not look for the spark configurations that you setup in cluster; specially when you trigger it from ECLIPSE.
Make a jar out of it; and trigger from cmd as spark-submit --class <main class package> --master spark://207.184.161.138:7077 --deploy-mode client
The master ip: spark://207.184.161.138:7077 should be replace with your cluster's ip and spark port.
And, remember to initialize HiveContext to trigger query on underlying HIVE.
val hc = new HiveContext(sc)
hc.sql("SELECT * FROM ...")

How to run sqlContext in the spark-jobserver

I'm trying to execute locally a job in the spark-jobserver. My application has the dependencies below:
name := "spark-test"
version := "1.0"
scalaVersion := "2.10.6"
resolvers += Resolver.jcenterRepo
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "spark.jobserver" %% "job-server-api" % "0.6.2" % "provided"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.6.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.2"
libraryDependencies += "com.holdenkarau" % "spark-testing-base_2.10" % "1.6.2_0.4.7" % "test"
I've generated the application package using:
sbt assembly
After that, I've submitted the package like this:
curl --data-binary #spark-test-assembly-1.0.jar localhost:8090/jars/myApp
When I triggered the job, I got the following error:
{
"duration": "0.101 secs",
"classPath": "jobs.TransformationJob",
"startTime": "2017-02-17T13:01:55.549Z",
"context": "42f857ba-jobs.TransformationJob",
"result": {
"message": "java.lang.Exception: Could not find resource path for Web UI: org/apache/spark/sql/execution/ui/static",
"errorClass": "java.lang.RuntimeException",
"stack": ["org.apache.spark.ui.JettyUtils$.createStaticHandler(JettyUtils.scala:180)", "org.apache.spark.ui.WebUI.addStaticHandler(WebUI.scala:117)", "org.apache.spark.sql.execution.ui.SQLTab.<init>(SQLTab.scala:34)", "org.apache.spark.sql.SQLContext$$anonfun$createListenerAndUI$1.apply(SQLContext.scala:1369)", "org.apache.spark.sql.SQLContext$$anonfun$createListenerAndUI$1.apply(SQLContext.scala:1369)", "scala.Option.foreach(Option.scala:236)", "org.apache.spark.sql.SQLContext$.createListenerAndUI(SQLContext.scala:1369)", "org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:77)", "jobs.TransformationJob$.runJob(TransformationJob.scala:64)", "jobs.TransformationJob$.runJob(TransformationJob.scala:14)", "spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)", "scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)", "scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)", "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)", "java.lang.Thread.run(Thread.java:745)"]
},
"status": "ERROR",
"jobId": "a6bd6f23-cc82-44f3-8179-3b68168a2aa7"
}
Here is the part of the application that is failing:
override def runJob(sparkCtx: SparkContext, config: Config): Any = {
val sqlContext = new SQLContext(sparkCtx)
...
}
I have some questions:
1) I've noticed that to run spark-jobserver local I don't need to have spark installed. Does spark-jobserver already come with spark embedded?
2) How do I know what is the version of the spark that is being used by spark-jobserver? Where is that?
3) I'm using the version 1.6.2 of the spark-sql. Should I change it or keep it?
If anyone can answer my questions, I will be very grateful.
Yes, spark-jobserver has spark dependencies. Instead of job-server/reStart you should use job-server-extras/reStart which will help you to get sql related dependencies.
Look at project/Versions.scala
You don't need spark-sql I think because it is included if you run job-server-extras/reStart

How to set PYTHONHASHSEED on AWS EMR

Is there any way to set an environment variable on all nodes of an EMR cluster?
I am getting an error when trying to use reduceByKey() in Python3 PySpark, and getting an error regarding the hash seed. I can see this is a known error, and that the environment varialbe PYTHONHASHSEED needs to be set to the same value on all nodes of the cluster, but I haven't had any luck with it.
I have tried adding a variable to spark-env through the cluster configuration:
[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3",
"PYTHONHASHSEED": "123"
}
}
]
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
but this doesn't work. I have also tried adding a bootstrap script:
#!/bin/bash
export PYTHONHASHSEED=123
but this also doesn't seem to do the trick.
I believe that the /usr/bin/python3 isn't picking up the environment variable PYTHONHASHSEED that you are defining in the cluster configuration under the spark-env scope.
You ought using python34 instead of /usr/bin/python3 and set the configuration as followed :
[
{
"classification":"spark-defaults",
"properties":{
// [...]
}
},
{
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34",
"PYTHONHASHSEED":"123"
}
}
],
"classification":"spark-env",
"properties":{
// [...]
}
}
]
Now, let's test it. I define a bash script call both pythons :
#!/bin/bash
echo "using python34"
for i in `seq 1 10`;
do
python -c "print(hash('foo'))";
done
echo "----------------------"
echo "using /usr/bin/python3"
for i in `seq 1 10`;
do
/usr/bin/python3 -c "print(hash('foo'))";
done
The verdict :
[hadoop#ip-10-0-2-182 ~]$ bash test.sh
using python34
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
----------------------
using /usr/bin/python3
8867846273747294950
-7610044127871105351
6756286456855631480
-4541503224938367706
7326699722121877093
3336202789104553110
3462714165845110404
-5390125375246848302
-7753272571662122146
8018968546238984314
PS1: I am using AMI release emr-4.8.2.
PS2: Snippet inspired from this answer.
EDIT: I have tested the following using pyspark.
16/11/22 07:16:56 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1479798580078_0001
16/11/22 07:16:56 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.2
/_/
Using Python version 3.4.3 (default, Sep 1 2016 23:33:38)
SparkContext available as sc, HiveContext available as sqlContext.
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
Also created a simple application (simple_app.py):
from pyspark import SparkContext
sc = SparkContext(appName = "simple-app")
numbers = [hash('foo') for i in range(10)]
print(numbers)
Which also seems to work perfectly :
[hadoop#ip-*** ~]$ spark-submit --master yarn simple_app.py
Output (truncated) :
[...]
16/11/22 07:28:42 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594] // THE RELEVANT LINE IS HERE.
16/11/22 07:28:42 INFO SparkContext: Invoking stop() from shutdown hook
[...]
As you can see it also works returning the same hash each time.
EDIT 2: From the comments, it seems like you are trying to compute hashes on the executors and not the driver, thus you'll need to set up spark.executorEnv.PYTHONHASHSEED, inside your spark application configuration so it can be propagated on the executors (it's one way to do it).
Note : Setting the environment variables for executors is the same with YARN client, use the spark.executorEnv.[EnvironmentVariableName].
Thus the following minimalist example with simple_app.py :
from pyspark import SparkContext, SparkConf
conf = SparkConf().set("spark.executorEnv.PYTHONHASHSEED","123")
sc = SparkContext(appName="simple-app", conf=conf)
numbers = sc.parallelize(['foo']*10).map(lambda x: hash(x)).collect()
print(numbers)
And now let's test it again. Here is the truncated output :
16/11/22 14:14:34 INFO DAGScheduler: Job 0 finished: collect at /home/hadoop/simple_app.py:6, took 14.251514 s
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594]
16/11/22 14:14:34 INFO SparkContext: Invoking stop() from shutdown hook
I think that this covers all.
From the spark docs
Note: When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file. Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode. See the YARN-related Spark Properties for more information.
Properties are listed here so I think you want this:
Add the environment variable specified by EnvironmentVariableName to the Application Master process launched on YARN.
spark.yarn.appMasterEnv.PYTHONHASHSEED="XXXX"
EMR docs for configuring spark-defaults.conf are here.
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.yarn.appMasterEnv.PYTHONHASHSEED: "XXX"
}
}
]
Just encountered the same problem, adding the following configuration solved it:
# Some settings...
Configurations=[
{
"Classification": "spark-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "python34"
},
"Configurations": []
}
]
},
{
"Classification": "hadoop-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYTHONHASHSEED": "0"
},
"Configurations": []
}
]
}
],
# Some more settings...
Be careful: we do not use yarn as a cluster manager, for the moment the cluster is only running Hadoop and Spark.
EDIT : Following Tim B comment, this seems to work also with yarn installed as a cluster manager.
You could probably do it via the bootstrap script but you'll need to do something like this:
echo "PYTHONHASHSEED=XXXX" >> /home/hadoop/.bashrc
(or possibly .profile)
So that it's picked up by the spark processes when they are launched.
Your configuration looks reasonable though, it might be worth setting it in the hadoop-env section instead?

Resources