Spark DynamoDB Connectivity Issue

Spark DynamoDB Connectivity Issue - apache-spark

Requirement: Read data from DynamoDB(not local but on AWS) via Spark using Scala from my local machine.
Understanding: Data can be read using the emr-hadoop-dynamodb.jar when we are using an EMR cluster
Question:
Can we read data from DynamoDB(on cloud and not local) using the emr-dynamodb-hadoop.jar?
EMR cluster is not to be used. I directly want to access dynamodb from spark using scala code on my local machine
build.sbt
version := "0.1"
scalaVersion := "2.11.12"
scalacOptions := Seq("-target:jvm-1.8")
libraryDependencies ++= Seq(
"software.amazon.awssdk" % "dynamodb" % "2.15.1",
"org.apache.spark" %% "spark-core" % "2.4.1",
"com.amazon.emr" % "emr-dynamodb-hadoop" % "4.2.0",
"org.apache.httpcomponents" % "httpclient" % "4.5"
)
dependencyOverrides ++= {
Seq(
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.6.7.1",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.6.7",
"com.fasterxml.jackson.core" % "jackson-core" % "2.6.7"
)
}
readDataFromDDB.scala
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.{SparkConf, SparkContext}
object readDataFromDDB {
def main(args: Array[String]): Unit = {
var sc: SparkContext = null
try {
val conf = new SparkConf().setAppName("DynamoDBApplication").setMaster("local")
sc = new SparkContext(conf)
val jobConf = getDynamoDbJobConf(sc, "Music", "TableNameForWrite")
val tableData = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
println(tableData.count())
} catch {
case e: Exception => {
println(e.getStackTrace)
}
} finally {
sc.stop()
}
}
private def getDynamoDbJobConf(sc: JavaSparkContext, tableNameForRead: String, tableNameForWrite: String) = {
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.servicename", "dynamodb")
jobConf.set("dynamodb.input.tableName", tableNameForRead)
jobConf.set("dynamodb.output.tableName", tableNameForWrite)
jobConf.set("dynamodb.awsAccessKeyId", "*****************")
jobConf.set("dynamodb.awsSecretAccessKey", "*********************")
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-1")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
jobConf
}
}
ERROR
java.lang.RuntimeException: Could not lookup table Music in DynamoDB.
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:116)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:57)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:153)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.(AbstractDynamoDBRecordReader.java:84)
at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.(DefaultDynamoDBRecordReader.java:24)
at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.IllegalStateException: Socket not created by this factory
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:105)
... 20 more
Links already reviewed
https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/
read/write dynamo db from apache spark
Spark 2.2.0 - How to write/read DataFrame to DynamoDB
https://github.com/awslabs/emr-dynamodb-connector

This was solved when the following dependency version were updated
"software.amazon.awssdk" % "dynamodb" % "2.15.31",
"com.amazon.emr" % "emr-dynamodb-hadoop" % "4.14.0"

Related

"Unable to instantiate SparkSession with Hive support" error when trying to process hive table with spark

I want to process hive table using spark, but when I run my program, I got this error:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
My application code
object spark_on_hive_table extends App {
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", "hdfs://localhost:54310/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
spark.sql("select * from pbSales").show()
}
build.sbt
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.2",
"org.apache.spark" %% "spark-sql" % "2.3.2",
"org.apache.spark" %% "spark-streaming" % "2.3.2",
"org.apache.spark" %% "spark-hive" % "2.3.2" % "provided"
)

You should remove provided for your spark-hive dependency:
"org.apache.spark" %% "spark-hive" % "2.3.2" % "provided"
change to
"org.apache.spark" %% "spark-hive" % "2.3.2"

Derby Metastore directory is created in spark workspace

I have spark 2.1.0 installed and integrated with eclipse and hive2 installed and metastore configured in Mysql also placed hive-site.xml file in spark >> conf folder. I'm trying to access tables already present in hive from eclipse.
when I execute the program metastore folder and derby.log file is been created in spark workspace and eclipse console show the below INFO:
Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
17/06/13 18:26:43 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/06/13 18:26:43 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/06/13 18:26:43 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/06/13 18:26:43 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/06/13 18:26:43 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery#0" since the connection used is closing
17/06/13 18:26:43 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
spark can't able to locate the configured mysql metastore database
also throwing the error
Exception in thread "main" java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
Code:
import org.apache.spark.SparkContext, org.apache.spark.SparkConf
import com.typesafe.config._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
object hivecore {
def main(args: Array[String]) {
val warehouseLocation = "hdfs://HADOOPMASTER:54310/user/hive/warehouse"
val spark = SparkSession
.builder().master("local[*]")
.appName("hivecore")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
sql("SELECT * FROM sample.source").show()
}
}
Build.sbt
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"
libraryDependencies += "com.typesafe" % "config" % "1.3.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-hive_2.11" % "2.1.0"
libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.42"
NOTE : I can able to access the hive tables from Spark-shell
Thanks

When you put context.setMaster(local), it may not look for the spark configurations that you setup in cluster; specially when you trigger it from ECLIPSE.
Make a jar out of it; and trigger from cmd as spark-submit --class <main class package> --master spark://207.184.161.138:7077 --deploy-mode client
The master ip: spark://207.184.161.138:7077 should be replace with your cluster's ip and spark port.
And, remember to initialize HiveContext to trigger query on underlying HIVE.
val hc = new HiveContext(sc)
hc.sql("SELECT * FROM ...")

How to run sqlContext in the spark-jobserver

I'm trying to execute locally a job in the spark-jobserver. My application has the dependencies below:
name := "spark-test"
version := "1.0"
scalaVersion := "2.10.6"
resolvers += Resolver.jcenterRepo
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "spark.jobserver" %% "job-server-api" % "0.6.2" % "provided"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.6.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.2"
libraryDependencies += "com.holdenkarau" % "spark-testing-base_2.10" % "1.6.2_0.4.7" % "test"
I've generated the application package using:
sbt assembly
After that, I've submitted the package like this:
curl --data-binary #spark-test-assembly-1.0.jar localhost:8090/jars/myApp
When I triggered the job, I got the following error:
{
"duration": "0.101 secs",
"classPath": "jobs.TransformationJob",
"startTime": "2017-02-17T13:01:55.549Z",
"context": "42f857ba-jobs.TransformationJob",
"result": {
"message": "java.lang.Exception: Could not find resource path for Web UI: org/apache/spark/sql/execution/ui/static",
"errorClass": "java.lang.RuntimeException",
"stack": ["org.apache.spark.ui.JettyUtils$.createStaticHandler(JettyUtils.scala:180)", "org.apache.spark.ui.WebUI.addStaticHandler(WebUI.scala:117)", "org.apache.spark.sql.execution.ui.SQLTab.<init>(SQLTab.scala:34)", "org.apache.spark.sql.SQLContext$$anonfun$createListenerAndUI$1.apply(SQLContext.scala:1369)", "org.apache.spark.sql.SQLContext$$anonfun$createListenerAndUI$1.apply(SQLContext.scala:1369)", "scala.Option.foreach(Option.scala:236)", "org.apache.spark.sql.SQLContext$.createListenerAndUI(SQLContext.scala:1369)", "org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:77)", "jobs.TransformationJob$.runJob(TransformationJob.scala:64)", "jobs.TransformationJob$.runJob(TransformationJob.scala:14)", "spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)", "scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)", "scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)", "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)", "java.lang.Thread.run(Thread.java:745)"]
},
"status": "ERROR",
"jobId": "a6bd6f23-cc82-44f3-8179-3b68168a2aa7"
}
Here is the part of the application that is failing:
override def runJob(sparkCtx: SparkContext, config: Config): Any = {
val sqlContext = new SQLContext(sparkCtx)
...
}
I have some questions:
1) I've noticed that to run spark-jobserver local I don't need to have spark installed. Does spark-jobserver already come with spark embedded?
2) How do I know what is the version of the spark that is being used by spark-jobserver? Where is that?
3) I'm using the version 1.6.2 of the spark-sql. Should I change it or keep it?
If anyone can answer my questions, I will be very grateful.

Yes, spark-jobserver has spark dependencies. Instead of job-server/reStart you should use job-server-extras/reStart which will help you to get sql related dependencies.
Look at project/Versions.scala
You don't need spark-sql I think because it is included if you run job-server-extras/reStart

Spark Driver Heap Memory Issues

I am seeing issues where I slowly run out of Java Heap on the master node. Below is a simple example I've created which just repeats itself 200 times. With the settings below the master runs out of memory in about 1 hour with the following error:
16/12/15 17:55:46 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 97578 on executor id: 9 hostname: ip-xxx-xxx-xx-xx
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 20160"...
The Code:
import org.apache.spark.sql.functions._
import org.apache.spark._
object MemTest {
case class X(colval: Long, colname: Long, ID: Long)
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MemTest")
val spark = new SparkContext(conf)
val sc = org.apache.spark.sql.SQLContext.getOrCreate(spark)
import sc.implicits._;
for( a <- 1 to 200)
{
var df = spark.parallelize((1 to 5000000).map(x => X(x.toLong, x.toLong % 10, x.toLong / 10 ))).toDF()
df = df.groupBy("ID").pivot("colname").agg(max("colval"))
df.count
}
spark.stop()
}
}
I'm running on AWS emr-5.1.0 using m4.xlarge (4 nodes+1 master). Here are my spark settings
{
"Classification": "spark-defaults",
"Properties": {
"spark.dynamicAllocation.enabled": "false",
"spark.executor.instances": "16",
"spark.executor.memory": "2560m",
"spark.driver.memory": "768m",
"spark.executor.cores": "1"
}
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "false"
}
},
I compile with sbt using
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.2" % "provided",
"org.apache.spark" %% "spark-sql" % "2.0.2")
and then run it using
spark-submit --class MemTest target/scala-2.11/simple-project_2.11-1.0.jar
Looking at memory with jmap -histo I see java.lang.Long and scala.Tuple2 keep growing.

Are you sure the spark version installed on the cluster is 2.0.2?
Or if there are several Spark installations on your cluster, are you sure you're calling the correct (2.0.2) spark-submit?
(I unfortunately cannot comment so that's the reason I posted this as an answer)

Not getting the proper base url using requestUri on server machine using spark-submit

My requirement is creating the rest json from the request uri using spray. I am using requestUri directive to get the base URL. When I run it through IDE or through spark-submit locally on my system, I got the proper output. But when I have done spark-submit on the cluster, I am not getting the base url using requestUri directive.The url, I am getting is partial. Because of which the expected output is also not proper.
The code to get the url is
requestUri {
uri =>
val reqUri = s"$uri"//uri.toString()
complete {
println ("URI " + reqUri)
}
}
build.sbt looks like this
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0"
resolvers ++= Seq(
"Akka Repository" at "http://repo.akka.io/releases/")
resolvers ++= Seq("Typesafe Repository" at "http://repo.typesafe.com/typesafe/releases/",
"Spray Repository" at "http://repo.spray.io")
libraryDependencies +=
"com.typesafe.akka" %% "akka-actor" % "2.3.0"
libraryDependencies ++= {
val sprayVersion = "1.3.1"
Seq(
"io.spray" %% "spray-can" % sprayVersion,
"io.spray" %% "spray-routing" % sprayVersion,
"io.spray" %% "spray-json" % sprayVersion
)
}
Please let me know how I can I fix this issue.All your suggestions are valuable. Thanks in advance.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark DynamoDB Connectivity Issue - apache-spark

This was solved when the following dependency version were updated "software.amazon.awssdk" % "dynamodb" % "2.15.31", "com.amazon.emr" % "emr-dynamodb-hadoop" % "4.14.0"

Related

"Unable to instantiate SparkSession with Hive support" error when trying to process hive table with spark

Derby Metastore directory is created in spark workspace

How to run sqlContext in the spark-jobserver

Spark Driver Heap Memory Issues

Not getting the proper base url using requestUri on server machine using spark-submit

Categories

Resources