I have spark 2.1.0 installed and integrated with eclipse and hive2 installed and metastore configured in Mysql also placed hive-site.xml file in spark >> conf folder. I'm trying to access tables already present in hive from eclipse.
when I execute the program metastore folder and derby.log file is been created in spark workspace and eclipse console show the below INFO:
Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
17/06/13 18:26:43 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/06/13 18:26:43 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/06/13 18:26:43 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/06/13 18:26:43 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/06/13 18:26:43 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery#0" since the connection used is closing
17/06/13 18:26:43 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
spark can't able to locate the configured mysql metastore database
also throwing the error
Exception in thread "main" java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
Code:
import org.apache.spark.SparkContext, org.apache.spark.SparkConf
import com.typesafe.config._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
object hivecore {
def main(args: Array[String]) {
val warehouseLocation = "hdfs://HADOOPMASTER:54310/user/hive/warehouse"
val spark = SparkSession
.builder().master("local[*]")
.appName("hivecore")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
sql("SELECT * FROM sample.source").show()
}
}
Build.sbt
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"
libraryDependencies += "com.typesafe" % "config" % "1.3.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-hive_2.11" % "2.1.0"
libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.42"
NOTE : I can able to access the hive tables from Spark-shell
Thanks
When you put context.setMaster(local), it may not look for the spark configurations that you setup in cluster; specially when you trigger it from ECLIPSE.
Make a jar out of it; and trigger from cmd as spark-submit --class <main class package> --master spark://207.184.161.138:7077 --deploy-mode client
The master ip: spark://207.184.161.138:7077 should be replace with your cluster's ip and spark port.
And, remember to initialize HiveContext to trigger query on underlying HIVE.
val hc = new HiveContext(sc)
hc.sql("SELECT * FROM ...")
Related
Requirement: Read data from DynamoDB(not local but on AWS) via Spark using Scala from my local machine.
Understanding: Data can be read using the emr-hadoop-dynamodb.jar when we are using an EMR cluster
Question:
Can we read data from DynamoDB(on cloud and not local) using the emr-dynamodb-hadoop.jar?
EMR cluster is not to be used. I directly want to access dynamodb from spark using scala code on my local machine
build.sbt
version := "0.1"
scalaVersion := "2.11.12"
scalacOptions := Seq("-target:jvm-1.8")
libraryDependencies ++= Seq(
"software.amazon.awssdk" % "dynamodb" % "2.15.1",
"org.apache.spark" %% "spark-core" % "2.4.1",
"com.amazon.emr" % "emr-dynamodb-hadoop" % "4.2.0",
"org.apache.httpcomponents" % "httpclient" % "4.5"
)
dependencyOverrides ++= {
Seq(
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.6.7.1",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.6.7",
"com.fasterxml.jackson.core" % "jackson-core" % "2.6.7"
)
}
readDataFromDDB.scala
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.{SparkConf, SparkContext}
object readDataFromDDB {
def main(args: Array[String]): Unit = {
var sc: SparkContext = null
try {
val conf = new SparkConf().setAppName("DynamoDBApplication").setMaster("local")
sc = new SparkContext(conf)
val jobConf = getDynamoDbJobConf(sc, "Music", "TableNameForWrite")
val tableData = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
println(tableData.count())
} catch {
case e: Exception => {
println(e.getStackTrace)
}
} finally {
sc.stop()
}
}
private def getDynamoDbJobConf(sc: JavaSparkContext, tableNameForRead: String, tableNameForWrite: String) = {
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.servicename", "dynamodb")
jobConf.set("dynamodb.input.tableName", tableNameForRead)
jobConf.set("dynamodb.output.tableName", tableNameForWrite)
jobConf.set("dynamodb.awsAccessKeyId", "*****************")
jobConf.set("dynamodb.awsSecretAccessKey", "*********************")
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-1")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
jobConf
}
}
ERROR
java.lang.RuntimeException: Could not lookup table Music in DynamoDB.
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:116)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:57)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:153)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.(AbstractDynamoDBRecordReader.java:84)
at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.(DefaultDynamoDBRecordReader.java:24)
at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.IllegalStateException: Socket not created by this factory
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:105)
... 20 more
Links already reviewed
https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/
read/write dynamo db from apache spark
Spark 2.2.0 - How to write/read DataFrame to DynamoDB
https://github.com/awslabs/emr-dynamodb-connector
This was solved when the following dependency version were updated
"software.amazon.awssdk" % "dynamodb" % "2.15.31",
"com.amazon.emr" % "emr-dynamodb-hadoop" % "4.14.0"
First of all I'm new to all this tech stack, so if I don't present all the details please let me know.
Here's my problem with this: I'm trying to make a jar archive of a apache spark - kafka app. To package my app into a jar I use sbt assembly plugin:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10")
the package of the jar runs successfuly.
Now if I try to run it with:
spark-submit kafka-consumer.jar
the app boots up successfuly.
I want to do the same with the java -jar cmd, but unfortunately it fails.
Here's how to stack looks like:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/03/29 11:16:23 INFO SparkContext: Running Spark version 2.4.4
20/03/29 11:16:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/29 11:16:23 INFO SparkContext: Submitted application: KafkaConsumer
20/03/29 11:16:23 INFO SecurityManager: Changing view acls to: popar
20/03/29 11:16:23 INFO SecurityManager: Changing modify acls to: popar
20/03/29 11:16:23 INFO SecurityManager: Changing view acls groups to:
20/03/29 11:16:23 INFO SecurityManager: Changing modify acls groups to:
20/03/29 11:16:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(popar); groups with view permissions: Set(); users with modify permissions: Set(popar); groups with modify permissions: Set()
20/03/29 11:16:25 INFO Utils: Successfully started service 'sparkDriver' on port 55595.
20/03/29 11:16:25 INFO SparkEnv: Registering MapOutputTracker
20/03/29 11:16:25 INFO SparkEnv: Registering BlockManagerMaster
20/03/29 11:16:25 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/03/29 11:16:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/03/29 11:16:25 INFO DiskBlockManager: Created local directory at C:\Users\popar\AppData\Local\Temp\blockmgr-77af3fbc-264e-451c-9df3-5b7dda58f3a8
20/03/29 11:16:25 INFO MemoryStore: MemoryStore started with capacity 898.5 MB
20/03/29 11:16:26 INFO SparkEnv: Registering OutputCommitCoordinator
20/03/29 11:16:26 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/03/29 11:16:26 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://DESKTOP-0IISN4F.mshome.net:4040
20/03/29 11:16:26 INFO Executor: Starting executor ID driver on host localhost
20/03/29 11:16:26 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55636.
20/03/29 11:16:26 INFO NettyBlockTransferService: Server created on DESKTOP-0IISN4F.mshome.net:55636
20/03/29 11:16:26 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/03/29 11:16:26 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, DESKTOP-0IISN4F.mshome.net, 55636, None)
20/03/29 11:16:26 INFO BlockManagerMasterEndpoint: Registering block manager DESKTOP-0IISN4F.mshome.net:55636 with 898.5 MB RAM, BlockManagerId(driver, DESKTOP-0IISN4F.mshome.net, 55636, None)
20/03/29 11:16:26 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, DESKTOP-0IISN4F.mshome.net, 55636, None)
20/03/29 11:16:26 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, DESKTOP-0IISN4F.mshome.net, 55636, None)
Exception in thread "main" java.io.IOException: No FileSystem for scheme: file
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2798)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2809)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2848)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2830)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:181)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.spark.streaming.StreamingContext.checkpoint(StreamingContext.scala:239)
at KafkaConsumer$.main(KafkaConsumer.scala:85)
at KafkaConsumer.main(KafkaConsumer.scala)
As you can see it fails with: Exception in thread "main" java.io.IOException: No FileSystem for scheme: file
Now for my main class this is the def:
import Service._
import kafka.serializer.StringDecoder
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.client.{Put, Scan, Table}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.JavaConversions
import scala.util.Try
object KafkaConsumer {
def setupLogging(): Unit = {
val rootLogger = Logger.getRootLogger
rootLogger.setLevel(Level.ERROR)
}
def persistToHbase[A](rdd: A): Unit = {
val connection = getHbaseConnection
val admin = connection.getAdmin
val columnFamily1 = "personal_data"
val table = connection.getTable(TableName.valueOf("employees"))
val scan = new Scan()
scan.addFamily(columnFamily1.getBytes())
val totalRows: Int = getLastRowNumber(scan, columnFamily1, table)
persistRdd(rdd, table, columnFamily1, totalRows + 1)
Try(table.close())
Try(admin.close())
Try(connection.close())
}
private def getLastRowNumber[A](scan: Scan,
columnFamily: String,
table: Table): Int = {
val scanner = table.getScanner(scan)
val values = scanner.iterator()
val seq = JavaConversions.asScalaIterator(values).toIndexedSeq
seq.size
}
def persistRdd[A](rdd: A,
table: Table,
columnFamily: String,
rowNumber: Int): Unit = {
val row = Bytes.toBytes(String.valueOf(rowNumber))
val put = new Put(row)
val qualifier = "test_column"
put.addColumn(columnFamily.getBytes(),
qualifier.getBytes(),
String.valueOf(rdd).getBytes())
table.put(put)
}
def main(args: Array[String]): Unit = {
// create the context with a one second batch of data & uses all the CPU cores
val context = new StreamingContext("local[*]", "KafkaConsumer", Seconds(1))
// hostname:port for Kafka brokers
val kafkaParams = Map("metadata.broker.list" -> "192.168.56.22:9092")
// list of topics you want to listen from Kafka
val topics = List("employees").toSet
setupLogging()
// create a Kafka Stream, which will contain(topic, message) pairs
// we take a map(_._2) at the end in order to only get the messages which contain individual lines of data
val stream: DStream[String] = KafkaUtils
.createDirectStream[String, String, StringDecoder, StringDecoder](
context,
kafkaParams,
topics)
.map(_._2)
// debug print
stream.print()
// stream.foreachRDD(rdd => rdd.foreach(persistToHbase(_)))
context.checkpoint("C:/checkpoint/")
context.start()
context.awaitTermination()
}
}
and the build.sbt looks like this:
import sbt._
import Keys._
name := "kafka-consumer"
version := "0.1"
scalaVersion := "2.11.8"
lazy val sparkVersion = "2.4.4"
lazy val sparkStreamingKafkaVersion = "1.6.3"
lazy val hbaseVersion = "2.2.1"
lazy val hadoopVersion = "2.8.0"
lazy val hadoopCoreVersion = "1.2.1"
resolvers in Global ++= Seq(
"Sbt plugins" at "https://dl.bintray.com/sbt/sbt-plugin-releases"
)
lazy val commonSettings = Seq(
version := "0.1",
organization := "com.rares",
scalaVersion := "2.11.8",
test in assembly := {}
)
lazy val excludeJPountz =
ExclusionRule(organization = "net.jpountz.lz4", name = "lz4")
lazy val excludeHadoop =
ExclusionRule(organization = "org.apache.hadoop")
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % sparkVersion excludeAll (excludeJPountz, excludeHadoop),
"org.apache.spark" % "spark-streaming-kafka_2.11" % sparkStreamingKafkaVersion,
"org.apache.spark" % "spark-streaming_2.11" % sparkVersion excludeAll (excludeJPountz),
"org.apache.hadoop" % "hadoop-client" % hadoopVersion,
"org.apache.hbase" % "hbase-server" % hbaseVersion,
"org.apache.hbase" % "hbase-client" % hbaseVersion,
"org.apache.hbase" % "hbase-common" % hbaseVersion
)
//Fat jar
assemblyMergeStrategy in assembly := {
case PathList("org", "aopalliance", xs # _*) => MergeStrategy.last
case PathList("javax", "inject", xs # _*) => MergeStrategy.last
case PathList("net", "jpountz", xs # _*) => MergeStrategy.last
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case PathList("jetty-dir.css", xs # _*) => MergeStrategy.last
case PathList("org", "apache", xs # _*) => MergeStrategy.last
case PathList("com", "sun", xs # _*) => MergeStrategy.last
case PathList("hdfs-default.xml", xs # _*) => MergeStrategy.last
case PathList("javax", xs # _*) => MergeStrategy.last
case PathList("mapred-default.xml", xs # _*) => MergeStrategy.last
case PathList("core-default.xml", xs # _*) => MergeStrategy.last
case PathList("javax", "servlet", xs # _*) => MergeStrategy.last
// case "git.properties" => MergeStrategy.last
// case PathList("org", "apache", "jasper", xs # _*) => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
assemblyJarName in assembly := "kafka-consumer.jar"
Any advice will be deeply appreciated!!!
Ok, so here is what helped me out. Add the hadoop configuration for the spark context as following:
val hadoopConfiguration = context.sparkContext.hadoopConfiguration
hadoopConfiguration.set(
"fs.hdfs.impl",
classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
hadoopConfiguration.set(
"fs.file.impl",
classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
Works like a charm!
A big thanks to this: hadoop No FileSystem for scheme: file
Is there a way to use spark-xml (https://github.com/databricks/spark-xml) in a spark .net/c# job?
I was able to use spark-xml data source from .Net.
Here is the test program:
using Microsoft.Spark.Sql;
namespace MySparkApp
{
class Program
{
static void Main(string[] args)
{
SparkSession spark = SparkSession
.Builder()
.AppName("spark-xml-example")
.GetOrCreate();
DataFrame df = spark.Read()
.Option("rowTag", "book")
.Format("xml")
.Load("books.xml");
df.Show();
df.Select("author", "_id")
.Write()
.Format("xml")
.Option("rootTag", "books")
.Option("rowTag", "book")
.Save("newbooks.xml");
spark.Stop();
}
}
}
Checkout https://github.com/databricks/spark-xml and build an assembly jar using 'sbt assembly' command, copy the assembly jar to the dotnet project workspace.
Build project: dotnet build
Submit Spark job:
$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--jars scala-2.11/spark-xml-assembly-0.10.0.jar \
--master local bin/Debug/netcoreapp3.1/microsoft-spark-2.4.x-0.10.0.jar \
dotnet bin/Debug/netcoreapp3.1/sparkxml.dll
I am seeing issues where I slowly run out of Java Heap on the master node. Below is a simple example I've created which just repeats itself 200 times. With the settings below the master runs out of memory in about 1 hour with the following error:
16/12/15 17:55:46 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 97578 on executor id: 9 hostname: ip-xxx-xxx-xx-xx
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 20160"...
The Code:
import org.apache.spark.sql.functions._
import org.apache.spark._
object MemTest {
case class X(colval: Long, colname: Long, ID: Long)
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MemTest")
val spark = new SparkContext(conf)
val sc = org.apache.spark.sql.SQLContext.getOrCreate(spark)
import sc.implicits._;
for( a <- 1 to 200)
{
var df = spark.parallelize((1 to 5000000).map(x => X(x.toLong, x.toLong % 10, x.toLong / 10 ))).toDF()
df = df.groupBy("ID").pivot("colname").agg(max("colval"))
df.count
}
spark.stop()
}
}
I'm running on AWS emr-5.1.0 using m4.xlarge (4 nodes+1 master). Here are my spark settings
{
"Classification": "spark-defaults",
"Properties": {
"spark.dynamicAllocation.enabled": "false",
"spark.executor.instances": "16",
"spark.executor.memory": "2560m",
"spark.driver.memory": "768m",
"spark.executor.cores": "1"
}
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "false"
}
},
I compile with sbt using
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.2" % "provided",
"org.apache.spark" %% "spark-sql" % "2.0.2")
and then run it using
spark-submit --class MemTest target/scala-2.11/simple-project_2.11-1.0.jar
Looking at memory with jmap -histo I see java.lang.Long and scala.Tuple2 keep growing.
Are you sure the spark version installed on the cluster is 2.0.2?
Or if there are several Spark installations on your cluster, are you sure you're calling the correct (2.0.2) spark-submit?
(I unfortunately cannot comment so that's the reason I posted this as an answer)
Am trying to JAR a simple scala application which make use of SparlCSV and spark sql to create a Data frame of the CSV file stored in HDFS and then just make a simple query to return the Max and Min of specific column in CSV file.
I am getting error when i use the sbt command to create the JAR which later i will curl to jobserver /jars folder and execute from remote machine
Code:
import com.typesafe.config.{Config, ConfigFactory}
import org.apache.spark.SparkContext._
import org.apache.spark._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object sparkSqlCSV extends SparkJob {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName("sparkSqlCSV")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val config = ConfigFactory.parseString("")
val results = runJob(sc, config)
println("Result is " + results)
}
override def validate(sc: sqlContext, config: Config): SparkJobValidation = {
SparkJobValid
}
override def runJob(sc: sqlContext, config: Config): Any = {
val value = "com.databricks.spark.csv"
val ControlDF = sqlContext.load(value,Map("path"->"hdfs://mycluster/user/Test.csv","header"->"true"))
ControlDF.registerTempTable("Control")
val aggDF = sqlContext.sql("select max(DieX) from Control")
aggDF.collectAsList()
}
}
Error:
[hduser#ptfhadoop01v spark-jobserver]$ sbt ashesh-jobs/package
[info] Loading project definition from /usr/local/hadoop/spark-jobserver/project
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
[info] Set current project to root (in build file:/usr/local/hadoop/spark-jobserver/)
[info] scalastyle using config /usr/local/hadoop/spark-jobserver/scalastyle-config.xml
[info] Processed 2 file(s)
[info] Found 0 errors
[info] Found 0 warnings
[info] Found 0 infos
[info] Finished in 9 ms
[success] created output: /usr/local/hadoop/spark-jobserver/ashesh-jobs/target
[warn] Credentials file /home/hduser/.bintray/.credentials does not exist
[info] Updating {file:/usr/local/hadoop/spark-jobserver/}ashesh-jobs...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] scalastyle using config /usr/local/hadoop/spark-jobserver/scalastyle-config.xml
[info] Processed 5 file(s)
[info] Found 0 errors
[info] Found 0 warnings
[info] Found 0 infos
[info] Finished in 1 ms
[success] created output: /usr/local/hadoop/spark-jobserver/job-server-api/target
[info] Compiling 2 Scala sources and 1 Java source to /usr/local/hadoop/spark-jobserver/ashesh-jobs/target/scala-2.10/classes...
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:8: object sql is not a member of package org.apache.spark
[error] import org.apache.spark.sql.SQLContext
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:14: object sql is not a member of package org.apache.spark
[error] val sqlContext = new org.apache.spark.sql.SQLContext(sc)
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:25: not found: type sqlContext
[error] override def runJob(sc: sqlContext, config: Config): Any = {
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:21: not found: type sqlContext
[error] override def validate(sc: sqlContext, config: Config): SparkJobValidation = {
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:27: not found: value sqlContext
[error] val ControlDF = sqlContext.load(value,Map("path"->"hdfs://mycluster/user/Test.csv","header"->"true"))
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:29: not found: value sqlContext
[error] val aggDF = sqlContext.sql("select max(DieX) from Control")
[error] ^
[error] 6 errors found
[error] (ashesh-jobs/compile:compileIncremental) Compilation failed
[error] Total time: 10 s, completed May 26, 2016 4:42:52 PM
[hduser#ptfhadoop01v spark-jobserver]$
I guess the main issue being that its missing the dependencies for sparkCSV and sparkSQL , But i have no idea where to place the dependencies before compiling the code using sbt.
I am issuing the following command to package the application , The source codes are placed under "ashesh_jobs" directory
[hduser#ptfhadoop01v spark-jobserver]$ sbt ashesh-jobs/package
I hope someone can help me to resolve this issue.Can you specify me the file where i can specify the dependency and the format to input
The following link has more information in creating other contexts https://github.com/spark-jobserver/spark-jobserver/blob/master/doc/contexts.md
Also you need job-server-extras
add library dependency in buil.sbt
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.2"