I'm running a simple spark application which does the 'word to vector'. here is my code (this is from the spark website)
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Word2Vector")
val sc = new SparkContext(conf)
val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
val synonyms = model.findSynonyms("china", 40)
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")
}
// Save and load model
model.save(sc, "myModelPath")
}
}
when running it it gives me the following error message
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://GXYDEVVM:8020/user/hadoop/YOUR_SPARK_HOME/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:442)
at org.apache.spark.api.java.AbstractJavaRDDLike.count(JavaRDDLike.scala:47)
at SimpleApp.main(SimpleApp.java:13)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
What is the problem? where this addess is coming from /user/hadoop/YOUR_SPARK_HOME/README.md
This is probably related to your default Spark configuration.
Take a look (or use grep) in the conf directory of your Spark home directory. You should find a spark-env.sh file, which could contain a reference to the strange file.
In fact, Spark is trying to load a file from HDFS (kind of a standard if you run Spark on a cluster : your input / output should be reachable by the master, and the workers slaves). If you use Spark locally you have to configure the Spark Context using setMaster method. Here is my version :
object SparkDemo {
def log[A](key:String)(job : =>A) = {
val start = System.currentTimeMillis
val output = job
println("===> %s in %s seconds"
.format(key, (System.currentTimeMillis - start) / 1000.0))
output
}
def main(args: Array[String]):Unit ={
val modelName ="w2vModel"
val sc = new SparkContext(
new SparkConf()
.setAppName("SparkDemo")
.set("spark.executor.memory", "4G")
.set("spark.driver.maxResultSize", "16G")
.setMaster("spark://192.168.1.53:7077") // ip of the spark master.
// .setMaster("local[2]") // does not work... workers loose contact with the master after 120s
)
// take a look into target folder if you are unsure how the jar is named
// onliner to compile / run : sbt package && sbt run
sc.addJar("./target/scala-2.10/sparkling_2.10-0.1.jar")
val input = sc.textFile("./text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = log("compute model") { word2vec.fit(input) }
log ("save model") { model.save(sc, modelName) }
val synonyms = model.findSynonyms("china", 40)
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")
}
val model2 = log("reload model") { Word2VecModel.load(sc, modelName) }
}
}
Related
Requirement: Read data from DynamoDB(not local but on AWS) via Spark using Scala from my local machine.
Understanding: Data can be read using the emr-hadoop-dynamodb.jar when we are using an EMR cluster
Question:
Can we read data from DynamoDB(on cloud and not local) using the emr-dynamodb-hadoop.jar?
EMR cluster is not to be used. I directly want to access dynamodb from spark using scala code on my local machine
build.sbt
version := "0.1"
scalaVersion := "2.11.12"
scalacOptions := Seq("-target:jvm-1.8")
libraryDependencies ++= Seq(
"software.amazon.awssdk" % "dynamodb" % "2.15.1",
"org.apache.spark" %% "spark-core" % "2.4.1",
"com.amazon.emr" % "emr-dynamodb-hadoop" % "4.2.0",
"org.apache.httpcomponents" % "httpclient" % "4.5"
)
dependencyOverrides ++= {
Seq(
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.6.7.1",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.6.7",
"com.fasterxml.jackson.core" % "jackson-core" % "2.6.7"
)
}
readDataFromDDB.scala
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.{SparkConf, SparkContext}
object readDataFromDDB {
def main(args: Array[String]): Unit = {
var sc: SparkContext = null
try {
val conf = new SparkConf().setAppName("DynamoDBApplication").setMaster("local")
sc = new SparkContext(conf)
val jobConf = getDynamoDbJobConf(sc, "Music", "TableNameForWrite")
val tableData = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
println(tableData.count())
} catch {
case e: Exception => {
println(e.getStackTrace)
}
} finally {
sc.stop()
}
}
private def getDynamoDbJobConf(sc: JavaSparkContext, tableNameForRead: String, tableNameForWrite: String) = {
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.servicename", "dynamodb")
jobConf.set("dynamodb.input.tableName", tableNameForRead)
jobConf.set("dynamodb.output.tableName", tableNameForWrite)
jobConf.set("dynamodb.awsAccessKeyId", "*****************")
jobConf.set("dynamodb.awsSecretAccessKey", "*********************")
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-1")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
jobConf
}
}
ERROR
java.lang.RuntimeException: Could not lookup table Music in DynamoDB.
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:116)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:57)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:153)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.(AbstractDynamoDBRecordReader.java:84)
at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.(DefaultDynamoDBRecordReader.java:24)
at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.IllegalStateException: Socket not created by this factory
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:105)
... 20 more
Links already reviewed
https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/
read/write dynamo db from apache spark
Spark 2.2.0 - How to write/read DataFrame to DynamoDB
https://github.com/awslabs/emr-dynamodb-connector
This was solved when the following dependency version were updated
"software.amazon.awssdk" % "dynamodb" % "2.15.31",
"com.amazon.emr" % "emr-dynamodb-hadoop" % "4.14.0"
I'm playing around with Gcloud Composer, trying to create a DAG that creates a DataProc cluster, runs a simple Spark job, then tears down the cluster. I am trying to run the Spark PI example job.
I understand that when calling DataProcSparkOperator I can choose only to define either the main_jar or the main_class property. When I define main_class, the job fails with the error:
java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:239)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
When I choose to define the main_jar property, the job fails with the error:
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
I'm at a bit of a loss as to how to resolve this, as I am kinda new to both Spark and DataProc.
My DAG:
import datetime as dt
from airflow import DAG, models
from airflow.contrib.operators import dataproc_operator as dpo
from airflow.utils import trigger_rule
MAIN_JAR = 'file:///usr/lib/spark/examples/jars/spark-examples.jar'
MAIN_CLASS = 'org.apache.spark.examples.SparkPi'
CLUSTER_NAME = 'quickspark-cluster-{{ ds_nodash }}'
yesterday = dt.datetime.combine(
dt.datetime.today() - dt.timedelta(1),
dt.datetime.min.time())
default_dag_args = {
'start_date': yesterday,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': dt.timedelta(seconds=30),
'project_id': models.Variable.get('gcp_project')
}
with DAG('dataproc_spark_submit', schedule_interval='0 17 * * *',
default_args=default_dag_args) as dag:
create_dataproc_cluster = dpo.DataprocClusterCreateOperator(
project_id = default_dag_args['project_id'],
task_id = 'create_dataproc_cluster',
cluster_name = CLUSTER_NAME,
num_workers = 2,
zone = models.Variable.get('gce_zone')
)
run_spark_job = dpo.DataProcSparkOperator(
task_id = 'run_spark_job',
#main_jar = MAIN_JAR,
main_class = MAIN_CLASS,
cluster_name = CLUSTER_NAME
)
delete_dataproc_cluster = dpo.DataprocClusterDeleteOperator(
project_id = default_dag_args['project_id'],
task_id = 'delete_dataproc_cluster',
cluster_name = CLUSTER_NAME,
trigger_rule = trigger_rule.TriggerRule.ALL_DONE
)
create_dataproc_cluster >> run_spark_job >> delete_dataproc_cluster
I compared it with a successful job using the CLI and saw that, even when the class was populating the Main class or jar field, the path to the Jar was specified in Jar files:
Checking the operator I noticed there is also a dataproc_spark_jars parameter which is not mutually exclusive to main_class:
run_spark_job = dpo.DataProcSparkOperator(
task_id = 'run_spark_job',
dataproc_spark_jars = [MAIN_JAR],
main_class = MAIN_CLASS,
cluster_name = CLUSTER_NAME
)
Adding it did the trick:
I'm trying to write a Spark Structured Streaming (2.3) dataset to ScyllaDB (Cassandra).
My code to write the dataset:
def saveStreamSinkProvider(ds: Dataset[InvoiceItemKafka]) = {
ds
.writeStream
.format("cassandra.ScyllaSinkProvider")
.outputMode(OutputMode.Append)
.queryName("KafkaToCassandraStreamSinkProvider")
.options(
Map(
"keyspace" -> namespace,
"table" -> StreamProviderTableSink,
"checkpointLocation" -> "/tmp/checkpoints"
)
)
.start()
}
My ScyllaDB Streaming Sinks:
class ScyllaSinkProvider extends StreamSinkProvider {
override def createSink(sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): ScyllaSink =
new ScyllaSink(parameters)
}
class ScyllaSink(parameters: Map[String, String]) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit =
data.write
.cassandraFormat(
parameters("table"),
parameters("keyspace")
//parameters("cluster")
)
.mode(SaveMode.Append)
.save()
}
However, when I run this code, I receive an exception:
...
[error] +- StreamingExecutionRelation KafkaSource[Subscribe[transactions_load]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]
[error] at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
[error] at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
[error] Caused by: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame;
[error] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[error] at org.apache.spark.sql.Dataset.write(Dataset.scala:3103)
[error] at cassandra.ScyllaSink.addBatch(CassandraDriver.scala:113)
[error] at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:477)
...
I have seen a similar question, but that is for CosmosDB - Spark CosmosDB Sink: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame
You could convert it to an RDD first and then write:
class ScyllaSink(parameters: Map[String, String]) extends Sink {
override def addBatch(batchId: Long, data: DataFrame): Unit = synchronized {
val schema = data.schema
// this ensures that the same query plan will be used
val rdd: RDD[Row] = df.queryExecution.toRdd.mapPartitions { rows =>
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row])
}
// write the RDD to Cassandra
}
}
build.gradle
plugins {
id 'java'
id 'com.github.johnrengelman.shadow' version '1.2.3'
}
group 'com.hello.aggregation'
version '1.0-SNAPSHOT'
apply plugin: 'java'
apply plugin: 'scala'
apply plugin: 'idea'
sourceCompatibility = 1.8
targetCompatibility = 1.8
configurations {
provided
}
sourceSets {
main {
compileClasspath += configurations.provided
}
}
repositories {
mavenCentral()
}
dependencies {
compile "org.scala-lang:scala-library:$scalaVersion"
compile "org.scala-lang:scala-reflect:$scalaVersion"
compile "org.scala-lang:scala-compiler:$scalaVersion"
compile "org.apache.spark:spark-core_$scalaBase:$sparkVersion"
compile "org.apache.spark:spark-sql_$scalaBase:$sparkVersion"
compile "com.datastax.cassandra:cassandra-driver-core:$cassandraDriverVersion"
compile "com.datastax.spark:spark-cassandra-connector_$scalaBase:$connectorVersion"
compile "org.slf4j:slf4j-api:$slf4jVersion"
compile "mysql:mysql-connector-java:$mySqlConnectorVersion"
testCompile group: 'junit', name: 'junit', version: '4.12'
}
task run(type: JavaExec, dependsOn: classes) {
main = mainClassFile
classpath sourceSets.main.runtimeClasspath
classpath configurations.runtime
}
jar {
classifier = 'all'
manifest {
attributes 'Implementation-Title': title,
'Implementation-Version': version,
'Main-Class': mainClassFile
}
include{sourceSets.main.output.classesDir}
zip64 true
}
shadowJar {
classifier = 'shadow'
append 'reference.conf'
dependencies {
}
zip64 true
}
idea {
module {
// IntelliJ does not know about the standard idiom of provided as used in managing
// uber/shaded jar dependencies. Make it so!
scopes.PROVIDED.plus += [ configurations.provided ]
}
}
gradle.properties
version=1.0.0
scalaBase=2.11
scalaVersion=2.11.4
slf4jVersion=1.7.25
sparkVersion=1.6.3
connectorVersion=1.6.7
cassandraDriverVersion=3.0.7
mySqlConnectorVersion=5.1.37
exception:
17/06/19 16:03:54 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.ObjectRef.zero()Lscala/runtime/ObjectRef;
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$7.apply(CassandraConnector.scala:150)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$7.apply(CassandraConnector.scala:150)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:32)
at com.datastax.spark.connector.cql.RefCountedCache.syncAcquire(RefCountedCache.scala:69)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:57)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:80)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:107)
at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:118)
at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:330)
at com.datastax.spark.connector.cql.Schema$.tableFromCassandra(Schema.scala:350)
at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.tableDef(CassandraTableRowReaderProvider.scala:50)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef$lzycompute(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.verify(CassandraTableRowReaderProvider.scala:137)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.verify(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:230)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$anonfun$distinct$2.apply(RDD.scala:359)
at org.apache.spark.rdd.RDD$$anonfun$distinct$2.apply(RDD.scala:359)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.distinct(RDD.scala:358)
at com.achoo.scala.streambright.SimpleDailyRun$.delayedEndpoint$com$achoo$scala$streambright$SimpleDailyRun$1(SimpleDailyRun.scala:30)
at com.achoo.scala.streambright.SimpleDailyRun$delayedInit$body.apply(SimpleDailyRun.scala:14)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
Code:
package com.streambright
import java.sql.{Connection, DriverManager}
import com.mysql.jdbc.Driver
import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.{SparkConf, SparkContext}
import com.datastax.spark.connector.toSparkContextFunctions
import org.apache.spark.sql.cassandra.CassandraSQLContext
object SimpleDailyRun extends App {
DriverManager.registerDriver(new Driver())
val config = new SparkConf(true).setAppName("Simple Daily Run")
val sc = SparkContext.getOrCreate(config)
val cc = new CassandraSQLContext(sc)
cc.setKeyspace("achoo")
val conn = DriverManager.getConnection("jdbc:mysql://10.175.190.95/db?useUnicode=yes&characterEncoding=UTF-8&user=user&password=pass")
val mySqlJdbcRDD = new JdbcRDD(sc, () => conn,
"SELECT b.project_id,a.keyword FROM keyword a " +
"JOIN project_keyword b ON a.id = b.keyword_id LIMIT ?, ?",
0, 100000000, 1, r => (r.getInt("project_id"), r.getString("keyword")))
val cassandraRDD = sc.cassandraTable("hello", "instagram_keyword_analytic")
.select("keyword", "relativepath")
.as((_: String, _: String))
.distinct()
mySqlJdbcRDD.saveAsTextFile("/data/MySQL_projectid_keywords_"+System.currentTimeMillis()+".txt")
cassandraRDD.saveAsTextFile("/data/MySQL_projectid_keywords_"+System.currentTimeMillis()+".txt")
}
Cassandra version: cassandra21-2.1.15-1 (DataStax)
Spark version: 1.6.3
Does anybody know how to fix this?
The fix for this problem is to use Scala 2.10 instead of 2.11.
Goal is to create the following on a local instance of Spark JobServer:
object foo extends SparkJob with NamedRddSupport
Question: How can I fix the following error which happens on every job:
{
"status": "ERROR",
"result": {
"message": "Ask timed out on [Actor[akka://JobServer/user/context-supervisor/439b2467-spark.jobserver.genderPrediction#884262439]] after [10000 ms]",
"errorClass": "akka.pattern.AskTimeoutException",
"stack: ["akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)", "akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)", "scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)", "scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)", "akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)", "akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)", "akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)", "akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)", "java.lang.Thread.run(Thread.java:745)"]
}
}
A more detailed error description by the Spark JobServer:
job-server[ERROR] Exception in thread "pool-100-thread-1" java.lang.AbstractMethodError: spark.jobserver.genderPrediction$.namedObjectsPrivate()Ljava/util/concurrent/atomic/AtomicReference;
job-server[ERROR] at spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:248)
job-server[ERROR] at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
job-server[ERROR] at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
job-server[ERROR] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
job-server[ERROR] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
job-server[ERROR] at java.lang.Thread.run(Thread.java:745)
In case somebody wants to see the code:
package spark.jobserver
import org.apache.spark.SparkContext._
import org.apache.spark.{SparkContext}
import com.typesafe.config.{Config, ConfigFactory}
import collection.JavaConversions._
import scala.io.Source
object genderPrediction extends SparkJob with NamedRddSupport
{
// Main function
def main(args: scala.Array[String])
{
val sc = new SparkContext()
sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
val config = ConfigFactory.parseString("")
val results = runJob(sc, config)
}
def validate(sc: SparkContext, config: Config): SparkJobValidation = {SparkJobValid}
def runJob(sc: SparkContext, config: Config): Any =
{
return "ok";
}
}
Version information:
Spark is 1.5.0 - SparkJobServer is latest version
Thank you all very much in advance!
Adding more explanation to #noorul 's answer
It seems like you compiled the code with an old version of SJS and you are running it with the latest.
NamedObjects were recently added. You are getting AbstractMethodError because your server expects NamedObjects support and you didn't compile the code with that.
Also: you don't need the main method there since it won't be executed by SJS.
Ensure that your.compile and run time library versions of dependent packages are same.