Cassandra Spark job submission - cassandra

I am a relative newbie to spark/cassandra. As such I have a basic question. I have compiled an uber jar and loaded it to my spark/cassandra server. Now I am in a pickle, how do I run it via the cassandra (DSE) enviornment? I know the spark shell command is "dse spark-submit" but when I try to do a "dse spark-submit" I get a "NullPointerException"
Here is the full output:
Exception in thread "main" java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The program code is very basic and has been proven to work in the spark shell
package xxx.seaoxxxx
import com.datastax.spark.connector._
import org.apache.spark.{SparkConf, SparkContext}
class test {
def main(args: Array[String]){
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "xx.xxx.xx.xx")
.setAppName("Seasonality")
val sc = new SparkContext("spark://xx.xxx.xx.xx:7077", "Season", conf)
val ks = "loadset"
val incf = "period"
val rdd = sc.cassandraTable(ks, incf)
rdd.count
println("done with test")
sc.stop()
}
}
The spark-submit code is as follows:
dse spark-submit \
--class xxx.seaoxxxx.test \
--master spark://xxx.xx.x.xxx:7077 \
/home/ubuntu/spark/Seasonality_v6-assembly-1.0.1.jar 100
Thanks,
Eric

The current release, DataStax Enterprise 4.5, supports dse spark-class instead of dse spark-submit: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkStart.html?scroll=sparkStart__spkShrkLaunch

Related

Spark-Atlas-Connector NullPointerExceptions during startup

I'm trying to start my job which I've done for testing integration spark with atlas.
This is simple job which reads from one topic and write to another.
val sparkConf = new SparkConf()
.setAppName("atlas-test")
.setMaster("local[2]")
.set("spark.extraListeners", "com.hortonworks.spark.atlas.SparkAtlasEventTracker")
.set("spark.sql.queryExecutionListeners", "com.hortonworks.spark.atlas.SparkAtlasEventTracker")
.set("spark.sql.streaming.streamingQueryListeners", "com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker")
val spark = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val df = spark.read.format("kafka")
.option("kafka.bootstrap.servers", BROKER_SERVERS)
.option("subscribe", "foobar2")
.option("startingOffset", "earliest")
.option("kafka.atlas.cluster.name", clusterName)
.load()
println("---------------------------------------------")
df.printSchema()
val dfs = df.selectExpr("CAST(key as STRING)","CAST(value AS STRING)").as[(String, String)]
dfs.show()
println("---------------------------------------------")
df.write
.format("kafka")
.option("kafka.bootstrap.servers", BROKER_SERVERS)
.option("topic", "foobar-out")
.option("kafka.atlas.cluster.name", clusterName)
.save()
Everything seems understandable. So I try to run the job in my IDE (Intellij) and almost everytime I got this exception
19/08/12 17:00:08 WARN SparkExecutionPlanProcessor: Caught exception during parsing event
java.lang.NullPointerException
at org.apache.spark.sql.internal.SQLConf$$anonfun$14.apply(SQLConf.scala:133)
at org.apache.spark.sql.internal.SQLConf$$anonfun$14.apply(SQLConf.scala:133)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:133)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.simpleString(SaveIntoDataSourceCommand.scala:52)
at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:177)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:548)
at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:472)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$4.apply(QueryExecution.scala:197)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$4.apply(QueryExecution.scala:197)
at org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:99)
at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:197)
at com.hortonworks.spark.atlas.sql.CommandsHarvester$.com$hortonworks$spark$atlas$sql$CommandsHarvester$$getPlanInfo(CommandsHarvester.scala:214)
at com.hortonworks.spark.atlas.sql.CommandsHarvester$.com$hortonworks$spark$atlas$sql$CommandsHarvester$$makeProcessEntities(CommandsHarvester.scala:222)
at com.hortonworks.spark.atlas.sql.CommandsHarvester$SaveIntoDataSourceHarvester$.harvest(CommandsHarvester.scala:183)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:108)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:89)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:89)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:63)
at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:72)
at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:71)
at scala.Option.foreach(Option.scala:257)
at com.hortonworks.spark.atlas.AbstractEventProcessor.eventProcess(AbstractEventProcessor.scala:71)
at com.hortonworks.spark.atlas.AbstractEventProcessor$$anon$1.run(AbstractEventProcessor.scala:38)
I'm using spark 2.4.0 with scala 2.11
And I have some misunderstanding about result. Honestly can't understand after this job in my atlas (local machine) will appear something? Because sometimes jobs run successful but nothing appears in Atlas.

Cassandra Connector fails when run under Spark 2.3 on Kubernetes

I'm trying to use the connector, which I've used a bunch of times in the past super successfully, with the new Spark 2.3 native Kubernetes support and am running into a lot of trouble.
I have a super simple job that looks like this:
package io.rhom
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector.cql.CassandraConnectorConf
import com.datastax.spark.connector.rdd.ReadConf
/** Computes an approximation to pi */
object BackupLocations {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("BackupLocations")
.getOrCreate()
spark.sparkContext.hadoopConfiguration.set(
"fs.defaultFS",
"wasb://<snip>"
)
spark.sparkContext.hadoopConfiguration.set(
"fs.azure.account.key.rhomlocations.blob.core.windows.net",
"<snip>"
)
val df = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "locations", "keyspace" -> "test"))
.load()
df.write
.mode("overwrite")
.format("com.databricks.spark.avro")
.save("wasb://<snip>")
spark.stop()
}
}
which I'm building under SBT with Scala 2.11 and packaging with a Dockerfile that looks like this:
FROM timfpark/spark:20180305
COPY core-site.xml /opt/spark/conf
RUN mkdir -p /opt/spark/jars
COPY target/scala-2.11/rhom-backup-locations_2.11-0.1.0-SNAPSHOT.jar /opt/spark/jars
and then executing with:
bin/spark-submit --master k8s://blue-rhom-io.eastus2.cloudapp.azure.com:443 \
--deploy-mode cluster \
--name backupLocations \
--class io.rhom.BackupLocations \
--conf spark.executor.instances=2 \
--conf spark.cassandra.connection.host=10.1.0.10 \
--conf spark.kubernetes.container.image=timfpark/rhom-backup-locations:20180306v12 \
--jars https://dl.bintray.com/spark-packages/maven/datastax/spark-cassandra-connector/2.0.3-s_2.11/spark-cassandra-connector-2.0.3-s_2.11.jar,http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.2/hadoop-azure-2.7.2.jar,http://central.maven.org/maven2/com/microsoft/azure/azure-storage/3.1.0/azure-storage-3.1.0.jar,http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar \
local:///opt/spark/jars/rhom-backup-locations_2.11-0.1.0-SNAPSHOT.jar
all of this works except for the Cassandra connection piece, which eventually fails with:
2018-03-07 01:19:38 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, 10.4.0.46, executor 1): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Exception during preparation of SELECT "user_id", "timestamp", "accuracy", "altitude", "altitude_accuracy", "course", "features", "latitude", "longitude", "source", "speed" FROM "rhom"."locations" WHERE token("user_id") > ? AND token("user_id") <= ? ALLOW FILTERING: org/apache/spark/sql/catalyst/package$ScalaReflectionLock$
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.createStatement(CassandraTableScanRDD.scala:323)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.com$datastax$spark$connector$rdd$CassandraTableScanRDD$$fetchTokenRange(CassandraTableScanRDD.scala:339)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at com.datastax.spark.connector.util.CountingIterator.hasNext(CountingIterator.scala:12)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:380)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
... 8 more
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/package$ScalaReflectionLock$
at org.apache.spark.sql.catalyst.ReflectionLock$.<init>(ReflectionLock.scala:5)
at org.apache.spark.sql.catalyst.ReflectionLock$.<clinit>(ReflectionLock.scala)
at com.datastax.spark.connector.types.TypeConverter$.<init>(TypeConverter.scala:73)
at com.datastax.spark.connector.types.TypeConverter$.<clinit>(TypeConverter.scala)
at com.datastax.spark.connector.types.BigIntType$.converterToCassandra(PrimitiveColumnType.scala:50)
at com.datastax.spark.connector.types.BigIntType$.converterToCassandra(PrimitiveColumnType.scala:46)
at com.datastax.spark.connector.types.ColumnType$.converterToCassandra(ColumnType.scala:231)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$11.apply(CassandraTableScanRDD.scala:312)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$11.apply(CassandraTableScanRDD.scala:312)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.createStatement(CassandraTableScanRDD.scala:312)
... 23 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.package$ScalaReflectionLock$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 41 more
2018-03-07 01:19:38 INFO TaskSetManager:54 - Starting task 0.1 in stage 0.0 (TID 3, 10.4.0.46, executor 1, partition 0, ANY, 9486 bytes)
I've tried every thing I can possibly think of to resolve this - anyone have any ideas? Is this possibly caused by another unrelated issue?
It turns out that version 2.0.7 of the Datastax Cassandra Connector does not support Spark 2.3 currently. I opened a JIRA ticket on Datastax's site for this and hopefully it will be addressed soon.

Phoenix Spark Plugin

is the plugin update to spark 2.0 ?
I can't use the plugin
val df = spark.read
.format("org.apache.phoenix.spark")
.option("table", "web_stat")
.option("zkUrl", "localhost:2181")
.option("driver","org.apache.phoenix.jdbc.PhoenixDriver")
.load()
ERROR:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
just jdbc connect phoenix is OK!
when i just use the spark jdbc connector ,it comes
val df = spark.read
.format("jdbc")
.option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
.option("url", " jdbc:phoenix:localhost:2181")
.option("dbtable", "web_stat")
.load()
ERROR
Exception in thread "main" java.lang.NullPointerException at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:167)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:117)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:53)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:345)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
at
org.apache.spark.sql.phoenix.SparkPhoenixExample$.main(SparkPhoenixExample.scala:65)
Spark 2.0 is not yet working with Phoenix. see this URL for a patch https://issues.apache.org/jira/browse/PHOENIX-3333

Can't run Cassandra on Docker with Spark

I have a Zeppelin notebook running on Docker. I have the following code using Cassandra:
import org.apache.spark.sql.cassandra._
val cqlContext = new CassandraSQLContext(sc)
cqlContext.sql("select * from demo.table").collect.foreach(println)
However, I am getting this error:
import org.apache.spark.sql.cassandra._
cqlContext: org.apache.spark.sql.cassandra.CassandraSQLContext = org.apache.spark.sql.cassandra.CassandraSQLContext#395e28a8
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: Cannot build a cluster without contact points
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2199)
at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
at org.apache.spark.sql.cassandra.CassandraCatalog.lookupRelation(CassandraCatalog.scala:28)
at org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(CassandraSQLContext.scala:219)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
at org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$2.lookupRelation(CassandraSQLContext.scala:219)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:138)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:137)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:32)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:41)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:43)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:47)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
at $iwC$$iwC$$iwC.<init>(<console>:53)
at $iwC$$iwC.<init>(<console>:55)
at $iwC.<init>(<console>:57)
at <init>(<console>:59)
at .<init>(<console>:63)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
at com.nflabs.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:541)
at com.nflabs.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:517)
at com.nflabs.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:510)
at com.nflabs.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:40)
at com.nflabs.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:76)
at com.nflabs.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:246)
at com.nflabs.zeppelin.scheduler.Job.run(Job.java:152)
at com.nflabs.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:101)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Cannot build a cluster without contact points
at com.datastax.driver.core.Cluster.checkNotEmpty(Cluster.java:116)
at com.datastax.driver.core.Cluster.<init>(Cluster.java:108)
at com.datastax.driver.core.Cluster.buildFrom(Cluster.java:177)
at com.datastax.driver.core.Cluster$Builder.build(Cluster.java:1109)
at com.datastax.spark.connector.cql.DefaultConnectionFactory$.createCluster(CassandraConnectionFactory.scala:78)
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:167)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:162)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:162)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:73)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:99)
at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:173)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:22)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:19)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
... 92 more
From the Docker command line I ran docker pull cassandra but still the issue persists.
What should I do to be able to use Cassandra?
For spark to connect to cassandra cluster you have to provide the one of the node of cassandra cluster in spark conf as follows:
conf.set("spark.cassandra.connection.host", "127.0.0.1")
I was having the same issue Cannot build a cluster without contact points and managed to solve it by setting the SparkConf() as follows:
conf = SparkConf() \
.setAppName("MyApp") \
.setMaster("spark://127.0.0.1:7077") \
.set("spark.cassandra.connection.host", "127.0.0.1")
So, a basic Spark < 2.0 program - in Python - that connects with a local Cassandra should look like:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = SparkConf() \
.setAppName("PySpark Cassandra Test") \
.setMaster("spark://127.0.0.1:7077") \
.set("spark.cassandra.connection.host", "127.0.0.1")
sc = SparkContext('local', conf=conf)
sql = SQLContext(sc)
test = sql.read.format("org.apache.spark.sql.cassandra").\
load(keyspace="mykeyspace", table="mytable")
test.collect()

Issue with Zeppelin on Spark-Cassandra system: Classnotfoundexception

I have recently started to work with zeppelin on top of a Spark-Cassandra Cluster (Master + 3 Workers) System to run simple machine learning algorithms using the MLlib library.
Here are the libraries that I loaded to zeppelin:
%dep
z.load("com.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M1")
z.load("org.apache.spark:spark-core_2.10:1.4.1")
z.load("com.datastax.cassandra:cassandra-driver-core:2.1.3")
z.load("org.apache.thrift:libthrift:0.9.2")
z.load("org.apache.spark:spark-mllib_2.10:1.4.0")
z.load("cassandra-clientutil-2.1.3.jar")
z.load("joda-time-2.3.jar")
I've tried to implement a script for linear regression. But, when I run it I get the following error message :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.xxx.xxx.xxx): java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:66)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
...
The problem is that the script runs without problems using spark-submit script, which confused me.
Here is some of the code that I was trying to execute:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import org.apache.spark.mllib.regression.{LinearRegressionWithSGD, LinearRegressionModel, LabeledPoint}
import org.apache.spark.rdd.RDD
sc.stop()
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "xxx.xxx.xxx.xxx").setMaster("spark://xxx.xxx.xxx.xxx:7077").setAppName("DEMONSTRATION")
val sc = new SparkContext(conf)
case class Fact(numdoc:String, numl:String, year:String, creator:Double, date:Double, day:Double, user:Double, workingday:Double, total:String)
val data= sc.textFile("~/Input/Data.csv ยป)
val parsed = data.filter(!_.isEmpty).map {row =>
val splitted = row.split(",")
val Array(nd, nl, yr)=splitted.slice(0,3)
val Array(cr, dt, wd, us, wod)=splitted.slice(3,8).map(_.toDouble)
Fact (nd, nl, yr, cr, dt, wd, us, wod, splitted(8))
}
val class2id = parsed.map(_.total.toDouble).distinct.collect.zipWithIndex.map{case (k,v) => (k, v.toDouble)}.toMap
val id2class = class2id.map(_.swap)
val parsedData = parsed.map { i => LabaledPoint(class2id(i.total.toDouble), Array(i.creator,i.date,i.day,i.workingday))
val model: LinearRegressionModel = LinearRegressionWithSGD.train(parsedData, 3)
Thank you in advance !
I finally found a solution !
In fact I shouldn't stop the SparkContext in the beginning and create a new one. But, In that case, I could not get access to Cassandra in a remote machine because by default zeppelin uses the address of the machine where it is installed as the address of Cassandra host. So I installed a new Cassandra instance there and I added it to my initial cluster, and the problem was solved.

Resources