How to connect S3 to pyspark on local (org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3") - apache-spark

SPark version =3.2.1
Hadoop version=3.3.1
I have followed all posts on StackOverflow, but couldn't get It to run. I am new to spark and trying to read json file.
On my local mac, I have installed homebrew and installed pyspark. I have just downloaded jars, Do I need to keep somewhere?
Jars downloaded are : hadoop-aws-3.3.1
aws-java-sdk-bundle-1.12.172
I have kept this under /usr/local/Cellar/apache-spark/3.2.1/libexec/jars
# /opt/python/latest/bin/pip3 list
import os
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
access_id = "A*"
access_key = "C*"
from pyspark import SparkConf, SparkContext, SQLContext
from pyspark.sql import *
## set Spark properties
conf = SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.1')
sc=SparkContext(conf=conf)
spark=SparkSession(sc)
spark._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", access_id)
spark._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", access_key)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
df=spark.read.json("s3://pt/raw/Deal_20220114.json")
df.show()
Error:
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
How am i running in local?
spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.1 test.py
Error :
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:747)
at scala.collection.immutable.List.map(List.scala:293)
at org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:745)
at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:577)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:405)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)
22/03/06 19:31:36 INFO SparkContext: Invoking stop() from shutdown hook
22/03/06 19:31:36 INFO SparkUI: Stopped Spark web UI at http://adityakxc5zmd6m.attlocal.net:4041
22/03/06 19:31:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/03/06 19:31:36 INFO MemoryStore: MemoryStore cleared
22/03/06 19:31:36 INFO BlockManager: BlockManager stopped
22/03/06 19:31:36 INFO BlockManagerMaster: BlockManagerMaster stopped
22/03/06 19:31:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/03/06 19:31:37 INFO SparkContext: Successfully stopped SparkContext
22/03/06 19:31:37 INFO ShutdownHookManager: Shutdown hook called
22/03/06 19:31:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/7q/r0xvmq6n4p55r6d8nx9gmd7c0000gr/T/spark-1e346a99-5d6f-498d-9bc4-ce8a3f951718
22/03/06 19:31:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/7q/r0xvmq6n4p55r6d8nx9gmd7c0000gr/T/spark-f2727124-4f4b-4ee3-a8c0-607e207a3a98/pyspark-b96bf92a-84e8-409e-b9df-48f303c57b70
22/03/06 19:31:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/7q/r0xvmq6n4p55r6d8nx9gmd7c0000gr/T/spark-f2727124-4f4b-4ee3-a8c0-607e207a3a98

No FileSystem for scheme "s3"
You've not configured fs.s3.impl, so Spark doesn't know what to do with filepaths starting with s3://
Using, fs.s3a.impl is recommended instead, and you access files with s3a://

Related

How to configure a custom Spark Plugin in Databricks?

How to properly configure Spark plugin and the jar containing the Spark Plugin class in Databricks?
I created the following Spark 3 Plugin class in Scala,
CustomExecSparkPlugin.scala:
package example
import org.apache.spark.api.plugin.{SparkPlugin, DriverPlugin, ExecutorPlugin}
class CustomExecSparkPlugin extends SparkPlugin {
override def driverPlugin(): DriverPlugin = {
new DriverPlugin() {
override def shutdown(): Unit = {
// custom code
}
}
}
override def executorPlugin(): ExecutorPlugin = {
new ExecutorPlugin() {
override def shutdown(): Unit = {
// custom code
}
}
}
}
I have packaged it into a jar and uploaded it to DBFS and during DBR 7.3 (Spark 3.0.1, Scala 2.12) cluster creation, I set the following Spark Configs (Advanced Options):
spark.plugins com.example.CustomExecSparkPlugin
spark.driver.extraClassPath /dbfs/path/to/jar
spark.executor.extraClassPath /dbfs/path/to/jar
However, the cluster creation fails with Exception: com.example.CustomExecSparkPlugin not found in com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader#622d7e4
Driver log4j logs:
21/11/01 13:33:01 ERROR SparkContext: Error initializing SparkContext.
java.lang.ClassNotFoundException: com.example.CustomExecSparkPlugin not found in com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader#622d7e4
at com.databricks.backend.daemon.driver.ClassLoaders$MultiReplClassLoader.loadClass(ClassLoaders.scala:115)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:226)
at org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:3006)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:3004)
at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:160)
at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:146)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:591)
at com.databricks.backend.daemon.driver.DatabricksILoop$.$anonfun$initializeSharedDriverContext$1(DatabricksILoop.scala:347)
at com.databricks.backend.daemon.driver.ClassLoaders$.withContextClassLoader(ClassLoaders.scala:29)
at com.databricks.backend.daemon.driver.DatabricksILoop$.initializeSharedDriverContext(DatabricksILoop.scala:347)
at com.databricks.backend.daemon.driver.DatabricksILoop$.getOrCreateSharedDriverContext(DatabricksILoop.scala:277)
at com.databricks.backend.daemon.driver.DriverCorral.com$databricks$backend$daemon$driver$DriverCorral$$driverContext(DriverCorral.scala:179)
at com.databricks.backend.daemon.driver.DriverCorral.<init>(DriverCorral.scala:216)
at com.databricks.backend.daemon.driver.DriverDaemon.<init>(DriverDaemon.scala:39)
at com.databricks.backend.daemon.driver.DriverDaemon$.create(DriverDaemon.scala:211)
at com.databricks.backend.daemon.driver.DriverDaemon$.wrappedMain(DriverDaemon.scala:216)
at com.databricks.DatabricksMain.$anonfun$main$1(DatabricksMain.scala:106)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.DatabricksMain.$anonfun$withStartupProfilingData$1(DatabricksMain.scala:321)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$4(UsageLogging.scala:431)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:239)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:234)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:231)
at com.databricks.DatabricksMain.withAttributionContext(DatabricksMain.scala:74)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:276)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:269)
at com.databricks.DatabricksMain.withAttributionTags(DatabricksMain.scala:74)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:412)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:338)
at com.databricks.DatabricksMain.recordOperation(DatabricksMain.scala:74)
at com.databricks.DatabricksMain.withStartupProfilingData(DatabricksMain.scala:321)
at com.databricks.DatabricksMain.main(DatabricksMain.scala:105)
at com.databricks.backend.daemon.driver.DriverDaemon.main(DriverDaemon.scala)
Caused by: java.lang.ClassNotFoundException: com.example.CustomExecSparkPlugin
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:151)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at com.databricks.backend.daemon.driver.ClassLoaders$MultiReplClassLoader.loadClass(ClassLoaders.scala:112)
... 43 more
21/11/01 13:33:02 INFO AbstractConnector: Stopped Spark#b6bccb4{HTTP/1.1,[http/1.1]}{10.88.234.70:40001}
21/11/01 13:33:02 INFO SparkUI: Stopped Spark web UI at http://10.88.234.70:40001
21/11/01 13:33:02 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/11/01 13:33:02 INFO MemoryStore: MemoryStore cleared
21/11/01 13:33:02 INFO BlockManager: BlockManager stopped
21/11/01 13:33:02 INFO BlockManagerMaster: BlockManagerMaster stopped
21/11/01 13:33:02 WARN MetricsSystem: Stopping a MetricsSystem that is not running
21/11/01 13:33:02 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/11/01 13:33:02 INFO SparkContext: Successfully stopped SparkContext
You might consider adding this as an init script instead. The init scripts give you an opportunity to add jars to the cluster before spark even begins which is probably what the spark plugin is expecting.
Upload your jar to dbfs, somewhere like dbfs:/databricks/plugins
Create and upload a bash script like below to the same place.
Create / Edit a cluster with the init script specified.
#!/bin/bash
STAGE_DIR="/dbfs/databricks/plugins/
echo "BEGIN: Upload Spark Plugins"
cp -f $STAGE_DIR/*.jar /mnt/driver-daemon/jars || { echo "Error copying Spark Plugin library file"; exit 1;}
echo "END: Upload Spark Plugin JARs"
echo "BEGIN: Modify Spark config settings"
cat << 'EOF' > /databricks/driver/conf/spark-plugin-driver-defaults.conf
[driver] {
"spark.plugins" = "com.example.CustomExecSparkPlugin"
}
EOF
echo "END: Modify Spark config settings"
I believe the copying of the jar to /mnt/driver-daemons/jars will make Spark aware of the jar before Spark fully initializes (doc). I'm less certain it will make to the executor though :(

SparkContext: Error initializing SparkContext While Running Spark Job

I'm doing a Spark program that loads data from Elastic Search to HDFS but I am getting Error initializing SparkContext. error while running the job. The error is during making spark session.
Hadoop: 3.2.1
Spark: 2.4.4
Elasticsearch Spark (for Spark 2.X) ยป 7.5.1
EMR: 6.0.0
Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date
spark = SparkSession \
.builder \
.getOrCreate()
Error Log:
20/10/22 10:09:12 ERROR SparkContext: Error initializing SparkContext.
java.util.ServiceConfigurationError: org.apache.spark.deploy.yarn.security.ServiceCredentialProvider: Provider org.elasticsearch.spark.deploy.yarn.security.EsServiceCredentialProvider could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
at scala.collection.TraversableLike.to(TraversableLike.scala:678)
at scala.collection.TraversableLike.to$(TraversableLike.scala:675)
at scala.collection.AbstractTraversable.to(Traversable.scala:108)
at scala.collection.TraversableOnce.toList(TraversableOnce.scala:299)
at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:299)
at scala.collection.AbstractTraversable.toList(Traversable.scala:108)
at org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.loadCredentialProviders(YARNHadoopDelegationTokenManager.scala:82)
at org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.getCredentialProviders(YARNHadoopDelegationTokenManager.scala:73)
at org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.<init>(YARNHadoopDelegationTokenManager.scala:46)
at org.apache.spark.deploy.yarn.Client.setupSecurityToken(Client.scala:308)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:1013)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:178)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:183)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/security/ServiceCredentialProvider$class
at org.elasticsearch.spark.deploy.yarn.security.EsServiceCredentialProvider.<init>(EsServiceCredentialProvider.scala:63)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 40 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.security.ServiceCredentialProvider$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
... 47 more
20/10/22 10:09:12 INFO SparkUI: Stopped Spark web UI at http://ip-172-31-1-155.us-east-2.test:4040
20/10/22 10:09:12 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
20/10/22 10:09:12 INFO YarnClientSchedulerBackend: Stopped
20/10/22 10:09:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/10/22 10:09:12 INFO MemoryStore: MemoryStore cleared
20/10/22 10:09:12 INFO BlockManager: BlockManager stopped
20/10/22 10:09:12 INFO BlockManagerMaster: BlockManagerMaster stopped
20/10/22 10:09:12 WARN MetricsSystem: Stopping a MetricsSystem that is not running
20/10/22 10:09:12 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/10/22 10:09:12 INFO SparkContext: Successfully stopped SparkContext
20/10/22 10:09:13 INFO ShutdownHookManager: Shutdown hook called
20/10/22 10:09:13 INFO ShutdownHookManager: Deleting directory /vol1/tmp/spark-b39bb8cc-5bc7-4721-89bd-8bd62b9e527e
20/10/22 10:09:13 INFO ShutdownHookManager: Deleting directory /vol1/tmp/spark-d94995f0-05b6-476f-935e-8ba501acbed3
at com.company.utils.ResourceScriptUtils.executeScript(ResourceScriptUtils.java:114)
at com.company.utils.ResourceScriptUtils.executeScript(ResourceScriptUtils.java:135)
at com.company.loader.impl.realTimeProcessing.RealTimeEsLoader.processJob(RealTimeEsLoader.java:232)
at com.company.loader.App.main(App.java:37)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:153)
at com.company.multijob.MultiJob$HadoopJob.call(MultiJob.java:50)
at com.company.multijob.MultiJob$HadoopJob.call(MultiJob.java:38)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Could someone please help? Thanks.
This is coming due to Spark-Yarn JAR's missing from your spark application.
Add below in your pom.xml if you are using maven.
Under dependencies:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn_2.11</artifactId>
<version>2.4.7</version>
</dependency>
Under artifactItems:
<artifactItem>
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn_2.11</artifactId>
<version>2.4.7</version>
<type>jar</type>
<overWrite>false</overWrite>
<outputDirectory>${project.build.directory}/classes/</outputDirectory>
<destFileName>optional-new-name.jar</destFileName>
</artifactItem>
Note: Change your scala and spark version accordingly as here I'm using Spark 2.4.7 and Scala 2.11.

How to write to HBase in Azure Databricks?

I'm trying a build a lambda architecture using 'kafka-spark-hbase'. I'm using azure cloud and the components are on following platforms
1. Kafka (0.10) -HDinsights
2. Spark (2.4.3)- Databricks
3. Hbase (1.2)- HDinsights
All the 3 components are under the same V-net so there is no issue in connectivity.
I'm using spark structured streaming, and successfully able to connect to Kafka as a source.
Now as spark does not provide native support to connect to Hbase, I'm using 'Spark Hortonworks Connector' to write data to Hbase, and I have implemented the code to write a batch to hbase in "foreachbatch" api provided in spark 2.4 onward.
The code is as below:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.expr
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.Dataset
import org.apache.spark.SparkConf
import org.apache.spark.sql.DataFrame
//code dependencies
import com.JEM.conf.HbaseTableConf
import com.JEM.constant.HbaseConstant
//'Spark hortonworks connector' dependency
import org.apache.spark.sql.execution.datasources.hbase._
//---------Variables--------------//
val kafkaBroker = "valid kafka broker"
val topic = "valid kafka topic"
val kafkaCheckpointLocation = "/checkpointDir"
//---------code--------------//
import spark.sqlContext.implicits._
val kafkaIpStream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribe", topic)
.option("checkpointLocation", kafkaCheckpointLocation)
.option("startingOffsets", "earliest")
.load()
val streamToBeSentToHbase = kafkaIpStream.selectExpr("cast (key as String)", "cast (value as String)")
.withColumn("ts", split($"key", "/")(1))
.selectExpr("key as rowkey", "ts", "value as val")
.writeStream
.option("failOnDataLoss", false)
.outputMode(OutputMode.Update())
.trigger(Trigger.ProcessingTime("30 seconds"))
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF
.write
.options(Map(HBaseTableCatalog.tableCatalog -> HbaseTableConf.getRawtableConf(HbaseConstant.hbaseRawTable), HBaseTableCatalog.newTable -> "5"))
.format("org.apache.spark.sql.execution.datasources.hbase").save
}.start()
Form the logs I can see that code is successfully able to get the data but when it tries to write to Hbase I get the following exception.
19/10/09 12:42:48 ERROR MicroBatchExecution: Query [id = 1a54283d-ab8a-4bf4-af65-63becc166328, runId = 670f90de-8ca5-41d7-91a9-e8d36dfeef66] terminated with error
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/NamespaceNotFoundException
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:59)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:72)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:88)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:146)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:134)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:187)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:183)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:134)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:116)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:111)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:240)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:97)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:170)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:306)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:292)
at com.JEM.Importer.StreamingHbaseImporter$.$anonfun$main$1(StreamingHbaseImporter.scala:57)
at com.JEM.Importer.StreamingHbaseImporter$.$anonfun$main$1$adapted(StreamingHbaseImporter.scala:53)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:36)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:568)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:111)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:240)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:97)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:170)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:566)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:565)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:207)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:296)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:208)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.NamespaceNotFoundException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 44 more
19/10/09 12:42:48 INFO SparkContext: Invoking stop() from shutdown hook
19/10/09 12:42:48 INFO AbstractConnector: Stopped Spark#42f85fa4{HTTP/1.1,[http/1.1]}{172.20.170.72:47611}
19/10/09 12:42:48 INFO SparkUI: Stopped Spark web UI at http://172.20.170.72:47611
19/10/09 12:42:48 INFO StandaloneSchedulerBackend: Shutting down all executors
19/10/09 12:42:48 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
19/10/09 12:42:48 INFO SQLAppStatusListener: Execution ID: 1 Total Executor Run Time: 0
19/10/09 12:42:49 INFO SQLAppStatusListener: Execution ID: 0 Total Executor Run Time: 0
19/10/09 12:42:49 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/10/09 12:42:49 INFO MemoryStore: MemoryStore cleared
19/10/09 12:42:49 INFO BlockManager: BlockManager stopped
19/10/09 12:42:49 INFO BlockManagerMaster: BlockManagerMaster stopped
19/10/09 12:42:49 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/10/09 12:42:49 INFO SparkContext: Successfully stopped SparkContext
19/10/09 12:42:49 INFO ShutdownHookManager: Shutdown hook called
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/temporaryReader-79d0f9b8-c380-4141-9ac2-46c257c6c854
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/temporary-d00d2f73-96e3-4a18-9d5c-a9ff76a871bb
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/spark-b92c0171-286b-4863-9fac-16f4ac379da8
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/spark-ef92ca6b-2e7d-4917-b407-4426ad088cee
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/spark-9b138379-fa3a-49db-95dc-436cd7040a95
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/spark-2666c4ff-30a2-4161-868e-137af5fa3787
What all have I tried:
There are 3 ways to run spark jobs in databricks:
Notebook: I have Installed SHC jar as library on the cluster and placed 'hbase-site.xml' into my job jar and installed it onto the cluster as well, not I have pasted the main class code onto the notebook, When I run it I'm able to load dependencies from SHC but get above error.
Jar: This is almost similar to notebook, with that difference that instead of notebook, I give the main class and the jar to the job and run it. Gives me the same error.
Spark submit: I created an uber-jar with all the dependencies including SHC, I uploaded the hbase-site file to dbfs path and provided above in the submit command as below.
[
"--class","com.JEM.Importer.StreamingHbaseImporter","dbfs:/FileStore/JarPath/jarfile.jar",
"--packages","com.hortonworks:shc-core:1.1.1-2.1-s_2.11",
"--repositories","http://repo.hortonworks.com/content/groups/public/",
"--files","dbfs:/PathToSiteFile/hbase_site.xml"
]
Still I get the same error. Can anyone please help?
Thanks

Spark-Elasticsearch Connectivity Issue

I have a code piece that should index data to Elasticsearch,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ES_indexer').getOrCreate()
df = spark.createDataFrame([{'num': i} for i in xrange(10)])
df = df.drop('_id')
df.write.format('es').option('es.nodes', '3.45.67.131').option('es.nodes.wan.only','true').option('es.port', 9200).option('es.resource', '%s/%s' % ('index_name', 'doc_type_name')).save()
However, when I execute this as a spark submit job using,
spark-submit --packages org.elasticsearch:elasticsearch-hadoop:7.2.0 test-chetan.py
I get an error as:
Traceback (most recent call last):
File "/mnt/tmp/test-chetan.py", line 5, in <module>
df.write.format('es').option('es.nodes', '3.45.67.131').option('es.nodes.wan.only','true').option('es.resource', '%s/%s' % ('index_name', 'doc_type_name')).save()
File "/usr/local/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 732, in save
File "/usr/local/lib/python2.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/lib/python2.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o49.save.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on [index_name] failed; server[3.15.27.191:9200] returned [503|Service Unavailable:]
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:469)
at org.elasticsearch.hadoop.rest.RestClient.executeNotFoundAllowed(RestClient.java:439)
at org.elasticsearch.hadoop.rest.RestClient.exists(RestClient.java:529)
at org.elasticsearch.hadoop.rest.RestClient.indexExists(RestClient.java:524)
at org.elasticsearch.hadoop.rest.RestRepository.isEmpty(RestRepository.java:466)
at org.elasticsearch.spark.sql.ElasticsearchRelation.isEmpty(DefaultSource.scala:625)
at org.elasticsearch.spark.sql.DefaultSource.createRelation(DefaultSource.scala:110)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
19/07/19 21:27:51 INFO SparkContext: Invoking stop() from shutdown hook
19/07/19 21:27:51 INFO SparkUI: Stopped Spark web UI at http://ip-172-31-40-1.us-east-2.compute.internal:4041
19/07/19 21:27:51 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/07/19 21:27:51 INFO MemoryStore: MemoryStore cleared
19/07/19 21:27:51 INFO BlockManager: BlockManager stopped
19/07/19 21:27:51 INFO BlockManagerMaster: BlockManagerMaster stopped
19/07/19 21:27:51 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/07/19 21:27:51 INFO SparkContext: Successfully stopped SparkContext
19/07/19 21:27:51 INFO ShutdownHookManager: Shutdown hook called
19/07/19 21:27:51 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-d23384c9-63c3-4875-a254-d403226cccdd/pyspark-5bc00e36-b585-4c18-96c2-59aa20848db2
19/07/19 21:27:51 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-d23384c9-63c3-4875-a254-d403226cccdd
19/07/19 21:27:51 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-605e10a6-0232-4f55-855d-a04ef83fa886
I am not quite able to debug the cause of this line,
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on [index_name] failed; server[3.15.27.191:9200] returned [503|Service Unavailable:]
My AWS Elasticsearch is publicly accessible and EMR, where spark is deployed, has all access to egress so I don't believe its a security issue.
Any suggestions?
Try to specify all the nodes of your ES cluster under the 'es.nodes' configuration.
When you set 'es.nodes.wan.only' to 'true' you are preventing the connector from accessing nodes that were not specified in 'es.nodes'
Or you could set 'es.nodes.wan.only'=false
es.nodes.wan.only (default false)-
Whether the connector is used against an Elasticsearch instance in a cloud/restricted environment over the WAN, such as Amazon Web Services. In this mode, the connector disables discovery and only connects through the declared es.nodes during all operations, including reads and writes. Note that in this mode, performance is highly affected."
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html

Spark Streaming job failing with ArrayBuffer(kafka.common.NotLeaderForPartitionException)

My spark streaming job (spark 1.6.1, kafka 0.9.0) is consuming from Kafka Topic with 20 partitions.
Offsets are being maintained in oracle DB.
At the the job startup i would read offsets from Oracle(read once), and write offsets to oracle after processing.
My job successfully ran for 8 hrs and then failed with the below reason. There are no changes to kafka topic, spark program, oracle code during the failure.
Can anyone tell why i am getting this error on a running spark streaming job?
16/11/02 08:09:21 ERROR JobScheduler: Error generating jobs for time 1478074160000 ms
org.apache.spark.SparkException: ArrayBuffer(kafka.common.NotLeaderForPartitionException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([MyTopic,11]))
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123)
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:47)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:114)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:114)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:246)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:246)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:86)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.spark.SparkException: ArrayBuffer(kafka.common.NotLeaderForPartitionException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([MyTopic,11]))
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123)
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:47)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:114)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:114)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:246)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:246)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:86)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
16/11/02 08:09:21 INFO StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook
16/11/02 08:09:21 INFO JobGenerator: Stopping JobGenerator immediately
16/11/02 08:09:21 INFO RecurringTimer: Stopped timer for JobGenerator after time 1478074160000
16/11/02 08:09:21 INFO JobGenerator: Stopped JobGenerator
16/11/02 08:09:21 INFO JobScheduler: Stopped JobScheduler
16/11/02 08:09:21 INFO StreamingContext: StreamingContext stopped successfully
16/11/02 08:09:21 INFO SparkContext: Invoking stop() from shutdown hook
16/11/02 08:09:21 INFO SparkUI: Stopped Spark web UI at http://10.251.228.103:4040
16/11/02 08:09:21 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/11/02 08:09:21 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/11/02 08:09:21 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/11/02 08:09:21 INFO MemoryStore: MemoryStore cleared
16/11/02 08:09:21 INFO BlockManager: BlockManager stopped
16/11/02 08:09:21 INFO BlockManagerMaster: BlockManagerMaster stopped
16/11/02 08:09:21 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/11/02 08:09:21 INFO SparkContext: Successfully stopped SparkContext
16/11/02 08:09:21 INFO ShutdownHookManager: Shutdown hook called
16/11/02 08:09:21 INFO ShutdownHookManager: Deleting directory /app/spark/spark-1.6.1-bin-hadoop2.6/local/spark-30fb329c-3ccf-4d8c-a06c-2d36e6f968b3/httpd-f81472a2-3262-4eea-8d64-7ff96d2ef3e5
16/11/02 08:09:21 INFO ShutdownHookManager: Deleting directory /app/spark/spark-1.6.1-bin-hadoop2.6/local/spark-30fb329c-3ccf-4d8c-a06c-2d36e6f968b3
For me, the problem was simply that my Kafka server had been interrupted. Easy enough to start it back up:
./bin/kafka-server-start.sh -daemon config/server.properties

Resources