SparkContext: Error initializing SparkContext While Running Spark Job - apache-spark

I'm doing a Spark program that loads data from Elastic Search to HDFS but I am getting Error initializing SparkContext. error while running the job. The error is during making spark session.
Hadoop: 3.2.1
Spark: 2.4.4
Elasticsearch Spark (for Spark 2.X) ยป 7.5.1
EMR: 6.0.0
Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date
spark = SparkSession \
.builder \
.getOrCreate()
Error Log:
20/10/22 10:09:12 ERROR SparkContext: Error initializing SparkContext.
java.util.ServiceConfigurationError: org.apache.spark.deploy.yarn.security.ServiceCredentialProvider: Provider org.elasticsearch.spark.deploy.yarn.security.EsServiceCredentialProvider could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
at scala.collection.TraversableLike.to(TraversableLike.scala:678)
at scala.collection.TraversableLike.to$(TraversableLike.scala:675)
at scala.collection.AbstractTraversable.to(Traversable.scala:108)
at scala.collection.TraversableOnce.toList(TraversableOnce.scala:299)
at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:299)
at scala.collection.AbstractTraversable.toList(Traversable.scala:108)
at org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.loadCredentialProviders(YARNHadoopDelegationTokenManager.scala:82)
at org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.getCredentialProviders(YARNHadoopDelegationTokenManager.scala:73)
at org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.<init>(YARNHadoopDelegationTokenManager.scala:46)
at org.apache.spark.deploy.yarn.Client.setupSecurityToken(Client.scala:308)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:1013)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:178)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:183)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/security/ServiceCredentialProvider$class
at org.elasticsearch.spark.deploy.yarn.security.EsServiceCredentialProvider.<init>(EsServiceCredentialProvider.scala:63)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 40 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.security.ServiceCredentialProvider$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
... 47 more
20/10/22 10:09:12 INFO SparkUI: Stopped Spark web UI at http://ip-172-31-1-155.us-east-2.test:4040
20/10/22 10:09:12 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
20/10/22 10:09:12 INFO YarnClientSchedulerBackend: Stopped
20/10/22 10:09:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/10/22 10:09:12 INFO MemoryStore: MemoryStore cleared
20/10/22 10:09:12 INFO BlockManager: BlockManager stopped
20/10/22 10:09:12 INFO BlockManagerMaster: BlockManagerMaster stopped
20/10/22 10:09:12 WARN MetricsSystem: Stopping a MetricsSystem that is not running
20/10/22 10:09:12 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/10/22 10:09:12 INFO SparkContext: Successfully stopped SparkContext
20/10/22 10:09:13 INFO ShutdownHookManager: Shutdown hook called
20/10/22 10:09:13 INFO ShutdownHookManager: Deleting directory /vol1/tmp/spark-b39bb8cc-5bc7-4721-89bd-8bd62b9e527e
20/10/22 10:09:13 INFO ShutdownHookManager: Deleting directory /vol1/tmp/spark-d94995f0-05b6-476f-935e-8ba501acbed3
at com.company.utils.ResourceScriptUtils.executeScript(ResourceScriptUtils.java:114)
at com.company.utils.ResourceScriptUtils.executeScript(ResourceScriptUtils.java:135)
at com.company.loader.impl.realTimeProcessing.RealTimeEsLoader.processJob(RealTimeEsLoader.java:232)
at com.company.loader.App.main(App.java:37)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:153)
at com.company.multijob.MultiJob$HadoopJob.call(MultiJob.java:50)
at com.company.multijob.MultiJob$HadoopJob.call(MultiJob.java:38)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Could someone please help? Thanks.

This is coming due to Spark-Yarn JAR's missing from your spark application.
Add below in your pom.xml if you are using maven.
Under dependencies:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn_2.11</artifactId>
<version>2.4.7</version>
</dependency>
Under artifactItems:
<artifactItem>
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn_2.11</artifactId>
<version>2.4.7</version>
<type>jar</type>
<overWrite>false</overWrite>
<outputDirectory>${project.build.directory}/classes/</outputDirectory>
<destFileName>optional-new-name.jar</destFileName>
</artifactItem>
Note: Change your scala and spark version accordingly as here I'm using Spark 2.4.7 and Scala 2.11.

Related

spark-sql/spark-submit with delta lake is resulting null pointer exception (at org.apache.spark.storage.BlockManagerMasterEndpoint)

I'm using delta lake on using pyspark by submitting below command
spark-sql --packages io.delta:delta-core_2.12:0.8.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
System Specs:
Spark - 3.0.3
scala - 2.12.10
java - 1.8.0
hadoop - 2.7
i'm looking at reference blog https://docs.delta.io/latest/quick-start.html
https://www.confessionsofadataguy.com/introduction-to-delta-lake-on-apache-spark-for-data-engineers/
but when I use spark without the delta config spark works fine.
Error (trucated the stacktrace):
22/12/29 20:38:10 INFO Executor: Starting executor ID driver on host 192.168.0.100
22/12/29 20:38:10 INFO Executor: Fetching spark://192.168.0.100:50920/jars/org.antlr_antlr4-runtime-4.7.jar with timestamp 1672326488255
22/12/29 20:38:23 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
22/12/29 20:38:23 INFO Executor: Told to re-register on heartbeat
22/12/29 20:38:23 INFO BlockManager: BlockManager null re-registering with master
22/12/29 20:38:23 INFO BlockManagerMaster: Registering BlockManager null
22/12/29 20:38:23 ERROR Inbox: Ignoring error
java.lang.NullPointerException
at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$register(BlockManagerMasterEndpoint.scala:404)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:97)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
22/12/29 20:38:23 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:87)
at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:66)
at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:567)
at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:934)
at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:200)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$register(BlockManagerMasterEndpoint.scala:404)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:97)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
22/12/29 20:38:31 ERROR Utils: Aborting task
java.io.IOException: Failed to connect to /192.168.0.100:50920
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:392)
at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$openChannel$4(NettyRpcEnv.scala:360)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:931)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:49)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.<init>(SparkSQLCLIDriver.scala:321)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: no further information: /192.168.0.100:50920
Caused by: java.net.ConnectException: Connection timed out: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at java.lang.Thread.run(Thread.java:745)
22/12/29 20:38:31 ERROR SparkContext: Error initializing SparkContext.
java.io.IOException: Failed to connect to /192.168.0.100:50920
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:392)
at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$openChannel$4(NettyRpcEnv.scala:360)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:745)
22/12/29 20:38:31 INFO SparkUI: Stopped Spark web UI at http://192.168.0.100:4041
22/12/29 20:38:31 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.scheduler.local.LocalSchedulerBackend.org$apache$spark$scheduler$local$LocalSchedulerBackend$$stop(LocalSchedulerBackend.scala:168)
at org.apache.spark.scheduler.local.LocalSchedulerBackend.stop(LocalSchedulerBackend.scala:144)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:734)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2171)
java.lang.NullPointerException
at org.apache.spark.executor.Executor.$anonfun$stop$3(Executor.scala:304)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:221)
at org.apache.spark.executor.Executor.stop(Executor.scala:304)
at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:74)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
22/12/29 20:38:31 INFO ShutdownHookManager: Shutdown hook called
what is being missed?
Try it like this:
!pip3 install findspark --user
import findspark
findspark.init()
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages io.delta:delta- core_2.12:2.2.0 --driver-memory 2g pyspark-shell'
spark = SparkSession.builder.appName("your application") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.2.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()

How to connect S3 to pyspark on local (org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3")

SPark version =3.2.1
Hadoop version=3.3.1
I have followed all posts on StackOverflow, but couldn't get It to run. I am new to spark and trying to read json file.
On my local mac, I have installed homebrew and installed pyspark. I have just downloaded jars, Do I need to keep somewhere?
Jars downloaded are : hadoop-aws-3.3.1
aws-java-sdk-bundle-1.12.172
I have kept this under /usr/local/Cellar/apache-spark/3.2.1/libexec/jars
# /opt/python/latest/bin/pip3 list
import os
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
access_id = "A*"
access_key = "C*"
from pyspark import SparkConf, SparkContext, SQLContext
from pyspark.sql import *
## set Spark properties
conf = SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.1')
sc=SparkContext(conf=conf)
spark=SparkSession(sc)
spark._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", access_id)
spark._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", access_key)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
df=spark.read.json("s3://pt/raw/Deal_20220114.json")
df.show()
Error:
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
How am i running in local?
spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.1 test.py
Error :
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:747)
at scala.collection.immutable.List.map(List.scala:293)
at org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:745)
at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:577)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:405)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)
22/03/06 19:31:36 INFO SparkContext: Invoking stop() from shutdown hook
22/03/06 19:31:36 INFO SparkUI: Stopped Spark web UI at http://adityakxc5zmd6m.attlocal.net:4041
22/03/06 19:31:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/03/06 19:31:36 INFO MemoryStore: MemoryStore cleared
22/03/06 19:31:36 INFO BlockManager: BlockManager stopped
22/03/06 19:31:36 INFO BlockManagerMaster: BlockManagerMaster stopped
22/03/06 19:31:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/03/06 19:31:37 INFO SparkContext: Successfully stopped SparkContext
22/03/06 19:31:37 INFO ShutdownHookManager: Shutdown hook called
22/03/06 19:31:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/7q/r0xvmq6n4p55r6d8nx9gmd7c0000gr/T/spark-1e346a99-5d6f-498d-9bc4-ce8a3f951718
22/03/06 19:31:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/7q/r0xvmq6n4p55r6d8nx9gmd7c0000gr/T/spark-f2727124-4f4b-4ee3-a8c0-607e207a3a98/pyspark-b96bf92a-84e8-409e-b9df-48f303c57b70
22/03/06 19:31:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/7q/r0xvmq6n4p55r6d8nx9gmd7c0000gr/T/spark-f2727124-4f4b-4ee3-a8c0-607e207a3a98
No FileSystem for scheme "s3"
You've not configured fs.s3.impl, so Spark doesn't know what to do with filepaths starting with s3://
Using, fs.s3a.impl is recommended instead, and you access files with s3a://

SPARK Error: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.4.1-libsnappyjava [duplicate]

This question already has answers here:
UnsatisfiedLinkError: no snappyjava in java.library.path when running Spark MLLib Unit test within Intellij
(4 answers)
UnsatisfiedLinkError: /tmp/snappy-1.1.4-libsnappyjava.so Error loading shared library ld-linux-x86-64.so.2: No such file or directory
(8 answers)
spark returns error libsnappyjava.so: failed to map segment from shared object: Operation not permitted
(2 answers)
Closed 3 years ago.
I am running CDH 5.16 standalone singlenode in a RHEL 7 Server.
I have written a simple spark code that reads a text file from HDFS and load it as parquet file in a separate location in HDFS. But when ever i am running this code in the server(i am using SBT to build jar and deploy it in cluster using spark-submit), following error is thrown:
19/06/07 12:56:04 INFO spark.SparkContext: Running Spark version 1.6.0
19/06/07 12:56:04 INFO spark.SecurityManager: Changing view acls to: ak_bng
19/06/07 12:56:04 INFO spark.SecurityManager: Changing modify acls to: ak_bng
19/06/07 12:56:04 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ak_bng); users with modify permissions: Set(ak_bng)
19/06/07 12:56:05 INFO util.Utils: Successfully started service 'sparkDriver' on port 44220.
19/06/07 12:56:05 INFO slf4j.Slf4jLogger: Slf4jLogger started
19/06/07 12:56:05 INFO Remoting: Starting remoting
19/06/07 12:56:05 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#10.188.223.5:36304]
19/06/07 12:56:05 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriverActorSystem#10.188.223.5:36304]
19/06/07 12:56:05 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 36304.
19/06/07 12:56:05 INFO spark.SparkEnv: Registering MapOutputTracker
19/06/07 12:56:05 INFO spark.SparkEnv: Registering BlockManagerMaster
19/06/07 12:56:05 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-c38a27e3-c483-4f56-ab7f-56e4c1be0832
19/06/07 12:56:05 INFO storage.MemoryStore: MemoryStore started with capacity 530.0 MB
19/06/07 12:56:06 INFO spark.SparkEnv: Registering OutputCommitCoordinator
19/06/07 12:56:06 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
19/06/07 12:56:06 INFO ui.SparkUI: Started SparkUI at http://10.188.223.5:4040
19/06/07 12:56:06 INFO spark.SparkContext: Added JAR file:/home/ak_bng/spark_jars/Simple_Project-assembly-1.0.jar at spark://10.188.223.5:44220/jars/Simple_Project-assembly-1.0.jar with timestamp 1559892366578
19/06/07 12:56:06 INFO executor.Executor: Starting executor ID driver on host localhost
19/06/07 12:56:06 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46170.
19/06/07 12:56:06 INFO netty.NettyBlockTransferService: Server created on 46170
19/06/07 12:56:06 INFO storage.BlockManager: external shuffle service port = 7337
19/06/07 12:56:06 INFO storage.BlockManagerMaster: Trying to register BlockManager
19/06/07 12:56:06 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:46170 with 530.0 MB RAM, BlockManagerId(driver, localhost, 46170)
19/06/07 12:56:06 INFO storage.BlockManagerMaster: Registered BlockManager
19/06/07 12:56:07 INFO scheduler.EventLoggingListener: Logging events to hdfs://indelsrv185.in.kworld.kpmg.com:8020/user/spark/applicationHistory/local-1559892366602
19/06/07 12:56:07 INFO spark.SparkContext: Registered listener com.cloudera.spark.lineage.ClouderaNavigatorListener
19/06/07 12:56:08 INFO parquet.ParquetRelation: Listing hdfs://10.188.223.5:8020/user/ak_bng/products on driver
19/06/07 12:56:08 INFO parquet.ParquetRelation: Listing hdfs://10.188.223.5:8020/user/ak_bng/categories on driver
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:312)
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219)
at org.xerial.snappy.Snappy.<clinit>(Snappy.java:44)
at org.apache.spark.io.SnappyCompressionCodec$.liftedTree1$1(CompressionCodec.scala:169)
at org.apache.spark.io.SnappyCompressionCodec$.org$apache$spark$io$SnappyCompressionCodec$$version$lzycompute(CompressionCodec.scala:168)
at org.apache.spark.io.SnappyCompressionCodec$.org$apache$spark$io$SnappyCompressionCodec$$version(CompressionCodec.scala:168)
at org.apache.spark.io.SnappyCompressionCodec.<init>(CompressionCodec.scala:152)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:72)
at org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:74)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:81)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1334)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:126)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:48)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:46)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:53)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:53)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$5.apply(QueryExecution.scala:81)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$5.apply(QueryExecution.scala:81)
at org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:61)
at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:81)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:50)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:106)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:56)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.dataSource$lzycompute$1(DataFrameWriter.scala:181)
at org.apache.spark.sql.DataFrameWriter.org$apache$spark$sql$DataFrameWriter$$dataSource$1(DataFrameWriter.scala:181)
at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply$mcV$sp(DataFrameWriter.scala:188)
at org.apache.spark.sql.DataFrameWriter.executeAndCallQEListener(DataFrameWriter.scala:154)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:188)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:172)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:370)
at SimpleApp$.main(SimpleApp.scala:169)
at SimpleApp.main(SimpleApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.4.1-libsnappyjava.so: /tmp/snappy-1.0.4.1-libsnappyjava.so: failed to map segment from shared object: Operation not permitted
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
at java.lang.Runtime.load0(Runtime.java:809)
at java.lang.System.load(System.java:1086)
at org.xerial.snappy.SnappyNativeLoader.load(SnappyNativeLoader.java:39)
... 65 more
Exception in thread "main" java.lang.IllegalArgumentException: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
at org.apache.spark.io.SnappyCompressionCodec$.liftedTree1$1(CompressionCodec.scala:171)
at org.apache.spark.io.SnappyCompressionCodec$.org$apache$spark$io$SnappyCompressionCodec$$version$lzycompute(CompressionCodec.scala:168)
at org.apache.spark.io.SnappyCompressionCodec$.org$apache$spark$io$SnappyCompressionCodec$$version(CompressionCodec.scala:168)
at org.apache.spark.io.SnappyCompressionCodec.<init>(CompressionCodec.scala:152)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:72)
at org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:74)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:81)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1334)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:126)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:48)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:46)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:53)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:53)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:51)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:106)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:56)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.dataSource$lzycompute$1(DataFrameWriter.scala:181)
at org.apache.spark.sql.DataFrameWriter.org$apache$spark$sql$DataFrameWriter$$dataSource$1(DataFrameWriter.scala:181)
at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply$mcV$sp(DataFrameWriter.scala:188)
at org.apache.spark.sql.DataFrameWriter.executeAndCallQEListener(DataFrameWriter.scala:154)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:188)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:172)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:370)
at SimpleApp$.main(SimpleApp.scala:169)
at SimpleApp.main(SimpleApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
at org.apache.spark.io.SnappyCompressionCodec$.liftedTree1$1(CompressionCodec.scala:169)
... 53 more
19/06/07 12:56:08 INFO spark.SparkContext: Invoking stop() from shutdown hook
19/06/07 12:56:08 INFO ui.SparkUI: Stopped Spark web UI at http://10.188.223.5:4040
19/06/07 12:56:08 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/06/07 12:56:09 INFO storage.MemoryStore: MemoryStore cleared
19/06/07 12:56:09 INFO storage.BlockManager: BlockManager stopped
19/06/07 12:56:09 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
19/06/07 12:56:09 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/06/07 12:56:09 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
19/06/07 12:56:09 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
19/06/07 12:56:09 INFO Remoting: Remoting shut down
19/06/07 12:56:09 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
19/06/07 12:56:09 INFO spark.SparkContext: Successfully stopped SparkContext
19/06/07 12:56:09 INFO util.ShutdownHookManager: Shutdown hook called
19/06/07 12:56:09 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-111712ef-39a8-41b6-bf6d-5d317d954fa1
spark submit command:
spark-submit --class SimpleApp --master local[8] /home/ak_bng/spark_jars/Simple_Project-assembly-1.0.jar.
I went through few links to resolve this issue(Snappy Compression not working due to tmp folder previliges, Apache Spark - Parquet / Snappy compression error, ) but none couldn't really provide a solution for this.
I had run Spark on HDFS (separate installation) successfully without any errors before. The problem started coming once CDHwas installed.
I am new to setting up cluster and quite don't understand what the issue is here and how to resolve it. Can some one help please shed some light on this.
I am using:
CDH 5.16
Spark 1.6.0
Server OS: RHEL 7
Hadoop 2.6

Spark-Elasticsearch Connectivity Issue

I have a code piece that should index data to Elasticsearch,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ES_indexer').getOrCreate()
df = spark.createDataFrame([{'num': i} for i in xrange(10)])
df = df.drop('_id')
df.write.format('es').option('es.nodes', '3.45.67.131').option('es.nodes.wan.only','true').option('es.port', 9200).option('es.resource', '%s/%s' % ('index_name', 'doc_type_name')).save()
However, when I execute this as a spark submit job using,
spark-submit --packages org.elasticsearch:elasticsearch-hadoop:7.2.0 test-chetan.py
I get an error as:
Traceback (most recent call last):
File "/mnt/tmp/test-chetan.py", line 5, in <module>
df.write.format('es').option('es.nodes', '3.45.67.131').option('es.nodes.wan.only','true').option('es.resource', '%s/%s' % ('index_name', 'doc_type_name')).save()
File "/usr/local/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 732, in save
File "/usr/local/lib/python2.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/lib/python2.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o49.save.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on [index_name] failed; server[3.15.27.191:9200] returned [503|Service Unavailable:]
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:469)
at org.elasticsearch.hadoop.rest.RestClient.executeNotFoundAllowed(RestClient.java:439)
at org.elasticsearch.hadoop.rest.RestClient.exists(RestClient.java:529)
at org.elasticsearch.hadoop.rest.RestClient.indexExists(RestClient.java:524)
at org.elasticsearch.hadoop.rest.RestRepository.isEmpty(RestRepository.java:466)
at org.elasticsearch.spark.sql.ElasticsearchRelation.isEmpty(DefaultSource.scala:625)
at org.elasticsearch.spark.sql.DefaultSource.createRelation(DefaultSource.scala:110)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
19/07/19 21:27:51 INFO SparkContext: Invoking stop() from shutdown hook
19/07/19 21:27:51 INFO SparkUI: Stopped Spark web UI at http://ip-172-31-40-1.us-east-2.compute.internal:4041
19/07/19 21:27:51 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/07/19 21:27:51 INFO MemoryStore: MemoryStore cleared
19/07/19 21:27:51 INFO BlockManager: BlockManager stopped
19/07/19 21:27:51 INFO BlockManagerMaster: BlockManagerMaster stopped
19/07/19 21:27:51 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/07/19 21:27:51 INFO SparkContext: Successfully stopped SparkContext
19/07/19 21:27:51 INFO ShutdownHookManager: Shutdown hook called
19/07/19 21:27:51 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-d23384c9-63c3-4875-a254-d403226cccdd/pyspark-5bc00e36-b585-4c18-96c2-59aa20848db2
19/07/19 21:27:51 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-d23384c9-63c3-4875-a254-d403226cccdd
19/07/19 21:27:51 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-605e10a6-0232-4f55-855d-a04ef83fa886
I am not quite able to debug the cause of this line,
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on [index_name] failed; server[3.15.27.191:9200] returned [503|Service Unavailable:]
My AWS Elasticsearch is publicly accessible and EMR, where spark is deployed, has all access to egress so I don't believe its a security issue.
Any suggestions?
Try to specify all the nodes of your ES cluster under the 'es.nodes' configuration.
When you set 'es.nodes.wan.only' to 'true' you are preventing the connector from accessing nodes that were not specified in 'es.nodes'
Or you could set 'es.nodes.wan.only'=false
es.nodes.wan.only (default false)-
Whether the connector is used against an Elasticsearch instance in a cloud/restricted environment over the WAN, such as Amazon Web Services. In this mode, the connector disables discovery and only connects through the declared es.nodes during all operations, including reads and writes. Note that in this mode, performance is highly affected."
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html

Spark Streaming job failing with ArrayBuffer(kafka.common.NotLeaderForPartitionException)

My spark streaming job (spark 1.6.1, kafka 0.9.0) is consuming from Kafka Topic with 20 partitions.
Offsets are being maintained in oracle DB.
At the the job startup i would read offsets from Oracle(read once), and write offsets to oracle after processing.
My job successfully ran for 8 hrs and then failed with the below reason. There are no changes to kafka topic, spark program, oracle code during the failure.
Can anyone tell why i am getting this error on a running spark streaming job?
16/11/02 08:09:21 ERROR JobScheduler: Error generating jobs for time 1478074160000 ms
org.apache.spark.SparkException: ArrayBuffer(kafka.common.NotLeaderForPartitionException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([MyTopic,11]))
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123)
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:47)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:114)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:114)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:246)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:246)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:86)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.spark.SparkException: ArrayBuffer(kafka.common.NotLeaderForPartitionException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([MyTopic,11]))
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123)
at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:47)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:114)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:114)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:246)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:246)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:86)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
16/11/02 08:09:21 INFO StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook
16/11/02 08:09:21 INFO JobGenerator: Stopping JobGenerator immediately
16/11/02 08:09:21 INFO RecurringTimer: Stopped timer for JobGenerator after time 1478074160000
16/11/02 08:09:21 INFO JobGenerator: Stopped JobGenerator
16/11/02 08:09:21 INFO JobScheduler: Stopped JobScheduler
16/11/02 08:09:21 INFO StreamingContext: StreamingContext stopped successfully
16/11/02 08:09:21 INFO SparkContext: Invoking stop() from shutdown hook
16/11/02 08:09:21 INFO SparkUI: Stopped Spark web UI at http://10.251.228.103:4040
16/11/02 08:09:21 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/11/02 08:09:21 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/11/02 08:09:21 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/11/02 08:09:21 INFO MemoryStore: MemoryStore cleared
16/11/02 08:09:21 INFO BlockManager: BlockManager stopped
16/11/02 08:09:21 INFO BlockManagerMaster: BlockManagerMaster stopped
16/11/02 08:09:21 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/11/02 08:09:21 INFO SparkContext: Successfully stopped SparkContext
16/11/02 08:09:21 INFO ShutdownHookManager: Shutdown hook called
16/11/02 08:09:21 INFO ShutdownHookManager: Deleting directory /app/spark/spark-1.6.1-bin-hadoop2.6/local/spark-30fb329c-3ccf-4d8c-a06c-2d36e6f968b3/httpd-f81472a2-3262-4eea-8d64-7ff96d2ef3e5
16/11/02 08:09:21 INFO ShutdownHookManager: Deleting directory /app/spark/spark-1.6.1-bin-hadoop2.6/local/spark-30fb329c-3ccf-4d8c-a06c-2d36e6f968b3
For me, the problem was simply that my Kafka server had been interrupted. Easy enough to start it back up:
./bin/kafka-server-start.sh -daemon config/server.properties

Resources