Apache Spark - Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

Apache Spark - Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream - apache-spark

I am trying to install spark (without hadoop).
Java version: 1.8.0_202
Spark version: spark-3.3.1
Python version: 3.7.15
When I execute spark-shell or pyspark I got this error:
[spark#de ~]$ spark-shell
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
Pyspark:
[spark#de ~]$ pyspark
Python 3.7.15 (default, Nov 1 2022, 23:18:36)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44.0.3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
Traceback (most recent call last):
File "/opt/spark-3.3.1/python/pyspark/shell.py", line 36, in <module>
SparkContext._ensure_initialized()
File "/opt/spark-3.3.1/python/pyspark/context.py", line 417, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/opt/spark-3.3.1/python/pyspark/java_gateway.py", line 106, in launch_gateway
raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number
What is wrong? do I need hadoop? so why there is a download option called spark-3.3.1-bin-without-hadoop.tgz? (without hadoop).

According to Spark documentation:
Spark uses Hadoop’s client libraries for HDFS and YARN. Downloads are pre-packaged for a handful of popular Hadoop versions. Users can also download a “Hadoop free” binary and run Spark with any Hadoop version by augmenting Spark’s classpath.
You do not need a Hadoop cluster to be set up, but you do need some Hadoop libraries.
For your case, it seems you don't have a Hadoop cluster set up, so you should go with spark-with-hadoop binaries.

Related

Not able to run simple pyflink word_count.py on aws emr

I have created an EMR cluster (v5.35.0) and am trying to run a sample word_count.py to verify if I am able to execute a flink job.
I am able to use python3 as mentioned in this question How do you run pyflink scripts on AWS EMR?
Using the below command to submit the job from /usr/lib/flink on the master node
flink run -m yarn-cluster --python examples/python/table/word_count.py
but I run into the following error
Executing word_count example with default input data set.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Traceback (most recent call last):
File "examples/python/table/word_count.py", line 146, in <module>
word_count(known_args.input, known_args.output)
File "examples/python/table/word_count.py", line 121, in word_count
.execute_insert('sink') \
File "/usr/lib/flink/opt/python/pyflink.zip/pyflink/table/table_result.py", line 76, in wait
File "/usr/lib/flink/opt/python/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__
File "/usr/lib/flink/opt/python/pyflink.zip/pyflink/util/exceptions.py", line 146, in deco
File "/usr/lib/flink/opt/python/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o100.await.
: java.util.concurrent.ExecutionException: org.apache.flink.table.api.TableException: Failed to wait job finish
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at org.apache.flink.table.api.internal.TableResultImpl.awaitInternal(TableResultImpl.java:129)
at org.apache.flink.table.api.internal.TableResultImpl.await(TableResultImpl.java:92)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.api.python.shaded.py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at org.apache.flink.api.python.shaded.py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at org.apache.flink.api.python.shaded.py4j.Gateway.invoke(Gateway.java:282)
at org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:79)
at org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.flink.table.api.TableException: Failed to wait job finish
at org.apache.flink.table.api.internal.InsertResultIterator.hasNext(InsertResultIterator.java:56)
at org.apache.flink.table.api.internal.TableResultImpl$CloseableRowIteratorWrapper.hasNext(TableResultImpl.java:370)
at org.apache.flink.table.api.internal.TableResultImpl$CloseableRowIteratorWrapper.isFirstRowReady(TableResultImpl.java:383)
at org.apache.flink.table.api.internal.TableResultImpl.lambda$awaitInternal$1(TableResultImpl.java:116)
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.util.concurrent.ExecutionException: org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 8064c1bde7be5c84d7086c13da8cb82b)
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at org.apache.flink.table.api.internal.InsertResultIterator.hasNext(InsertResultIterator.java:54)
... 7 more
Caused by: org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 8064c1bde7be5c84d7086c13da8cb82b)
at org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:125)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:403)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.flink.client.program.rest.RestClusterClient.lambda$pollResourceAsync$26(RestClusterClient.java:698)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:403)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:943)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
... 3 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
at org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:123)
... 24 more
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679)
at org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79)
at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:537)
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
at akka.actor.ActorCell.invoke(ActorCell.scala:548)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: java.io.IOException: Failed to execute the command: python3 -c import pyflink;import os;print(os.path.join(os.path.abspath(os.path.dirname(pyflink.__file__)), 'bin'))
output: Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'pyflink'
at org.apache.flink.python.util.PythonEnvironmentManagerUtils.execute(PythonEnvironmentManagerUtils.java:211)
at org.apache.flink.python.util.PythonEnvironmentManagerUtils.getPythonUdfRunnerScript(PythonEnvironmentManagerUtils.java:154)
at org.apache.flink.python.env.beam.ProcessPythonEnvironmentManager.createEnvironment(ProcessPythonEnvironmentManager.java:156)
at org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.createPythonExecutionEnvironment(BeamPythonFunctionRunner.java:395)
at org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.lambda$open$0(BeamPythonFunctionRunner.java:243)
at org.apache.flink.runtime.memory.MemoryManager.lambda$getSharedMemoryResourceForManagedMemory$5(MemoryManager.java:539)
at org.apache.flink.runtime.memory.SharedResources.createResource(SharedResources.java:126)
at org.apache.flink.runtime.memory.SharedResources.getOrAllocateSharedResource(SharedResources.java:72)
at org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:555)
at org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.open(BeamPythonFunctionRunner.java:246)
at org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.open(AbstractPythonFunctionOperator.java:131)
at org.apache.flink.table.runtime.operators.python.AbstractStatelessFunctionOperator.open(AbstractStatelessFunctionOperator.java:110)
at org.apache.flink.table.runtime.operators.python.table.PythonTableFunctionOperator.open(PythonTableFunctionOperator.java:113)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:110)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:711)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call(StreamTaskActionExecutor.java:100)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:687)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:654)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
at java.lang.Thread.run(Thread.java:750)
I have choosen spark, hadoop, flink, presto, zookeeper as the frameworks.
It works without a glitch if I use WordCount.jar but doesn't work for word_count.py
I am not sure why it shows that pyflink module is not found. I also installed apache flink again on the master node using pip as a last ditch effort but the same error occurs
pip install apache-flink==1.14
Any pointers would be helpful

How to resolve spark error when reading from s3

I am getting an error(java.io.IOException: No FileSystem for scheme: S3a) when running a spark application. I have looked through various other questions regarding this type of error, but Im not able to determine the solution. Spark is version 3.1.2
Updated details below to reflect current state
pyspark script:
import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.4 pyspark-shell'
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("s3reader") \
.getOrCreate()\
sc = spark.sparkContext
#sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
#sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "xxxxxxx")
#sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xxxxxxxxxxxx")
#sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint","xxx.x.xxx.x.com", "us-1-east")
#sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
df = spark.read.json("S3a://silver/testfolder/4a2426b2-856c-4e9b-b698-b3dcdca74f48")
print(df)
here are my jar versions:
cloud#spark-dev-master:/usr/local/spark/jars$ ls -ltr *aws*
-rw-rw-r-- 1 cloud cloud 126287 Aug 18 2016 hadoop-aws-2.7.4.jar
-rw-rw-r-- 1 cloud cloud 4479 Sep 17 02:36 aws-java-sdk-1.7.4.jar
stack trace:
Traceback (most recent call last):
File "/home/cloud/sparks3test.py", line 18, in <module>
df = spark.read.json("S3a://silver/testfolder/4a2426b2-856c-4e9b-b698-b3dcdca74f48")
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 372, in json
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o33.json.
: java.io.IOException: No FileSystem for scheme: S3a
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)

You need to use hadoop-aws version 3.2.0.
You can refer my previous answer here.

I am getting an error(java.lang.NoClassDefFoundError:org/apache/hadoop/fs/StreamCapabilities)
This is what you see when you mix hadoop-aws and hadoop-common JAR versions. They must match point for point (as spark JARs also require).
Do not attempt to work around this except by syncing up JARs, you will only be moving stack traces around.
See Hadoop troubleshooting s3a

As there still appeared to be jar dependecies issues, I did a fresh install on spark using 3.1.2 and hadoop 3.2.0 and aligned hadoop-aws and java-sdk jars with aws-common jar version on the master and worker nodes. This corrected the file system issue. Consequently, upgrading to 3.2.0 also corrected the endpoint issue we were running to as well as path.style.access=true is not supported in any hadoop version older than 2.8.0. That issue was documented here: https://issues.apache.org/jira/browse/HADOOP-12963 for reference.

pyspark SparkContext issue "Another SparkContext is being constructed"

I installed Spark on my EC2 instance following this tutorial:
https://sparkour.urizone.net/recipes/installing-ec2/#03
but when I try to start pyspark shell, I get this error:
"Another SparkContext is being constructed"
Here is the full exception:
[ec2-user#ip-10-0-0-153 ~]$ pyspark
Python 2.7.12 (default, Sep 1 2016, 22:14:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/22 11:46:16 WARN spark.SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/opt/spark/python/pyspark/shell.py", line 54, in <module>
spark = SparkSession.builder.getOrCreate()
File "/opt/spark/python/pyspark/sql/session.py", line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/opt/spark/python/pyspark/context.py", line 334, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/opt/spark/python/pyspark/context.py", line 118, in __init__
conf, jsc, profiler_cls)
File "/opt/spark/python/pyspark/context.py", line 180, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/opt/spark/python/pyspark/context.py", line 273, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.internal.config.package$
at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:546)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:373)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
I googled a lot and tried everything with no solution. I used this code to get a list of all running Contexts:
>>> from pyspark import SparkConf
>>> conf = SparkConf()
>>> conf.getAll()
And I got this:
[(u'spark.master', u'local[*]'), (u'spark.submit.deployMode', u'client'), (u'spark.app.name', u'PySparkShell')]
Any ideas how can I solve this issue?

I encountered the same error while trying to run PySpark in Jupyter Notebook (on macOS). The problem seems related to a Java version incompatibility and it only works with Java 8. I fixed the issue by changing the java version:
check java (JVM) installation versions. If you don't have java 8 installed, install it following instructions for your OS.
/usr/libexec/java_home -V
change the version to Java 8 (it is enough to do it for the current session)
export JAVA_HOME=/usr/libexec/java_home -v 1.8....
check:
java -version
run PySpark. that should solve the issue.

I solved the problem by setting the SPARK_MASTER_HOST=127.0.0.1 in spark-env.sh file

Navigate to Spark config folder:
cd $SPARK_HOME/conf
Open spark-env.sh file using an editor (is this file does not exist copy the spark-env-template file):
vi spark-env.sh
edit the SPARK_LOCAL_IP value to your machine IP (e.g. 205.210.42.205) :
SPARK_LOCAL_IP="205.210.42.205"
Hope this helps.

Adding S3DistCp to PySpark

I'm trying to add S3DistCp to my local, standalone Spark install. I've downloaded S3DistCp:
aws s3 cp s3://elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar .
And the AWS SDK as well:
wget http://sdk-for-java.amazonwebservices.com/latest/aws-java-sdk.zip
I extracted the AWS SDK:
unzip aws-java-sdk.zip
Then added s3distcp.jar to my spark-defaults.conf:
spark.driver.extraClassPath /Users/mark.miller/.ivy2/jars/s3distcp.jar
spark.executor.extraClassPath /Users/mark.miller/.ivy2/jars/s3distcp.jar
Then I added the AWS SDK and all it's dependencies to $LIBJARS and $HADOOP_CLASSPATH
export $LIBJARS=/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/lib/aws-java-sdk-1.11.86.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/aspectjrt-1.8.2.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/aspectjweaver.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/commons-codec-1.9.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/commons-logging-1.1.3.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/freemarker-2.3.9.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/httpclient-4.5.2.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/httpcore-4.4.4.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/ion-java-1.0.1.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/jackson-annotations-2.6.0.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/jackson-core-2.6.6.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/jackson-databind-2.6.6.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/jackson-dataformat-cbor-2.6.6.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/javax.mail-api-1.4.6.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/jmespath-java-1.11.86.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/joda-time-2.8.1.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/json-path-2.2.0.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/slf4j-api-1.7.16.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/spring-beans-3.0.7.RELEASE.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/spring-context-3.0.7.RELEASE.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/spring-core-3.0.7.RELEASE.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/spring-test-3.0.7.RELEASE.jar
export HADOOP_CLASSPATH=$LIBJARS
But when I try to start the pyspark shell:
$ pyspark
I get the following error:
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/02/06 17:48:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/06 17:48:50 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/shell.py", line 47, in <module>
spark = SparkSession.builder.getOrCreate()
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/sql/session.py", line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/context.py", line 307, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/context.py", line 118, in __init__
conf, jsc, profiler_cls)
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/context.py", line 179, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/context.py", line 246, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class com.google.common.cache.LocalCache
at com.google.common.cache.LocalCache$LocalLoadingCache.<init>(LocalCache.java:4867)
at com.google.common.cache.CacheBuilder.build(CacheBuilder.java:785)
at org.apache.hadoop.security.Groups.<init>(Groups.java:101)
at org.apache.hadoop.security.Groups.<init>(Groups.java:74)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:303)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:284)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2373)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2373)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2373)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:295)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
If I remove s3distcp.jar from spark-defaults.conf the error goes away. There just doesn't seem to be much documentation on how to deploy this since it's provided as part of EMR.

I was able to get this working by passing --driver-class-path to pyspark:
$ pyspark \
--driver-class-path \
~/Downloads/aws-java-sdk-1.11.86/lib/aws-java-sdk-1.11.86.jar:\
~/Downloads/aws-java-sdk-1.11.86/third-party/lib/*
To set this up in spark-defaults.conf I had to do it this way:
spark.jars /Users/mark.miller/.ivy2/jars/s3distcp.jar
spark.driver.extraClassPath /Users/mark.miller/Downloads/aws-java-sdk-1.11.86/lib/aws-java-sdk-1.11.86.jar:/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/*
I also learned that spark.executor.extraClassPath is only needed for backwards-compatibility with older versions of spark (https://spark.apache.org/docs/latest/configuration.html#runtime-environment)

Cassandra Startup java.lang.reflect.InvocationTargetException

What could be the solution for this exception
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:322)
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229)
at org.xerial.snappy.Snappy.<clinit>(Snappy.java:48)
at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:79)
at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:66)
at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:359)
at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:150)
Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5-libsnappyjava.so: /tmp/snappy-1.0.5-libsnappyjava.so: failed to map segment from shared object: Operation not permitted
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1957)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1882)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1843)
at java.lang.Runtime.load0(Runtime.java:795)
at java.lang.System.load(System.java:1061)
at org.xerial.snappy.SnappyNativeLoader.load(SnappyNativeLoader.java:39)
... 11 more
ERROR 11:46:22,430 Exception in thread Thread[WRITE-/172.20.6.138,5,main]
org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:239)
at org.xerial.snappy.Snappy.<clinit>(Snappy.java:48)
at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:79)
at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:66)
at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:359)
at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:150)
But the cassandra startup execution is continue. and at the end i am getting exception
java.lang.IllegalStateException: Unable to contact any seeds!
at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:947)
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:716)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:554)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:451)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:347)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:446)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:489)
Exception encountered during startup: Unable to contact any seeds!
ERROR 11:55:11,732 Exception in thread Thread[StorageServiceShutdownHook,5,main]
java.lang.NullPointerException
at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370)
at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:519)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:724)
but the seeds nodes are running.
Machine firewall is disabled.
I also tried changing default tmp directory. I tried including JVM_OPTS=-Dorg.xerial.snappy.tempdir=/tmp to cassandra.in.sh file but the problem is remains.
Linux version 2.6.32-358.6.1.el6.x86_64 (mockbuild#c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue Apr 23 19:29:00 UTC 2013
please help me.. thanks in advance.

They mention this error here in the Cassandra JIRA:
https://issues.apache.org/jira/browse/CASSANDRA-4400
I had a similar error with Cassandra 1.2, and I fixed it by replacing the snappy 1.0.5 library with the snappy 1.0.4.1. If that link doesn't work, you should be able to grab that from an earlier version of Cassandra (like 1.1.x).
Basically, remove the snappy-java.1.0.5.jar from your $CASSANDRA_HOME/lib directory, and replace it with snappy-java-1.0.4.1.jar

I found an answer wrote by Erik in
http://mail-archives.apache.org/mod_mbox/cassandra-user/201312.mbox/%3C52C1DEC4.2080904#cj.com%3E
You can add something like this to cassandra-env.sh :
JVM_OPTS="$JVM_OPTS
-Dorg.xerial.snappy.tempdir=/path/that/allows/executables"
I created a tmp folder in /var/lib/cassandra/
so I added
JVM_OPTS="$JVM_OPTS -Dorg.xerial.snappy.tempdir=/var/lib/cassandra/tmp/"
Hope this help somebody

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Apache Spark - Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream - apache-spark

Related

Not able to run simple pyflink word_count.py on aws emr

How to resolve spark error when reading from s3

pyspark SparkContext issue "Another SparkContext is being constructed"

Adding S3DistCp to PySpark

Cassandra Startup java.lang.reflect.InvocationTargetException

Categories

Resources