Pyspark throws JNI error when creating a SparkContext - apache-spark

I (try to) run pyspark on Manjaro Linux with Python2. I've create a test script to create a SparkContext instance and stop it again:
import findspark
findspark.init()
from pyspark import SparkContext
if __name__ == '__main__':
sc = SparkContext()
sc.stop()
I launch this from a terminal with python2 filename.py. This has worked previously but for reasons I don't understand, this now raises the following:
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
Traceback (most recent call last):
File "mwe.py", line 22, in <module>
sc = SparkContext()
File "/opt/apache-spark/python/pyspark/context.py", line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/opt/apache-spark/python/pyspark/context.py", line 292, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/opt/apache-spark/python/pyspark/java_gateway.py", line 93, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
I've read on SO about others solving this problem by using Java version 8 instead of 9 or 10. However, I seem to be running version 8 already, as archlinux-java status outputs:
Available Java environments:
java-10-openjdk
java-8-jdk
java-8-jre/jre
java-8-openjdk/jre (default)
I have no idea how to proceed from here, so any help would be greatly appreciated.

I had the same problem, and I fixed it by uninstalling java-10-openjdk

Related

Not able to run simple pyflink word_count.py on aws emr

I have created an EMR cluster (v5.35.0) and am trying to run a sample word_count.py to verify if I am able to execute a flink job.
I am able to use python3 as mentioned in this question How do you run pyflink scripts on AWS EMR?
Using the below command to submit the job from /usr/lib/flink on the master node
flink run -m yarn-cluster --python examples/python/table/word_count.py
but I run into the following error
Executing word_count example with default input data set.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Traceback (most recent call last):
File "examples/python/table/word_count.py", line 146, in <module>
word_count(known_args.input, known_args.output)
File "examples/python/table/word_count.py", line 121, in word_count
.execute_insert('sink') \
File "/usr/lib/flink/opt/python/pyflink.zip/pyflink/table/table_result.py", line 76, in wait
File "/usr/lib/flink/opt/python/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__
File "/usr/lib/flink/opt/python/pyflink.zip/pyflink/util/exceptions.py", line 146, in deco
File "/usr/lib/flink/opt/python/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o100.await.
: java.util.concurrent.ExecutionException: org.apache.flink.table.api.TableException: Failed to wait job finish
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at org.apache.flink.table.api.internal.TableResultImpl.awaitInternal(TableResultImpl.java:129)
at org.apache.flink.table.api.internal.TableResultImpl.await(TableResultImpl.java:92)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.api.python.shaded.py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at org.apache.flink.api.python.shaded.py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at org.apache.flink.api.python.shaded.py4j.Gateway.invoke(Gateway.java:282)
at org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:79)
at org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.flink.table.api.TableException: Failed to wait job finish
at org.apache.flink.table.api.internal.InsertResultIterator.hasNext(InsertResultIterator.java:56)
at org.apache.flink.table.api.internal.TableResultImpl$CloseableRowIteratorWrapper.hasNext(TableResultImpl.java:370)
at org.apache.flink.table.api.internal.TableResultImpl$CloseableRowIteratorWrapper.isFirstRowReady(TableResultImpl.java:383)
at org.apache.flink.table.api.internal.TableResultImpl.lambda$awaitInternal$1(TableResultImpl.java:116)
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.util.concurrent.ExecutionException: org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 8064c1bde7be5c84d7086c13da8cb82b)
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at org.apache.flink.table.api.internal.InsertResultIterator.hasNext(InsertResultIterator.java:54)
... 7 more
Caused by: org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 8064c1bde7be5c84d7086c13da8cb82b)
at org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:125)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:403)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.flink.client.program.rest.RestClusterClient.lambda$pollResourceAsync$26(RestClusterClient.java:698)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:403)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:943)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
... 3 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
at org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:123)
... 24 more
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679)
at org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79)
at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:537)
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
at akka.actor.ActorCell.invoke(ActorCell.scala:548)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: java.io.IOException: Failed to execute the command: python3 -c import pyflink;import os;print(os.path.join(os.path.abspath(os.path.dirname(pyflink.__file__)), 'bin'))
output: Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'pyflink'
at org.apache.flink.python.util.PythonEnvironmentManagerUtils.execute(PythonEnvironmentManagerUtils.java:211)
at org.apache.flink.python.util.PythonEnvironmentManagerUtils.getPythonUdfRunnerScript(PythonEnvironmentManagerUtils.java:154)
at org.apache.flink.python.env.beam.ProcessPythonEnvironmentManager.createEnvironment(ProcessPythonEnvironmentManager.java:156)
at org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.createPythonExecutionEnvironment(BeamPythonFunctionRunner.java:395)
at org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.lambda$open$0(BeamPythonFunctionRunner.java:243)
at org.apache.flink.runtime.memory.MemoryManager.lambda$getSharedMemoryResourceForManagedMemory$5(MemoryManager.java:539)
at org.apache.flink.runtime.memory.SharedResources.createResource(SharedResources.java:126)
at org.apache.flink.runtime.memory.SharedResources.getOrAllocateSharedResource(SharedResources.java:72)
at org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:555)
at org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.open(BeamPythonFunctionRunner.java:246)
at org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.open(AbstractPythonFunctionOperator.java:131)
at org.apache.flink.table.runtime.operators.python.AbstractStatelessFunctionOperator.open(AbstractStatelessFunctionOperator.java:110)
at org.apache.flink.table.runtime.operators.python.table.PythonTableFunctionOperator.open(PythonTableFunctionOperator.java:113)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:110)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:711)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call(StreamTaskActionExecutor.java:100)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:687)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:654)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
at java.lang.Thread.run(Thread.java:750)
I have choosen spark, hadoop, flink, presto, zookeeper as the frameworks.
It works without a glitch if I use WordCount.jar but doesn't work for word_count.py
I am not sure why it shows that pyflink module is not found. I also installed apache flink again on the master node using pip as a last ditch effort but the same error occurs
pip install apache-flink==1.14
Any pointers would be helpful

ClassNotFoundException loading data from snowflake with pyspark

I am getting this error when I try to load data from snowflake into a dataframe with pyspark:
py4j.protocol.Py4JJavaError: An error occurred while calling o45.load.
: java.lang.NoClassDefFoundError: net/snowflake/client/jdbc/internal/org/bouncycastle/jce/provider/BouncyCastleProvider
Here is some code to reproduce the error:
from pyspark.sql import SparkSession
from pyspark import SparkConf
conf = SparkConf()
conf.set('spark.jars.packages',
'net.snowflake:spark-snowflake_2.11:2.8.4-spark_2.4,net.snowflake:snowflake-jdbc:3.12.17')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sf_reader_options = {'sfURL': 'example.snowflakecomputing.com', 'sfAccount': 'example_account',
'sfWarehouse': 'example_warehouse', 'sfRole': 'DATASCIENCE', 'sfUser': 'user',
'sfPassword': 'pass', 'sfDatabase': 'db_name', 'sfSchema': 'schema_name', 'sfTimezone': 'UTC'}
reader = (spark
.read
.format('net.snowflake.spark.snowflake')
.options(**sf_reader_options))
result = reader.option('query', 'select * from TABLE_NAME').load()
The stacktrace for the error looks like this:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Users/charlie/lark/bigbird/venv/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 172, in load
return self._df(self._jreader.load())
File "/Users/charlie/lark/bigbird/venv/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/charlie/lark/bigbird/venv/lib/python3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/charlie/lark/bigbird/venv/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o45.load.
: java.lang.NoClassDefFoundError: net/snowflake/client/jdbc/internal/org/bouncycastle/jce/provider/BouncyCastleProvider
at net.snowflake.spark.snowflake.Parameters$.mergeParameters(Parameters.scala:202)
at net.snowflake.spark.snowflake.DefaultSource.createRelation(DefaultSource.scala:59)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: net.snowflake.client.jdbc.internal.org.bouncycastle.jce.provider.BouncyCastleProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 17 more
I am using spark 2.4.7 and spark-snowflake 2.8.4, with snowflake jdbc 3.12.17. I am on Mac OS X Big Sur. This happened after I upgraded to Big Sur, though I'm not sure whether that's related.
I have tried:
adding bouncy castle provider to my configuration as a package dependency
checking that JAVA_HOME points to Java 8 (it does)
reinstalling java 8 (with homebrew and adoptopenjdk)
adding bouncy castle as a security provider, per instructions here
updating spark-snowflake and snowflake-jdbc (was using 2.7.0 and 3.12.3 before, same error)
Any help would be much appreciated!
Ultimately, I was able to resolve this by:
downloading Java straight from Oracle (rather than uninstalling and reinstalling with homebrew),
deleting spark, downloading again (from apache, not via homebrew), and setting up environment variables as described here (mostly... I use a virtual environment so I didn't hardcode PYSPARK_PYTHON to system python3)
uninstalling pyspark and reinstalling
quitting pycharm and reopening (this refreshed all my environment variables that were set in .zshrc, like JAVA_HOME)
There's almost certainly an easier way, but this worked.

java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST

i'm using PyCharm 2019.1, and Python 3.7 (in Project Interpreter)
On PyCharm, i've added Pyspark 2.4.2
when i run the following code (to create a Spark DataFrame), i get error
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
....
Exception: Java gateway process exited before sending its port number
from the other SO issues, it seems that it is related to version mismatch,
question is how to resolve this
my $SPARK_HOME points to Apache Spark 2.2.0,
when i try to install 2.2.0 on Pycharm, it gives error
Collecting pyspark==2.2.0
Could not find a version that satisfies the requirement pyspark==2.2.0 (from versions: 2.1.2, 2.1.3, 2.2.0.post0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3)
No matching distribution found for pyspark==2.2.0
Any ideas on how to fix this ?
CODE ->
from pyspark.sql import SparkSession
d = {'a':1, 'b':2, 'c':3}
spark = SparkSession.builder.master("local").appName("CreatingDF").getOrCreate()
pandaDF = spark.createDataFrame(d)
print(pandaDF)
ERROR ->
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/06 23:21:45 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "/Users/karanalang/PycharmProjects/PythonFalcon/FalconIncremental/python_createDF2.py", line 28, in <module>
spark = SparkSession.builder.master("local").appName("CreatingDF").getOrCreate()
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 367, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
Yes.It's versioning issue. Verify your python version on your command prompt/terminal. If default python version is 2.7 and pyCharm is pointing to python3.7 in it's interpreter then it should work.
mostly Anaconda3 and onwards cause this issue.

pyspark SparkContext issue "Another SparkContext is being constructed"

I installed Spark on my EC2 instance following this tutorial:
https://sparkour.urizone.net/recipes/installing-ec2/#03
but when I try to start pyspark shell, I get this error:
"Another SparkContext is being constructed"
Here is the full exception:
[ec2-user#ip-10-0-0-153 ~]$ pyspark
Python 2.7.12 (default, Sep 1 2016, 22:14:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/22 11:46:16 WARN spark.SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/opt/spark/python/pyspark/shell.py", line 54, in <module>
spark = SparkSession.builder.getOrCreate()
File "/opt/spark/python/pyspark/sql/session.py", line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/opt/spark/python/pyspark/context.py", line 334, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/opt/spark/python/pyspark/context.py", line 118, in __init__
conf, jsc, profiler_cls)
File "/opt/spark/python/pyspark/context.py", line 180, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/opt/spark/python/pyspark/context.py", line 273, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.internal.config.package$
at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:546)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:373)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
I googled a lot and tried everything with no solution. I used this code to get a list of all running Contexts:
>>> from pyspark import SparkConf
>>> conf = SparkConf()
>>> conf.getAll()
And I got this:
[(u'spark.master', u'local[*]'), (u'spark.submit.deployMode', u'client'), (u'spark.app.name', u'PySparkShell')]
Any ideas how can I solve this issue?
I encountered the same error while trying to run PySpark in Jupyter Notebook (on macOS). The problem seems related to a Java version incompatibility and it only works with Java 8. I fixed the issue by changing the java version:
check java (JVM) installation versions. If you don't have java 8 installed, install it following instructions for your OS.
/usr/libexec/java_home -V
change the version to Java 8 (it is enough to do it for the current session)
export JAVA_HOME=/usr/libexec/java_home -v 1.8....
check:
java -version
run PySpark. that should solve the issue.
I solved the problem by setting the SPARK_MASTER_HOST=127.0.0.1 in spark-env.sh file
Navigate to Spark config folder:
cd $SPARK_HOME/conf
Open spark-env.sh file using an editor (is this file does not exist copy the spark-env-template file):
vi spark-env.sh
edit the SPARK_LOCAL_IP value to your machine IP (e.g. 205.210.42.205) :
SPARK_LOCAL_IP="205.210.42.205"
Hope this helps.

Pyspark Streaming with Kafka in PyCharm

I have been recently trying to debug the pyspark.streaming.kafka class in Pycharm so that it is easier to troubleshoot compared to working on that on the linux box.
Here is my sample code:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
sc = SparkContext(appName="sample app")
ssc = StreamingContext(sc, 1)
kafkaParams = {"metadata.broker.list": "{broker list}",
"auto.offset.reset": "smallest"}
kafka_stream = KafkaUtils.createDirectStream(ssc, {topic list}, kafkaParams)
However, i got the error below:
Traceback (most recent call last):
File "C:\Program Files (x86)\JetBrains\PyCharm 5.0.3\helpers\pydev\pydevd.py", line 2411, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files (x86)\JetBrains\PyCharm 5.0.3\helpers\pydev\pydevd.py", line 1802, in run
launch(file, globals, locals) # execute the script
File "{script path}", line 30, in <module> {topic}], kafkaParams)
File "C:\spark-1.6.0-bin- hadoop2.6\python\lib\pyspark.zip\pyspark\streaming\kafka.py", line 152, in createDirectStream
py4j.protocol.Py4JJavaError: An error occurred while calling o20.loadClass.
: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Unknown Source)
16/02/22 11:45:49 INFO SparkContext: Invoking stop() from shutdown hook
I would appreciate if someone can provide some guidance on how to debug the PySpark Kafka streaming module in PyCharm
Kafka support depends on external spark-streaming-kafka JAR which is not shipped with Spark binaries. Typically this can be specified on submit with --packages argument.
For local development using PyCharm the simplest solution I can think off is to add it to $SPARK_HOME/conf/spark-defaults.conf. Assuming you use Spark 1.6.0 built with Scala 2.10:
spark.jars.packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0
Keep in mind that you won't be able to use PyCharm debugger with Python worker process. See How can pyspark be called in debug mode?

Resources