Cannot load a saved Spark model in pyspark: "java.lang.NoSuchMethodException" - apache-spark

When I run the following Python program
from pyspark.ml.classification import LinearSVC
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Sparkmodel").getOrCreate()
data = spark.read.format("libsvm").load("/usr/local/spark/data/mllib/sample_libsvm_data.txt")
model = LinearSVC().fit(data)
model.save("mymodel")
LinearSVC.load("mymodel")
the load fails with a "java.lang.NoSuchMethodException".
/anaconda3/envs/scratch/bin/python /Users/billmcn/src/toy/sparkmodel/sparkmodel/little.py
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/11/12 13:23:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/12 13:23:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/11/12 13:23:06 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
17/11/12 13:23:17 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
17/11/12 13:23:17 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
Traceback (most recent call last):
File "/Users/billmcn/src/toy/sparkmodel/sparkmodel/little.py", line 9, in <module>
LinearSVC.load("mymodel")
File "/anaconda3/envs/scratch/lib/python3.6/site-packages/pyspark/ml/util.py", line 257, in load
return cls.read().load(path)
File "/anaconda3/envs/scratch/lib/python3.6/site-packages/pyspark/ml/util.py", line 197, in load
java_obj = self._jread.load(path)
File "/anaconda3/envs/scratch/lib/python3.6/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/anaconda3/envs/scratch/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/anaconda3/envs/scratch/lib/python3.6/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o64.load.
: java.lang.NoSuchMethodException: org.apache.spark.ml.classification.LinearSVCModel.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:328)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Process finished with exit code 1
The "mymodel" directory is created and its contents appear to be valid.
I am running Spark 2.2.0 and pyspark 2.2.0. I have the following mllib jars in my installation.
> ll /usr/local/spark.versions/spark-2.2.0-bin-hadoop2.7/jars/spark-mllib*
-rw-r--r--# 1 billmcn admin 6501535 Jun 30 18:09 /usr/local/spark.versions/spark-2.2.0-bin-hadoop2.7/jars/spark-mllib_2.11-2.2.0.jar
-rw-r--r--# 1 billmcn admin 182887 Jun 30 18:09 /usr/local/spark.versions/spark-2.2.0-bin-hadoop2.7/jars/spark-mllib-local_2.11-2.2.0.jar
And the latter contains the class I want.
jar tf /usr/local/spark.versions/spark-2.2.0-bin-hadoop2.7/jars/spark-mllib_2.11-2.2.0.jar | grep LinearSVCModel
org/apache/spark/ml/classification/LinearSVCModel$LinearSVCWriter$Data.class
org/apache/spark/ml/classification/LinearSVCModel$.class
org/apache/spark/ml/classification/LinearSVCModel$LinearSVCWriter.class
org/apache/spark/ml/classification/LinearSVCModel$LinearSVCReader.class
org/apache/spark/ml/classification/LinearSVCModel$$anonfun$11.class
org/apache/spark/ml/classification/LinearSVCModel$LinearSVCWriter$$typecreator1$1.class
org/apache/spark/ml/classification/LinearSVCModel$LinearSVCWriter$Data$.class
org/apache/spark/ml/classification/LinearSVCModel.class
The same problem happens on two different machines.
What am I doing wrong?

I was using the wrong class to load the module. The following works
model = LinearSVCModel.load(model_path)

This looks like a version mismatch. The most likely scenario is:
Your Python (PySpark) installation uses Spark 2.2
While JVM jars have been compiled with earlier Spark version, which didn't include LinearSVCModel.

Related

java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST

i'm using PyCharm 2019.1, and Python 3.7 (in Project Interpreter)
On PyCharm, i've added Pyspark 2.4.2
when i run the following code (to create a Spark DataFrame), i get error
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
....
Exception: Java gateway process exited before sending its port number
from the other SO issues, it seems that it is related to version mismatch,
question is how to resolve this
my $SPARK_HOME points to Apache Spark 2.2.0,
when i try to install 2.2.0 on Pycharm, it gives error
Collecting pyspark==2.2.0
Could not find a version that satisfies the requirement pyspark==2.2.0 (from versions: 2.1.2, 2.1.3, 2.2.0.post0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3)
No matching distribution found for pyspark==2.2.0
Any ideas on how to fix this ?
CODE ->
from pyspark.sql import SparkSession
d = {'a':1, 'b':2, 'c':3}
spark = SparkSession.builder.master("local").appName("CreatingDF").getOrCreate()
pandaDF = spark.createDataFrame(d)
print(pandaDF)
ERROR ->
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/06 23:21:45 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "/Users/karanalang/PycharmProjects/PythonFalcon/FalconIncremental/python_createDF2.py", line 28, in <module>
spark = SparkSession.builder.master("local").appName("CreatingDF").getOrCreate()
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 367, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
Yes.It's versioning issue. Verify your python version on your command prompt/terminal. If default python version is 2.7 and pyCharm is pointing to python3.7 in it's interpreter then it should work.
mostly Anaconda3 and onwards cause this issue.

Unable to Load Logistic Regression Model in Spark 2.x

I am trying save and load options available in Spark 2.x version. I built a LogisticRegression model and saved the model successfully. But while loading the model, facing the following issue
Code snippet:
from pyspark.ml.classification import LogisticRegressionModel
LogisticRegressionModel.load("lrmodel")
Error Message:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/jars/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
18/10/03 16:26:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/03 16:26:20 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.
Traceback (most recent call last):
File "/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o25.load.
: java.lang.IllegalArgumentException: requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.classification.LogisticRegressionModel but found class name org.apache.spark.ml.PipelineModel
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:404)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:383)
at org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelReader.load(LogisticRegression.scala:1197)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.base/java.lang.Thread.run(Thread.java:844)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Volumes/Data/Innominds/WorkSpace/SparkIncrementalLearning/src/PipilineBasedModelling.py", line 59, in <module>
loadAndRetrainModel(spark)
File "/Volumes/Data/Innominds/WorkSpace/SparkIncrementalLearning/src/PipilineBasedModelling.py", line 51, in loadAndRetrainModel
LogisticRegressionModel.load("lrmodel")
File "/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/python/pyspark/ml/util.py", line 257, in load
return cls.read().load(path)
File "/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/python/pyspark/ml/util.py", line 197, in load
java_obj = self._jread.load(path)
File "/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.classification.LogisticRegressionModel but found class name org.apache.spark.ml.PipelineModel'
Am I missing anything here?
That's because your model is not a LogisticRegressionModel. If you read the stracktrace you'll see this particular line (emphasis mine):
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.classification.LogisticRegressionModel but found class name org.apache.spark.ml.PipelineModel'
Therefore you should use PipelineModel
from pyspark.ml import PipelineModel
PipelineModel.load("lrmodel")

pyspark SparkContext issue "Another SparkContext is being constructed"

I installed Spark on my EC2 instance following this tutorial:
https://sparkour.urizone.net/recipes/installing-ec2/#03
but when I try to start pyspark shell, I get this error:
"Another SparkContext is being constructed"
Here is the full exception:
[ec2-user#ip-10-0-0-153 ~]$ pyspark
Python 2.7.12 (default, Sep 1 2016, 22:14:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/22 11:46:16 WARN spark.SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/opt/spark/python/pyspark/shell.py", line 54, in <module>
spark = SparkSession.builder.getOrCreate()
File "/opt/spark/python/pyspark/sql/session.py", line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/opt/spark/python/pyspark/context.py", line 334, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/opt/spark/python/pyspark/context.py", line 118, in __init__
conf, jsc, profiler_cls)
File "/opt/spark/python/pyspark/context.py", line 180, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/opt/spark/python/pyspark/context.py", line 273, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.internal.config.package$
at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:546)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:373)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
I googled a lot and tried everything with no solution. I used this code to get a list of all running Contexts:
>>> from pyspark import SparkConf
>>> conf = SparkConf()
>>> conf.getAll()
And I got this:
[(u'spark.master', u'local[*]'), (u'spark.submit.deployMode', u'client'), (u'spark.app.name', u'PySparkShell')]
Any ideas how can I solve this issue?
I encountered the same error while trying to run PySpark in Jupyter Notebook (on macOS). The problem seems related to a Java version incompatibility and it only works with Java 8. I fixed the issue by changing the java version:
check java (JVM) installation versions. If you don't have java 8 installed, install it following instructions for your OS.
/usr/libexec/java_home -V
change the version to Java 8 (it is enough to do it for the current session)
export JAVA_HOME=/usr/libexec/java_home -v 1.8....
check:
java -version
run PySpark. that should solve the issue.
I solved the problem by setting the SPARK_MASTER_HOST=127.0.0.1 in spark-env.sh file
Navigate to Spark config folder:
cd $SPARK_HOME/conf
Open spark-env.sh file using an editor (is this file does not exist copy the spark-env-template file):
vi spark-env.sh
edit the SPARK_LOCAL_IP value to your machine IP (e.g. 205.210.42.205) :
SPARK_LOCAL_IP="205.210.42.205"
Hope this helps.

Adding S3DistCp to PySpark

I'm trying to add S3DistCp to my local, standalone Spark install. I've downloaded S3DistCp:
aws s3 cp s3://elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar .
And the AWS SDK as well:
wget http://sdk-for-java.amazonwebservices.com/latest/aws-java-sdk.zip
I extracted the AWS SDK:
unzip aws-java-sdk.zip
Then added s3distcp.jar to my spark-defaults.conf:
spark.driver.extraClassPath /Users/mark.miller/.ivy2/jars/s3distcp.jar
spark.executor.extraClassPath /Users/mark.miller/.ivy2/jars/s3distcp.jar
Then I added the AWS SDK and all it's dependencies to $LIBJARS and $HADOOP_CLASSPATH
export $LIBJARS=/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/lib/aws-java-sdk-1.11.86.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/aspectjrt-1.8.2.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/aspectjweaver.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/commons-codec-1.9.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/commons-logging-1.1.3.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/freemarker-2.3.9.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/httpclient-4.5.2.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/httpcore-4.4.4.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/ion-java-1.0.1.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/jackson-annotations-2.6.0.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/jackson-core-2.6.6.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/jackson-databind-2.6.6.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/jackson-dataformat-cbor-2.6.6.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/javax.mail-api-1.4.6.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/jmespath-java-1.11.86.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/joda-time-2.8.1.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/json-path-2.2.0.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/slf4j-api-1.7.16.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/spring-beans-3.0.7.RELEASE.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/spring-context-3.0.7.RELEASE.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/spring-core-3.0.7.RELEASE.jar,/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/spring-test-3.0.7.RELEASE.jar
export HADOOP_CLASSPATH=$LIBJARS
But when I try to start the pyspark shell:
$ pyspark
I get the following error:
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/02/06 17:48:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/06 17:48:50 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/shell.py", line 47, in <module>
spark = SparkSession.builder.getOrCreate()
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/sql/session.py", line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/context.py", line 307, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/context.py", line 118, in __init__
conf, jsc, profiler_cls)
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/context.py", line 179, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/context.py", line 246, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class com.google.common.cache.LocalCache
at com.google.common.cache.LocalCache$LocalLoadingCache.<init>(LocalCache.java:4867)
at com.google.common.cache.CacheBuilder.build(CacheBuilder.java:785)
at org.apache.hadoop.security.Groups.<init>(Groups.java:101)
at org.apache.hadoop.security.Groups.<init>(Groups.java:74)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:303)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:284)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2373)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2373)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2373)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:295)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
If I remove s3distcp.jar from spark-defaults.conf the error goes away. There just doesn't seem to be much documentation on how to deploy this since it's provided as part of EMR.
I was able to get this working by passing --driver-class-path to pyspark:
$ pyspark \
--driver-class-path \
~/Downloads/aws-java-sdk-1.11.86/lib/aws-java-sdk-1.11.86.jar:\
~/Downloads/aws-java-sdk-1.11.86/third-party/lib/*
To set this up in spark-defaults.conf I had to do it this way:
spark.jars /Users/mark.miller/.ivy2/jars/s3distcp.jar
spark.driver.extraClassPath /Users/mark.miller/Downloads/aws-java-sdk-1.11.86/lib/aws-java-sdk-1.11.86.jar:/Users/mark.miller/Downloads/aws-java-sdk-1.11.86/third-party/lib/*
I also learned that spark.executor.extraClassPath is only needed for backwards-compatibility with older versions of spark (https://spark.apache.org/docs/latest/configuration.html#runtime-environment)

PySpark in Pycharm- unable to connect to remote server

Use Case: I want to use my laptop (using Win 7 Professional) to connect to the CentOS 6.4 master server using PyCharm.
Objective: To write the code in Pycharm on the laptop and then send the job to the server which will do the processing and should then return the result back to the laptop or to any other visualizing API.
The server and 3 namenodes already installed with pyspark and I have checked pyspark works in standalone mode on all four servers. Pyspark works in standalone mode on my laptop too.
I use the following code but I am not able to connect to the remote server.
import os
import sys
try:
from pyspark import SparkContext
from pyspark import SparkConf
print ("Pyspark sucess")
except ImportError as e:
print ("Error importing Spark Modules", e)
conf = SparkConf()
conf.setMaster("spark://10.210.250.400:7077")
conf.setAppName("First_Remote_Spark_Program")
sc = SparkContext(conf=conf)
print ("connection succeeded with Master",conf)
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
print(distData)
The stack trace of error is
Pyspark sucess
15/08/01 14:08:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/08/01 14:08:24 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
at org.apache.hadoop.security.Groups.<init>(Groups.java:77)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:232)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:718)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:703)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:605)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2162)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2162)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2162)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:301)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
15/08/01 14:08:25 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
15/08/01 14:08:26 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#10.210.250.400:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#10.210.250.400:7077
15/08/01 14:08:26 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#10.210.250.400:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: no further information: /10.210.250.400:7077
15/08/01 14:08:46 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#10.210.250.400:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#10.210.250.400:7077
15/08/01 14:08:46 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#10.210.250.400:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: no further information: /10.210.250.400:7077
15/08/01 14:09:06 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#10.210.250.400:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#10.210.250.400:7077
15/08/01 14:09:06 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#10.210.250.400:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: no further information: /10.210.250.400:7077
15/08/01 14:09:25 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/08/01 14:09:25 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet.
15/08/01 14:09:25 ERROR OneForOneStrategy:
java.lang.NullPointerException
at org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/08/01 14:09:25 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
File "C:/Users/ashish dutt/PycharmProjects/KafkaToHDFS/local2Remote.py", line 26, in <module>
sc = SparkContext(conf=conf)
File "C:\spark-1.4.0\python\pyspark\context.py", line 113, in __init__
conf, jsc, profiler_cls)
File "C:\spark-1.4.0\python\pyspark\context.py", line 165, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "C:\spark-1.4.0\python\pyspark\context.py", line 219, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "C:\spark-1.4.0\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 701, in __call__
File "C:\spark-1.4.0\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1501)
at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2005)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Process finished with exit code 1
The spark-defaults.conf file is configured as follows
#spark.eventLog.dir=hdfs://ABCD01:8020/user/spark/applicationHistory
spark.eventLog.dir hdfs://10.210.250.400:8020/user/spark/eventlog
spark.eventLog.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled true
spark.shuffle.service.port 7337
spark.yarn.historyServer.address http://ABCD04:18088
spark.master spark://10.210.250.400:7077
spark.yarn.jar local:/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.2-hadoop2.6.0-cdh5.4.2.jar
spark.driver.extraLibraryPath /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/hadoop/lib/native
spark.executor.extraLibraryPath /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/hadoop/lib/native
spark.yarn.am.extraLibraryPath /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/hadoop/lib/native
spark.logConf true
The spark-env.sh file is configured as follows
#!/usr/bin/env bash
##
# Generated by Cloudera Manager and should not be modified directly
##
SELF="$(cd $(dirname $BASH_SOURCE) && pwd)"
if [ -z "$SPARK_CONF_DIR" ]; then
export SPARK_CONF_DIR="$SELF"
fi
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/spark
export DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/hadoop
#export STANDALONE_SPARK_MASTER_HOST=`ABCD01`
export SPARK_MASTER_IP=spark://10.210.250.400
export SPARK_MASTER_PORT=7077
export SPARK_WEBUI_PORT=18080
### Path of Spark assembly jar in HDFS
export SPARK_JAR_HDFS_PATH=${SPARK_JAR_HDFS_PATH:-''}
export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME}
if [ -n "$HADOOP_HOME" ]; then
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi
SPARK_EXTRA_LIB_PATH=""
if [ -n "$SPARK_EXTRA_LIB_PATH" ]; then
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SPARK_EXTRA_LIB_PATH
fi
export LD_LIBRARY_PATH
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$SPARK_CONF_DIR/yarn-conf}
# This is needed to support old CDH versions that use a forked version
# of compute-classpath.sh.
export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib
# Set distribution classpath. This is only used in CDH 5.3 and later.
export SPARK_DIST_CLASSPATH=$(paste -sd: "$SELF/classpath.txt")
And the slaves.sh file is configured as
10.210.250.401
10.210.250.402
10.210.250.403
Please tell me how can I connect to the remote server.
The problem is that Spark requires some elements of a Hadoop distribution to be present on Windows in order to work. Spark-env.sh won't help you as it is a shell script not executed on Windows. I think the solution you need is already covered here submit .py script on Spark without Hadoop installation

Resources