Unable to Load Logistic Regression Model in Spark 2.x - apache-spark

I am trying save and load options available in Spark 2.x version. I built a LogisticRegression model and saved the model successfully. But while loading the model, facing the following issue
Code snippet:
from pyspark.ml.classification import LogisticRegressionModel
LogisticRegressionModel.load("lrmodel")
Error Message:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/jars/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
18/10/03 16:26:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/03 16:26:20 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.
Traceback (most recent call last):
File "/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o25.load.
: java.lang.IllegalArgumentException: requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.classification.LogisticRegressionModel but found class name org.apache.spark.ml.PipelineModel
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:404)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:383)
at org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelReader.load(LogisticRegression.scala:1197)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.base/java.lang.Thread.run(Thread.java:844)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Volumes/Data/Innominds/WorkSpace/SparkIncrementalLearning/src/PipilineBasedModelling.py", line 59, in <module>
loadAndRetrainModel(spark)
File "/Volumes/Data/Innominds/WorkSpace/SparkIncrementalLearning/src/PipilineBasedModelling.py", line 51, in loadAndRetrainModel
LogisticRegressionModel.load("lrmodel")
File "/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/python/pyspark/ml/util.py", line 257, in load
return cls.read().load(path)
File "/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/python/pyspark/ml/util.py", line 197, in load
java_obj = self._jread.load(path)
File "/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/Volumes/Data/Innominds/spark-2.2.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.classification.LogisticRegressionModel but found class name org.apache.spark.ml.PipelineModel'
Am I missing anything here?

That's because your model is not a LogisticRegressionModel. If you read the stracktrace you'll see this particular line (emphasis mine):
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.classification.LogisticRegressionModel but found class name org.apache.spark.ml.PipelineModel'
Therefore you should use PipelineModel
from pyspark.ml import PipelineModel
PipelineModel.load("lrmodel")

Related

Not able to run simple pyflink word_count.py on aws emr

I have created an EMR cluster (v5.35.0) and am trying to run a sample word_count.py to verify if I am able to execute a flink job.
I am able to use python3 as mentioned in this question How do you run pyflink scripts on AWS EMR?
Using the below command to submit the job from /usr/lib/flink on the master node
flink run -m yarn-cluster --python examples/python/table/word_count.py
but I run into the following error
Executing word_count example with default input data set.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Traceback (most recent call last):
File "examples/python/table/word_count.py", line 146, in <module>
word_count(known_args.input, known_args.output)
File "examples/python/table/word_count.py", line 121, in word_count
.execute_insert('sink') \
File "/usr/lib/flink/opt/python/pyflink.zip/pyflink/table/table_result.py", line 76, in wait
File "/usr/lib/flink/opt/python/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__
File "/usr/lib/flink/opt/python/pyflink.zip/pyflink/util/exceptions.py", line 146, in deco
File "/usr/lib/flink/opt/python/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o100.await.
: java.util.concurrent.ExecutionException: org.apache.flink.table.api.TableException: Failed to wait job finish
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at org.apache.flink.table.api.internal.TableResultImpl.awaitInternal(TableResultImpl.java:129)
at org.apache.flink.table.api.internal.TableResultImpl.await(TableResultImpl.java:92)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.api.python.shaded.py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at org.apache.flink.api.python.shaded.py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at org.apache.flink.api.python.shaded.py4j.Gateway.invoke(Gateway.java:282)
at org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:79)
at org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.flink.table.api.TableException: Failed to wait job finish
at org.apache.flink.table.api.internal.InsertResultIterator.hasNext(InsertResultIterator.java:56)
at org.apache.flink.table.api.internal.TableResultImpl$CloseableRowIteratorWrapper.hasNext(TableResultImpl.java:370)
at org.apache.flink.table.api.internal.TableResultImpl$CloseableRowIteratorWrapper.isFirstRowReady(TableResultImpl.java:383)
at org.apache.flink.table.api.internal.TableResultImpl.lambda$awaitInternal$1(TableResultImpl.java:116)
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.util.concurrent.ExecutionException: org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 8064c1bde7be5c84d7086c13da8cb82b)
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at org.apache.flink.table.api.internal.InsertResultIterator.hasNext(InsertResultIterator.java:54)
... 7 more
Caused by: org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 8064c1bde7be5c84d7086c13da8cb82b)
at org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:125)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:403)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.flink.client.program.rest.RestClusterClient.lambda$pollResourceAsync$26(RestClusterClient.java:698)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:403)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:943)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
... 3 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
at org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:123)
... 24 more
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679)
at org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79)
at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:537)
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
at akka.actor.ActorCell.invoke(ActorCell.scala:548)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: java.io.IOException: Failed to execute the command: python3 -c import pyflink;import os;print(os.path.join(os.path.abspath(os.path.dirname(pyflink.__file__)), 'bin'))
output: Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'pyflink'
at org.apache.flink.python.util.PythonEnvironmentManagerUtils.execute(PythonEnvironmentManagerUtils.java:211)
at org.apache.flink.python.util.PythonEnvironmentManagerUtils.getPythonUdfRunnerScript(PythonEnvironmentManagerUtils.java:154)
at org.apache.flink.python.env.beam.ProcessPythonEnvironmentManager.createEnvironment(ProcessPythonEnvironmentManager.java:156)
at org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.createPythonExecutionEnvironment(BeamPythonFunctionRunner.java:395)
at org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.lambda$open$0(BeamPythonFunctionRunner.java:243)
at org.apache.flink.runtime.memory.MemoryManager.lambda$getSharedMemoryResourceForManagedMemory$5(MemoryManager.java:539)
at org.apache.flink.runtime.memory.SharedResources.createResource(SharedResources.java:126)
at org.apache.flink.runtime.memory.SharedResources.getOrAllocateSharedResource(SharedResources.java:72)
at org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:555)
at org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.open(BeamPythonFunctionRunner.java:246)
at org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.open(AbstractPythonFunctionOperator.java:131)
at org.apache.flink.table.runtime.operators.python.AbstractStatelessFunctionOperator.open(AbstractStatelessFunctionOperator.java:110)
at org.apache.flink.table.runtime.operators.python.table.PythonTableFunctionOperator.open(PythonTableFunctionOperator.java:113)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:110)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:711)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call(StreamTaskActionExecutor.java:100)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:687)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:654)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
at java.lang.Thread.run(Thread.java:750)
I have choosen spark, hadoop, flink, presto, zookeeper as the frameworks.
It works without a glitch if I use WordCount.jar but doesn't work for word_count.py
I am not sure why it shows that pyflink module is not found. I also installed apache flink again on the master node using pip as a last ditch effort but the same error occurs
pip install apache-flink==1.14
Any pointers would be helpful

ClassNotFoundException loading data from snowflake with pyspark

I am getting this error when I try to load data from snowflake into a dataframe with pyspark:
py4j.protocol.Py4JJavaError: An error occurred while calling o45.load.
: java.lang.NoClassDefFoundError: net/snowflake/client/jdbc/internal/org/bouncycastle/jce/provider/BouncyCastleProvider
Here is some code to reproduce the error:
from pyspark.sql import SparkSession
from pyspark import SparkConf
conf = SparkConf()
conf.set('spark.jars.packages',
'net.snowflake:spark-snowflake_2.11:2.8.4-spark_2.4,net.snowflake:snowflake-jdbc:3.12.17')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sf_reader_options = {'sfURL': 'example.snowflakecomputing.com', 'sfAccount': 'example_account',
'sfWarehouse': 'example_warehouse', 'sfRole': 'DATASCIENCE', 'sfUser': 'user',
'sfPassword': 'pass', 'sfDatabase': 'db_name', 'sfSchema': 'schema_name', 'sfTimezone': 'UTC'}
reader = (spark
.read
.format('net.snowflake.spark.snowflake')
.options(**sf_reader_options))
result = reader.option('query', 'select * from TABLE_NAME').load()
The stacktrace for the error looks like this:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/Users/charlie/lark/bigbird/venv/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 172, in load
return self._df(self._jreader.load())
File "/Users/charlie/lark/bigbird/venv/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/charlie/lark/bigbird/venv/lib/python3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/charlie/lark/bigbird/venv/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o45.load.
: java.lang.NoClassDefFoundError: net/snowflake/client/jdbc/internal/org/bouncycastle/jce/provider/BouncyCastleProvider
at net.snowflake.spark.snowflake.Parameters$.mergeParameters(Parameters.scala:202)
at net.snowflake.spark.snowflake.DefaultSource.createRelation(DefaultSource.scala:59)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: net.snowflake.client.jdbc.internal.org.bouncycastle.jce.provider.BouncyCastleProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 17 more
I am using spark 2.4.7 and spark-snowflake 2.8.4, with snowflake jdbc 3.12.17. I am on Mac OS X Big Sur. This happened after I upgraded to Big Sur, though I'm not sure whether that's related.
I have tried:
adding bouncy castle provider to my configuration as a package dependency
checking that JAVA_HOME points to Java 8 (it does)
reinstalling java 8 (with homebrew and adoptopenjdk)
adding bouncy castle as a security provider, per instructions here
updating spark-snowflake and snowflake-jdbc (was using 2.7.0 and 3.12.3 before, same error)
Any help would be much appreciated!
Ultimately, I was able to resolve this by:
downloading Java straight from Oracle (rather than uninstalling and reinstalling with homebrew),
deleting spark, downloading again (from apache, not via homebrew), and setting up environment variables as described here (mostly... I use a virtual environment so I didn't hardcode PYSPARK_PYTHON to system python3)
uninstalling pyspark and reinstalling
quitting pycharm and reopening (this refreshed all my environment variables that were set in .zshrc, like JAVA_HOME)
There's almost certainly an easier way, but this worked.

java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST

i'm using PyCharm 2019.1, and Python 3.7 (in Project Interpreter)
On PyCharm, i've added Pyspark 2.4.2
when i run the following code (to create a Spark DataFrame), i get error
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
....
Exception: Java gateway process exited before sending its port number
from the other SO issues, it seems that it is related to version mismatch,
question is how to resolve this
my $SPARK_HOME points to Apache Spark 2.2.0,
when i try to install 2.2.0 on Pycharm, it gives error
Collecting pyspark==2.2.0
Could not find a version that satisfies the requirement pyspark==2.2.0 (from versions: 2.1.2, 2.1.3, 2.2.0.post0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3)
No matching distribution found for pyspark==2.2.0
Any ideas on how to fix this ?
CODE ->
from pyspark.sql import SparkSession
d = {'a':1, 'b':2, 'c':3}
spark = SparkSession.builder.master("local").appName("CreatingDF").getOrCreate()
pandaDF = spark.createDataFrame(d)
print(pandaDF)
ERROR ->
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/06 23:21:45 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "/Users/karanalang/PycharmProjects/PythonFalcon/FalconIncremental/python_createDF2.py", line 28, in <module>
spark = SparkSession.builder.master("local").appName("CreatingDF").getOrCreate()
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 367, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "/Users/karanalang/anaconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
Yes.It's versioning issue. Verify your python version on your command prompt/terminal. If default python version is 2.7 and pyCharm is pointing to python3.7 in it's interpreter then it should work.
mostly Anaconda3 and onwards cause this issue.

Load Spark Dataframe into ElasticSearch using PySpark [duplicate]

I have a spark dataframe that I am trying to push to AWS Elasticsearch, but before that I was testing this sample code snippet to push to ES,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ES_indexer').getOrCreate()
df = spark.createDataFrame([{'num': i} for i in xrange(10)])
df = df.drop('_id')
df.write.format(
'org.elasticsearch.spark.sql'
).option(
'es.nodes', 'http://spark-data-push-adertadaltdpioy124.us-west-2.es.amazonaws.com'
).option(
'es.port', 9200
).option(
'es.resource', '%s/%s' % ('index_name', 'doc_type_name'),
).save()
I get an error saying,
java.lang.ClassNotFoundException: Failed to find data source: org.elasticsearch.spark.sql. Please find packages at http://spark.apache.org/third-party-projects.html
Any suggestions would be greatly appreciated.
Error Trace:
Traceback (most recent call last):
File "es_3.py", line 12, in <module>
'es.resource', '%s/%s' % ('index_name', 'doc_type_name'),
File "/usr/local/lib/python2.7/site-packages/pyspark/sql/readwriter.py", line 732, in save
self._jwrite.save()
File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/lib/python2.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o46.save.
: java.lang.ClassNotFoundException: Failed to find data source: org.elasticsearch.spark.sql. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:245)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.spark.sql.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
... 12 more
tl;dr Use pyspark --packages org.elasticsearch:elasticsearch-hadoop:7.2.0 and use format("es") to reference the connector.
Quoting Installation from the official documentation of the Elasticsearch for Apache Hadoop product:
Just like other libraries, elasticsearch-hadoop needs to be available in Spark’s classpath.
And later in Supported Spark SQL versions:
elasticsearch-hadoop supports both version Spark SQL 1.3-1.6 and Spark SQL 2.0 through two different jars: elasticsearch-spark-1.x-<version>.jar and elasticsearch-hadoop-<version>.jar
elasticsearch-spark-2.0-<version>.jar supports Spark SQL 2.0
That looks like an issue with the document (as they use two different versions of the jar file), but does mean that you have to use the proper jar file on the CLASSPATH of your Spark application.
And later in the same document:
Spark SQL support is available under org.elasticsearch.spark.sql package.
That simply says that the format (in df.write.format('org.elasticsearch.spark.sql')) is correct.
Further down the document you can find that you could even use an alias df.write.format("es") (!)
I found Apache Spark section in the project's repository on GitHub more readable and current.
Update: The current ES-hadoop package as of June 2020 is 7.7.1, so I used pyspark --packages org.elasticsearch:elasticsearch-hadoop:7.7.1 instead.
You have to mention the version of you elasticsearch db at the end of the package e.g --packages org.elasticsearch:elasticsearch-hadoop:(version). In my case it was org.elasticsearch:elasticsearch-hadoop:7.0.0.

Cannot load a saved Spark model in pyspark: "java.lang.NoSuchMethodException"

When I run the following Python program
from pyspark.ml.classification import LinearSVC
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Sparkmodel").getOrCreate()
data = spark.read.format("libsvm").load("/usr/local/spark/data/mllib/sample_libsvm_data.txt")
model = LinearSVC().fit(data)
model.save("mymodel")
LinearSVC.load("mymodel")
the load fails with a "java.lang.NoSuchMethodException".
/anaconda3/envs/scratch/bin/python /Users/billmcn/src/toy/sparkmodel/sparkmodel/little.py
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/11/12 13:23:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/12 13:23:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/11/12 13:23:06 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
17/11/12 13:23:17 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
17/11/12 13:23:17 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
Traceback (most recent call last):
File "/Users/billmcn/src/toy/sparkmodel/sparkmodel/little.py", line 9, in <module>
LinearSVC.load("mymodel")
File "/anaconda3/envs/scratch/lib/python3.6/site-packages/pyspark/ml/util.py", line 257, in load
return cls.read().load(path)
File "/anaconda3/envs/scratch/lib/python3.6/site-packages/pyspark/ml/util.py", line 197, in load
java_obj = self._jread.load(path)
File "/anaconda3/envs/scratch/lib/python3.6/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/anaconda3/envs/scratch/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/anaconda3/envs/scratch/lib/python3.6/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o64.load.
: java.lang.NoSuchMethodException: org.apache.spark.ml.classification.LinearSVCModel.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:328)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Process finished with exit code 1
The "mymodel" directory is created and its contents appear to be valid.
I am running Spark 2.2.0 and pyspark 2.2.0. I have the following mllib jars in my installation.
> ll /usr/local/spark.versions/spark-2.2.0-bin-hadoop2.7/jars/spark-mllib*
-rw-r--r--# 1 billmcn admin 6501535 Jun 30 18:09 /usr/local/spark.versions/spark-2.2.0-bin-hadoop2.7/jars/spark-mllib_2.11-2.2.0.jar
-rw-r--r--# 1 billmcn admin 182887 Jun 30 18:09 /usr/local/spark.versions/spark-2.2.0-bin-hadoop2.7/jars/spark-mllib-local_2.11-2.2.0.jar
And the latter contains the class I want.
jar tf /usr/local/spark.versions/spark-2.2.0-bin-hadoop2.7/jars/spark-mllib_2.11-2.2.0.jar | grep LinearSVCModel
org/apache/spark/ml/classification/LinearSVCModel$LinearSVCWriter$Data.class
org/apache/spark/ml/classification/LinearSVCModel$.class
org/apache/spark/ml/classification/LinearSVCModel$LinearSVCWriter.class
org/apache/spark/ml/classification/LinearSVCModel$LinearSVCReader.class
org/apache/spark/ml/classification/LinearSVCModel$$anonfun$11.class
org/apache/spark/ml/classification/LinearSVCModel$LinearSVCWriter$$typecreator1$1.class
org/apache/spark/ml/classification/LinearSVCModel$LinearSVCWriter$Data$.class
org/apache/spark/ml/classification/LinearSVCModel.class
The same problem happens on two different machines.
What am I doing wrong?
I was using the wrong class to load the module. The following works
model = LinearSVCModel.load(model_path)
This looks like a version mismatch. The most likely scenario is:
Your Python (PySpark) installation uses Spark 2.2
While JVM jars have been compiled with earlier Spark version, which didn't include LinearSVCModel.

Resources