Java error for installation rasterframes (databricks) - apache-spark

I have followed the steps in this notebook to install rasterframes on my databricks cluster.
Eventually I am able to import the following:
from pyrasterframes import rf_ipython
from pyrasterframes.utils import create_rf_spark_session
from pyspark.sql.functions import lit
from pyrasterframes.rasterfunctions import *
But when I run:
spark = create_rf_spark_session()
I get the following error: "java.lang.NoClassDefFoundError: scala/Product$class".
I am using a cluster with Spark 3.2.1. I also installed Java Runtime Environment 1.8.0_341, but this made no difference.
Could someone explain what went wrong? And how to solve this error?
The full error log:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<command-2354681519525034> in <module>
5
6 # Use the provided convenience function to create a basic local SparkContext
----> 7 spark = create_rf_spark_session()
/databricks/python/lib/python3.8/site-packages/pyrasterframes/utils.py in create_rf_spark_session(master, **kwargs)
97
98 try:
---> 99 spark.withRasterFrames()
100 return spark
101 except TypeError as te:
/databricks/python/lib/python3.8/site-packages/pyrasterframes/__init__.py in _rf_init(spark_session)
42 """ Adds RasterFrames functionality to PySpark session."""
43 if not hasattr(spark_session, "rasterframes"):
---> 44 spark_session.rasterframes = RFContext(spark_session)
45 spark_session.sparkContext._rf_context = spark_session.rasterframes
46
/databricks/python/lib/python3.8/site-packages/pyrasterframes/rf_context.py in __init__(self, spark_session)
37 self._jvm = self._gateway.jvm
38 jsess = self._spark_session._jsparkSession
---> 39 self._jrfctx = self._jvm.org.locationtech.rasterframes.py.PyRFContext(jsess)
40
41 def list_to_seq(self, py_list):
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
1566
1567 answer = self._gateway_client.send_command(command)
-> 1568 return_value = get_return_value(
1569 answer, self._gateway_client, None, self._fqn)
1570
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 def deco(*a, **kw):
116 try:
--> 117 return f(*a, **kw)
118 except py4j.protocol.Py4JJavaError as e:
119 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling None.org.locationtech.rasterframes.py.PyRFContext.
: java.lang.NoClassDefFoundError: scala/Product$class
at org.locationtech.rasterframes.model.TileDimensions.<init>(TileDimensions.scala:35)
at org.locationtech.rasterframes.package$.<init>(rasterframes.scala:55)
at org.locationtech.rasterframes.package$.<clinit>(rasterframes.scala)
at org.locationtech.rasterframes.py.PyRFContext.<init>(PyRFContext.scala:49)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:250)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:151)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
... 15 more
Many thanks in advance?

That version of RasterFrames (0.8.4) works only with DBR 6.x that uses Spark 2.4 & Scala 2.11, and will not work on Spark 3.2.x that uses Scala 2.12. You may try to use version 0.10.1 instead that was upgraded to Spark 3.1.2, but it may not work with Spark 3.2 (I haven't tested it).
If you're looking for execution of the geospatial queries on Databricks, you can look onto the Mosaic project from Databricks Labs - it supports standard st_ functions & many other things. You can find announcement in the following blog post, more information is in the talk at Data & AI Summit 2022, documentation & project on GitHub.

I managed to get version 0.10.x of rasterframes working with Databricks runtime version 9.1 LTS. At the time of writing you cannot upgrade to a higher version of the runtime, because of pyspark version differences. Below you'll find a step-by-step guide on how to get this to work:
Cluster should be single user, otherwise you'll get this error:
py4j.security.Py4JSecurityException: Constructor public org.apache.spark.SparkConf(boolean) is not whitelisted
At the time of writing, the Databricks runtime version needs to be 9.1 LTS.
An init script should install GDAL: 
pip install gdal -f https://girder.github.io/large_image_wheels
Rasterframe JAR should be build from source:
git clone https://github.com/mjohns-databricks/rasterframes.git
cd rasterframes
sbt publishLocal
Rasterframe JAR should be uploaded to Databricks. After the build, the file is located at:
/pyrasterframes/target/scala-2.12/pyrasterframes-assembly-0.10.1-SNAPSHOT.jar

Related

Py4JJavaError: An error occurred while calling o65.createGraph

I wanted to install graphframes for spark following the instructions on the spark website, but the command:
pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12
did not work for me.
I tried many ways to install, but decided to stay at downloading graphframes .jar, adding it to the general list of Spark .jar files and adding it manually in the code spark.sparkContext.addPyFile("path to /spark-2.4.7-bin-hadoop2.7/jars/graphframes-0.8.1-spark3.0-s_2.12.jar").
After that, the library is imported, but there is always an error when creating the GraphFrame. And I just have no idea how to solve it.
My .bashrc variables:
export CLASSPATH="/home/german/spark-2.4.7-bin-hadoop2.7/jars"
export HADOOP_CONF_DIR="/home/german/spark-2.4.7-bin-hadoop2.7/conf"
export HADOOP_HOME="/home/german/spark-2.4.7-bin-hadoop2.7"
export HADOOP_SECURITY_LOGGER=ERROR,console
export JAVA_HOME="/home/german/jdk1.8.0_301"
export SPARK_CLASSPATH="/home/german/spark-2.4.7-bin-hadoop2.7/jars"
export SPARK_DIST_CLASSPATH="/home/german/spark-2.4.7-bin-hadoop2.7/jars"
export SPARK_HOME="/home/german/spark-2.4.7-bin-hadoop2.7"
export PATH="/home/german/spark-2.4.7-bin-hadoop2.7/bin:$PATH"
export PYTHONPATH="/home/german/spark-2.4.7-bin-hadoop2.7/python/lib/pyspark.zip:/home/german/spark-2.4.7-bin-hadoop2.7/python/lib:/home/german/spark-2.4.7-bin-hadoop2.7/python:$PYTHONPATH"
My jdk version 1.8, python 3.7.10, OS: Ubuntu 20.04 LTS.
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config("spark.sql.warehouse.dir", "spark_warehouse")\
.getOrCreate()
spark.sparkContext.setCheckpointDir("graphframes_checkpoints")
spark.sparkContext.addPyFile("path to /spark-2.4.7-bin-hadoop2.7/jars/graphframes-0.8.1-spark3.0-s_2.12.jar")
vertices = spark.read.parquet("tmp_dfs/parquet/vertices.parquet")
edges = spark.read.parquet("tmp_dfs/parquet/edges.parquet")
from graphframes import *
graph = GraphFrame(vertices, edges)
And I get the error:
Py4JJavaError: An error occurred while calling o65.createGraph.
: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
at org.graphframes.GraphFrame$.apply(GraphFrame.scala:676)
at org.graphframes.GraphFramePythonAPI.createGraph(GraphFramePythonAPI.scala:10)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-2-ee0c1444db6f> in <module>
37
38 from graphframes import *
---> 39 graph = GraphFrame(vertices, edges)
/tmp/spark-9d209109-e503-4ea1-813c-9ca68e76d72a/userFiles-4417833f-c19c-4e6e-9eea-7a21b6553f5f/graphframes-0.8.1-spark3.0-s_2.12.jar/graphframes/graphframe.py in __init__(self, v, e)
87 .format(self.DST, ",".join(e.columns)))
88
---> 89 self._jvm_graph = self._jvm_gf_api.createGraph(v._jdf, e._jdf)
90
91 #property
~/spark-2.4.7-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
~/spark-2.4.7-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
~/spark-2.4.7-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o65.createGraph.
: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
at org.graphframes.GraphFrame$.apply(GraphFrame.scala:676)
at org.graphframes.GraphFramePythonAPI.createGraph(GraphFramePythonAPI.scala:10)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I may have chosen the wrong installation method or something else. I would be glad to hear any suggestions on how to solve this problem.
Check with which scala version spark jars are available under $SPARK_HOME/jars folder, example spark-sql_<scala version>-2.4.7.jar. If the version is 2.11 then you need to use graphframe which is compiled with scala v2.11.
And one more thing, spark version which you are using is 2.4.7 but graphframes jar which you added is related to spark 3.0, this might also cause issues.

Cannot install geopandas Pypi package in Databricks (Run time 5.5 LTS Spark 2.4.3 Scala 2.11)

I cannot install geopandas package in Databricks. I'm using cluster Run time 5.5 LTS Spark 2.4.3 Scala 2.11
The package is successfully installed in other runtime version but not the version I need.
What need to be done to install this package in cluster runtime 5.5?
I'm using below command
dbutils.library.installPyPI("geopandas")
Below is an error statement
org.apache.spark.SparkException: Process List(/local_disk0/pythonVirtualEnvDirs/virtualEnv-e9b469dd-aad9-4414-a208-03e3ecd8096c/bin/python, /local_disk0/pythonVirtualEnvDirs/virtualEnv-e9b469dd-aad9-4414-a208-03e3ecd8096c/bin/pip, install, geopandas, --disable-pip-version-check) exited with code 1. Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-bgvkkr58/fiona/
Detailed error
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<command-1887950226624660> in <module>()
1
----> 2 dbutils.library.installPyPI("geopandas")
/local_disk0/tmp/1625551234943-0/dbutils.py in installPyPI(self, project, version, repo, extras)
237 def installPyPI(self, project, version = "", repo = "", extras = ""):
238 return self.print_and_return(self.entry_point.getSharedDriverContext() \
--> 239 .addIsolatedPyPILibrary(project, version, repo, extras))
240
241 def restartPython(self):
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o413.addIsolatedPyPILibrary.
: org.apache.spark.SparkException: Process List(/local_disk0/pythonVirtualEnvDirs/virtualEnv-e9b469dd-aad9-4414-a208-03e3ecd8096c/bin/python, /local_disk0/pythonVirtualEnvDirs/virtualEnv-e9b469dd-aad9-4414-a208-03e3ecd8096c/bin/pip, install, geopandas, --disable-pip-version-check) exited with code 1. Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-bgvkkr58/fiona/
at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1403)
at org.apache.spark.util.Utils$.installLibrary(Utils.scala:836)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1700)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1632)
at com.databricks.backend.daemon.driver.SharedDriverContext$$anonfun$addIsolatedPyPILibrary$1.apply$mcV$sp(SharedDriverContext.scala:558)
at com.databricks.backend.daemon.driver.SharedDriverContext$$anonfun$addIsolatedPyPILibrary$1.apply(SharedDriverContext.scala:558)
at com.databricks.backend.daemon.driver.SharedDriverContext$$anonfun$addIsolatedPyPILibrary$1.apply(SharedDriverContext.scala:558)
at com.databricks.logging.UsageLogging$$anonfun$recordOperation$1.apply(UsageLogging.scala:369)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:238)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:233)
at com.databricks.backend.daemon.driver.SharedDriverContext.withAttributionContext(SharedDriverContext.scala:57)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:271)
at com.databricks.backend.daemon.driver.SharedDriverContext.withAttributionTags(SharedDriverContext.scala:57)
at com.databricks.logging.UsageLogging$class.recordOperation(UsageLogging.scala:350)
at com.databricks.backend.daemon.driver.SharedDriverContext.recordOperation(SharedDriverContext.scala:57)
at com.databricks.backend.daemon.driver.SharedDriverContext.addIsolatedPyPILibrary(SharedDriverContext.scala:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
enter code here

Pyspark.ml - Error when loading model and Pipeline

I want to import a trained pyspark model (or pipeline) into a pyspark script. I trained a decision tree model like so:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
# Create assembler and labeller for spark.ml format preperation
assembler = VectorAssembler(inputCols = requiredFeatures, outputCol = 'features')
label_indexer = StringIndexer(inputCol='measurement_status', outputCol='indexed_label')
# Apply transformations
eq_df_labelled = label_indexer.fit(eq_df).transform(eq_df)
eq_df_labelled_featured = assembler.transform(eq_df_labelled)
# Split into training and testing datasets
(training_data, test_data) = eq_df_labelled_featured.randomSplit([0.75, 0.25])
# Create a decision tree algorithm
dtree = DecisionTreeClassifier(
labelCol ='indexed_label',
featuresCol = 'features',
maxDepth = 5,
minInstancesPerNode=1,
impurity = 'gini',
maxBins=32,
seed=None
)
# Fit classifier object to training data
dtree_model = dtree.fit(training_data)
# Save model to given directory
dtree_model.save("models/dtree")
All of the code above works without any erros. The problem is, when I try to load this model (on the same or on another pyspark application), using:
from pyspark.ml.classification import DecisionTreeClassifier
imported_model = DecisionTreeClassifier()
imported_model.load("models/dtree")
I get the following error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-4-b283bc2da75f> in <module>
2
3 imported_model = DecisionTreeClassifier()
----> 4 imported_model.load("models/dtree")
5
6 #lodel = DecisionTreeClassifier.load("models/dtree-test/")
~/.local/lib/python3.6/site-packages/pyspark/ml/util.py in load(cls, path)
328 def load(cls, path):
329 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 330 return cls.read().load(path)
331
332
~/.local/lib/python3.6/site-packages/pyspark/ml/util.py in load(self, path)
278 if not isinstance(path, basestring):
279 raise TypeError("path should be a basestring, got type %s" % type(path))
--> 280 java_obj = self._jread.load(path)
281 if not hasattr(self._clazz, "_from_java"):
282 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
~/.local/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
~/.local/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
126 def deco(*a, **kw):
127 try:
--> 128 return f(*a, **kw)
129 except py4j.protocol.Py4JJavaError as e:
130 converted = convert_exception(e.java_exception)
~/.local/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o39.load.
: java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1439)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.first(RDD.scala:1437)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I went for this approach because it also didnt work using a Pipeline object. Any ideas about what is happening?
UPDATE
I have realised that this error only occurs when I work with my Spark cluster (one master, two workers using Spark's standalone cluster manager). If I set Spark Session like so (where the master is set to the local one):
spark = SparkSession\
.builder\
.config(conf=conf)\
.appName("MachineLearningTesting")\
.master("local[*]")\
.getOrCreate()
I do not get the above error.
Also, I am using Spark 3.0.0, could it be that model importing and exporting in Spark 3 still has bugs?
There were two problems:
SSH authenticated communication must be enabled between all nodes in the cluster. Even though all nodes in my Spark cluster are in the same network, only the master had SSH authentication to the workers and not vise versa.
The model must be available to all nodes in the cluster. This may sound really obvious but I thought that the model files need to only be available to the master who then diffuses this to the worker nodes. In other words, when you load the model like so:
from pyspark.ml.classification import DecisionTreeClassifier
imported_model = DecisionTreeClassifier()
imported_model.load("models/dtree")
The file /absoloute_path/models/dtree must exist on every machine in the cluster. This made me understand that in production contexts, the models are probably accessed via an external shared file system.
These two steps solved my problem of loading pyspark models into a Spark application running on a cluster.

Cannot read from Elasticsearch using PySpark

perhaps there's someone out there that can help me.
I'm trying to read data from ES using PySpark. My Jupyter Notebook code is pretty simple:
import pyspark
conf = pyspark.SparkConf().setAppName('Test').setMaster('spark://spark-master:7077')
sc = pyspark.SparkContext(conf=conf)
es_rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf={
"es.resource": "some-log/doc",
"es.nodes": "192.168.1.25",
"es.port": "9200"
})
I have Spark and Jupyter Notebook installed on the host running the NB. The spark-defaults.conf file is loading the "elasticsearch-hadoop-6.4.0.jar" via: spark.jars /opt/maya/es-hadoop/elasticsearch-hadoop-6.4.0.jar
I can connect to the ES instance and read from it by using other tools like elasticsearch-py, the Test app shows up in the Spark Master UI. However, when I execute the code above, I keep getting this error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-5-c990f37c388b> in <module>
6 "es.resource": "logs-dfir-winevent-security-*/doc",
7 "es.nodes": "192.168.248.131",
----> 8 "es.port": "9200"
9 })
10 #es_rdd.first()
/opt/anaconda/lib/python3.6/site-packages/pyspark/context.py in newAPIHadoopRDD(self, inputFormatClass, keyClass, valueClass, keyConverter, valueConverter, conf, batchSize)
715 jrdd = self._jvm.PythonRDD.newAPIHadoopRDD(self._jsc, inputFormatClass, keyClass,
716 valueClass, keyConverter, valueConverter,
--> 717 jconf, batchSize)
718 return RDD(jrdd, self)
719
/opt/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/anaconda/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: org.elasticsearch.hadoop.mr.LinkedMapWritable
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:302)
at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:286)
at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I've searched and cannot see that the error is in the code itself, I'm having a feeling this issue is more related to bad Spark configuration within the host running the Jupyter Notebook. Any insight would be much appreciated!
Please refer to this question : pyspark: ship jar dependency with spark-submit
What you need to do is pass the jar of the dependency with the configuration. If you're using a Jupyter notebook, you can add it via SparkConf() such as :
conf = SparkConf().option('spark.driver.extraClassPath', 'full/path/to/jar')
Just change your code to :
conf = pyspark.SparkConf().setAppName('Test').setMaster('spark://spark-master:7077').option('spark.driver.extraClassPath', 'full/path/to/jar')
Another method is:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = \
'--jars /full/path/to/your/jar.jar pyspark-shell'
jars could be download from https://www.elastic.co/downloads/hadoop
works on spark 2.3 and elasticsearch 6.4

Why does pyspark fail with "Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars."?

I am using a standalone cluster of apache spark version 2.0.0 with two nodes and i have not installed hive.I am getting the following error on creating a dataframe.
from pyspark import SparkContext
from pyspark import SQLContext
sqlContext = SQLContext(sc)
l = [('Alice', 1)]
sqlContext.createDataFrame(l).collect()
---------------------------------------------------------------------------
IllegalArgumentException Traceback (most recent call last)
<ipython-input-9-63bc4f21f23e> in <module>()
----> 1 sqlContext.createDataFrame(l).collect()
/home/mok/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio)
297 Py4JJavaError: ...
298 """
--> 299 return self.sparkSession.createDataFrame(data, schema, samplingRatio)
300
301 #since(1.3)
/home/mok/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio)
522 rdd, schema = self._createFromLocal(map(prepare, data), schema)
523 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
--> 524 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
525 df = DataFrame(jdf, self._wrapped)
526 df._schema = schema
/home/mok/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/home/mok/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: u'Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars.'
So should i install Hive or edit the configurations.
IllegalArgumentException: u'Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars.'
I had the same issue and fixed it by using Java 8. Make sure you install JDK 8 and set the environment variables accordingly.
Do not use Java 11 with Spark / pyspark 2.4.
If you have several java versions you'll have to figure out which spark is using (I did this using trial and error , starting with
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
and ending with
JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
Did the trick.
If you have multiple jdks installed, you can find the java homes as below
/usr/libexec/java_home -V
Matching Java Virtual Machines (3):
13.0.2, x86_64: "OpenJDK 13.0.2" /Library/Java/JavaVirtualMachines/adoptopenjdk-13.0.2.jdk/Contents/Home
11.0.6, x86_64: "AdoptOpenJDK 11" /Library/Java/JavaVirtualMachines/adoptopenjdk-11.jdk/Contents/Home
1.8.0_252, x86_64: "AdoptOpenJDK 8" /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home
Now set the JAVA_HOME for 1.8 use
export JAVA_HOME=/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home
Please ensure that your JAVA_HOME environment variable is set.
For Mac OS I did, echo export JAVA_HOME=/Library/Java/Home >> ~/.bash_profile and then source ~/.bash_profile or open ~/.bash_profile in type the above.

Resources