Pyspark.ml - Error when loading model and Pipeline - apache-spark

I want to import a trained pyspark model (or pipeline) into a pyspark script. I trained a decision tree model like so:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
# Create assembler and labeller for spark.ml format preperation
assembler = VectorAssembler(inputCols = requiredFeatures, outputCol = 'features')
label_indexer = StringIndexer(inputCol='measurement_status', outputCol='indexed_label')
# Apply transformations
eq_df_labelled = label_indexer.fit(eq_df).transform(eq_df)
eq_df_labelled_featured = assembler.transform(eq_df_labelled)
# Split into training and testing datasets
(training_data, test_data) = eq_df_labelled_featured.randomSplit([0.75, 0.25])
# Create a decision tree algorithm
dtree = DecisionTreeClassifier(
labelCol ='indexed_label',
featuresCol = 'features',
maxDepth = 5,
minInstancesPerNode=1,
impurity = 'gini',
maxBins=32,
seed=None
)
# Fit classifier object to training data
dtree_model = dtree.fit(training_data)
# Save model to given directory
dtree_model.save("models/dtree")
All of the code above works without any erros. The problem is, when I try to load this model (on the same or on another pyspark application), using:
from pyspark.ml.classification import DecisionTreeClassifier
imported_model = DecisionTreeClassifier()
imported_model.load("models/dtree")
I get the following error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-4-b283bc2da75f> in <module>
2
3 imported_model = DecisionTreeClassifier()
----> 4 imported_model.load("models/dtree")
5
6 #lodel = DecisionTreeClassifier.load("models/dtree-test/")
~/.local/lib/python3.6/site-packages/pyspark/ml/util.py in load(cls, path)
328 def load(cls, path):
329 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 330 return cls.read().load(path)
331
332
~/.local/lib/python3.6/site-packages/pyspark/ml/util.py in load(self, path)
278 if not isinstance(path, basestring):
279 raise TypeError("path should be a basestring, got type %s" % type(path))
--> 280 java_obj = self._jread.load(path)
281 if not hasattr(self._clazz, "_from_java"):
282 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
~/.local/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
~/.local/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
126 def deco(*a, **kw):
127 try:
--> 128 return f(*a, **kw)
129 except py4j.protocol.Py4JJavaError as e:
130 converted = convert_exception(e.java_exception)
~/.local/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o39.load.
: java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1439)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.first(RDD.scala:1437)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I went for this approach because it also didnt work using a Pipeline object. Any ideas about what is happening?
UPDATE
I have realised that this error only occurs when I work with my Spark cluster (one master, two workers using Spark's standalone cluster manager). If I set Spark Session like so (where the master is set to the local one):
spark = SparkSession\
.builder\
.config(conf=conf)\
.appName("MachineLearningTesting")\
.master("local[*]")\
.getOrCreate()
I do not get the above error.
Also, I am using Spark 3.0.0, could it be that model importing and exporting in Spark 3 still has bugs?

There were two problems:
SSH authenticated communication must be enabled between all nodes in the cluster. Even though all nodes in my Spark cluster are in the same network, only the master had SSH authentication to the workers and not vise versa.
The model must be available to all nodes in the cluster. This may sound really obvious but I thought that the model files need to only be available to the master who then diffuses this to the worker nodes. In other words, when you load the model like so:
from pyspark.ml.classification import DecisionTreeClassifier
imported_model = DecisionTreeClassifier()
imported_model.load("models/dtree")
The file /absoloute_path/models/dtree must exist on every machine in the cluster. This made me understand that in production contexts, the models are probably accessed via an external shared file system.
These two steps solved my problem of loading pyspark models into a Spark application running on a cluster.

Related

Java error for installation rasterframes (databricks)

I have followed the steps in this notebook to install rasterframes on my databricks cluster.
Eventually I am able to import the following:
from pyrasterframes import rf_ipython
from pyrasterframes.utils import create_rf_spark_session
from pyspark.sql.functions import lit
from pyrasterframes.rasterfunctions import *
But when I run:
spark = create_rf_spark_session()
I get the following error: "java.lang.NoClassDefFoundError: scala/Product$class".
I am using a cluster with Spark 3.2.1. I also installed Java Runtime Environment 1.8.0_341, but this made no difference.
Could someone explain what went wrong? And how to solve this error?
The full error log:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<command-2354681519525034> in <module>
5
6 # Use the provided convenience function to create a basic local SparkContext
----> 7 spark = create_rf_spark_session()
/databricks/python/lib/python3.8/site-packages/pyrasterframes/utils.py in create_rf_spark_session(master, **kwargs)
97
98 try:
---> 99 spark.withRasterFrames()
100 return spark
101 except TypeError as te:
/databricks/python/lib/python3.8/site-packages/pyrasterframes/__init__.py in _rf_init(spark_session)
42 """ Adds RasterFrames functionality to PySpark session."""
43 if not hasattr(spark_session, "rasterframes"):
---> 44 spark_session.rasterframes = RFContext(spark_session)
45 spark_session.sparkContext._rf_context = spark_session.rasterframes
46
/databricks/python/lib/python3.8/site-packages/pyrasterframes/rf_context.py in __init__(self, spark_session)
37 self._jvm = self._gateway.jvm
38 jsess = self._spark_session._jsparkSession
---> 39 self._jrfctx = self._jvm.org.locationtech.rasterframes.py.PyRFContext(jsess)
40
41 def list_to_seq(self, py_list):
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
1566
1567 answer = self._gateway_client.send_command(command)
-> 1568 return_value = get_return_value(
1569 answer, self._gateway_client, None, self._fqn)
1570
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 def deco(*a, **kw):
116 try:
--> 117 return f(*a, **kw)
118 except py4j.protocol.Py4JJavaError as e:
119 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling None.org.locationtech.rasterframes.py.PyRFContext.
: java.lang.NoClassDefFoundError: scala/Product$class
at org.locationtech.rasterframes.model.TileDimensions.<init>(TileDimensions.scala:35)
at org.locationtech.rasterframes.package$.<init>(rasterframes.scala:55)
at org.locationtech.rasterframes.package$.<clinit>(rasterframes.scala)
at org.locationtech.rasterframes.py.PyRFContext.<init>(PyRFContext.scala:49)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:250)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:151)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
... 15 more
Many thanks in advance?
That version of RasterFrames (0.8.4) works only with DBR 6.x that uses Spark 2.4 & Scala 2.11, and will not work on Spark 3.2.x that uses Scala 2.12. You may try to use version 0.10.1 instead that was upgraded to Spark 3.1.2, but it may not work with Spark 3.2 (I haven't tested it).
If you're looking for execution of the geospatial queries on Databricks, you can look onto the Mosaic project from Databricks Labs - it supports standard st_ functions & many other things. You can find announcement in the following blog post, more information is in the talk at Data & AI Summit 2022, documentation & project on GitHub.
I managed to get version 0.10.x of rasterframes working with Databricks runtime version 9.1 LTS. At the time of writing you cannot upgrade to a higher version of the runtime, because of pyspark version differences. Below you'll find a step-by-step guide on how to get this to work:
Cluster should be single user, otherwise you'll get this error:
py4j.security.Py4JSecurityException: Constructor public org.apache.spark.SparkConf(boolean) is not whitelisted
At the time of writing, the Databricks runtime version needs to be 9.1 LTS.
An init script should install GDAL: 
pip install gdal -f https://girder.github.io/large_image_wheels
Rasterframe JAR should be build from source:
git clone https://github.com/mjohns-databricks/rasterframes.git
cd rasterframes
sbt publishLocal
Rasterframe JAR should be uploaded to Databricks. After the build, the file is located at:
/pyrasterframes/target/scala-2.12/pyrasterframes-assembly-0.10.1-SNAPSHOT.jar

Issue with save, load and predict with lightgbm from synapseml

Things don't work out as per the official overview.ipynb
I am trying to build a model in local mode on pyspark.
import pyspark
from pyspark.ml.feature import VectorAssembler
from synapse.ml.lightgbm import LightGBMClassificationModel, LightGBMClassifier
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.4") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.getOrCreate()
##build modeling data
non_feats = [i for i in df.columns if '__val_col' in i.lower()]
non_feats.extend(['index','label','fold'])
feature_cols = [i for i in df.columns if i not in non_feats]
featurizer = VectorAssembler(
inputCols = feature_cols,
outputCol = 'features',handleInvalid='keep')
spark_df = df.to_spark() #will not run on koalas or pandas on pysaprk dataframes. This step is necessary
modeling_data = featurizer.transform(spark_df)
##build the model
model=LightGBMClassifier(objective='binary',isUnbalance=True,metric='auc',
earlyStoppingRound=10,featuresShapCol='shap',validationIndicatorCol='__val_col0',labelCol='label',
)
Post this I am able to train the model successfully (I think!)
model.fit(dataset=modeling_data)
output: LightGBMClassifier_867a1dd2276d
Issues:
There is no saveNativeModel attribute available with model.
model.saveNativeModel("./lgbmclassifier.model")
output: AttributeError: 'LightGBMClassifier' object has no attribute 'saveNativeModel'
Model does not load post saving it with .save (this is the only save attribute available)
model.save("./lgbmclassifier_test.model")
LightGBMClassificationModel.loadNativeModelFromFile("./lgbmclassifier_test.model")
output:
[LightGBM] [Fatal] Model file doesn't specify the number of classes
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
/tmp/ipykernel_268437/1849142297.py in <module>
----> 1 LightGBMClassificationModel.loadNativeModelFromFile("./lgbmclassifier_test.model")
/tmp/spark-5794a877-8405-42db-b23c-33bfa7d0eacd/userFiles-9a0cfba3-8761-4c76-9d83-ae455b04cb37/com.microsoft.azure_synapseml-lightgbm_2.12-0.9.4.jar/synapse/ml/lightgbm/LightGBMClassificationModel.py in loadNativeModelFromFile(filename)
19 ctx = SparkContext._active_spark_context
20 loader = ctx._jvm.com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassificationModel
---> 21 java_model = loader.loadNativeModelFromFile(filename)
22 return JavaParams._from_java(java_model)
23
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/py4j/java_gateway.py in __call__(self, *args)
1307
1308 answer = self.gateway_client.send_command(command)
-> 1309 return_value = get_return_value(
1310 answer, self.gateway_client, self.target_id, self.name)
1311
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling z:com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassificationModel.loadNativeModelFromFile.
: java.lang.Exception: Booster LoadFromString call failed in LightGBM with error: Model file doesn't specify the number of classes
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMUtils$.validate(LightGBMUtils.scala:24)
at com.microsoft.azure.synapse.ml.lightgbm.booster.BoosterHandler$.com$microsoft$azure$synapse$ml$lightgbm$booster$BoosterHandler$$createBoosterPtrFromModelString(LightGBMBooster.scala:46)
at com.microsoft.azure.synapse.ml.lightgbm.booster.BoosterHandler.<init>(LightGBMBooster.scala:63)
at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.boosterHandler$lzycompute(LightGBMBooster.scala:236)
at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.boosterHandler(LightGBMBooster.scala:230)
at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.numClasses$lzycompute(LightGBMBooster.scala:500)
at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.numClasses(LightGBMBooster.scala:500)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassificationModel$.loadNativeModelFromFile(LightGBMClassifier.scala:199)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassificationModel.loadNativeModelFromFile(LightGBMClassifier.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Env information:
SynapseML Version: How to check the version? There is no .version on synapse when I import it. But, I guess it is 0.9.4 since that is the jar that I provide("com.microsoft.azure:synapseml_2.12:0.9.4") while creating the context.
Spark Version: 3.2.0
Spark Platform: local[*]
Can you please help me understand what to do post training the model. I am looking for save, load and predict capabilities
You have to save on filesytem and then move to sandbox:
gbm.save_model('/tmp/%s'%(gbm_name))
storage_account_name = "name"
storage_account_key = "key"
container_name = "container_name"
directory_name = "ml_path"
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", storage_account_name), credential=storage_account_key)
file_system_client = service_client.get_file_system_client(file_system=container_name)
dir_client = file_system_client.get_directory_client(directory_name)
dir_client.create_directory()
local_path = '/tmp/%s'%(gbm_name)
file_client = dir_client.create_file(gbm_name)
f = open(local_path,'rb')
file_contents = f.read()
file_client.upload_data(file_contents, overwrite=True)
f.close()
Then you can load and score with sparkmml.
from mmlspark.lightgbm import LightGBMClassificationModel
gbm_spark = LightGBMClassificationModel().loadNativeModelFromFile('directory_name/gbm_name')
featurizer = VectorAssembler(
handleInvalid = "keep",
inputCols=feature_cols,
outputCol='features'
)
score_data= featurizer.transform(var_score)
scored_data = gbm_spark.transform(score_data)
I hope it helps you.

How to set up JDBC driver for MySQL in Jupyter notebook for pyspark?

I am trying to load a bunch of CSV files row by row into mysql instance which is running on OpenShift using pyspark configuration. I have a Jupyter notebook with spark up and running.
Below is the code I have. And it fails with specific driver error
Py4JJavaError: An error occurred while calling o89.save.
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
if __name__ == '__main__':
scSpark = SparkSession \
.builder \
.appName("reading csv") \
.getOrCreate()
if __name__ == '__main__':
scSpark = SparkSession \
.builder \
.appName("reading csv") \
.getOrCreate()
data_file = '/opt/app-root/src/data/train.psv'
sdfData = scSpark.read.csv(data_file, header=True, sep="|").cache()
print('Total Records = {}'.format(sdfData.count()))
sdfData.show()
sdfData.registerTempTable("train")
output = scSpark.sql('SELECT count(*) from train')
output.show()
+--------+
|count(1)|
+--------+
| 1168686|
+--------+
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages mysql:mysql-connector-java:jar:8.0.21 pyspark-shell'
output = scSpark.sql('SELECT * from train')
output.show()
output.write.format('jdbc').options(
url='jdbc:mysql://mysql-1-28d85/sepsis',
driver='com.mysql.jdbc.Driver',
#driver='mysql-connector-java.Driver',
#driver='org.mysql.jdbc.Driver',
dbtable='train',
user='sepsis',
password='Success_2020').mode('append').save()
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-57-114af97e0442> in <module>
11 dbtable='train',
12 user='sepsis',
---> 13 password='Success_2020').mode('append').save()
/opt/app-root/lib/python3.6/site-packages/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
735 self.format(format)
736 if path is None:
--> 737 self._jwrite.save()
738 else:
739 self._jwrite.save(path)
/opt/app-root/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/app-root/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/opt/app-root/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o1641.save.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$5.apply(JDBCOptions.scala:99)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$5.apply(JDBCOptions.scala:99)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:99)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:190)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:194)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at sun.reflect.GeneratedMethodAccessor67.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Changed the code with package.
Also this is openshift , where all components are running as pods with no access to outside environment.
java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
That says it all. You have to start pyspark (or the environment) with the JDBC driver for MySQL using --driver-class-path or similar (that will be specific to Jupyter).
For Jupyter Notebook
Copying from PySpark in Jupyter Notebook — Working with Dataframe & JDBC Data Sources:
If you use Jupyter Notebook, you should set the PYSPARK_SUBMIT_ARGS environment variable, as following:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'
Change the --packages to reference the MySQL JDBC driver.
Once you go to the installation path of the spark, there will be a jars folder. Download your mysql jdbc jar file and place it into the jars folder then you don't need any options to the command or code.

Multidimensional KDE in PySpark

My goal is to fit a Kernel Density Estimate (KDE) model to a large two-dimensional dataset using either Python or PySpark, then use the fit model to predict densities for another two-dimensional dataset.
I have the training data in a Spark RDD object, which I can fit using MLlib. X is a list of lists in python, where the (i,j) th element of the array represents a 'score' for the ith example on the jth feature.
from pyspark.mllib.stat import KernelDensity
sample = sc.parallelize(X)
kd = KernelDensity()
kd.setBandwidth(0.2)
kd.setSample(sample)
I want to use the model to estimate densities for another, two-dimensional dataset. I perform the calculation, for simplicity on a small set of data:
sample2 = [[1.0, 2.2],[3.1,0.9]]
kd.estimate(sample2)
At this point, PySpark throws an error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-140-7fdd873f4c51> in <module>()
----> 1 kd.estimate(sample2)
/usr/local/spark/python/pyspark/mllib/stat/KernelDensity.pyc in estimate(self, points)
56 points = list(points)
57 densities = callMLlibFunc(
---> 58 "estimateKernelDensity", self._sample, self._bandwidth, points)
59 return np.asarray(densities)
/usr/local/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name, *args)
128 sc = SparkContext.getOrCreate()
129 api = getattr(sc._jvm.PythonMLLibAPI(), name)
--> 130 return callJavaFunc(sc, api, *args)
131
132
/usr/local/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, *args)
121 """ Call Java Function """
122 args = [_py2java(sc, a) for a in args]
--> 123 return _java2py(sc, func(*args))
124
125
/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/usr/local/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling {0}{1}{2}.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(
Py4JJavaError: An error occurred while calling o1734.estimateKernelDensity.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:114)
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:93)
at scala.collection.IterableLike$class.copyToArray(IterableLike.scala:254)
at scala.collection.AbstractIterable.copyToArray(Iterable.scala:54)
at scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:278)
at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:286)
at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
at org.apache.spark.mllib.api.python.PythonMLLibAPI.estimateKernelDensity(PythonMLLibAPI.scala:1067)
at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
I have tried transforming sample2 into an RDD, pyspark dataframe, and numpy array, but these all give various other errors.
I am curious as to why this error is occurring; the documentation for the estimate method states that a list object should be passed as the input. Is a list-of-lists not acceptable?
I could probably use sklearn's KDE models to fit this code, but was hoping to use Spark due to the large size of the data. If there is a clever way to do this in a fast, scalable way using sklearn, I would be open to suggestions on how to take that route as well.
I believe sparks KDE implementation is univariate. In other words, I believe you can only offer it a single array of values. Probably where your casting error is coming from.
java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.Double
It expects a list of doubles and you are giving it a list of lists. Hence why it says it can't cast a list to a double.
See documentation: http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe

import os
import sys
os.chdir("/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/bin")
os.curdir
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = '/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7'
SPARK_HOME = os.environ['SPARK_HOME']
sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","pyspark.zip"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","py4j-0.9-src.zip"))
from pyspark import SparkContext
from pyspark import SparkConf
# Optionally configure Spark Settings
conf=SparkConf()
conf.set("spark.executor.memory", "1g")
conf.set("spark.cores.max", "2")
conf.setAppName("V2 Maestros")
## Initialize SparkContext. Run only once. Otherwise you get multiple
#Context Error.
sc = SparkContext('local', conf=conf)
#Test to make sure everything works.
lines=sc.textFile("auto-data.csv")
lines.count()
This is the error that occurred. It was simple program calculating the number of entering of the file but this error came up. I have kept the file in both the locations mention in the code even though the result is the same.
Py4JJavaError Traceback (most recent call last)
<ipython-input-6-5c9242495358> in <module>()
1 lines = sc.textFile("auto-save.csv")
----> 2 lines.count()
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.pyc in count(self)
1006 3
1007 """
-> 1008 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
1009
1010 def stats(self):
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.pyc in sum(self)
997 6.0
998 """
--> 999 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
1000
1001 def count(self):
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.pyc in fold(self, zeroValue, op)
871 # zeroValue provided to each partition is unique from the one provided
872 # to the final reduce call
--> 873 vals = self.mapPartitions(func).collect()
874 return reduce(op, vals, zeroValue)
875
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.pyc in collect(self)
774 """
775 with SCCallSiteSync(self.context) as css:
--> 776 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
777 return list(_load_from_socket(port, self._jrdd_deserializer))
778
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling {0}{1}{2}.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/auto-save.csv
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:53)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
I faced same error and I solved it. If we configure Spark context with more cores as workers than your system supports. Like I have 3 core system but in my code when I mentioned below code it won't work because I don't have 4th core.
Unsupported Spark Context Configuration code for which I got Py4JJavaerror:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("Collinear Points").setMaster("local[4]") #Initialize spark context using 4 local cores as workers
sc = SparkContext(conf=conf)
from pyspark.rdd import RDD
Supported SparkContext Configuration code for all types of systems because in below we are not initializing cores explicitly as workers.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("Collinear Points")
sc = SparkContext('local',conf=conf)
from pyspark.rdd import RDD
You should save your output as
lines=sc.textFile("hdfs:///tmp/auto-data.csv")
or just
lines=sc.textFile("/tmp/auto-data.csv")
This command would write your output to hdfs
The exception is self-explanatory. Try to give absolute path of auto-save.csv to
lines=sc.textFile("auto-data.csv") or move the auto-save.csv to /home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/
thonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/auto-save.csv

Resources