Multidimensional KDE in PySpark - apache-spark

My goal is to fit a Kernel Density Estimate (KDE) model to a large two-dimensional dataset using either Python or PySpark, then use the fit model to predict densities for another two-dimensional dataset.
I have the training data in a Spark RDD object, which I can fit using MLlib. X is a list of lists in python, where the (i,j) th element of the array represents a 'score' for the ith example on the jth feature.
from pyspark.mllib.stat import KernelDensity
sample = sc.parallelize(X)
kd = KernelDensity()
kd.setBandwidth(0.2)
kd.setSample(sample)
I want to use the model to estimate densities for another, two-dimensional dataset. I perform the calculation, for simplicity on a small set of data:
sample2 = [[1.0, 2.2],[3.1,0.9]]
kd.estimate(sample2)
At this point, PySpark throws an error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-140-7fdd873f4c51> in <module>()
----> 1 kd.estimate(sample2)
/usr/local/spark/python/pyspark/mllib/stat/KernelDensity.pyc in estimate(self, points)
56 points = list(points)
57 densities = callMLlibFunc(
---> 58 "estimateKernelDensity", self._sample, self._bandwidth, points)
59 return np.asarray(densities)
/usr/local/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name, *args)
128 sc = SparkContext.getOrCreate()
129 api = getattr(sc._jvm.PythonMLLibAPI(), name)
--> 130 return callJavaFunc(sc, api, *args)
131
132
/usr/local/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, *args)
121 """ Call Java Function """
122 args = [_py2java(sc, a) for a in args]
--> 123 return _java2py(sc, func(*args))
124
125
/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/usr/local/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling {0}{1}{2}.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(
Py4JJavaError: An error occurred while calling o1734.estimateKernelDensity.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:114)
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:93)
at scala.collection.IterableLike$class.copyToArray(IterableLike.scala:254)
at scala.collection.AbstractIterable.copyToArray(Iterable.scala:54)
at scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:278)
at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:286)
at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
at org.apache.spark.mllib.api.python.PythonMLLibAPI.estimateKernelDensity(PythonMLLibAPI.scala:1067)
at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
I have tried transforming sample2 into an RDD, pyspark dataframe, and numpy array, but these all give various other errors.
I am curious as to why this error is occurring; the documentation for the estimate method states that a list object should be passed as the input. Is a list-of-lists not acceptable?
I could probably use sklearn's KDE models to fit this code, but was hoping to use Spark due to the large size of the data. If there is a clever way to do this in a fast, scalable way using sklearn, I would be open to suggestions on how to take that route as well.

I believe sparks KDE implementation is univariate. In other words, I believe you can only offer it a single array of values. Probably where your casting error is coming from.
java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.Double
It expects a list of doubles and you are giving it a list of lists. Hence why it says it can't cast a list to a double.
See documentation: http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity

Related

Issue adding Word2Vec to Spark pipeline

I'm still getting used to Spark but I am having an issue figuring out how to build a pipeline. I have a spark dataframe below and my end goal is to classify each movie by reviewing their plot and classifying them.
dataframe
I am trying to create a pipeline using a stringIndexer, tokenizer, stopwordsremover and word2vec but I am getting the below error. I'm not sure how to resolve it after looking at some similar topics.
indexer = StringIndexer(inputCol="word", outputCol="label")
tokenizer = Tokenizer(inputCol = "plot_synopsis", outputCol = "tokenized_terms")
remover = StopWordsRemover(inputCol="tokenized_terms", outputCol="filtered")
word2Vec = Word2Vec(vectorSize=5, minCount=0, inputCol="filtered", outputCol="wordVectors")
pipeline = Pipeline(stages=[tokenizer, remover, word2Vec, indexer])
encodedData = pipeline.fit(df_expand).transform(df_expand)
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-25-7d237f91c3cf> in <module>
----> 1 encodedData = pipeline.fit(df_expand).transform(df_expand)
~\anaconda3\lib\site-packages\pyspark\ml\base.py in fit(self, dataset, params)
159 return self.copy(params)._fit(dataset)
160 else:
--> 161 return self._fit(dataset)
162 else:
163 raise TypeError("Params must be either a param map or a list/tuple of param maps, "
~\anaconda3\lib\site-packages\pyspark\ml\pipeline.py in _fit(self, dataset)
112 dataset = stage.transform(dataset)
113 else: # must be an Estimator
--> 114 model = stage.fit(dataset)
115 transformers.append(model)
116 if i < indexOfLastEstimator:
~\anaconda3\lib\site-packages\pyspark\ml\base.py in fit(self, dataset, params)
159 return self.copy(params)._fit(dataset)
160 else:
--> 161 return self._fit(dataset)
162 else:
163 raise TypeError("Params must be either a param map or a list/tuple of param maps, "
~\anaconda3\lib\site-packages\pyspark\ml\wrapper.py in _fit(self, dataset)
333
334 def _fit(self, dataset):
--> 335 java_model = self._fit_java(dataset)
336 model = self._create_model(java_model)
337 return self._copyValues(model)
~\anaconda3\lib\site-packages\pyspark\ml\wrapper.py in _fit_java(self, dataset)
330 """
331 self._transfer_params_to_java()
--> 332 return self._java_obj.fit(dataset._jdf)
333
334 def _fit(self, dataset):
~\anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
1319
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1323
~\anaconda3\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
~\anaconda3\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o147.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18.0 failed 1 times, most recent failure: Lost task 0.0 in stage 18.0 (TID 14) (host.docker.internal executor driver): TaskResultLost (result lost from block manager)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at org.apache.spark.mllib.feature.Word2Vec.learnVocab(Word2Vec.scala:191)
at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:312)
at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:182)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Unknown Source)
can you check once and make sure you are passing Datafrmae to fit() method.
Related issue:
PySpark :: FP-growth algorithm ( raise ValueError("Params must be either a param map or a list/tuple of param maps, ")

PySpark using s3a throws java.lang.IllegalArgumentException

I've a bunch of code that works perfectly with s3n, but when i try to switch to s3a I just get some sort of java.lang.IllegalArgumentException without a real pointer or hint as to what exactly is wrong.. would appreciate some suggestions for debugging! I'm on hadoop-aws-2.7.3 and aws-java-sdk-1.7.4 so I believe that should be fine
error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-2-1aafd157ea37> in <module>
----> 1 schema_df = spark.read.json('s3a://udemy-stream-logs/cdn-access-raw/verizon/mp4-a.udemycdn.com/wpc_C9216_306_20200701_0C390000BFD7B55E_100.json_lines.gz')
2 schema = schema_df.schema
/usr/local/spark/python/pyspark/sql/readwriter.py in json(self, path, schema, primitivesAsString, prefersDecimal, allowComments, allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord, dateFormat, timestampFormat, multiLine, allowUnquotedControlChars, lineSep, samplingRatio, dropFieldIfAllNull, encoding, locale, pathGlobFilter, recursiveFileLookup)
298 path = [path]
299 if type(path) == list:
--> 300 return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
301 elif isinstance(path, RDD):
302 def func(iterator):
/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
126 def deco(*a, **kw):
127 try:
--> 128 return f(*a, **kw)
129 except py4j.protocol.Py4JJavaError as e:
130 converted = convert_exception(e.java_exception)
/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o31.json.
: java.lang.IllegalArgumentException
at java.base/java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1293)
at java.base/java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1215)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:280)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:477)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
my code:
conf = (SparkConf()
.set('spark.executor.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
.set('spark.driver.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
.set('spark.master', 'local[*]')
.set('spark.driver.memory', '4g'))
scT = SparkContext(conf=conf)
scT.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
scT.setLogLevel("INFO")
hadoopConf = scT._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3.buffer.dir', '/tmp/pyspark')
hadoopConf.set('fs.s3a.awsAccessKeyId', 'key')
hadoopConf.set('fs.s3a.awsSecretAccessKey', 'secret')
hadoopConf.set('fs.s3a.endpoint', 's3-us-east-1.amazonaws.com')
hadoopConf.set('fs.s3a.multipart.size', '104857600')
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
hadoopConf.set('fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider')
spark = SparkSession(scT)
df = spark.read.json('s3a://mybucket/something_something.json_lines.gz')
Ignoring the little detail that you have set the username and password in the wrong properties for the s3a connector, that stack trace implies its from thread pool construction. Presumably one of the parameters passed in (thread pool size, keep alive time. is somehow invalid. No obvious cue as to which specific option is provided by the JVM though.
My recommendation is to stop copying and pasting other stack overflow examples and look at the s3a documentation. See what the options are for authentication and then for bounded and unbounded thread pools and make sure they're set
I got the same problem as you, and I figure out this was caused by the code you configure "fs.s3a.multipart.size". I removed it and the problem has gone. You could try it.

Pyspark.ml - Error when loading model and Pipeline

I want to import a trained pyspark model (or pipeline) into a pyspark script. I trained a decision tree model like so:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
# Create assembler and labeller for spark.ml format preperation
assembler = VectorAssembler(inputCols = requiredFeatures, outputCol = 'features')
label_indexer = StringIndexer(inputCol='measurement_status', outputCol='indexed_label')
# Apply transformations
eq_df_labelled = label_indexer.fit(eq_df).transform(eq_df)
eq_df_labelled_featured = assembler.transform(eq_df_labelled)
# Split into training and testing datasets
(training_data, test_data) = eq_df_labelled_featured.randomSplit([0.75, 0.25])
# Create a decision tree algorithm
dtree = DecisionTreeClassifier(
labelCol ='indexed_label',
featuresCol = 'features',
maxDepth = 5,
minInstancesPerNode=1,
impurity = 'gini',
maxBins=32,
seed=None
)
# Fit classifier object to training data
dtree_model = dtree.fit(training_data)
# Save model to given directory
dtree_model.save("models/dtree")
All of the code above works without any erros. The problem is, when I try to load this model (on the same or on another pyspark application), using:
from pyspark.ml.classification import DecisionTreeClassifier
imported_model = DecisionTreeClassifier()
imported_model.load("models/dtree")
I get the following error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-4-b283bc2da75f> in <module>
2
3 imported_model = DecisionTreeClassifier()
----> 4 imported_model.load("models/dtree")
5
6 #lodel = DecisionTreeClassifier.load("models/dtree-test/")
~/.local/lib/python3.6/site-packages/pyspark/ml/util.py in load(cls, path)
328 def load(cls, path):
329 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 330 return cls.read().load(path)
331
332
~/.local/lib/python3.6/site-packages/pyspark/ml/util.py in load(self, path)
278 if not isinstance(path, basestring):
279 raise TypeError("path should be a basestring, got type %s" % type(path))
--> 280 java_obj = self._jread.load(path)
281 if not hasattr(self._clazz, "_from_java"):
282 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
~/.local/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
~/.local/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
126 def deco(*a, **kw):
127 try:
--> 128 return f(*a, **kw)
129 except py4j.protocol.Py4JJavaError as e:
130 converted = convert_exception(e.java_exception)
~/.local/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o39.load.
: java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1439)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.first(RDD.scala:1437)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I went for this approach because it also didnt work using a Pipeline object. Any ideas about what is happening?
UPDATE
I have realised that this error only occurs when I work with my Spark cluster (one master, two workers using Spark's standalone cluster manager). If I set Spark Session like so (where the master is set to the local one):
spark = SparkSession\
.builder\
.config(conf=conf)\
.appName("MachineLearningTesting")\
.master("local[*]")\
.getOrCreate()
I do not get the above error.
Also, I am using Spark 3.0.0, could it be that model importing and exporting in Spark 3 still has bugs?
There were two problems:
SSH authenticated communication must be enabled between all nodes in the cluster. Even though all nodes in my Spark cluster are in the same network, only the master had SSH authentication to the workers and not vise versa.
The model must be available to all nodes in the cluster. This may sound really obvious but I thought that the model files need to only be available to the master who then diffuses this to the worker nodes. In other words, when you load the model like so:
from pyspark.ml.classification import DecisionTreeClassifier
imported_model = DecisionTreeClassifier()
imported_model.load("models/dtree")
The file /absoloute_path/models/dtree must exist on every machine in the cluster. This made me understand that in production contexts, the models are probably accessed via an external shared file system.
These two steps solved my problem of loading pyspark models into a Spark application running on a cluster.

Pyspark error - py4j.Py4JException: Method limit([class java.lang.String]) does not exist

When executing a code to get a spark dataframe from HDFS and then convert it to pandas dataframe,
spark_df = spark.read.parquet(*data_paths)
# other code in the process like filtering, groupby etc.
# ....
# write sparkdf to hadoop, get n rows if specified
if n:
spark_df.limit(n).write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
else:
spark_df.write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
I get an error:
/home/sarah/anaconda3/envs/py27/lib/python2.7/site-packages/dspipeline/core/wf_spark.pyc in to_pd(spark_df, n, save_csv, csv_sep, csv_quote, quick)
215 # write sparkdf to hadoop, get n rows if specified
216 if n:
--> 217 spark_df.limit(n).write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
218 else:
219 spark_df.write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
/opt/spark-2.3.0-SNAPSHOT-bin-spark-master/python/pyspark/sql/dataframe.py in limit(self, num)
472 []
473 """
--> 474 jdf = self._jdf.limit(num)
475 return DataFrame(jdf, self.sql_ctx)
476
/opt/spark-2.3.0-SNAPSHOT-bin-spark-master/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/opt/spark-2.3.0-SNAPSHOT-bin-spark-master/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/opt/spark-2.3.0-SNAPSHOT-bin-spark-master/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
321 raise Py4JError(
322 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
--> 323 format(target_id, ".", name, value))
324 else:
325 raise Py4JError(
Py4JError: An error occurred while calling o1086.limit. Trace:
py4j.Py4JException: Method limit([class java.lang.String]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
As I found the function limit(num) in pyspark documentation, I guess the reason is that I'm not correctly using it. Any help?
The exception is pretty clear here:
Method limit([class java.lang.String]) does not exist
n you are trying to pass to limit is not an int but a str.
You should go back to the point where n is defined, and fix it.
There exits .limit method for DataFrame, if you want to get the n rows from a DataFrame, you can use .limit(n) method, but parameter n is must be integer.
example:
df.limit(10)
if you use the other param like df.limit('10'), an error will occurre:
py4j.Py4JException: Method limit([class java.lang.String]) does not exist.
If you pass a number greater than 2,147,483,647 you will get the same error because limit expects an IntegerType.
IntegerType: Represents 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.

mllib KernelDensity error

I'm trying to use pyspark.mllib.stat.KernelDensity this way:
data = sc.parallelize([0, 1, 2, 2, 1, 1, 1, 1, 1, 2, 0, 0])
kd = KernelDensity()
kd.setSample(data)
kd.setBandwidth(3)
densities = kd.estimate([-1.0, 2.0, 5.0])
but eventually get this error:
--------------------------------------------------------------------------- Py4JError Traceback (most recent call
last) in ()
8
9 # Find density estimates for the given values
---> 10 densities = kd.estimate([-1.0, 2.0, 5.0])
/home/user10215193/anaconda3/lib/python3.6/site-packages/pyspark/mllib/stat/KernelDensity.py
in estimate(self, points)
56 points = list(points)
57 densities = callMLlibFunc(
---> 58 "estimateKernelDensity", self._sample, self._bandwidth, points)
59 return np.asarray(densities)
/home/user10215193/anaconda3/lib/python3.6/site-packages/pyspark/mllib/common.py
in callMLlibFunc(name, *args)
129 api = getattr(sc._jvm.PythonMLLibAPI(), name)
130 print(api)
--> 131 return callJavaFunc(sc, api, *args)
132
133
/home/user10215193/anaconda3/lib/python3.6/site-packages/pyspark/mllib/common.py
in callJavaFunc(sc, func, *args)
121 """ Call Java Function """
122 args = [_py2java(sc, a) for a in args]
--> 123 return _java2py(sc, func(*args))
124
125
/home/user10215193/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py
in call(self, *args) 1131 answer =
self.gateway_client.send_command(command) 1132 return_value
= get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args:
/home/user10215193/anaconda3/lib/python3.6/site-packages/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)
321 raise Py4JError(
322 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
--> 323 format(target_id, ".", name, value))
324 else:
325 raise Py4JError(
Py4JError: An error occurred while calling o19.estimateKernelDensity.
Trace: py4j.Py4JException: Method estimateKernelDensity([class
org.apache.spark.api.java.JavaRDD, class java.lang.Integer, class
java.util.ArrayList]) does not exist at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:214) at
java.lang.Thread.run(Thread.java:748)
I couldn't find anything similar here so if somebody can help me with this I would much appreciate it.
You have to be careful about the types:
bandwidth has to be float
sample has to be RDD[float]
So replace your code with:
kd.setSample(data.map(float))
kd.setBandwidth(3.0)
densities = kd.estimate([-1.0, 2.0, 5.0])
and you'll be fine.

Resources