Things don't work out as per the official overview.ipynb
I am trying to build a model in local mode on pyspark.
import pyspark
from pyspark.ml.feature import VectorAssembler
from synapse.ml.lightgbm import LightGBMClassificationModel, LightGBMClassifier
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.4") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.getOrCreate()
##build modeling data
non_feats = [i for i in df.columns if '__val_col' in i.lower()]
non_feats.extend(['index','label','fold'])
feature_cols = [i for i in df.columns if i not in non_feats]
featurizer = VectorAssembler(
inputCols = feature_cols,
outputCol = 'features',handleInvalid='keep')
spark_df = df.to_spark() #will not run on koalas or pandas on pysaprk dataframes. This step is necessary
modeling_data = featurizer.transform(spark_df)
##build the model
model=LightGBMClassifier(objective='binary',isUnbalance=True,metric='auc',
earlyStoppingRound=10,featuresShapCol='shap',validationIndicatorCol='__val_col0',labelCol='label',
)
Post this I am able to train the model successfully (I think!)
model.fit(dataset=modeling_data)
output: LightGBMClassifier_867a1dd2276d
Issues:
There is no saveNativeModel attribute available with model.
model.saveNativeModel("./lgbmclassifier.model")
output: AttributeError: 'LightGBMClassifier' object has no attribute 'saveNativeModel'
Model does not load post saving it with .save (this is the only save attribute available)
model.save("./lgbmclassifier_test.model")
LightGBMClassificationModel.loadNativeModelFromFile("./lgbmclassifier_test.model")
output:
[LightGBM] [Fatal] Model file doesn't specify the number of classes
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
/tmp/ipykernel_268437/1849142297.py in <module>
----> 1 LightGBMClassificationModel.loadNativeModelFromFile("./lgbmclassifier_test.model")
/tmp/spark-5794a877-8405-42db-b23c-33bfa7d0eacd/userFiles-9a0cfba3-8761-4c76-9d83-ae455b04cb37/com.microsoft.azure_synapseml-lightgbm_2.12-0.9.4.jar/synapse/ml/lightgbm/LightGBMClassificationModel.py in loadNativeModelFromFile(filename)
19 ctx = SparkContext._active_spark_context
20 loader = ctx._jvm.com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassificationModel
---> 21 java_model = loader.loadNativeModelFromFile(filename)
22 return JavaParams._from_java(java_model)
23
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/py4j/java_gateway.py in __call__(self, *args)
1307
1308 answer = self.gateway_client.send_command(command)
-> 1309 return_value = get_return_value(
1310 answer, self.gateway_client, self.target_id, self.name)
1311
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
~/miniconda3/envs/pyspark/lib/python3.9/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling z:com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassificationModel.loadNativeModelFromFile.
: java.lang.Exception: Booster LoadFromString call failed in LightGBM with error: Model file doesn't specify the number of classes
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMUtils$.validate(LightGBMUtils.scala:24)
at com.microsoft.azure.synapse.ml.lightgbm.booster.BoosterHandler$.com$microsoft$azure$synapse$ml$lightgbm$booster$BoosterHandler$$createBoosterPtrFromModelString(LightGBMBooster.scala:46)
at com.microsoft.azure.synapse.ml.lightgbm.booster.BoosterHandler.<init>(LightGBMBooster.scala:63)
at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.boosterHandler$lzycompute(LightGBMBooster.scala:236)
at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.boosterHandler(LightGBMBooster.scala:230)
at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.numClasses$lzycompute(LightGBMBooster.scala:500)
at com.microsoft.azure.synapse.ml.lightgbm.booster.LightGBMBooster.numClasses(LightGBMBooster.scala:500)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassificationModel$.loadNativeModelFromFile(LightGBMClassifier.scala:199)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassificationModel.loadNativeModelFromFile(LightGBMClassifier.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Env information:
SynapseML Version: How to check the version? There is no .version on synapse when I import it. But, I guess it is 0.9.4 since that is the jar that I provide("com.microsoft.azure:synapseml_2.12:0.9.4") while creating the context.
Spark Version: 3.2.0
Spark Platform: local[*]
Can you please help me understand what to do post training the model. I am looking for save, load and predict capabilities
You have to save on filesytem and then move to sandbox:
gbm.save_model('/tmp/%s'%(gbm_name))
storage_account_name = "name"
storage_account_key = "key"
container_name = "container_name"
directory_name = "ml_path"
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", storage_account_name), credential=storage_account_key)
file_system_client = service_client.get_file_system_client(file_system=container_name)
dir_client = file_system_client.get_directory_client(directory_name)
dir_client.create_directory()
local_path = '/tmp/%s'%(gbm_name)
file_client = dir_client.create_file(gbm_name)
f = open(local_path,'rb')
file_contents = f.read()
file_client.upload_data(file_contents, overwrite=True)
f.close()
Then you can load and score with sparkmml.
from mmlspark.lightgbm import LightGBMClassificationModel
gbm_spark = LightGBMClassificationModel().loadNativeModelFromFile('directory_name/gbm_name')
featurizer = VectorAssembler(
handleInvalid = "keep",
inputCols=feature_cols,
outputCol='features'
)
score_data= featurizer.transform(var_score)
scored_data = gbm_spark.transform(score_data)
I hope it helps you.
Related
I've a bunch of code that works perfectly with s3n, but when i try to switch to s3a I just get some sort of java.lang.IllegalArgumentException without a real pointer or hint as to what exactly is wrong.. would appreciate some suggestions for debugging! I'm on hadoop-aws-2.7.3 and aws-java-sdk-1.7.4 so I believe that should be fine
error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-2-1aafd157ea37> in <module>
----> 1 schema_df = spark.read.json('s3a://udemy-stream-logs/cdn-access-raw/verizon/mp4-a.udemycdn.com/wpc_C9216_306_20200701_0C390000BFD7B55E_100.json_lines.gz')
2 schema = schema_df.schema
/usr/local/spark/python/pyspark/sql/readwriter.py in json(self, path, schema, primitivesAsString, prefersDecimal, allowComments, allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord, dateFormat, timestampFormat, multiLine, allowUnquotedControlChars, lineSep, samplingRatio, dropFieldIfAllNull, encoding, locale, pathGlobFilter, recursiveFileLookup)
298 path = [path]
299 if type(path) == list:
--> 300 return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
301 elif isinstance(path, RDD):
302 def func(iterator):
/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
126 def deco(*a, **kw):
127 try:
--> 128 return f(*a, **kw)
129 except py4j.protocol.Py4JJavaError as e:
130 converted = convert_exception(e.java_exception)
/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o31.json.
: java.lang.IllegalArgumentException
at java.base/java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1293)
at java.base/java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1215)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:280)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:477)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
my code:
conf = (SparkConf()
.set('spark.executor.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
.set('spark.driver.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
.set('spark.master', 'local[*]')
.set('spark.driver.memory', '4g'))
scT = SparkContext(conf=conf)
scT.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
scT.setLogLevel("INFO")
hadoopConf = scT._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3.buffer.dir', '/tmp/pyspark')
hadoopConf.set('fs.s3a.awsAccessKeyId', 'key')
hadoopConf.set('fs.s3a.awsSecretAccessKey', 'secret')
hadoopConf.set('fs.s3a.endpoint', 's3-us-east-1.amazonaws.com')
hadoopConf.set('fs.s3a.multipart.size', '104857600')
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
hadoopConf.set('fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider')
spark = SparkSession(scT)
df = spark.read.json('s3a://mybucket/something_something.json_lines.gz')
Ignoring the little detail that you have set the username and password in the wrong properties for the s3a connector, that stack trace implies its from thread pool construction. Presumably one of the parameters passed in (thread pool size, keep alive time. is somehow invalid. No obvious cue as to which specific option is provided by the JVM though.
My recommendation is to stop copying and pasting other stack overflow examples and look at the s3a documentation. See what the options are for authentication and then for bounded and unbounded thread pools and make sure they're set
I got the same problem as you, and I figure out this was caused by the code you configure "fs.s3a.multipart.size". I removed it and the problem has gone. You could try it.
I want to import a trained pyspark model (or pipeline) into a pyspark script. I trained a decision tree model like so:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
# Create assembler and labeller for spark.ml format preperation
assembler = VectorAssembler(inputCols = requiredFeatures, outputCol = 'features')
label_indexer = StringIndexer(inputCol='measurement_status', outputCol='indexed_label')
# Apply transformations
eq_df_labelled = label_indexer.fit(eq_df).transform(eq_df)
eq_df_labelled_featured = assembler.transform(eq_df_labelled)
# Split into training and testing datasets
(training_data, test_data) = eq_df_labelled_featured.randomSplit([0.75, 0.25])
# Create a decision tree algorithm
dtree = DecisionTreeClassifier(
labelCol ='indexed_label',
featuresCol = 'features',
maxDepth = 5,
minInstancesPerNode=1,
impurity = 'gini',
maxBins=32,
seed=None
)
# Fit classifier object to training data
dtree_model = dtree.fit(training_data)
# Save model to given directory
dtree_model.save("models/dtree")
All of the code above works without any erros. The problem is, when I try to load this model (on the same or on another pyspark application), using:
from pyspark.ml.classification import DecisionTreeClassifier
imported_model = DecisionTreeClassifier()
imported_model.load("models/dtree")
I get the following error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-4-b283bc2da75f> in <module>
2
3 imported_model = DecisionTreeClassifier()
----> 4 imported_model.load("models/dtree")
5
6 #lodel = DecisionTreeClassifier.load("models/dtree-test/")
~/.local/lib/python3.6/site-packages/pyspark/ml/util.py in load(cls, path)
328 def load(cls, path):
329 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 330 return cls.read().load(path)
331
332
~/.local/lib/python3.6/site-packages/pyspark/ml/util.py in load(self, path)
278 if not isinstance(path, basestring):
279 raise TypeError("path should be a basestring, got type %s" % type(path))
--> 280 java_obj = self._jread.load(path)
281 if not hasattr(self._clazz, "_from_java"):
282 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
~/.local/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
~/.local/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
126 def deco(*a, **kw):
127 try:
--> 128 return f(*a, **kw)
129 except py4j.protocol.Py4JJavaError as e:
130 converted = convert_exception(e.java_exception)
~/.local/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o39.load.
: java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1439)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.first(RDD.scala:1437)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I went for this approach because it also didnt work using a Pipeline object. Any ideas about what is happening?
UPDATE
I have realised that this error only occurs when I work with my Spark cluster (one master, two workers using Spark's standalone cluster manager). If I set Spark Session like so (where the master is set to the local one):
spark = SparkSession\
.builder\
.config(conf=conf)\
.appName("MachineLearningTesting")\
.master("local[*]")\
.getOrCreate()
I do not get the above error.
Also, I am using Spark 3.0.0, could it be that model importing and exporting in Spark 3 still has bugs?
There were two problems:
SSH authenticated communication must be enabled between all nodes in the cluster. Even though all nodes in my Spark cluster are in the same network, only the master had SSH authentication to the workers and not vise versa.
The model must be available to all nodes in the cluster. This may sound really obvious but I thought that the model files need to only be available to the master who then diffuses this to the worker nodes. In other words, when you load the model like so:
from pyspark.ml.classification import DecisionTreeClassifier
imported_model = DecisionTreeClassifier()
imported_model.load("models/dtree")
The file /absoloute_path/models/dtree must exist on every machine in the cluster. This made me understand that in production contexts, the models are probably accessed via an external shared file system.
These two steps solved my problem of loading pyspark models into a Spark application running on a cluster.
I am trying to load a bunch of CSV files row by row into mysql instance which is running on OpenShift using pyspark configuration. I have a Jupyter notebook with spark up and running.
Below is the code I have. And it fails with specific driver error
Py4JJavaError: An error occurred while calling o89.save.
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
if __name__ == '__main__':
scSpark = SparkSession \
.builder \
.appName("reading csv") \
.getOrCreate()
if __name__ == '__main__':
scSpark = SparkSession \
.builder \
.appName("reading csv") \
.getOrCreate()
data_file = '/opt/app-root/src/data/train.psv'
sdfData = scSpark.read.csv(data_file, header=True, sep="|").cache()
print('Total Records = {}'.format(sdfData.count()))
sdfData.show()
sdfData.registerTempTable("train")
output = scSpark.sql('SELECT count(*) from train')
output.show()
+--------+
|count(1)|
+--------+
| 1168686|
+--------+
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages mysql:mysql-connector-java:jar:8.0.21 pyspark-shell'
output = scSpark.sql('SELECT * from train')
output.show()
output.write.format('jdbc').options(
url='jdbc:mysql://mysql-1-28d85/sepsis',
driver='com.mysql.jdbc.Driver',
#driver='mysql-connector-java.Driver',
#driver='org.mysql.jdbc.Driver',
dbtable='train',
user='sepsis',
password='Success_2020').mode('append').save()
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-57-114af97e0442> in <module>
11 dbtable='train',
12 user='sepsis',
---> 13 password='Success_2020').mode('append').save()
/opt/app-root/lib/python3.6/site-packages/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
735 self.format(format)
736 if path is None:
--> 737 self._jwrite.save()
738 else:
739 self._jwrite.save(path)
/opt/app-root/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/app-root/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/opt/app-root/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o1641.save.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$5.apply(JDBCOptions.scala:99)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$5.apply(JDBCOptions.scala:99)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:99)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:190)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:194)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at sun.reflect.GeneratedMethodAccessor67.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Changed the code with package.
Also this is openshift , where all components are running as pods with no access to outside environment.
java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
That says it all. You have to start pyspark (or the environment) with the JDBC driver for MySQL using --driver-class-path or similar (that will be specific to Jupyter).
For Jupyter Notebook
Copying from PySpark in Jupyter Notebook — Working with Dataframe & JDBC Data Sources:
If you use Jupyter Notebook, you should set the PYSPARK_SUBMIT_ARGS environment variable, as following:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'
Change the --packages to reference the MySQL JDBC driver.
Once you go to the installation path of the spark, there will be a jars folder. Download your mysql jdbc jar file and place it into the jars folder then you don't need any options to the command or code.
While running my spark program in jupyter notebook I got the error "Job cancelled because SparkContext was shut down".I am using spark without hadoop.The same program gave output earlier but now showing error.Any idea why would the error must have occured.
My code is :
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("Musical_Instruments_5.json")
pd=df.select(df['asin'],df['overall'],df['reviewerID'])
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index") for
column in list(set(pd.columns)-set(['overall'])) ]
pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(pd).transform(pd)
transformed.show()
(training,test)=transformed.randomSplit([0.8, 0.2])
als=ALS(maxIter=30,regParam=0.09,rank=25,userCol="reviewerID_index",itemCol="asin_index",ratingCol="overall",coldStartStrategy="drop",nonnegative=True)
model=als.fit(training)
This is the point where it gives error.
Py4JJavaError Traceback (most recent call last)
<ipython-input-14-2e31692d867d> in <module>()
1 #Fit ALS model to training data
----> 2 model=als.fit(training)
C:\spark\spark-2.3.1-bin-hadoop2.7\python\pyspark\ml\base.py in fit(self, dataset, params)
130 return self.copy(params)._fit(dataset)
131 else:
--> 132 return self._fit(dataset)
133 else:
134 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
C:\spark\spark-2.3.1-bin-hadoop2.7\python\pyspark\ml\wrapper.py in _fit(self, dataset)
286
287 def _fit(self, dataset):
--> 288 java_model = self._fit_java(dataset)
289 model = self._create_model(java_model)
290 return self._copyValues(model)
C:\spark\spark-2.3.1-bin-hadoop2.7\python\pyspark\ml\wrapper.py in _fit_java(self, dataset)
283 """
284 self._transfer_params_to_java()
--> 285 return self._java_obj.fit(dataset._jdf)
286
287 def _fit(self, dataset):
C:\spark\spark-2.3.1-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
C:\spark\spark-2.3.1-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
C:\spark\spark-2.3.1-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o132.fit.
: org.apache.spark.SparkException: Job 11 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1841)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1754)
at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1931)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD.count(RDD.scala:1162)
at org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:1030)
at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:674)
at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:568)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
This problem is solved now.I have to create a checkpoint directory as number of iterations was more than 20 for training.
The code for creating checkpoint directory is:
SparkContext.setCheckpointDir("path to directory")
import os
import sys
os.chdir("/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/bin")
os.curdir
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = '/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7'
SPARK_HOME = os.environ['SPARK_HOME']
sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","pyspark.zip"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","py4j-0.9-src.zip"))
from pyspark import SparkContext
from pyspark import SparkConf
# Optionally configure Spark Settings
conf=SparkConf()
conf.set("spark.executor.memory", "1g")
conf.set("spark.cores.max", "2")
conf.setAppName("V2 Maestros")
## Initialize SparkContext. Run only once. Otherwise you get multiple
#Context Error.
sc = SparkContext('local', conf=conf)
#Test to make sure everything works.
lines=sc.textFile("auto-data.csv")
lines.count()
This is the error that occurred. It was simple program calculating the number of entering of the file but this error came up. I have kept the file in both the locations mention in the code even though the result is the same.
Py4JJavaError Traceback (most recent call last)
<ipython-input-6-5c9242495358> in <module>()
1 lines = sc.textFile("auto-save.csv")
----> 2 lines.count()
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.pyc in count(self)
1006 3
1007 """
-> 1008 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
1009
1010 def stats(self):
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.pyc in sum(self)
997 6.0
998 """
--> 999 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
1000
1001 def count(self):
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.pyc in fold(self, zeroValue, op)
871 # zeroValue provided to each partition is unique from the one provided
872 # to the final reduce call
--> 873 vals = self.mapPartitions(func).collect()
874 return reduce(op, vals, zeroValue)
875
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.pyc in collect(self)
774 """
775 with SCCallSiteSync(self.context) as css:
--> 776 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
777 return list(_load_from_socket(port, self._jrdd_deserializer))
778
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling {0}{1}{2}.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/auto-save.csv
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:53)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
I faced same error and I solved it. If we configure Spark context with more cores as workers than your system supports. Like I have 3 core system but in my code when I mentioned below code it won't work because I don't have 4th core.
Unsupported Spark Context Configuration code for which I got Py4JJavaerror:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("Collinear Points").setMaster("local[4]") #Initialize spark context using 4 local cores as workers
sc = SparkContext(conf=conf)
from pyspark.rdd import RDD
Supported SparkContext Configuration code for all types of systems because in below we are not initializing cores explicitly as workers.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("Collinear Points")
sc = SparkContext('local',conf=conf)
from pyspark.rdd import RDD
You should save your output as
lines=sc.textFile("hdfs:///tmp/auto-data.csv")
or just
lines=sc.textFile("/tmp/auto-data.csv")
This command would write your output to hdfs
The exception is self-explanatory. Try to give absolute path of auto-save.csv to
lines=sc.textFile("auto-data.csv") or move the auto-save.csv to /home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/
thonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hp/Downloads/spark-2.0.0-bin-hadoop2.7/auto-save.csv