py4j.protocol.Py4JJavaError: An error occurred while calling o49.csv

py4j.protocol.Py4JJavaError: An error occurred while calling o49.csv - apache-spark

I'm new to pyspark. I'm running pyspark in the local machine. I'm trying write to CSV file from pyspark data frame. So I wrote the following code
dataframe.write.mode('append').csv(outputPath)
But I'm getting an error message
Traceback (most recent call last):
File "D:\PycharmProjects\pythonProject\org\spark\weblog\SparkWebLogsAnalysis.py", line 71, in <module>
weblog_sessionIds.write.mode('append').csv(outputPath)
File "C:\spark-3.1.2-bin-hadoop3.2\python\pyspark\sql\readwriter.py", line 1372, in csv
self._jwrite.csv(path)
File "C:\spark-3.1.2-bin-hadoop3.2\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1304, in __call__
File "C:\spark-3.1.2-bin-hadoop3.2\python\pyspark\sql\utils.py", line 111, in deco
return f(*a, **kw)
File "C:\spark-3.1.2-bin-hadoop3.2\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o49.csv.
: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode(NativeIO.java:560)
at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:587)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:559)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:586)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:559)
at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:705)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:354)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:178)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
Can you suggest me to rectify this error?

Problem got resolve by deleting hadoop.dll file from winutils folder and using lower version of Spark

Related

Pandas UDF for pyspark - Package not found error

I am using the pandas UDF approach to scale my models. However, I am getting an error with the pmdarima package not found. The code works fine till I run it on my notebook on the pandas dataframe itself. So the package is available for use in the notebook. From few answers online, the error seems in package not being available on the worker nodes where the code is trying to parallelize. Can someone help on how to resolve this? How can I also install the package on my worker nodes, if that's the case.
FYI - I am working on Azure Databricks.
def funct1(grp_keys, df):
other statements
model = pm.auto_arima(train_data['sum_hlqty'],X=x,
test='adf',trace=False,
maxiter = 12,max_p=5,max_q=5,
njobs=-1)
forecast_df = sales.groupby('Col1','Col2').applyInPandas(funct1,schema="C1 string, C2 string, C3 date, C4 float, C5 float")
Py4JJavaError: An error occurred while calling o256.sql.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:230)
at com.databricks.sql.transaction.tahoe.files.TransactionalWriteEdge.$anonfun$writeFiles$5(TransactionalWriteEdge.scala:183)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)
at com.databricks.sql.transaction.tahoe.files.TransactionalWriteEdge.$anonfun$writeFiles$1(TransactionalWriteEdge.scala:135)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$4(UsageLogging.scala:431)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:239)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:234)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:231)
at com.databricks.spark.util.PublicDBLogging.withAttributionContext(DatabricksSparkUsageLogger.scala:19)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:276)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:269)
at com.databricks.spark.util.PublicDBLogging.withAttributionTags(DatabricksSparkUsageLogger.scala:19)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:412)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:338)
at com.databricks.spark.util.PublicDBLogging.recordOperation(DatabricksSparkUsageLogger.scala:19)
at com.databricks.spark.util.PublicDBLogging.recordOperation0(DatabricksSparkUsageLogger.scala:56)
at com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:129)
at com.databricks.spark.util.UsageLogger.recordOperation(UsageLogger.scala:71)
at com.databricks.spark.util.UsageLogger.recordOperation$(UsageLogger.scala:58)
at com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:85)
at com.databricks.spark.util.UsageLogging.recordOperation(UsageLogger.scala:401)
at com.databricks.spark.util.UsageLogging.recordOperation$(UsageLogger.scala:380)
at com.databricks.sql.transaction.tahoe.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:84)
at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:108)
at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:94)
at com.databricks.sql.transaction.tahoe.OptimisticTransaction.recordDeltaOperation(OptimisticTransaction.scala:84)
at com.databricks.sql.transaction.tahoe.files.TransactionalWriteEdge.writeFiles(TransactionalWriteEdge.scala:92)
at com.databricks.sql.transaction.tahoe.files.TransactionalWriteEdge.writeFiles$(TransactionalWriteEdge.scala:88)
at com.databricks.sql.transaction.tahoe.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:84)
at com.databricks.sql.transaction.tahoe.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:112)
at com.databricks.sql.transaction.tahoe.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:111)
at com.databricks.sql.transaction.tahoe.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:84)
at com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.write(WriteIntoDelta.scala:112)
at com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.$anonfun$run$2(WriteIntoDelta.scala:71)
at com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.$anonfun$run$2$adapted(WriteIntoDelta.scala:70)
at com.databricks.sql.transaction.tahoe.DeltaLog.withNewTransaction(DeltaLog.scala:203)
at com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.$anonfun$run$1(WriteIntoDelta.scala:70)
at com.databricks.sql.acl.CheckPermissions$.trusted(CheckPermissions.scala:1128)
at com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.run(WriteIntoDelta.scala:69)
at com.databricks.sql.transaction.tahoe.catalog.WriteIntoDeltaBuilder$$anon$1.insert(DeltaTableV2.scala:193)
at org.apache.spark.sql.execution.datasources.v2.SupportsV1Write.writeWithV1(V1FallbackWriters.scala:118)
at org.apache.spark.sql.execution.datasources.v2.SupportsV1Write.writeWithV1$(V1FallbackWriters.scala:116)
at org.apache.spark.sql.execution.datasources.v2.AppendDataExecV1.writeWithV1(V1FallbackWriters.scala:38)
at org.apache.spark.sql.execution.datasources.v2.AppendDataExecV1.run(V1FallbackWriters.scala:44)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:39)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:39)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:45)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:234)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3709)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3707)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:234)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:104)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:101)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:680)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:675)
at sun.reflect.GeneratedMethodAccessor655.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 in stage 7774.0 failed 4 times, most recent failure: Lost task 98.3 in stage 7774.0 (TID 177293, 10.240.138.10, executor 133): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 177, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport
__import__(name)
**ModuleNotFoundError: No module named 'pmdarima''** Full traceback below:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 177, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport
__import__(name)
ModuleNotFoundError: No module named 'pmdarima'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 638, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/databricks/spark/python/pyspark/worker.py", line 438, in read_udfs
arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0)
File "/databricks/spark/python/pyspark/worker.py", line 255, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/databricks/spark/python/pyspark/worker.py", line 75, in read_command
command = serializer._read_with_length(file)
File "/databricks/spark/python/pyspark/serializers.py", line 180, in _read_with_length
raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 177, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport
__import__(name)
**ModuleNotFoundError: No module named 'pmdarima'****strong text**

py4j.protocol.Py4JJavaError: An error occurred while calling showString

When we use pyspark to generate analytics, we find sometimes it generates error with no details to debug. How do we go about to identifying the issue and resolving it. This is the exception shown.
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 502, in show
File "/opt/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1310, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/opt/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1060.showString.
: org.apache.spark.util.SparkFatalException
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:183)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

java.io.IOException: No FileSystem for scheme: C and WinError 10054: An existing connection was forcibly closed by the remote host

I was trying to Connect and Fetch data from BigQuery Dataset to Local Pycharm Using Pyspark.
I ran this below Script in Pycharm:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config('spark.jars', "C:/Users/PycharmProjects/pythonProject/spark-bigquery-latest.jar")\
.getOrCreate()
conn = spark.read.format("bigquery")\
.option("credentialsFile", "C:/Users/PycharmProjects/pythonProject/google-bq-api.json")\
.option("parentProject", "Google-Project-ID")\
.option("project", "Dataset-Name")\
.option("table", "dataset.schema.tablename")\
.load()
conn.show()
For this I got the below error:
Exception in thread "main" java.io.IOException: No FileSystem for scheme: C
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:191)
at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:147)
at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:145)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:145)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$4(SparkSubmit.scala:363)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:363)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "C:\Users\naveen.chandar\PycharmProjects\pythonProject\BigQueryConnector.py", line 4, in <module>
spark = SparkSession.builder.config('spark.jars', 'C:/Users/naveen.chandar/PycharmProjects/pythonProject/spark-bigquery-latest.jar').getOrCreate()
File "C:\Users\naveen.chandar\AppData\Local\Programs\Python\Python39\lib\site-packages\pyspark\sql\session.py", line 186, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\Users\naveen.chandar\AppData\Local\Programs\Python\Python39\lib\site-packages\pyspark\context.py", line 376, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "C:\Users\naveen.chandar\AppData\Local\Programs\Python\Python39\lib\site-packages\pyspark\context.py", line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\Users\naveen.chandar\AppData\Local\Programs\Python\Python39\lib\site-packages\pyspark\context.py", line 325, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\Users\naveen.chandar\AppData\Local\Programs\Python\Python39\lib\site-packages\pyspark\java_gateway.py", line 105, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
So, I researched and tried it from different Diecrtory like "D-drive" and also tried to fix a static port with set PYSPARK_SUBMIT_ARGS="--master spark://<IP_Address>:<Port>", but still I got the same error in Pycharm.
Then I thought of trying the same script in local Command Prompt under Pyspark and I got this error:
failed to find class org/conscrypt/CryptoUpcalls
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "D:\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1152, in send_command
answer = smart_decode(self.stream.readline()[:-1])
File "C:\Users\naveen.chandar\AppData\Local\Programs\Python\Python37\lib\socket.py", line 589, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "D:\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\dataframe.py", line 381, in show
print(self._jdf.showString(n, 20, vertical))
File "D:\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
File "D:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "D:\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 336, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o42.showString
My Python Version is 3.7.9 and Spark Version is 2.4.7
So either way I ran out of idea's and I appreciate some help on any one of the situation I facing...
Thanks In Advance!!

Start your file system references with file:///c:/...

You need to replace / with \ for the path to work

Tracker issue with XGBoost in PySpark

I am using XGBoost in PySpark using by placing these two jars xgboost4j and xgboost4j-spark in $SPARK_HOME/jars folder.
When I try to fit the XGBoostClassifier model I get an error with the following message
py4j.protocol.Py4JJavaError: An error occurred while calling o413.fit.
: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
I looked for the tracker in trace and noticed that it is not binding to the localhost. This is the tracker info
Tracker started, with env={}
I am using a Mac and so checked for the /etc/hosts file
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost.localdomain localhost
255.255.255.255 broadcasthost
::1 localhost
127.0.0.1 myusername
Everything looks fine in the hosts file.
Any idea why the tracker is failing to initialise properly?
Error trace
Tracker started, with env={}
2019-01-07 12:50:19 ERROR RabitTracker:91 - Uncaught exception thrown by worker:
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4$$anon$1.run(XGBoost.scala:233)
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Users/myusername/Downloads/ml_project/ml_project/features/variable_selection.py", line 130, in fit
self.ttv.fit(target_col, X, test=None, validation=None)
File "/Users/myusername/Downloads/ml_project/ml_project/models/train_test_validator.py", line 636, in fit
upper_bounds, self.curr_model_num_iter, is_integer_variable)
File "/Users/myusername/Downloads/ml_project/ml_project/models/train_test_validator.py", line 253, in model_tuner
num_iter, is_integer_variable, random_state=42)
File "/Users/myusername/Downloads/ml_project/ml_project/models/hyperparam_optimizers.py", line 200, in aml_forest_maximize
is_integer_variable, random_state=random_state)
File "/Users/myusername/Downloads/ml_project/ml_project/models/hyperparam_optimizers.py", line 179, in aml_forest_minimize
return forest_minimize(objective_calculator, space, n_calls=num_iter, random_state=random_state, n_random_starts=n_random_starts, base_estimator="RF", n_jobs=-1)
File "/usr/local/lib/python3.7/site-packages/skopt/optimizer/forest.py", line 161, in forest_minimize
callback=callback, acq_optimizer="sampling")
File "/usr/local/lib/python3.7/site-packages/skopt/optimizer/base.py", line 248, in base_minimize
next_y = func(next_x)
File "/Users/myusername/Downloads/ml_project/ml_project/models/train_test_validator.py", line 487, in objective_calculator
model_fit = init_model.fit(train) # fit model
File "/Users/myusername/Downloads/spark/python/pyspark/ml/base.py", line 132, in fit
return self._fit(dataset)
File "/Users/myusername/Downloads/spark/python/pyspark/ml/wrapper.py", line 288, in _fit
java_model = self._fit_java(dataset)
File "/Users/myusername/Downloads/spark/python/pyspark/ml/wrapper.py", line 285, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/Users/myusername/Downloads/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/Users/myusername/Downloads/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/myusername/Downloads/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o413.fit.
: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:283)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:240)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:222)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:221)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:191)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:48)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Try to add xgboost-tracker.properties file in the folder with your jar files with the following content:
host-ip=0.0.0.0
XGBoost github
Another option is to unzip xgboost4j jar file using command:
jar xf xgboost4j-0.72.jar
You can modify tracker.py file manually and add the corrected file back to jar using
jar uf xgboost4j-0.72.jar tracker.py

Spark 2.0 with Zeppelin 0.6.1 - SQLContext not available

I am running spark 2.0 and zeppelin-0.6.1-bin-all on a Linux server. The default spark notebook runs just fine, but when I try to create and run a new notebook in pyspark using sqlContext I get the error "py4j.Py4JException: Method createDataFrame([class java.util.ArrayList, class java.util.ArrayList, null]) does not exist"
I tried running a simple code,
%pyspark
wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordsDF.show()
print type(wordsDF)
wordsDF.printSchema()
I get the error,
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7635635698598314374.py", line 266, in
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7635635698598314374.py", line 259, in
exec(code)
File "", line 1, in
File "/spark/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.py", line 299, in createDataFrame
return self.sparkSession.createDataFrame(data, schema, samplingRatio)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value
format(target_id, ".", name, value))
Py4JError: An error occurred while calling o48.createDataFrame. Trace:
py4j.Py4JException: Method createDataFrame([class java.util.ArrayList, class java.util.ArrayList, null]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
When I try the same code with "sqlContext = SQLContext(sc)" it works just fine.
I have tried setting the interpreter "zeppelin.spark.useHiveContext false" configuration but it did not work.
I must obviously be missing something since this is such a simple operation. Please advice if there is any other configuration to be set or what I am missing.
I tested the same piece of code with Zeppelin 0.6.0 and it is working fine.

SparkSession is the default entry-point for Spark 2.0.0, which is mapped to spark in Zeppelin 0.6.1 (as it is in the Spark shell). Have you tried spark.createDataFrame(...)?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

py4j.protocol.Py4JJavaError: An error occurred while calling o49.csv - apache-spark

Problem got resolve by deleting hadoop.dll file from winutils folder and using lower version of Spark

Related

Pandas UDF for pyspark - Package not found error

py4j.protocol.Py4JJavaError: An error occurred while calling showString

java.io.IOException: No FileSystem for scheme: C and WinError 10054: An existing connection was forcibly closed by the remote host

Tracker issue with XGBoost in PySpark

Spark 2.0 with Zeppelin 0.6.1 - SQLContext not available

Categories

Resources