perhaps there's someone out there that can help me.
I'm trying to read data from ES using PySpark. My Jupyter Notebook code is pretty simple:
import pyspark
conf = pyspark.SparkConf().setAppName('Test').setMaster('spark://spark-master:7077')
sc = pyspark.SparkContext(conf=conf)
es_rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf={
"es.resource": "some-log/doc",
"es.nodes": "192.168.1.25",
"es.port": "9200"
})
I have Spark and Jupyter Notebook installed on the host running the NB. The spark-defaults.conf file is loading the "elasticsearch-hadoop-6.4.0.jar" via: spark.jars /opt/maya/es-hadoop/elasticsearch-hadoop-6.4.0.jar
I can connect to the ES instance and read from it by using other tools like elasticsearch-py, the Test app shows up in the Spark Master UI. However, when I execute the code above, I keep getting this error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-5-c990f37c388b> in <module>
6 "es.resource": "logs-dfir-winevent-security-*/doc",
7 "es.nodes": "192.168.248.131",
----> 8 "es.port": "9200"
9 })
10 #es_rdd.first()
/opt/anaconda/lib/python3.6/site-packages/pyspark/context.py in newAPIHadoopRDD(self, inputFormatClass, keyClass, valueClass, keyConverter, valueConverter, conf, batchSize)
715 jrdd = self._jvm.PythonRDD.newAPIHadoopRDD(self._jsc, inputFormatClass, keyClass,
716 valueClass, keyConverter, valueConverter,
--> 717 jconf, batchSize)
718 return RDD(jrdd, self)
719
/opt/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/anaconda/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: org.elasticsearch.hadoop.mr.LinkedMapWritable
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:302)
at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:286)
at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I've searched and cannot see that the error is in the code itself, I'm having a feeling this issue is more related to bad Spark configuration within the host running the Jupyter Notebook. Any insight would be much appreciated!
Please refer to this question : pyspark: ship jar dependency with spark-submit
What you need to do is pass the jar of the dependency with the configuration. If you're using a Jupyter notebook, you can add it via SparkConf() such as :
conf = SparkConf().option('spark.driver.extraClassPath', 'full/path/to/jar')
Just change your code to :
conf = pyspark.SparkConf().setAppName('Test').setMaster('spark://spark-master:7077').option('spark.driver.extraClassPath', 'full/path/to/jar')
Another method is:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = \
'--jars /full/path/to/your/jar.jar pyspark-shell'
jars could be download from https://www.elastic.co/downloads/hadoop
works on spark 2.3 and elasticsearch 6.4
Related
I have been trying to register spark with GeoSpark. I have installed apache sedona 3.1.3 version in python 3.7. Spark session has created using
#Import required libraries
import os
import folium
import geopandas as gpd
from pyspark.sql import SparkSession
from geospark.register import GeoSparkRegistrator
from geospark.utils import GeoSparkKryoRegistrator, KryoSerializer
from geospark.register import upload_jars
#Generate spark session
upload_jars()
spark = SparkSession.builder.\
master("local[*]").\
appName("TestApp").\
config("spark.serializer", KryoSerializer.getName).\
config("spark.kryo.registrator", GeoSparkKryoRegistrator.getName) .\
getOrCreate()
spark session:
SparkSession -
in-memory
SparkContext
Spark UI
Version
v3.1.3
Master
local[*]
AppName
TestApp
When I tried to register this spark session with geospark using command:
GeoSparkRegistrator.registerAll(spark)
, I'm getting error py4javaerror like this:
{
Py4JJavaError Traceback (most recent call last)
Input In [4], in <module>
----> 1 GeoSparkRegistrator.registerAll(spark)
File ~/anaconda3/envs/ox/lib/python3.10/site-packages/geospark/register/geo_registrator.py:24, in GeoSparkRegistrator.registerAll(cls, spark)
15 #classmethod
16 def registerAll(cls, spark: SparkSession) -> bool:
17 """
18 This is the core of whole package, It uses py4j to run wrapper which takes existing SparkSession
19 and register all User Defined Functions by GeoSpark developers, for this SparkSession.
(...)
22 :return: bool, True if registration was correct.
23 """
---> 24 spark.sql("SELECT 1 as geom").count()
25 PackageImporter.import_jvm_lib(spark._jvm)
26 cls.register(spark)
File ~/anaconda3/envs/ox/lib/python3.10/site-packages/pyspark/sql/dataframe.py:680, in DataFrame.count(self)
670 def count(self):
671 """Returns the number of rows in this :class:`DataFrame`.
672
673 .. versionadded:: 1.3.0
(...)
678 2
679 """
--> 680 return int(self._jdf.count())
File ~/anaconda3/envs/ox/lib/python3.10/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
1315 command = proto.CALL_COMMAND_NAME +\
1316 self.command_header +\
1317 args_command +\
1318 proto.END_COMMAND_PART
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1324 for temp_arg in temp_args:
1325 temp_arg._detach()
File ~/anaconda3/envs/ox/lib/python3.10/site-packages/pyspark/sql/utils.py:111, in capture_sql_exception.<locals>.deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
File ~/anaconda3/envs/ox/lib/python3.10/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))
Py4JJavaError: An error occurred while calling o42.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: org.apache.spark.SparkException: Failed to register classes with Kryo
org.apache.spark.SparkException: Failed to register classes with Kryo
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:173)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:222)
at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:161)
at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:336)
at org.apache.spark.serializer.KryoSerializationStream.<init>(KryoSerializer.scala:256)
at org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:422)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:319)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:140)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:95)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:35)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:77)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1509)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$submitStage$5(DAGScheduler.scala:1274)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$submitStage$5$adapted(DAGScheduler.scala:1273)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1273)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1213)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2432)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2421)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: java.lang.ClassNotFoundException: org.datasyslab.geospark.serde.GeoSparkKryoRegistrator
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:209)
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$7(KryoSerializer.scala:168)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:168)
... 26 more
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2303)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2252)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2251)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2251)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1443)
at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$submitStage$5(DAGScheduler.scala:1274)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$submitStage$5$adapted(DAGScheduler.scala:1273)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1273)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1213)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2432)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2421)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:902)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3019)
at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3018)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3700)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3698)
at org.apache.spark.sql.Dataset.count(Dataset.scala:3018)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Failed to register classes with Kryo
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:173)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:222)
at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:161)
at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102)
at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48)
at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109)
at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:336)
at org.apache.spark.serializer.KryoSerializationStream.<init>(KryoSerializer.scala:256)
at org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:422)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:319)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:140)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:95)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:35)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:77)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1509)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$submitStage$5(DAGScheduler.scala:1274)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$submitStage$5$adapted(DAGScheduler.scala:1273)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1273)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1213)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2432)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2421)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: java.lang.ClassNotFoundException: org.datasyslab.geospark.serde.GeoSparkKryoRegistrator
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
}
Please help. I need to run this for geospatial analysis. Thanks in advance.
I wanted to install graphframes for spark following the instructions on the spark website, but the command:
pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12
did not work for me.
I tried many ways to install, but decided to stay at downloading graphframes .jar, adding it to the general list of Spark .jar files and adding it manually in the code spark.sparkContext.addPyFile("path to /spark-2.4.7-bin-hadoop2.7/jars/graphframes-0.8.1-spark3.0-s_2.12.jar").
After that, the library is imported, but there is always an error when creating the GraphFrame. And I just have no idea how to solve it.
My .bashrc variables:
export CLASSPATH="/home/german/spark-2.4.7-bin-hadoop2.7/jars"
export HADOOP_CONF_DIR="/home/german/spark-2.4.7-bin-hadoop2.7/conf"
export HADOOP_HOME="/home/german/spark-2.4.7-bin-hadoop2.7"
export HADOOP_SECURITY_LOGGER=ERROR,console
export JAVA_HOME="/home/german/jdk1.8.0_301"
export SPARK_CLASSPATH="/home/german/spark-2.4.7-bin-hadoop2.7/jars"
export SPARK_DIST_CLASSPATH="/home/german/spark-2.4.7-bin-hadoop2.7/jars"
export SPARK_HOME="/home/german/spark-2.4.7-bin-hadoop2.7"
export PATH="/home/german/spark-2.4.7-bin-hadoop2.7/bin:$PATH"
export PYTHONPATH="/home/german/spark-2.4.7-bin-hadoop2.7/python/lib/pyspark.zip:/home/german/spark-2.4.7-bin-hadoop2.7/python/lib:/home/german/spark-2.4.7-bin-hadoop2.7/python:$PYTHONPATH"
My jdk version 1.8, python 3.7.10, OS: Ubuntu 20.04 LTS.
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config("spark.sql.warehouse.dir", "spark_warehouse")\
.getOrCreate()
spark.sparkContext.setCheckpointDir("graphframes_checkpoints")
spark.sparkContext.addPyFile("path to /spark-2.4.7-bin-hadoop2.7/jars/graphframes-0.8.1-spark3.0-s_2.12.jar")
vertices = spark.read.parquet("tmp_dfs/parquet/vertices.parquet")
edges = spark.read.parquet("tmp_dfs/parquet/edges.parquet")
from graphframes import *
graph = GraphFrame(vertices, edges)
And I get the error:
Py4JJavaError: An error occurred while calling o65.createGraph.
: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
at org.graphframes.GraphFrame$.apply(GraphFrame.scala:676)
at org.graphframes.GraphFramePythonAPI.createGraph(GraphFramePythonAPI.scala:10)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-2-ee0c1444db6f> in <module>
37
38 from graphframes import *
---> 39 graph = GraphFrame(vertices, edges)
/tmp/spark-9d209109-e503-4ea1-813c-9ca68e76d72a/userFiles-4417833f-c19c-4e6e-9eea-7a21b6553f5f/graphframes-0.8.1-spark3.0-s_2.12.jar/graphframes/graphframe.py in __init__(self, v, e)
87 .format(self.DST, ",".join(e.columns)))
88
---> 89 self._jvm_graph = self._jvm_gf_api.createGraph(v._jdf, e._jdf)
90
91 #property
~/spark-2.4.7-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
~/spark-2.4.7-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
~/spark-2.4.7-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o65.createGraph.
: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
at org.graphframes.GraphFrame$.apply(GraphFrame.scala:676)
at org.graphframes.GraphFramePythonAPI.createGraph(GraphFramePythonAPI.scala:10)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I may have chosen the wrong installation method or something else. I would be glad to hear any suggestions on how to solve this problem.
Check with which scala version spark jars are available under $SPARK_HOME/jars folder, example spark-sql_<scala version>-2.4.7.jar. If the version is 2.11 then you need to use graphframe which is compiled with scala v2.11.
And one more thing, spark version which you are using is 2.4.7 but graphframes jar which you added is related to spark 3.0, this might also cause issues.
I cannot install geopandas package in Databricks. I'm using cluster Run time 5.5 LTS Spark 2.4.3 Scala 2.11
The package is successfully installed in other runtime version but not the version I need.
What need to be done to install this package in cluster runtime 5.5?
I'm using below command
dbutils.library.installPyPI("geopandas")
Below is an error statement
org.apache.spark.SparkException: Process List(/local_disk0/pythonVirtualEnvDirs/virtualEnv-e9b469dd-aad9-4414-a208-03e3ecd8096c/bin/python, /local_disk0/pythonVirtualEnvDirs/virtualEnv-e9b469dd-aad9-4414-a208-03e3ecd8096c/bin/pip, install, geopandas, --disable-pip-version-check) exited with code 1. Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-bgvkkr58/fiona/
Detailed error
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<command-1887950226624660> in <module>()
1
----> 2 dbutils.library.installPyPI("geopandas")
/local_disk0/tmp/1625551234943-0/dbutils.py in installPyPI(self, project, version, repo, extras)
237 def installPyPI(self, project, version = "", repo = "", extras = ""):
238 return self.print_and_return(self.entry_point.getSharedDriverContext() \
--> 239 .addIsolatedPyPILibrary(project, version, repo, extras))
240
241 def restartPython(self):
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o413.addIsolatedPyPILibrary.
: org.apache.spark.SparkException: Process List(/local_disk0/pythonVirtualEnvDirs/virtualEnv-e9b469dd-aad9-4414-a208-03e3ecd8096c/bin/python, /local_disk0/pythonVirtualEnvDirs/virtualEnv-e9b469dd-aad9-4414-a208-03e3ecd8096c/bin/pip, install, geopandas, --disable-pip-version-check) exited with code 1. Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-bgvkkr58/fiona/
at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1403)
at org.apache.spark.util.Utils$.installLibrary(Utils.scala:836)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1700)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1632)
at com.databricks.backend.daemon.driver.SharedDriverContext$$anonfun$addIsolatedPyPILibrary$1.apply$mcV$sp(SharedDriverContext.scala:558)
at com.databricks.backend.daemon.driver.SharedDriverContext$$anonfun$addIsolatedPyPILibrary$1.apply(SharedDriverContext.scala:558)
at com.databricks.backend.daemon.driver.SharedDriverContext$$anonfun$addIsolatedPyPILibrary$1.apply(SharedDriverContext.scala:558)
at com.databricks.logging.UsageLogging$$anonfun$recordOperation$1.apply(UsageLogging.scala:369)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:238)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:233)
at com.databricks.backend.daemon.driver.SharedDriverContext.withAttributionContext(SharedDriverContext.scala:57)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:271)
at com.databricks.backend.daemon.driver.SharedDriverContext.withAttributionTags(SharedDriverContext.scala:57)
at com.databricks.logging.UsageLogging$class.recordOperation(UsageLogging.scala:350)
at com.databricks.backend.daemon.driver.SharedDriverContext.recordOperation(SharedDriverContext.scala:57)
at com.databricks.backend.daemon.driver.SharedDriverContext.addIsolatedPyPILibrary(SharedDriverContext.scala:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
enter code here
I've a bunch of code that works perfectly with s3n, but when i try to switch to s3a I just get some sort of java.lang.IllegalArgumentException without a real pointer or hint as to what exactly is wrong.. would appreciate some suggestions for debugging! I'm on hadoop-aws-2.7.3 and aws-java-sdk-1.7.4 so I believe that should be fine
error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-2-1aafd157ea37> in <module>
----> 1 schema_df = spark.read.json('s3a://udemy-stream-logs/cdn-access-raw/verizon/mp4-a.udemycdn.com/wpc_C9216_306_20200701_0C390000BFD7B55E_100.json_lines.gz')
2 schema = schema_df.schema
/usr/local/spark/python/pyspark/sql/readwriter.py in json(self, path, schema, primitivesAsString, prefersDecimal, allowComments, allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord, dateFormat, timestampFormat, multiLine, allowUnquotedControlChars, lineSep, samplingRatio, dropFieldIfAllNull, encoding, locale, pathGlobFilter, recursiveFileLookup)
298 path = [path]
299 if type(path) == list:
--> 300 return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
301 elif isinstance(path, RDD):
302 def func(iterator):
/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
126 def deco(*a, **kw):
127 try:
--> 128 return f(*a, **kw)
129 except py4j.protocol.Py4JJavaError as e:
130 converted = convert_exception(e.java_exception)
/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o31.json.
: java.lang.IllegalArgumentException
at java.base/java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1293)
at java.base/java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1215)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:280)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:477)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
my code:
conf = (SparkConf()
.set('spark.executor.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
.set('spark.driver.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true')
.set('spark.master', 'local[*]')
.set('spark.driver.memory', '4g'))
scT = SparkContext(conf=conf)
scT.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
scT.setLogLevel("INFO")
hadoopConf = scT._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3.buffer.dir', '/tmp/pyspark')
hadoopConf.set('fs.s3a.awsAccessKeyId', 'key')
hadoopConf.set('fs.s3a.awsSecretAccessKey', 'secret')
hadoopConf.set('fs.s3a.endpoint', 's3-us-east-1.amazonaws.com')
hadoopConf.set('fs.s3a.multipart.size', '104857600')
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
hadoopConf.set('fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider')
spark = SparkSession(scT)
df = spark.read.json('s3a://mybucket/something_something.json_lines.gz')
Ignoring the little detail that you have set the username and password in the wrong properties for the s3a connector, that stack trace implies its from thread pool construction. Presumably one of the parameters passed in (thread pool size, keep alive time. is somehow invalid. No obvious cue as to which specific option is provided by the JVM though.
My recommendation is to stop copying and pasting other stack overflow examples and look at the s3a documentation. See what the options are for authentication and then for bounded and unbounded thread pools and make sure they're set
I got the same problem as you, and I figure out this was caused by the code you configure "fs.s3a.multipart.size". I removed it and the problem has gone. You could try it.
I am trying to load a bunch of CSV files row by row into mysql instance which is running on OpenShift using pyspark configuration. I have a Jupyter notebook with spark up and running.
Below is the code I have. And it fails with specific driver error
Py4JJavaError: An error occurred while calling o89.save.
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
if __name__ == '__main__':
scSpark = SparkSession \
.builder \
.appName("reading csv") \
.getOrCreate()
if __name__ == '__main__':
scSpark = SparkSession \
.builder \
.appName("reading csv") \
.getOrCreate()
data_file = '/opt/app-root/src/data/train.psv'
sdfData = scSpark.read.csv(data_file, header=True, sep="|").cache()
print('Total Records = {}'.format(sdfData.count()))
sdfData.show()
sdfData.registerTempTable("train")
output = scSpark.sql('SELECT count(*) from train')
output.show()
+--------+
|count(1)|
+--------+
| 1168686|
+--------+
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages mysql:mysql-connector-java:jar:8.0.21 pyspark-shell'
output = scSpark.sql('SELECT * from train')
output.show()
output.write.format('jdbc').options(
url='jdbc:mysql://mysql-1-28d85/sepsis',
driver='com.mysql.jdbc.Driver',
#driver='mysql-connector-java.Driver',
#driver='org.mysql.jdbc.Driver',
dbtable='train',
user='sepsis',
password='Success_2020').mode('append').save()
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-57-114af97e0442> in <module>
11 dbtable='train',
12 user='sepsis',
---> 13 password='Success_2020').mode('append').save()
/opt/app-root/lib/python3.6/site-packages/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
735 self.format(format)
736 if path is None:
--> 737 self._jwrite.save()
738 else:
739 self._jwrite.save(path)
/opt/app-root/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/app-root/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/opt/app-root/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o1641.save.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$5.apply(JDBCOptions.scala:99)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$5.apply(JDBCOptions.scala:99)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:99)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:190)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:194)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at sun.reflect.GeneratedMethodAccessor67.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Changed the code with package.
Also this is openshift , where all components are running as pods with no access to outside environment.
java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
That says it all. You have to start pyspark (or the environment) with the JDBC driver for MySQL using --driver-class-path or similar (that will be specific to Jupyter).
For Jupyter Notebook
Copying from PySpark in Jupyter Notebook — Working with Dataframe & JDBC Data Sources:
If you use Jupyter Notebook, you should set the PYSPARK_SUBMIT_ARGS environment variable, as following:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'
Change the --packages to reference the MySQL JDBC driver.
Once you go to the installation path of the spark, there will be a jars folder. Download your mysql jdbc jar file and place it into the jars folder then you don't need any options to the command or code.