I have a dataframe "df", which I want to store in Cloud Storage Bucket "my_bucket". I am currently writing my code on Google Colab. My code is as follows:
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pd.DataFrame({
'a': [1, 2],
'b': [2, 4]
}))
df.write.csv('gs://my_bucket/df')
I'm getting the following error:
/usr/local/lib/python3.7/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o128.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461)
at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:558)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:851)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Anyone have any suggestions for this? Not sure what I'm doing wrong!
The error message tells you, that spark does not understand the path to your bucket. It looks like you have to mount the bucket first.
Try this:
from google.colab import auth
auth.authenticate_user()
Authenticate your user
Then install gcsfuse with the following snippet:
!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse
Then you can mount the bucket as following:
!mkdir mybucket
!gcsfuse mybucket mybucket
You can store your data then to the following path:
df.write.csv('/content/my_bucket/df')
Check out also this medium post for the detailed workflow.
Related
I have a spark standalone configured with 3 nodes.
I want to read csv data stored in s3-compatible storage (dell ecs) in this pySpark.
Here's the method and configuration I've tried:
Put hadoop-aws-3.3.1.jar and aws-java-sdk-bundle-1.11.375.jar in $SPARK_HOME/jars path of all spark nodes.
$SPARK_HOME/conf/spark-defauly.conf.
.
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key ACCESS_KEY
spark.hadoop.fs.s3a.secret.key SECRET_KEY
spark.hadoop.fs.s3a.endpoint http://ECS_IP:9020
spark.hadoop.fs.s3a.connection.ssl.enabled false
spark.hadoop.fs.s3a.path.style.access true
$SPARK_HOME/bin/pyspark.
df = spark.read.options(header='True', delimiter=',', inferSchema=True).csv("s3a://my-bucket/file.csv")
Here is the result:
22/04/25 15:45:34 WARN FileSystem: Failed to initialize fileystem s3a://my-bucket/file.csv: java.lang.IllegalArgumentException: bucket
22/04/25 15:45:34 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://my-bucket/file.csv.
java.lang.IllegalArgumentException: bucket
at org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:144)
at org.apache.hadoop.fs.s3a.S3AUtils.propagateBucketOptions(S3AUtils.java:1152)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:374)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:571)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)
22/04/25 15:45:34 WARN FileSystem: Failed to initialize fileystem s3a://my-bucket/file.csv: java.lang.IllegalArgumentException: bucket
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 410, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/local/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1309, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: bucket
I am trying to read a text file from on-prem s3 compatible object storage using Spark and I am getting an error stating: UsupportedOperationException. I am unsure what this is pointing to and have tried to adjust code thinking maybe it was the spark.read command. I have tried read.text and read.csv both of which should work, but result in the same error. Full stack trace is below along with code:
Code being used:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("s3reader") \
.getOrCreate()\
sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key","xxxxxxxxxxxx")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xxxxxxxxxxxxxx")
sc._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "true")
df = spark.read.text("https://s3a.us-east-1.xxxx.xxxx.xxxx.com/bronze/xxxxxxx/test.txt")
print(df)
Stack trace:
Traceback (most recent call last):
File "/home/cloud/sparks3test.py", line 19, in <module>
df = spark.read.text("https://s3a.us-east-1.tpavcps3ednrg1.vici.verizon.com/bronze/CoreMetrics/test.txt")
File "/usr/local/bin/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 516, in text
File "/usr/local/bin/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/bin/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/local/bin/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.text.
: java.lang.UnsupportedOperationException
at org.apache.hadoop.fs.http.AbstractHttpFileSystem.listStatus(AbstractHttpFileSystem.java:91)
at org.apache.hadoop.fs.http.HttpsFileSystem.listStatus(HttpsFileSystem.java:23)
at org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
at org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$1(HadoopFSUtils.scala:95)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFilesInternal(HadoopFSUtils.scala:85)
at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFiles(HadoopFSUtils.scala:69)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:158)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:131)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:94)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:66)
at org.apache.spark.sql.execution.datasources.DataSource.createInMemoryFileIndex(DataSource.scala:581)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:944)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)```
Try reading file from S3 like below.
s3a://bucket/bronze/xxxxxxx/test.txt
I wanted to install graphframes for spark following the instructions on the spark website, but the command:
pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12
did not work for me.
I tried many ways to install, but decided to stay at downloading graphframes .jar, adding it to the general list of Spark .jar files and adding it manually in the code spark.sparkContext.addPyFile("path to /spark-2.4.7-bin-hadoop2.7/jars/graphframes-0.8.1-spark3.0-s_2.12.jar").
After that, the library is imported, but there is always an error when creating the GraphFrame. And I just have no idea how to solve it.
My .bashrc variables:
export CLASSPATH="/home/german/spark-2.4.7-bin-hadoop2.7/jars"
export HADOOP_CONF_DIR="/home/german/spark-2.4.7-bin-hadoop2.7/conf"
export HADOOP_HOME="/home/german/spark-2.4.7-bin-hadoop2.7"
export HADOOP_SECURITY_LOGGER=ERROR,console
export JAVA_HOME="/home/german/jdk1.8.0_301"
export SPARK_CLASSPATH="/home/german/spark-2.4.7-bin-hadoop2.7/jars"
export SPARK_DIST_CLASSPATH="/home/german/spark-2.4.7-bin-hadoop2.7/jars"
export SPARK_HOME="/home/german/spark-2.4.7-bin-hadoop2.7"
export PATH="/home/german/spark-2.4.7-bin-hadoop2.7/bin:$PATH"
export PYTHONPATH="/home/german/spark-2.4.7-bin-hadoop2.7/python/lib/pyspark.zip:/home/german/spark-2.4.7-bin-hadoop2.7/python/lib:/home/german/spark-2.4.7-bin-hadoop2.7/python:$PYTHONPATH"
My jdk version 1.8, python 3.7.10, OS: Ubuntu 20.04 LTS.
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config("spark.sql.warehouse.dir", "spark_warehouse")\
.getOrCreate()
spark.sparkContext.setCheckpointDir("graphframes_checkpoints")
spark.sparkContext.addPyFile("path to /spark-2.4.7-bin-hadoop2.7/jars/graphframes-0.8.1-spark3.0-s_2.12.jar")
vertices = spark.read.parquet("tmp_dfs/parquet/vertices.parquet")
edges = spark.read.parquet("tmp_dfs/parquet/edges.parquet")
from graphframes import *
graph = GraphFrame(vertices, edges)
And I get the error:
Py4JJavaError: An error occurred while calling o65.createGraph.
: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
at org.graphframes.GraphFrame$.apply(GraphFrame.scala:676)
at org.graphframes.GraphFramePythonAPI.createGraph(GraphFramePythonAPI.scala:10)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-2-ee0c1444db6f> in <module>
37
38 from graphframes import *
---> 39 graph = GraphFrame(vertices, edges)
/tmp/spark-9d209109-e503-4ea1-813c-9ca68e76d72a/userFiles-4417833f-c19c-4e6e-9eea-7a21b6553f5f/graphframes-0.8.1-spark3.0-s_2.12.jar/graphframes/graphframe.py in __init__(self, v, e)
87 .format(self.DST, ",".join(e.columns)))
88
---> 89 self._jvm_graph = self._jvm_gf_api.createGraph(v._jdf, e._jdf)
90
91 #property
~/spark-2.4.7-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
~/spark-2.4.7-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
~/spark-2.4.7-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o65.createGraph.
: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
at org.graphframes.GraphFrame$.apply(GraphFrame.scala:676)
at org.graphframes.GraphFramePythonAPI.createGraph(GraphFramePythonAPI.scala:10)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I may have chosen the wrong installation method or something else. I would be glad to hear any suggestions on how to solve this problem.
Check with which scala version spark jars are available under $SPARK_HOME/jars folder, example spark-sql_<scala version>-2.4.7.jar. If the version is 2.11 then you need to use graphframe which is compiled with scala v2.11.
And one more thing, spark version which you are using is 2.4.7 but graphframes jar which you added is related to spark 3.0, this might also cause issues.
perhaps there's someone out there that can help me.
I'm trying to read data from ES using PySpark. My Jupyter Notebook code is pretty simple:
import pyspark
conf = pyspark.SparkConf().setAppName('Test').setMaster('spark://spark-master:7077')
sc = pyspark.SparkContext(conf=conf)
es_rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf={
"es.resource": "some-log/doc",
"es.nodes": "192.168.1.25",
"es.port": "9200"
})
I have Spark and Jupyter Notebook installed on the host running the NB. The spark-defaults.conf file is loading the "elasticsearch-hadoop-6.4.0.jar" via: spark.jars /opt/maya/es-hadoop/elasticsearch-hadoop-6.4.0.jar
I can connect to the ES instance and read from it by using other tools like elasticsearch-py, the Test app shows up in the Spark Master UI. However, when I execute the code above, I keep getting this error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-5-c990f37c388b> in <module>
6 "es.resource": "logs-dfir-winevent-security-*/doc",
7 "es.nodes": "192.168.248.131",
----> 8 "es.port": "9200"
9 })
10 #es_rdd.first()
/opt/anaconda/lib/python3.6/site-packages/pyspark/context.py in newAPIHadoopRDD(self, inputFormatClass, keyClass, valueClass, keyConverter, valueConverter, conf, batchSize)
715 jrdd = self._jvm.PythonRDD.newAPIHadoopRDD(self._jsc, inputFormatClass, keyClass,
716 valueClass, keyConverter, valueConverter,
--> 717 jconf, batchSize)
718 return RDD(jrdd, self)
719
/opt/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/anaconda/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: org.elasticsearch.hadoop.mr.LinkedMapWritable
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDDFromClassNames(PythonRDD.scala:302)
at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:286)
at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I've searched and cannot see that the error is in the code itself, I'm having a feeling this issue is more related to bad Spark configuration within the host running the Jupyter Notebook. Any insight would be much appreciated!
Please refer to this question : pyspark: ship jar dependency with spark-submit
What you need to do is pass the jar of the dependency with the configuration. If you're using a Jupyter notebook, you can add it via SparkConf() such as :
conf = SparkConf().option('spark.driver.extraClassPath', 'full/path/to/jar')
Just change your code to :
conf = pyspark.SparkConf().setAppName('Test').setMaster('spark://spark-master:7077').option('spark.driver.extraClassPath', 'full/path/to/jar')
Another method is:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = \
'--jars /full/path/to/your/jar.jar pyspark-shell'
jars could be download from https://www.elastic.co/downloads/hadoop
works on spark 2.3 and elasticsearch 6.4
I'm working on CentOS 7 with Spark 2.3.0
I linked PySpark with jupyter but I get an error when I try to read an xml file
df=pyspark.SQLContext(sc).read.format('com.databricks.spark.xml').options(rowTag='books').load('xm.xml')
Py4JJavaError Traceback (most recent call last)
<ipython-input-13-c0ea09e4b676> in <module>()
----> 1 df = pyspark.SQLContext(sc).read.format('com.databricks.spark.xml').options(rowTag='books').load('xm.xml')
/usr/lib/spark/python/pyspark/sql/readwriter.pyc in load(self, path, format,
schema, **options)
164 self.options(**options)
165 if isinstance(path, basestring):
--> 166 return self._df(self._jreader.load(path))
167 elif path is not None:
168 if type(path) != list:
...
Py4JJavaError: An error occurred while calling o61.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: <br>
How can I add com.databriks to work with me?