RDD doesn't work - apache-spark

I' m currently working on a project and can't seem to overcome an error in spark.
function like .first() and .collect() won't give results.
this is my code:
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="C:\spark-2.0.1-bin-hadoop2.7"
# Append pyspark to Python Path
sys.path.append("C:\spark-2.0.1-bin-hadoop2.7\python ")
try:
from pyspark import SparkContext
from pyspark import SparkConf
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
import re
sc = SparkContext()
file = sc.textFile('rC:\\essay.txt')
word = file.map(lambda line: re.split(r'[?:\n|\s]\s*', line))
word.first()
when i run it on pycharm. It generates the following:
Successfully imported Spark Modules
16/12/18 17:23:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/18 17:23:43 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
Traceback (most recent call last):
File "C:/Users/User1/PycharmProjects/BigData/SparkMatrice.py", line 43, in <module>
word.first()
File "C:\spark-2.0.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 1328, in first
File "C:\spark-2.0.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 1280, in take
File "C:\spark-2.0.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 2388, in getNumPartitions
File "C:\spark-2.0.1-bin-hadoop2.7\python\lib\py4j-0.10.3-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "C:\spark-2.0.1-bin-hadoop2.7\python\lib\py4j-0.10.3-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o19.partitions.
: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: rC:%5Cessay.txt
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:245)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$29.apply(SparkContext.scala:992)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$29.apply(SparkContext.scala:992)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:146)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:60)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: rC:%5Cessay.txt
at java.net.URI.checkPath(Unknown Source)
at java.net.URI.<init>(Unknown Source)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 32 more
Same thing happens when i replace .first() with .collect().(same thing happens when i use the terminal instead of pycharm).
I hope that someone can help me figure out what is wrong.

The problem is listed there for you, your path is wrong:
Caused by: java.net.URISyntaxException: Relative path in absolute URI: rC:%5Cessay.txt
at java.net.URI.checkPath(Unknown Source)
You need to change
file = sc.textFile('rC:\\essay.txt')
to
file = sc.textFile(r'C:\\essay.txt')

Related

read a text file from S3 into a Spark df : UsupportedOperationException

I am trying to read a text file from on-prem s3 compatible object storage using Spark and I am getting an error stating: UsupportedOperationException. I am unsure what this is pointing to and have tried to adjust code thinking maybe it was the spark.read command. I have tried read.text and read.csv both of which should work, but result in the same error. Full stack trace is below along with code:
Code being used:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("s3reader") \
.getOrCreate()\
sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key","xxxxxxxxxxxx")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xxxxxxxxxxxxxx")
sc._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "true")
df = spark.read.text("https://s3a.us-east-1.xxxx.xxxx.xxxx.com/bronze/xxxxxxx/test.txt")
print(df)
Stack trace:
Traceback (most recent call last):
File "/home/cloud/sparks3test.py", line 19, in <module>
df = spark.read.text("https://s3a.us-east-1.tpavcps3ednrg1.vici.verizon.com/bronze/CoreMetrics/test.txt")
File "/usr/local/bin/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 516, in text
File "/usr/local/bin/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/bin/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/local/bin/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.text.
: java.lang.UnsupportedOperationException
at org.apache.hadoop.fs.http.AbstractHttpFileSystem.listStatus(AbstractHttpFileSystem.java:91)
at org.apache.hadoop.fs.http.HttpsFileSystem.listStatus(HttpsFileSystem.java:23)
at org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
at org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$1(HadoopFSUtils.scala:95)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFilesInternal(HadoopFSUtils.scala:85)
at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFiles(HadoopFSUtils.scala:69)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:158)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:131)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:94)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:66)
at org.apache.spark.sql.execution.datasources.DataSource.createInMemoryFileIndex(DataSource.scala:581)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:944)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)```
Try reading file from S3 like below.
s3a://bucket/bronze/xxxxxxx/test.txt

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob

I am having this error in Jupyter notebook running Python 3.6.5, and in my Python shell running 3.7.2. My OS is Windows 10. I did pip install pyspark in both environments. Both are using Spark version 2.4.0, and my Java JDK is Oracle JDK version 8, jdk1.8.0_201. This is the code I'm running in both cases:
>>> from pyspark import SparkConf, SparkContext
>>> conf = SparkConf().setAppName("app")
>>> sc = SparkContext(conf=conf)
>>> import os
>>> os.chdir("C:/Users/theca/Desktop/school_folders/Big Data")
>>> data = sc.textFile("post_codes.txt")
>>> data.take(1)
I was using JRE version 8, I verified JAVA_HOME:
C:\Python\Python37\Scripts>echo %JAVA_HOME%
C:\ProgramData\Oracle\Java\javapath\java.exe
I Changed to JDK thinking that would fix the issue:
C:\Program Files\Java\jdk1.8.0_201>setx JAVA_HOME "C:\Program Files\Java\jdk1.8.0_201"
SUCCESS: Specified value was saved.
C:\Program Files\Java\jdk1.8.0_201>setx PATH "%PATH%;%JAVA_HOME%\bin";
WARNING: The data being saved is truncated to 1024 characters.
I exited cmd and went back in, verified my java home:
C:\WINDOWS\system32>echo %JAVA_HOME%
C:\Program Files\Java\jdk1.8.0_201
I have tried solutions here:
PySpark exception: Java gateway process exited before sending its port number
and here:
Pyspark: SparkContext definition in Spyder throws Java gateway error
As well as a few other answers in this board.I am wondering if I may need to use an earlier version of spark?
Here is the entirety of the error message:
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
data.take(1)
File "C:\Python\Python37\lib\site-packages\pyspark\rdd.py", line 1327, in take
totalParts = self.getNumPartitions()
File "C:\Python\Python37\lib\site-packages\pyspark\rdd.py", line 391, in getNumPartitions
return self._jrdd.partitions().size()
File "C:\Python\Python37\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Python\Python37\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o20.partitions.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Python/Python37/post_codes.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Try the following:
data = sc.textFile("file:///path to the file/")
This should work.

Error while running python program for spark streaming with kafka

I am using latest spark (2.1.0) and python (3.5.3) installed. I have kafka (2.10.0) installed locally.
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pykafka import KafkaClient
import json
import sys
import pprint
spsc = SparkContext(appName="SampleApp")
stsc = StreamingContext(spsc, 1)
print('contexts =================== {} {}'.format(spsc,stsc));
kvs = KafkaUtils.createStream(stsc, "localhost:2181", "spark-consumer", {"7T-test3": 1})
spsc.stop()
Here 'print' line executes fine. But on next line while creating stream I get following error,
Traceback (most recent call last):
File "/Users/MacAdmin/Downloads/spark-streaming/spark/spark_streaming_osample.py", line 24, in <module>
kvs = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer", {"7T-test3": 1})
File "/Users/MacAdmin/Documents/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 70, in createStream
File "/Users/MacAdmin/Documents/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/Users/MacAdmin/Documents/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o25.createStream.
: java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91)
at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:168)
at org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper.createStream(KafkaUtils.scala:632)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 25 more
I run my program from command line as
/Users/MacAdmin/Documents/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --jars spark-streaming-kafka-assembly_2.10-1.6.3.jar spark_streaming_sample.py
Do I need any environment variable or I am not using correct library versions?
Few things were missing, added classpaths
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip/:$PYTHONPATH
And spark logging is private from 2.* onwards so had to use below kafka streaming version while running program
spark-streaming-kafka-0-8-assembly_2.10-2.1.0.jar
Make sure that you have the topic created (7T-test3) in Kafka before executing the stream.
You may also want to provide more details leading up to the error.

MapR Stream and PySpark

Does PySpark work (compatible) for MapR Streams?
Any example code?
I've tried that but keep getting exception
strLoc = '/Path1:Stream1'
protocol = 'file://' if ( strLoc.startswith('/') or strLoc.startswith('\\') ) else ''
from pyspark.streaming.kafka import *;
from pyspark import StorageLevel;
APA = KafkaUtils.createDirectStream(ssc, [strLoc], kafkaParams={ \
"oracle.odi.prefer.dataserver.packages" : "" \
,"key.deserializer" : "org.apache.kafka.common.serialization.StringDeserializer" \
,"value.deserializer" : "org.apache.kafka.common.serialization.ByteArrayDeserializer" \
,"zookeeper.connect" : "maprdemo:5181" \
,"metadata.broker.list" : "this.will.be.ignored:9092"
,"group.id" : "New_Mapping_2_Physical"}, fromOffsets=None, messageHandler=None)
Traceback (most recent call last):
File "/tmp/New_Mapping_2_Physical.py", line 77, in <module>
,"group.id" : "New_Mapping_2_Physical"}, fromOffsets=None, messageHandler=None)
File "/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 152, in createDirectStream
py4j.protocol.Py4JJavaError: An error occurred while calling o58.createDirectStreamWithoutMessageHandler.
: org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at scala.util.Either.fold(Either.scala:97)
at org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
at org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
at org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper.createDirectStream(KafkaUtils.scala:720)
at org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper.createDirectStreamWithoutMessageHandler(KafkaUtils.scala:688)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
On Scala, it seems to work fine, but on PySpark, not.
I downloaded the latest build http://package.mapr.com/releases/ecosystem-5.x/redhat/mapr-spark-1.6.1.201612010646-1.noarch.rpm and it resolved the issue.
I've checked the the pyspark kafka.py, and found it updated. I was using label 1605, now 1611.

Issues Google Cloud Storage connector on Spark

I am trying to install the Google Cloud Storage on Spark on Mac OS to do local testing of my Spark app. I have read the following document (https://cloud.google.com/hadoop/google-cloud-storage-connector). I have added "gcs-connector-latest-hadoop2.jar" in my spark/lib folder. I have also added the core-data.xml file in the spark/conf directory.
When I run my pyspark shell, I get an error:
>>> sc.textFile("gs://mybucket/test.csv").count()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/poiuytrez/Documents/DataBerries/programs/spark/python/pyspark/rdd.py", line 847, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/Users/poiuytrez/Documents/DataBerries/programs/spark/python/pyspark/rdd.py", line 838, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
File "/Users/poiuytrez/Documents/DataBerries/programs/spark/python/pyspark/rdd.py", line 759, in reduce
vals = self.mapPartitions(func).collect()
File "/Users/poiuytrez/Documents/DataBerries/programs/spark/python/pyspark/rdd.py", line 723, in collect
bytesInJava = self._jrdd.collect().iterator()
File "/Users/poiuytrez/Documents/DataBerries/programs/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/Users/poiuytrez/Documents/DataBerries/programs/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o26.collect.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2379)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)
at org.apache.spark.rdd.RDD.collect(RDD.scala:774)
at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:305)
at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893)
... 40 more
I am not sure where to go next.
The requirement It may vary between versions of Spark, but if you peek inside bdutil-0.35.2/extensions/spark/install_spark.sh you'll see how our "Spark + Hadoop on GCE" setup using bdutil works; it includes the items you mention, adding the connector into the spark/lib folder, and adding the core-site.xml file into the spark/conf directory, but additionally has the line added to spark/conf/spark-env.sh:
export SPARK_CLASSPATH=\$SPARK_CLASSPATH:${LOCAL_GCS_JAR}
where ${LOCAL_GCS_JAR} would be the absolute path to the jarfile that you added to spark/lib. Try adding that to your spark/conf/spark-env.sh and the ClassNotFoundException should go away.

Resources