How to access azure block file system (abfss) from a standalone spark cluster

How to access azure block file system (abfss) from a standalone spark cluster - apache-spark

I have a need to use a standalone spark cluster (2.4.7) with Hadoop 3.2 and I am trying to access the ADLS Gen2 storage through pyspark.
I've added a shared key to my core-site.xml and I can ls the storage account like so:
hadoop fs -ls abfss://<container>#<storage_account>.dfs.core.windows.net/
But when I try to read a json file in pyspark (using the shell) like so:
spark.conf.set("fs.azure.account.key.<<storageaccount>>.dfs.core.windows.net", "<<key>>")
spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("abfss://<container>#<storageaccount>.dfs.core.windows.net/example.json").show()
I get the following error:
WARN streaming.FileStreamSink: Error while looking for metadata directory.
File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o103.json.
: java.io.IOException: No FileSystem for scheme: abfss
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:561)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:559)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:559)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:411)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)WARN: command not found
I have also configured the HADOOP lib paths for SPARK_DIST_CLASSPATH to point to $(hadoop classpath) and copied over the hadoop-azure jar to the hadoop/common folder. But still unable to access abfss via pyspark.
What could I be missing here?
Also tried answers given here

This article helps you to DIY: Apache Spark and ADLS Gen 2 support, make sure you have followed all the necessary steps to successfully configure ADLS gen2 on Apache Spark.

Related

Error in reading files in Azure blob storage from laptop spark

I have setup spark (spark-3.2.1-bin-hadoop3.2) in my laptop and trying to read a CSV file from Azure blob storage which is failing. Here is what I am doing to get the prompt:
./bin/pyspark \
--conf spark.hadoop.fs.azure.account.key.<storage-account>.blob.core.windows.net=<key>\
--packages org.apache.hadoop:hadoop-azure:3.3.2,com.microsoft.azure:azure-storage:8.6.6
And then:
df = spark.read.csv("wasbs://<container>#<storage-account>.blob.core.windows.net/data/Fraud.csv", header=True, inferSchema=True)
It's throwing the following error:
Py4JJavaError: An error occurred while calling o38.csv.
: java.lang.NoSuchMethodError: org.eclipse.jetty.util.log.Log.getProperties()Ljava/util/Properties;
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createPermissionJsonSerializer(AzureNativeFileSystemStore.java:429)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.<clinit>(AzureNativeFileSystemStore.java:331)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.createDefaultStore(NativeAzureFileSystem.java:1485)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1410)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:571)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:834)
Tried the following combinations as well:
spark-3.2.1-bin-hadoop3.2 + org.apache.hadoop:hadoop-azure:3.2.0,com.microsoft.azure:azure-storage:8.6.3
spark-3.1.3-bin-hadoop3.2 + org.apache.hadoop:hadoop-azure:3.3.2,com.microsoft.azure:azure-storage:8.6.6
spark-3.1.3-bin-hadoop3.2 + org.apache.hadoop:hadoop-azure:3.2.0,com.microsoft.azure:azure-storage:8.6.3
spark-3.2.1-bin-hadoop3.2 + org.apache.hadoop:hadoop-azure:2.7.7,com.microsoft.azure:azure-storage:8.6.6
but no luck.
I also have the followig two jar files in spark's jar folder:
jetty-util-11.0.8.jar and jetty-util-ajax-11.0.8.jar

Please make sure you have provided Storage Blob Data Contributor role assigned to user.And then Try to re run again.
Df = spark.read.format("csv").load(filePath, inferSchema = True, header = True)
It could be because of multiple versions of jar present in the classpath. And most likely it might have have compiled a class against a different version of the class that is missing a method, than the one you are using when running it.
Please check if you have a mix of versions of Jetty. You'll need to correct that to same version if present.
References:
apache spark - Looping through files in databricks fails - Stack
Overflow
Read csv from Azure blob Storage and store in a dataframe with
python - Stack Overflow

How to resolve spark error when reading from s3

I am getting an error(java.io.IOException: No FileSystem for scheme: S3a) when running a spark application. I have looked through various other questions regarding this type of error, but Im not able to determine the solution. Spark is version 3.1.2
Updated details below to reflect current state
pyspark script:
import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.4 pyspark-shell'
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("s3reader") \
.getOrCreate()\
sc = spark.sparkContext
#sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
#sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "xxxxxxx")
#sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xxxxxxxxxxxx")
#sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint","xxx.x.xxx.x.com", "us-1-east")
#sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
df = spark.read.json("S3a://silver/testfolder/4a2426b2-856c-4e9b-b698-b3dcdca74f48")
print(df)
here are my jar versions:
cloud#spark-dev-master:/usr/local/spark/jars$ ls -ltr *aws*
-rw-rw-r-- 1 cloud cloud 126287 Aug 18 2016 hadoop-aws-2.7.4.jar
-rw-rw-r-- 1 cloud cloud 4479 Sep 17 02:36 aws-java-sdk-1.7.4.jar
stack trace:
Traceback (most recent call last):
File "/home/cloud/sparks3test.py", line 18, in <module>
df = spark.read.json("S3a://silver/testfolder/4a2426b2-856c-4e9b-b698-b3dcdca74f48")
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 372, in json
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o33.json.
: java.io.IOException: No FileSystem for scheme: S3a
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)

You need to use hadoop-aws version 3.2.0.
You can refer my previous answer here.

I am getting an error(java.lang.NoClassDefFoundError:org/apache/hadoop/fs/StreamCapabilities)
This is what you see when you mix hadoop-aws and hadoop-common JAR versions. They must match point for point (as spark JARs also require).
Do not attempt to work around this except by syncing up JARs, you will only be moving stack traces around.
See Hadoop troubleshooting s3a

As there still appeared to be jar dependecies issues, I did a fresh install on spark using 3.1.2 and hadoop 3.2.0 and aligned hadoop-aws and java-sdk jars with aws-common jar version on the master and worker nodes. This corrected the file system issue. Consequently, upgrading to 3.2.0 also corrected the endpoint issue we were running to as well as path.style.access=true is not supported in any hadoop version older than 2.8.0. That issue was documented here: https://issues.apache.org/jira/browse/HADOOP-12963 for reference.

How do you read a file from Azure Blob w/ Apache Spark without Databricks but with wasbs on Windows 10?

I have azure-storage-8.6.0.jar and hadoop-azure-3.0.1.jar. I keep seeing from other forums that I have to modify the core-site.xml file in the etc folder in hadoop like so https://github.com/hning86/articles/blob/master/hadoopAndWasb.md. I didn't know I even needed to download all of hadoop to run Spark. I thought all I needed was the winutils.exe in hadoop/bin.
spark.read.load(f"wasbs://{container_name}#{storage_account_name}.blob.core.windows.net/{container_name}/myfile.txt" )
Py4JJavaError: An error occurred while calling o53.load.
: java.io.IOException: No FileSystem for scheme: wasbs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

If you want to use pyspark to read CSV file from Azure blob storage on Windows 10, please refer to the following steps
Install pyspark
pip install pyspark
Code (create .py file)
from pyspark.sql import SparkSession
import traceback
try:
spark = SparkSession.builder.getOrCreate()
conf = spark.sparkContext._jsc.hadoopConfiguration()
conf.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set('fs.azure.account.key.<account name>.blob.core.windows.net',
'<account key>')
df = spark.read.option("header", True).csv(
'wasbs://<container name>#<account name>.blob.core.windows.net/<directory name>/<file name>')
df.show()
except Exception as exp:
print("Exception occurred")
print(traceback.format_exc())
Run code
cd <your python or env path>\Scripts
spark-submit --packages org.apache.hadoop:hadoop-azure:3.2.1,com.microsoft.azure:azure-storage:8.6.5 <your py file path>

How to access headnode hdfs files via jupyter notebook

I have set up a head node cluster.I successfully integrated a jupyter notebook with it.(Using this answer)
I am also sucessfully able to run pyspark.I referred this link for that
Now I want to access hdfs files in headnode via jupyter notebook.But when I run the below command which fetches data from hdfs.
df = sqlContext.read.json('hdfs:///192.168.21.110/user/hdfs/ML/pass/Teleram_18/notefind/2018-12-14/')
I get the following error
An error occurred while calling o29.json.
: java.io.IOException: Incomplete HDFS URI, no host: hdfs:///192.168.21.110/user/hdfs/ML/pass/Teleram_18/notefind/2018-12-14/
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:705)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
What is actually wrong? One thing I noticed is that I have pyspark installed on both user head node and hdfs user head node.And I use jupyter notebook using user headnode.
I submit application programs in hdfs headnode and I am able to access hdfs files inside hdfs user spark shell.What can I do so that I can access hdfs files from normal headnode user.There is nothing wrong with my path, I can find the data using hadoop fs
UPDATE : I see that in normal user mode python3.5 and pyspark 2.4 is used whereas in hdfs user python2.7 and pyspark 2.3.1 is used.How can I resolve this

Try it with port, for example
hdfs://192.168.21.110:9000/user/hdfs/ML/pass/Teleram_18/notefind/2018-12-14/
Port 9000 - You can verify this port in core-site.xml.

Spark Loading Data from Azure Data Lake Store - Py4JJavaError: NoSuchMethodError

I am trying to load data in Spark 2.3.1 from ADLS using the following:
moviesfileAdls = "adl://xxxxxx.azuredatalakestore.net/Data/movies.csv"
dfMovies = spark.read.format("csv") \
.option("header", "true") \
.option("delimiter",",") \
.load(moviesfileAdls)
The setup: Hadoop-3.1.1 running on the same box as spark-2.3.1-bin-hadoop2.7. In hdfs, I am able to get the file using the following command:
hadoop distcp adl://xxxxxx.azuredatalakestore.net/Data/movies.csv /user/hadoop/movies
The above command successfully copies the file into local HDFS so I believe the hadoop setup is OK.
However, when I try to run the spark.read.format("csv") command, I am getting the following error:
Py4JJavaError: An error occurred while calling o54.load.
: java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V
at org.apache.hadoop.fs.adl.AdlConfKeys.addDeprecatedKeys(AdlConfKeys.java:126)
at org.apache.hadoop.fs.adl.AdlFileSystem.<clinit>(AdlFileSystem.java:98)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I tried adding the ADLS jars directly in spark-defaults.conf:
spark.jars /usr/local/hadoop/share/hadoop/tools/lib/azure-data-lake-store-sdk-2.3.1.jar, /usr/local/hadoop/share/hadoop/tools/lib/hadoop-azure-datalake-3.1.1.jar
HADOOP_CLASSPATH refers to the folder where the jars are located according to the spark user:
spark#xxxxx:~$ echo $HADOOP_CLASSPATH /usr/local/hadoop/etc/hadoop/*:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/share/hadoop/tools/lib/*
Any pointers are greatly appreciated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to access azure block file system (abfss) from a standalone spark cluster - apache-spark

This article helps you to DIY: Apache Spark and ADLS Gen 2 support, make sure you have followed all the necessary steps to successfully configure ADLS gen2 on Apache Spark.

Related

Error in reading files in Azure blob storage from laptop spark

How to resolve spark error when reading from s3

How do you read a file from Azure Blob w/ Apache Spark without Databricks but with wasbs on Windows 10?

How to access headnode hdfs files via jupyter notebook

Spark Loading Data from Azure Data Lake Store - Py4JJavaError: NoSuchMethodError

Categories

Resources