Using Apache Spark with a local S3-compatible Object store - apache-spark

I am trying to run a simple Apache spark (Cloudera) read operation using a local object store that is fully s3 sdk/api compatible. But I can not seem to figure out how to get Spark to understand that I am trying to access a local S3 bucket and not remote AWS/S3.
Here's what I've tried...
pyspark2 --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/user/myusername/awskeyfile.jceks --conf fs.s3a.endpoint=https://myenvironment.domain.com
df = spark.read.parquet("s3a://mybucket/path1/")
Error message...
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to mybucket.s3.amazonaws.com:443 [mybucket.s3.amazonaws.com/12.345.678.90] failed: Connection refused (Connection refused)
I can list the local bucket contents without issue on the command-line so I know that I have the access/secret key correct but I need to make Spark understand not to reach out to aws to try and resolve the bucket url.
Thanks.
Update / Resolution:
The fix to the issue was a missing prerequisite jar at maven coordinates: org.apache.hadoop:hadoop-aws:2.6.0
So the final pyspark call looked like:
pyspark2 --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/user/myusername/awskeyfile.jceks --conf fs.s3a.endpoint=https://myenvironment.domain.com --jars hadoop-aws-2.6.0.jar
df = spark.read.parquet("s3a://mybucket/path1/")

This is covered in HDP docs, Working with third party object stores.
Settings are the same for CDH.
It comes down
endpoint fs.s3a.endpoint = hostname
disable DNS to bucket map fs.s3a.path.style.access = true
play with signing options.
There are a few other switches you can turn for better compatibility; they're in those docs.
You might find the Cloudstore storediag command useful.

Related

how can spark read / write from azurite

I am trying to read (and eventually write) from azurite (version 3.18.0) using spark (3.1.1)
i can't understand what spark configurations and file uri i need to set to make this work properly
for example these are the containers and files i have inside azurite
/devstoreaccount1/container1/file1.avro
/devstoreaccount1/container2/file2.avro
This is the code that im running - the uri val is one of the values below
val uri = ...
val spark = SparkSession.builder()
.appName(appName)
.master("local")
.config("spark.driver.host", "127.0.0.1").getOrCreate()
spark.conf.set("spark.hadoop.fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(s"spark.hadoop.fs.azure.account.auth.type.devstoreaccount1.blob.core.windows.net", "SharedKey")
spark.conf.set(s"spark.hadoop.fs.azure.account.key.devstoreaccount1.blob.core.windows.net", <azurite account key>)
spark.read.format("avro").load(uri)
uri value - what is the correct one?
http://127.0.0.1:10000/container1/file1.avro
I get UnsupportedOperationException when i perform the spark.read.format("avro").load(uri) because spark will use the HttpFileSystem implementation and it doesn't support listStatus
wasb://container1#devstoreaccount1.blob.core.windows.net/file1.avro
Spark will try to authenticate against azure servers (and will fail for obvious reasons)
I have tried to follow this stackoverflow post without success.
I have also tried to remove the blob.core.windows.net configuration postfix but then i don't how to give spark the endpoint for the azurite container?
So my question is what are the correct configurations to give spark so it will be able to read from azurite, and what are the correct file path formats to pass as the URI?

Get correct path for sparkdid.cassandra.connection.config.cloud.path in AWS EMR

I checked the entire web. but could not found the solution.
I am trying to connect astra cassandra using bundle in AWS EMR its able to download the bundle file but not loading it.
spark.conf.set("sparkdid.cassandra.connection.config.cloud.path", "secure-connect-app.zip")
this is how I am giving path I know this is wrong path since its not loading the correct config returning connection refused with localhost.
I don't what is the correct path in EMR.
If you look into documentation, then you see that the file could be either specified as URL of file on S3, or you can use —files option when submitting with spark-submit or spark-shell , then it will be available as just a file name, like you’re doing

Providing AWS_PROFILE when reading S3 files with Spark

I want my Spark app (Scala) to be able to read S3 files
spark.read.parquet("s3://my-bucket-name/my-object-key")
On my dev machine, I could access S3 files using awscli a pre-configured profile in ~/.aws/config or ~/.aws/credentials, like:
aws --profile my-profile s3 ls s3://my-bucket-name/my-object-key
But when trying to read those files from Spark, with the aws_profile provided as an env variable (AWS_PROFILE), I got the following error:
doesBucketExist on my-bucket-name: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint
Also tried to provide the profile as a JVM option (-Daws.profile=my-profile), with no luck.
Thanks for reading.
The solution is to provide the spark property: fs.s3a.aws.credentials.provider, setting it to com.amazonaws.auth.profile.ProfileCredentialsProvider.
If I could change the code to build the Spark Session, then something like:
SparkSession
.builder()
.config("fs.s3a.aws.credentials.provider","com.amazonaws.auth.profile.ProfileCredentialsProvider")
.getOrCreate()
The other way is to provide the JVM option -Dspark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider.*NOTE the prefix spark.hadoop
If problems arise still after setting fs.s3a.aws.credentials.provider to com.amazonaws.auth.profile.ProfileCredentialsProvider and correctly setting AWS_PROFILE, it might be because you're using Hadoop 2 for which the above configuration is not supported.
Therefore, the only workaround I found was to upgrade to Hadoop 3.
Check this post and Hadoop docs for more information.

Read CSV file from AWS S3

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.
I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')
Which throws a long Python error message and then something related to:
Py4JJavaError: An error occurred while calling o131.csv.
Specify S3 path along with access key and secret key as following:
's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#my.bucket/folder/input_data.csv'
Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get
spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#bucketname/filename.csv")
As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

Spark submit cluster mode from s3

I have Spark stand-alone set up on EC2 instances. I'm try to use cluster mode to submit a Spark apllication. The jar is in S3, and access to it is set up via IAM roles. I can run aws s3 cp s3://bucket/dir/foo.jar . to get the jar file - that works fine. However, when I run the following:
spark-submit --master spark://master-ip:7077 --class Foo
--deploy-mode cluster --verbose s3://bucket/dir/foo/jar
I get the error outlined below. Seeing that the boxes have IAM roles configured to allow access, what would be the correct way to submit the job? The job itself doesn't use S3 at all...the issue seems to be fetching the jar from S3.
Any help will be appreciated.
16/07/04 11:44:09 ERROR ClientEndpoint: Exception from cluster was: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy5.initialize(Unknown Source)
at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1686)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:598)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:395)
at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)
I've found a workaround. I put the jar in a static http server, and use http://server/foo.jar in spark-submit. That seems to work.

Resources