How to add user defined metadata to S3 object via Spark - apache-spark

I am using spark sql dataframe to write to s3 as parquet
Dataset.write
.mode(SaveMode.Overwrite)
.parquet("s3://filepath")
in the spark configuration i have specified following options for SSE and for ACL
spark.sparkContext.hadoopConfiguration.set("fs.s3a.server-side-encryption-algorithm", "AES256")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.acl.default","BucketOwnerFullControl")
how add the user define metadata to the s3 object .
Thanks
Saravanan.

I dont think its possible today. You can't add/update the user defined metadata for S3 objects from EMR. This is from my limited knowledge. Again, AWS Support is the best source to get this answered but I dont believe the API is exposed to allow users to add/update the user defined metadata yet from EMR

It's a very good question. Unfortunately, Spark does not permit it, since Spark is FileSystem agnostic (it doesnt make difference between local, HDFS, or S3).
In my mind, since all file system support kind of metadata, Spark should propose something as well...
But, as a workaround, you can always change meta-data after file have been uploaded.
Example with Java, make file public:
Gradle:
[group: 'org.apache.hadoop', name: 'hadoop-aws', version: '2.8.0'],
(provides: [group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: '1.10.60'],
[group: 'com.amazonaws', name: 'aws-java-sdk-core', version: '1.10.60'])
// Write data to S3
df
.write()
.save("s3a://BUCKET/test_json");
// (this generate test_json/_SUCCESS + test_json/part-XXX keys)
// Create S3 client
private final AmazonS3 conn;
AWSCredentials credentials = new BasicAWSCredentials(ACCESS_KEY, SECRET_KEY);
ClientConfiguration clientConfig = new ClientConfiguration();
clientConfig.setProtocol(Protocol.HTTP);
this.conn = new AmazonS3Client(credentials, clientConfig);
this.conn.setEndpoint(S3_ENDPOINT);
this.conn.setS3ClientOptions(new S3ClientOptions().withPathStyleAccess(true));
// Get Spark generated folder files
this.conn.listObjects(BUCKET, "test_json").getObjectSummaries().forEach(obj -> {
this.conn.setObjectAcl(BUCKET, obj.getKey(), CannedAccessControlList.PublicRead);
});

Related

How to write a spark dataframe to S3 with MD5 header?

I have a Spark DataFrame that I need to write to an S3 Object Lock enabled bucket.
A simple write results in this error
df.write.parquet(output_path)
com.amazonaws.services.s3.model.AmazonS3Exception:
Content-MD5 HTTP header is required for Put Object requests with Object Lock parameters
Any ideas how can I solve this?
There are ways to work this with boto3 type uploads, but how to do it with spark.df.write
s3_client.put_object(
Bucket=<S3_BUCKET_NAME>,
Key=<KEY>,
Body=open(<FILE_NAME>, "rb"),
ContentMD5=<MD5_HASH>
)
not supported in hadoop's s3a connector I'm afraid.
As is usual in OSS projects, your participation will help there: https://issues.apache.org/jira/browse/HADOOP-15224. That JIRA was created in 2018 and hasn't had attention as nobody new it was actually needed. We could maybe revisit that.
Don't know about the EMR s3 connector.

How to mount S3 bucket on Kubernetes container/pods?

I am trying to run my spark job on Amazon EKS cluster. My spark job required some static data (reference data) at each data nodes/worker/executor and this reference data is available at S3.
Can somebody kindly help me to find out a clean and performant solution to mount S3 bucket on pods ?
S3 API is an option and I am using it for my input records and output results. But "Reference data" is static data so I dont want to download it in each run/execution of my spark job. In first run job will download the data and upcoming jobs will check if data is already available locally and there is no need to download it again.
We recently opensourced a project that looks to automate this steps for you: https://github.com/IBM/dataset-lifecycle-framework
Basically you can create a dataset:
apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
name: example-dataset
spec:
local:
type: "COS"
accessKeyID: "iQkv3FABR0eywcEeyJAQ"
secretAccessKey: "MIK3FPER+YQgb2ug26osxP/c8htr/05TVNJYuwmy"
endpoint: "http://192.168.39.245:31772"
bucket: "my-bucket-d4078283-dc35-4f12-a1a3-6f32571b0d62"
region: "" #it can be empty
And then you will get a pvc you can mount in your pods
in general, you just don't do that. You should instead interact directly with S3 API to retrieve/store what you need (probably via some tools like aws cli).
As you run in AWS, you can have IAM configured in a way that your nodes can access particular data authorized on "infrastructure" level, or you can provide S3 access tokens via secrets/confogmaps/env etc.
S3 is not a filesystem, so don't expect it to behave like one (even if there are FUSE clients that emulate FS for your needs, this is rarely the right solution)

spark read partitioned data in S3 partly in glacier

I have a dataset in parquet in S3 partitioned by date (dt) with oldest date stored in AWS Glacier to save some money. For instance, we have...
s3://my-bucket/my-dataset/dt=2017-07-01/ [in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-09/ [in glacier]
s3://my-bucket/my-dataset/dt=2017-07-10/ [not in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-24/ [not in glacier]
I want to read this dataset, but only the a subset of date that are not yet in glacier, eg:
val from = "2017-07-15"
val to = "2017-08-24"
val path = "s3://my-bucket/my-dataset/"
val X = spark.read.parquet(path).where(col("dt").between(from, to))
Unfortunately, I have the exception
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)
I seems that spark does not like partitioned dataset when some partitions are in Glacier. I could always read specifically each date, add the column with current date and reduce(_ union _) at the end, but it is ugly like hell and it should not be necessary.
Is there any tip to read available data in the datastore even with old data in glacier?
Error you are getting not related to Apache spark , you are getting exception because of Glacier service in short S3 objects in the Glacier storage class are not accessible in the same way as normal objects, they need to be retrieved from Glacier before they can be read.
Apache Spark cannot handle directly glacier storage TABLE/PARTITION mapped to an S3 .
java.io.IOException:
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The operation is not valid for the object's storage class (Service:
Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request
ID: C444D508B6042138)
When S3 moves any objects from S3 storage classes
STANDARD,
STANDARD_IA,
REDUCED_REDUNDANCY
to GLACIER storage class, you have object S3 has stored in Glacier which is not visible
to you and S3 will bill only Glacier storage rates.
It is still an S3 object, but has the GLACIER storage class.
When you need to access one of these objects, you initiate a restore,
which temporary copy into S3 .
Move data into S3 bucket read into Apache Spark will resolve your issue.
https://aws.amazon.com/s3/storage-classes/
Note : Apache Spark , AWS athena etc cannot read object directly from glacier if you try will get 403 error.
If you archive objects using the Glacier storage option, you must
inspect the storage class of an object before you attempt to retrieve
it. The customary GET request will work as expected if the object is
stored in S3 Standard or Reduced Redundancy (RRS) storage. It will
fail (with a 403 error) if the object is archived in Glacier. In this
case, you must use the RESTORE operation (described below) to make
your data available in S3.
https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/
403 error is due to the fact you can not read object that is archieve in Glacier, source
Reading Files from Glacier
If you want to read files from Glacier, you need to restore them to s3 before using them in Apache Spark, a copy will be available on s3 for the time mentioned during restore command, for details see here, you can use S3 console, cli or any language to do that too
Discarding some Glacier files that you do not want to restore
Let's say you do not want to restore all the files from Glacier and discard them during processing, from Spark 2.1.1, 2.2.0 you can ignore those files (with IO/Runtime Exception), by setting spark.sql.files.ignoreCorruptFiles to true source
If you define your table through Hive, and use the Hive metastore catalog to query it, it won't try to go onto the non selected partitions.
Take a look at the spark.sql.hive.metastorePartitionPruning setting
try this setting:
ss.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
or
add the spark-defaults.conf config:
spark.sql.hive.caseSensitiveInferenceMode NEVER_INFER
The S3 connectors from Amazon (s3://) and the ASF (s3a://) don't work with Glacier. Certainly nobody tests s3a against glacier. and if there were problems, you'd be left to fix them yourself. Just copy the data into s3 or onto local HDFS and then work with it there

Write to S3 from Spark without access and secret keys

Our EC2 server is configured to allow access to my-bucket when using DefaultAWSCredentialsProviderChain, so the following code using plain AWS SDK works fine:
AmazonS3 s3client = new AmazonS3Client(new DefaultAWSCredentialsProviderChain());
s3client.putObject(new PutObjectRequest("my-bucket", "my-object", "/path/to/my-file.txt"));
Spark's S3AOutputStream uses the same SDK internally, however trying to upload a file without providing acces and secret keys doesn't work:
sc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
// not setting access and secret key
JavaRDD<String> rdd = sc.parallelize(Arrays.asList("hello", "stackoverflow"));
rdd.saveAsTextFile("s3a://my-bucket/my-file-txt");
gives:
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 25DF243A166206A0, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: Ki5SP11xQEMKb0m0UZNXb4FhfWLMdbehbknQ+jeZuO/wjhwurjkFoEYVfrQfW1KIq435Lo9jPkw=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:130)
<truncated>
Is there a way to force Spark to use default credential provider chain instead of relying on access and secret key?
technically, Hadoop's s3a output stream. Look at the stack trace to see who to file bugreports against :)
And s3a does support Instance Credentials from Hadoop 2.7+, proof.
If you can't connect, you need to have the 2.7 JARs on your CP, along with the exact version of the AWS SDK is used (1.7.4, I recall).
Spark has one little feature: if you submit work and you have the AWS_* env vars set, then it picks them up, copies them in as the fs.s3a keys, so propagating them to your systems.

How to read Azure Table Storage data from Apache Spark running on HDInsight

Is it any way of doing that from a Spark application running on Azure HDInsight? We are using Scala.
Azure Blobs are supported (through WASB). I don't understand why Azure Tables aren't.
Thanks in advance
You can actually read from Table Storage in Spark, here's a project done by a Microsoft guy doing just that:
https://github.com/mooso/azure-tables-hadoop
You probably won't need all the Hive stuff, just the classes at root level:
AzureTableConfiguration.java
AzureTableInputFormat.java
AzureTableInputSplit.java
AzureTablePartitioner.java
AzureTableRecordReader.java
BaseAzureTablePartitioner.java
DefaultTablePartitioner.java
PartitionInputSplit.java
WritableEntity.java
You can read with something like this:
import org.apache.hadoop.conf.Configuration
sparkContext.newAPIHadoopRDD(getTableConfig(tableName,account,key),
classOf[AzureTableInputFormat],
classOf[Text],
classOf[WritableEntity])
def getTableConfig(tableName : String, account : String, key : String): Configuration = {
val configuration = new Configuration()
configuration.set("azure.table.name", tableName)
configuration.set("azure.table.account.uri", account)
configuration.set("azure.table.storage.key", key)
configuration
}
You will have to write a decoding function to transform your WritableEntity to the Class you want.
It worked for me!
Currently Azure Tables are not supported. Only Azure blobs support the HDFS interface required by Hadoop & Spark.

Resources