Read CSV file from AWS S3 - apache-spark

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.
I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')
Which throws a long Python error message and then something related to:
Py4JJavaError: An error occurred while calling o131.csv.

Specify S3 path along with access key and secret key as following:
's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#my.bucket/folder/input_data.csv'

Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get
spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#bucketname/filename.csv")
As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

Related

Can't access files via Local file API on Databricks

I'm trying to access small text file stored directly on dbfs using local file API.
I'm getting the following error.
No such file or directory
My code:
val filename = "/dbfs/test/test.txt"
for (line <- Source.fromFile(filename).getLines()) {
println(line)
}
At the same time I can access this file without any problems using dbutils or load it to RDD via spark context.
I've tried specifying the path starting with dbfs:/ or /dbfs/ or just with the test folder name, both in Scala and Python, getting the same error each time. I'm running the code from the notebook. Is it some problem with the cluster configuration?
Check if your cluster has Credentials Passthrough enabled. If so, local file Api is not available.
https://docs.azuredatabricks.net/data/databricks-file-system.html#local-file-apis

fs.s3 configuration with two s3 account with EMR

I have pipeline using lambda and EMR, where I read csv from one s3 account A and write parquet to another s3 in account B.
I created EMR in account B and has access to s3 in account B.
I cannot add account A s3 bucket access in EMR_EC2_DefaultRole(as this account is enterprise wide data storage), so i use accessKey, secret key to access account A s3 bucket.This is done through congnito token.
METHOD1
I am using fs.s3 protocol to read csv from s3 from account A and writing to s3 on account B.
I have pyspark code which reads from s3 (A) and write to parquet s3 (B) I submit job 100 of jobs at time.This pyspark code runs in EMR.
Reading using following setting
hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.set("fs.s3.awsAccessKeyId", dl_access_key)
hadoop_config.set("fs.s3.awsSecretAccessKey", dl_secret_key)
hadoop_config.set("fs.s3.awsSessionToken", dl_session_key)
spark_df_csv = spark_session.read.option("Header", "True").csv("s3://somepath")
Writing:
I am using s3a protocol s3a://some_bucket/
It works but sometimes i see
_temporary folder present in s3 bucket and not all csv converted to parquet
When i enable EMR concurrency to 256 (EMR-5.28) and submit 100 jobs it this i get _temporary rename error.
Issues:
This method creates temporary folder and sometimes it doesn't deletes it.I can see _temporary folder in s3 bucket.
when i enable EMR concurrency (EMR latest versin5.28) it allows to run steps in parallel, i get rename _temporary error for some of the files.
METHOD2:
I feel s3a is not good for parallel job.
So i want to read and write using fs.s3 as it has better file commiters.
So i did this initially i set hadoop configuration as above to account A and then unset the configuration, so that it can access default account B eventually s3 bucket.
In this way
hadoop_config = sc._jsc.hadoopConfiguration()
hadoop_config.unset("fs.s3.awsAccessKeyId")
hadoop_config.unset("fs.s3.awsSecretAccessKey")
hadoop_config.unset("fs.s3.awsSessionToken")
spark_df_csv.repartition(1).write.partitionBy(['org_id', 'institution_id']). \
mode('append').parquet(write_path)
Issues:
This works but the issue is let say if i trigger lambda which in turn submit job for 100 files (in loop) some 10 odd files result in access denied while writing file to s3 bucket.
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n ... 1 more\nCaused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service:
This could be because of either this unset is not working sometimes or
because of parallel run Spark context/session set unset happening in paralleling? I mean spark context for one job is unsettling the hadoop configuration and other is setting it, which may cause this issue, though not sure how spark context works in parallel.
Isn't each job has separate Spark context and session.
Please suggest alternatives for my situation.

Access hdfs cluster from pydoop

I have hdfs cluster and python on the same google cloud platform. I want to access the files present in the hdfs cluster from python. I found that using pydoop one can do that but I am struggling with giving it right parameters maybe. Below is the code that I have tried so far:-
import pydoop.hdfs as hdfs
import pydoop
pydoop.hdfs.hdfs(host='url of the file system goes here',
port=9864, user=None, groups=None)
"""
class pydoop.hdfs.hdfs(host='default', port=0, user=None, groups=None)
A handle to an HDFS instance.
Parameters
host (str) – hostname or IP address of the HDFS NameNode. Set to an empty string (and port to 0) to connect to the local file system; set to 'default' (and port to 0) to connect to the default (i.e., the one defined in the Hadoop configuration files) file system.
port (int) – the port on which the NameNode is listening
user (str) – the Hadoop domain user name. Defaults to the current UNIX user. Note that, in MapReduce applications, since tasks are spawned by the JobTracker, the default user will be the one that started the JobTracker itself.
groups (list) – ignored. Included for backwards compatibility.
"""
#print (hdfs.ls("/vs_co2_all_2019_v1.csv"))
It gives this error:-
RuntimeError: Hadoop config not found, try setting HADOOP_CONF_DIR
And if I execute this line of code:-
print (hdfs.ls("/vs_co2_all_2019_v1.csv"))
nothing happens. But this "vs_co2_all_2019_v1.csv" file does exist in the cluster but is not available at the moment, when I took screenshot.
My hdfs screenshot is shown below:
and the credentials that I have are shown below:
Can anybody tell me that what am I doing wrong? Which credentials do I need to put where in the pydoop api? Or maybe there is another simpler way around this problem, any help will be much appreciated!!
Have you tried the following?
import pydoop.hdfs as hdfs
import pydoop
hdfs_object = pydoop.hdfs.hdfs(host='url of the file system goes here',
port=9864, user=None, groups=None)
hdfs_object.list_directory("/vs_co2_all_2019_v1.csv")
or simply:
hdfs_object.list_directory("/")
Keep in mind that pydoop.hdfs module is not directly related with the hdfs class (hdfs_object). Thus, the connection that you established in the first command is not used in hdfs.ls("/vs_co2_all_2019_v1.csv")

spark read partitioned data in S3 partly in glacier

I have a dataset in parquet in S3 partitioned by date (dt) with oldest date stored in AWS Glacier to save some money. For instance, we have...
s3://my-bucket/my-dataset/dt=2017-07-01/ [in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-09/ [in glacier]
s3://my-bucket/my-dataset/dt=2017-07-10/ [not in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-24/ [not in glacier]
I want to read this dataset, but only the a subset of date that are not yet in glacier, eg:
val from = "2017-07-15"
val to = "2017-08-24"
val path = "s3://my-bucket/my-dataset/"
val X = spark.read.parquet(path).where(col("dt").between(from, to))
Unfortunately, I have the exception
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)
I seems that spark does not like partitioned dataset when some partitions are in Glacier. I could always read specifically each date, add the column with current date and reduce(_ union _) at the end, but it is ugly like hell and it should not be necessary.
Is there any tip to read available data in the datastore even with old data in glacier?
Error you are getting not related to Apache spark , you are getting exception because of Glacier service in short S3 objects in the Glacier storage class are not accessible in the same way as normal objects, they need to be retrieved from Glacier before they can be read.
Apache Spark cannot handle directly glacier storage TABLE/PARTITION mapped to an S3 .
java.io.IOException:
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The operation is not valid for the object's storage class (Service:
Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request
ID: C444D508B6042138)
When S3 moves any objects from S3 storage classes
STANDARD,
STANDARD_IA,
REDUCED_REDUNDANCY
to GLACIER storage class, you have object S3 has stored in Glacier which is not visible
to you and S3 will bill only Glacier storage rates.
It is still an S3 object, but has the GLACIER storage class.
When you need to access one of these objects, you initiate a restore,
which temporary copy into S3 .
Move data into S3 bucket read into Apache Spark will resolve your issue.
https://aws.amazon.com/s3/storage-classes/
Note : Apache Spark , AWS athena etc cannot read object directly from glacier if you try will get 403 error.
If you archive objects using the Glacier storage option, you must
inspect the storage class of an object before you attempt to retrieve
it. The customary GET request will work as expected if the object is
stored in S3 Standard or Reduced Redundancy (RRS) storage. It will
fail (with a 403 error) if the object is archived in Glacier. In this
case, you must use the RESTORE operation (described below) to make
your data available in S3.
https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/
403 error is due to the fact you can not read object that is archieve in Glacier, source
Reading Files from Glacier
If you want to read files from Glacier, you need to restore them to s3 before using them in Apache Spark, a copy will be available on s3 for the time mentioned during restore command, for details see here, you can use S3 console, cli or any language to do that too
Discarding some Glacier files that you do not want to restore
Let's say you do not want to restore all the files from Glacier and discard them during processing, from Spark 2.1.1, 2.2.0 you can ignore those files (with IO/Runtime Exception), by setting spark.sql.files.ignoreCorruptFiles to true source
If you define your table through Hive, and use the Hive metastore catalog to query it, it won't try to go onto the non selected partitions.
Take a look at the spark.sql.hive.metastorePartitionPruning setting
try this setting:
ss.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
or
add the spark-defaults.conf config:
spark.sql.hive.caseSensitiveInferenceMode NEVER_INFER
The S3 connectors from Amazon (s3://) and the ASF (s3a://) don't work with Glacier. Certainly nobody tests s3a against glacier. and if there were problems, you'd be left to fix them yourself. Just copy the data into s3 or onto local HDFS and then work with it there

Write to S3 from Spark without access and secret keys

Our EC2 server is configured to allow access to my-bucket when using DefaultAWSCredentialsProviderChain, so the following code using plain AWS SDK works fine:
AmazonS3 s3client = new AmazonS3Client(new DefaultAWSCredentialsProviderChain());
s3client.putObject(new PutObjectRequest("my-bucket", "my-object", "/path/to/my-file.txt"));
Spark's S3AOutputStream uses the same SDK internally, however trying to upload a file without providing acces and secret keys doesn't work:
sc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
// not setting access and secret key
JavaRDD<String> rdd = sc.parallelize(Arrays.asList("hello", "stackoverflow"));
rdd.saveAsTextFile("s3a://my-bucket/my-file-txt");
gives:
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 25DF243A166206A0, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: Ki5SP11xQEMKb0m0UZNXb4FhfWLMdbehbknQ+jeZuO/wjhwurjkFoEYVfrQfW1KIq435Lo9jPkw=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:130)
<truncated>
Is there a way to force Spark to use default credential provider chain instead of relying on access and secret key?
technically, Hadoop's s3a output stream. Look at the stack trace to see who to file bugreports against :)
And s3a does support Instance Credentials from Hadoop 2.7+, proof.
If you can't connect, you need to have the 2.7 JARs on your CP, along with the exact version of the AWS SDK is used (1.7.4, I recall).
Spark has one little feature: if you submit work and you have the AWS_* env vars set, then it picks them up, copies them in as the fs.s3a keys, so propagating them to your systems.

Resources