How to create S3 bucket dynamically in pyspark - apache-spark

I would like to "create if not exists" a S3 bucket in YYYY-MM-DD format and store my transformed parquet files there. How do you achieve this in pyspark? Should I use boto3 or does pyspark have something builtin?
I am using the code below to read data from S3. I would like to create S3 and put my transformed files there.
spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key", config.access_id)
spark_context._jsc.hadoopConfiguration().set("fs.s3a.secret.key", config.access_key)
spark.conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

Seems like you need to just enable: fs.s3.buckets.create.enabled

Related

parse gz file in aws s3 using python

I am trying to bulk copy tables from SnowFlake to postgreSQL. From SnowFlake, I was able to extract tables in CSV format using COPY. The COPY compresses the extract in gz format in aws s3.
Now the second step is to load these files in postgreSQL. I am planning to use postgreSQL COPY utility to ingest the data. However, I don't want to unzip the files. I would rather like to buffer the data directly from gz files and give the buffer file as input to the psycopg2 copy_from function.
Is there a way to parse gz files in AWS S3 using python? Thanks in advance!

How to get File/Files create by Spark df.write?

I have requirement to capture the parquet files created as the outcome of a df.write.parquet("s3://bkt/folder", mode="append") command.
I am running this on AWS EMR pyspark.
I can achive this using awswrangler using wr.s3.to_parquet() but this is not really fit for my EMR spark use case.
Is there such functionality ?
I want list of the files from s3://bkt/folder which spark wrote
Thx all
If you want a list of files that spark wrote to particular S3 path you can use either of below approach:
Use input_file_name which will give file path from which the record is originating from and do a distinct operation by selecting filename:
from pyspark.sql.functions import input_file_name
df=spark.read.parquet("s3://bkt/folder")
df.withColumn("filename", input_file_name())
Or you can use boto3 to list the files :
from boto3 import client
conn = client('s3') # again assumes boto.cfg setup, assume AWS S3
for key in conn.list_objects(Bucket='bucket_name')['Contents']:
print(key['Key'])

Lambda Function to convert csv to parquet from s3

I have a requirement -
1. To convert parquet file present in s3 to csv format and place it back in s3. The process should exclude the use of EMR.
2. Parquet file has more than 100 cols i need to just extract 4 cols from that parquet file and create the csv in s3.
Does anyone has any solution to this?
Note - Cannot use EMR or AWS Glue
Assuming you want to keep things easy within the AWS environment, and not using Spark (Glue / EMR), you could use AWS Athena in the following way:
Let's say your parquet files are located in S3://bucket/parquet/.
You can create a table in the Data Catalog (i.e. using Athena or a Glue Crawler), pointing to that parquet location. For example, running something like this in the Athena SQL console:
CREATE EXTERNAL TABLE parquet_table (
col_1 string,
...
col_100 string)
PARTITIONED BY (date string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://bucket/parquet/' ;
Once you can query your parquet_table table, which will be reading parquet files, you should be able to create the CSV files in the following way, using Athena too and choosing only the 4 columns you're interested in:
CREATE TABLE csv_table
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://bucket/csv/'
)
AS SELECT col_1, col_2, col_3, col_4
FROM parquet_table ;
After this, you can actually drop the csv temporary table and only use the CSV files, under s3://bucket/csv/, and do more, for example by having an S3-trigger Lambda function and doing something else or similar.
Remember that all this can be achieved from Lambda, interacting with Athena (example here) and also, bear in mind it has an ODBC connector and PyAthena to use it from Python, or more options, so using Athena through Lambda or the AWS Console is not the only option you have, in case you want to automate this in a different way.
I hope this helps.
Additional edit, on Sept 25th, 2019:
Answering to your question, about doing this in Pandas, I think the best way would be using Glue Python Shell, but you mentioned you didn't want to use it. So, if you decide to, here it is a basic example of how to:
import pandas as pd
import boto3
from awsglue.utils import getResolvedOptions
from boto3.dynamodb.conditions import Key, Attr
args = getResolvedOptions(sys.argv,
['region',
's3_bucket',
's3_input_folder',
's3_output_folder'])
## #params and #variables: [JOB_NAME]
## Variables used for now. Job input parameters to be used.
s3Bucket = args['s3_bucket']
s3InputFolderKey = args['s3_input_folder']
s3OutputFolderKey = args['s3_output_folder']
## aws Job Settings
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
s3_bucket = s3_resource.Bucket(s3Bucket)
for s3_object in s3_bucket.objects.filter(Prefix=s3InputFolderKey):
s3_key = s3_object.key
s3_file = s3_client.get_object(Bucket=s3Bucket, Key=s3_key)
df = pd.read_csv(s3_file['Body'], sep = ';')
partitioned_path = 'partKey={}/year={}/month={}/day={}'.format(partKey_variable,year_variable,month_variable,day_variable)
s3_output_file = '{}/{}/{}'.format(s3OutputFolderKey,partitioned_path,s3_file_name)
# Writing file to S3 the new dataset:
put_response = s3_resource.Object(s3Bucket,s3_output_file).put(Body=df)
Carlos.
It all depends upon your business requirement, what sort of action you want to take like Asynchronously or Synchronous call.
You can trigger a lambda github example on s3 bucket asynchronously when a parquet file arrives in the specified bucket. aws s3 doc
You can configure s3 service to send a notification to SNS or SQS as well when an object is added/removed form the bucket which in turn can then invoke a lambda to process the file Triggering a Notification.
You can run a lambda Asynchronously every 5 minutes by scheduling the aws cloudwatch events The finest resolution using a cron expression is a minute.
Invoke a lambda Synchronously over HTTPS (REST API endpoint) using API Gateway.
Also worth checking how big is your Parquet file as lambda can run max 15 min i.e. 900 sec.
Worth checking this page as well Using AWS Lambda with Other Services
It is worth taking a look at CTAS queries in Athena recently: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
We can store the query results in a different format using CTAS.

Writing databricks dataframe to S3 using python

I have a databricks data frame called df. I want to write it to a S3 bucket as a csv file. I have the S3 bucket name and other credentials. I checked the online documentation given here https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-aws-s3 and it says to use following commands
dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey#$AwsBucketName", s"/mnt/$MountName", "sse-s3")
dbutils.fs.put(s"/mnt/$MountName", "<file content>")
But what I have is a dataframe and not a file. How can I achieve it?
I had the same problem. I found two solutions
1srt
df
.write \
.format("com.databricks.spark.csv") \
.option("header", "true") \
.save("s3a://{}:{}#{}/{}".format(ACCESS_KEY, SECRET_KEY, BUCKET_NAME, DIRECTORY)))
Worked like a charm.
2nd
You can indeed mount an S3 Bucket and then write a file to it directly like this :
#### MOUNT AND READ S3 FILES
AWS_BUCKET_NAME = "your-bucket-name"
MOUNT_NAME = "a-directory-name"
dbutils.fs.mount("s3a://%s" % AWS_BUCKET_NAME, "/mnt/%s" % MOUNT_NAME)
display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))
#### WRITE FILE
df.write.save('/mnt/{}/{}'.format(MOUNT_NAME, "another-directory-name"), format='csv')
This is also going to sync to your S3 Bucket.

Spark SQL DataFrame write to S3

I am using Spark-SQL where I want to retrieve records from Redshift & write them to S3. I am referring to this link Spark-Redshift ...I am able to print the schema of DataFrame but I am not able to write to S3..It is throwing error :
error: S3ServiceException:The AWS Access Key Id you provided does not exist in our records.,Status 403,Error InvalidAccessKeyId
What should I need to do to write to S3 ?? Any other way to save Redshift data to S3 using Spark??

Resources