Writing databricks dataframe to S3 using python - apache-spark

I have a databricks data frame called df. I want to write it to a S3 bucket as a csv file. I have the S3 bucket name and other credentials. I checked the online documentation given here https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-aws-s3 and it says to use following commands
dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey#$AwsBucketName", s"/mnt/$MountName", "sse-s3")
dbutils.fs.put(s"/mnt/$MountName", "<file content>")
But what I have is a dataframe and not a file. How can I achieve it?

I had the same problem. I found two solutions
1srt
df
.write \
.format("com.databricks.spark.csv") \
.option("header", "true") \
.save("s3a://{}:{}#{}/{}".format(ACCESS_KEY, SECRET_KEY, BUCKET_NAME, DIRECTORY)))
Worked like a charm.
2nd
You can indeed mount an S3 Bucket and then write a file to it directly like this :
#### MOUNT AND READ S3 FILES
AWS_BUCKET_NAME = "your-bucket-name"
MOUNT_NAME = "a-directory-name"
dbutils.fs.mount("s3a://%s" % AWS_BUCKET_NAME, "/mnt/%s" % MOUNT_NAME)
display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))
#### WRITE FILE
df.write.save('/mnt/{}/{}'.format(MOUNT_NAME, "another-directory-name"), format='csv')
This is also going to sync to your S3 Bucket.

Related

How to create S3 bucket dynamically in pyspark

I would like to "create if not exists" a S3 bucket in YYYY-MM-DD format and store my transformed parquet files there. How do you achieve this in pyspark? Should I use boto3 or does pyspark have something builtin?
I am using the code below to read data from S3. I would like to create S3 and put my transformed files there.
spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key", config.access_id)
spark_context._jsc.hadoopConfiguration().set("fs.s3a.secret.key", config.access_key)
spark.conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
Seems like you need to just enable: fs.s3.buckets.create.enabled

How to name a csv file after overwriting in Azure Blob Storage

I am using Databricks notebook to read and write the file into the same location. But when I write into the file I am getting a lot of files with different names.
Like this:
I am not sure why these files are created in the location I specified.
Also, another file with the name "new_location" was created after I performed the write operation
What I want is that after reading the file from Azure Blob Storage I should write the file into the same location with the same name as the original into the same location. But I am unable to do so. please help me out as I am new to Pyspark
I have already mounted and now I am reading the CSV file store in an azure blob storage container.
The overwritten file is created with the name "part-00000-tid-84371752119947096-333f1e37-6fdc-40d0-97f5-78cee0b108cf-31-1-c000.csv"
Code:
df = spark.read.csv("/mnt/ndemo/nsalman/addresses.csv", inferSchema = True)
df = df.toDF("firstName","lastName","street","town","city","code")
df.show()
file_location_new = "/mnt/ndemo/nsalman/new_location"
# write the dataframe as a single file to blob storage
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Spark will save a partial csv file for each partition of your dataset. To generate a single csv file, you can convert it to a pandas dataframe, and then write it out.
Try to change these lines:
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
to this line
df.toPandas().to_csv(file_location_new, header=True)
You might need to prepend "/dbfs/" to file_location_new for this to work.
Here is a minimal self-contained example that demonstrate how to write a csv file with pandas:
df = spark.createDataFrame([(1,3),(2,2),(3,1)], ["Testing", "123"])
df.show()
df.toPandas().to_csv("/dbfs/" + "/mnt/ndemo/nsalman/" + "testfile.csv", header=True)

Partitioned download of a CSV using S3A from S3 object storage

I am looking to download a file that has been saved to s3 using partitioned upload. I tried to add * at the end of my address but it looks like this format is not valid. My code is as follows:
df= spark.read.csv('s3a://bucket-name/file.csv/*')
, header='true'
, inferSchema ='true'
)
The files are stored with the following:
file.csv/part1.csv
file.csv/part2.csv
I'm wondering whether using * is supported or not. And if not, what is an alternative?
You Can try just giving up to directory location like below instead specifying the '*' character,
val df=spark.read
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.csv("s3a://bucket-name/file.csv/")

How to write parquet file from pandas dataframe in S3 in python

I have a pandas dataframe. i want to write this dataframe to parquet file in S3.
I need a sample code for the same.I tried to google it. but i could not get a working sample code.
For your reference, I have the following code works.
s3_url = 's3://bucket/folder/bucket.parquet.gzip'
df.to_parquet(s3_url, compression='gzip')
In order to use to_parquet, you need pyarrow or fastparquet to be installed. Also, make sure you have correct information in your config and credentials files, located at .aws folder.
Edit: Additionally, s3fs is needed. see https://stackoverflow.com/a/54006942/1862909
the below function gets parquet output in a buffer and then write buffer.values() to S3 without any need to save parquet locally
Also, since you're creating an s3 client you can create credentials using aws s3 keys that can be either stored locally, in an airflow connection or aws secrets manager
def dataframe_to_s3(s3_client, input_datafame, bucket_name, filepath, format):
if format == 'parquet':
out_buffer = BytesIO()
input_datafame.to_parquet(out_buffer, index=False)
elif format == 'csv':
out_buffer = StringIO()
input_datafame.to_parquet(out_buffer, index=False)
s3_client.put_object(Bucket=bucket_name, Key=filepath, Body=out_buffer.getvalue())
S3_client is nothing but a boto3 client object.Hope this helps!
courtesy- https://stackoverflow.com/a/40615630/12036254
First ensure that you have pyarrow or fastparquet installed with pandas.
Then install boto3 and aws cli. Use aws cli to set up the config and credentials files, located at .aws folder.
Here is a simple script using pyarrow, and boto3 to create a temporary parquet file and then send to AWS S3.
Sample code excluding imports:
def main():
data = {0: {"data1": "value1"}}
df = pd.DataFrame.from_dict(data, orient='index')
write_pandas_parquet_to_s3(
df, "bucket", "folder/test/file.parquet", ".tmp/file.parquet")
def write_pandas_parquet_to_s3(df, bucketName, keyName, fileName):
# dummy dataframe
table = pa.Table.from_pandas(df)
pq.write_table(table, fileName)
# upload to s3
s3 = boto3.client("s3")
BucketName = bucketName
with open(fileName) as f:
object_data = f.read()
s3.put_object(Body=object_data, Bucket=BucketName, Key=keyName)
For python 3.6+, AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet
to install do;
pip install awswrangler
if you want to write your pandas dataframe as a parquet file to S3 do;
import awswrangler as wr
wr.s3.to_parquet(
dataframe=df,
path="s3://my-bucket/key/my-file.parquet"
)

Specify AWS profile name to use when uploading Pandas dataframe to S3

I would like to upload a Panda's data-frame directly to S3 by specifying s3 url. I have a multi-profile AWS environment, and I would like to specify the name of the profile to use for this upload.
Since it is not possible to specify region in the s3 url, I would like to know if there is any other way I could specify the (non-default) region in the code.
I could not file any such option in the s3fs library, which is used internally by boto3 for uploading to s3.
Note that I do not want to use environment variables, or modify the default configuration in the AWS credentials files.
import pandas as pd
data = [1, 2, 3]
df = pd.DataFrame()
# I would like to specify non-default profile to use here
s3_url = 's3://my_bucket/path/to/file.parquet'
df.to_parquet(s3_url)
Use a session
session = boto3.Session(profile_name='dev')
s3_client = session.client('s3')
Save the DataFrame to a parquet file
df.to_parquet( parquet_pandas_file )
Upload the file to S3
with open( parquet_pandas_file, 'rb' ) as s3_source_data:
s3.upload_fileobj(s3_source_data, 'bucket_name', 'bucket_key_name' )
Use the below code to set the profile name when using s3fs command
fs = s3fs.S3FileSystem(profile_name='<profile name>')
with fs.open('s3://bucketname/root1/file.csv', 'w') as f:
df.to_csv(f)

Resources