Lambda Function to convert csv to parquet from s3 - python-3.x

I have a requirement -
1. To convert parquet file present in s3 to csv format and place it back in s3. The process should exclude the use of EMR.
2. Parquet file has more than 100 cols i need to just extract 4 cols from that parquet file and create the csv in s3.
Does anyone has any solution to this?
Note - Cannot use EMR or AWS Glue

Assuming you want to keep things easy within the AWS environment, and not using Spark (Glue / EMR), you could use AWS Athena in the following way:
Let's say your parquet files are located in S3://bucket/parquet/.
You can create a table in the Data Catalog (i.e. using Athena or a Glue Crawler), pointing to that parquet location. For example, running something like this in the Athena SQL console:
CREATE EXTERNAL TABLE parquet_table (
col_1 string,
...
col_100 string)
PARTITIONED BY (date string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://bucket/parquet/' ;
Once you can query your parquet_table table, which will be reading parquet files, you should be able to create the CSV files in the following way, using Athena too and choosing only the 4 columns you're interested in:
CREATE TABLE csv_table
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://bucket/csv/'
)
AS SELECT col_1, col_2, col_3, col_4
FROM parquet_table ;
After this, you can actually drop the csv temporary table and only use the CSV files, under s3://bucket/csv/, and do more, for example by having an S3-trigger Lambda function and doing something else or similar.
Remember that all this can be achieved from Lambda, interacting with Athena (example here) and also, bear in mind it has an ODBC connector and PyAthena to use it from Python, or more options, so using Athena through Lambda or the AWS Console is not the only option you have, in case you want to automate this in a different way.
I hope this helps.
Additional edit, on Sept 25th, 2019:
Answering to your question, about doing this in Pandas, I think the best way would be using Glue Python Shell, but you mentioned you didn't want to use it. So, if you decide to, here it is a basic example of how to:
import pandas as pd
import boto3
from awsglue.utils import getResolvedOptions
from boto3.dynamodb.conditions import Key, Attr
args = getResolvedOptions(sys.argv,
['region',
's3_bucket',
's3_input_folder',
's3_output_folder'])
## #params and #variables: [JOB_NAME]
## Variables used for now. Job input parameters to be used.
s3Bucket = args['s3_bucket']
s3InputFolderKey = args['s3_input_folder']
s3OutputFolderKey = args['s3_output_folder']
## aws Job Settings
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
s3_bucket = s3_resource.Bucket(s3Bucket)
for s3_object in s3_bucket.objects.filter(Prefix=s3InputFolderKey):
s3_key = s3_object.key
s3_file = s3_client.get_object(Bucket=s3Bucket, Key=s3_key)
df = pd.read_csv(s3_file['Body'], sep = ';')
partitioned_path = 'partKey={}/year={}/month={}/day={}'.format(partKey_variable,year_variable,month_variable,day_variable)
s3_output_file = '{}/{}/{}'.format(s3OutputFolderKey,partitioned_path,s3_file_name)
# Writing file to S3 the new dataset:
put_response = s3_resource.Object(s3Bucket,s3_output_file).put(Body=df)
Carlos.

It all depends upon your business requirement, what sort of action you want to take like Asynchronously or Synchronous call.
You can trigger a lambda github example on s3 bucket asynchronously when a parquet file arrives in the specified bucket. aws s3 doc
You can configure s3 service to send a notification to SNS or SQS as well when an object is added/removed form the bucket which in turn can then invoke a lambda to process the file Triggering a Notification.
You can run a lambda Asynchronously every 5 minutes by scheduling the aws cloudwatch events The finest resolution using a cron expression is a minute.
Invoke a lambda Synchronously over HTTPS (REST API endpoint) using API Gateway.
Also worth checking how big is your Parquet file as lambda can run max 15 min i.e. 900 sec.
Worth checking this page as well Using AWS Lambda with Other Services

It is worth taking a look at CTAS queries in Athena recently: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
We can store the query results in a different format using CTAS.

Related

How to get File/Files create by Spark df.write?

I have requirement to capture the parquet files created as the outcome of a df.write.parquet("s3://bkt/folder", mode="append") command.
I am running this on AWS EMR pyspark.
I can achive this using awswrangler using wr.s3.to_parquet() but this is not really fit for my EMR spark use case.
Is there such functionality ?
I want list of the files from s3://bkt/folder which spark wrote
Thx all
If you want a list of files that spark wrote to particular S3 path you can use either of below approach:
Use input_file_name which will give file path from which the record is originating from and do a distinct operation by selecting filename:
from pyspark.sql.functions import input_file_name
df=spark.read.parquet("s3://bkt/folder")
df.withColumn("filename", input_file_name())
Or you can use boto3 to list the files :
from boto3 import client
conn = client('s3') # again assumes boto.cfg setup, assume AWS S3
for key in conn.list_objects(Bucket='bucket_name')['Contents']:
print(key['Key'])

boto3 - Getting files only uploaded in the past month in S3

I am writing a python3 lambda function which needs to return all of the files that were uploaded to an S3 bucket in the past 30 days from the time that the function is ran.
How should I approach this? Ideally, I want to only iterate through the files from the past 30 days and nothing else - there are thousands upon thousands of files in the S3 bucket that I am iterating through, and maybe 100 max will be updated/uploaded per month. It would be very inefficient to have to iterate through every file and compare dates like that. There is also a 29 second time limit for AWS API gateway.
Any help would be greatly appreciated. Thanks!
You will need to iterate through the list of objects (sample code: List s3 buckets with its size in csv format) and compare the date within the Python code (sample code: Get day old filepaths from s3 bucket).
There is no filter when listing objects (aside from Prefix).
An alternative is to use Amazon S3 Inventory, which can provide a daily CSV file listing the contents of a bucket. You could parse that CSV instead of the listing objects.
A more extreme option is to keep a separate database of objects, which would need to be updated whenever objects are added/deleted. This could be done via Amazon S3 Events that trigger an AWS Lambda function. Lots of work, though.
I can't give you an 100% answer, since you have asked for the upload date, but if you can live with the 'last modified' value, this code snippet should do the job:
import boto3
import datetime
paginator = boto3.resource('s3').meta.client.get_paginator('list_objects')
date = datetime.datetime.now() - datetime.timedelta(30)
filtered_files = (page['Key'] for page in paginator.paginate(Bucket="bucketname").search(f"Contents[?to_string(LastModified)>='\"{date}\"']"))
For filterting I used JMESPath
From the architect perspective
The bottle neck is that whether if you can iterate all objects with in 30 seconds. If natively there are too many files, there are a few more options you can use:
Create a aws lambda function that triggered by S3:PutObject event, and store the S3 key, and last_modified_at information into Dynamodb (A AWS Key Value NoSQL database). Then you can easily use Dynamodb to filter the S3 key and retrieve those S3 object accordingly.
Crreate a aws lambda function that triggered by S3:PutObject event, and move the file to a partitioned S3 Key schema location such as s3://bucket/datalake/year=${year}/month=${month}/day=${day}/your-file.csv. Then you can easily use the partition information to locate the subset of your objects, which fits in 30 seconds hard limit.
From programming perspective
Here's the code snippet solves your problem using this library s3pathlib:
from datetime import datetime, timedelta
from s3pathlib import S3path
# define a folder
p_dir = S3Path("bucket/my-folder/")
# find one month ago datetime
now = datetime.utcnow()
one_month_ago = now - timedelta(days=30)
# filter by last modified
for p in p_bucket.iter_objects().filter(
# any Filterable Attribute can be used for filtering
S3Path.last_modified_at >= one_month_ago
):
# do whatever you like
print(p.console_url) # click link to open it in console, inspect
If you want to use other S3Path attributes for filtering, and use other comparators, or even define your custom filter, you can follow this document

how to create file in aws S3 using python boto3

I want to create an empty file in AWS s3 using python.
I'm using boto3 and python.
I want to know apart from the put method is there any way to create files in s3?
Assuming that you genuinely want a zero-byte file, you can do it as follows:
import boto3
s3 = boto3.client('s3')
s3.put_object(
Bucket='mybucket',
Key='myemptyfile'
)
Note the lack of a Body parameter, resulting in an empty file.
You can use upload_file() method :
s3_resource.Bucket(bucket_name).upload_file(Filename = "file_name" , Key = "key")

AWS Lambda Nodejs: Get all objects created in the last 24hours from a S3 bucket

I have a requirement whereby I need to convert all my JSON files in my bucket into one new line delimited JSON for a 3rd party to consume. However, I need to make sure that each newly created new delimited JSON only includes files that were received in the last 24 hours in order to avoid picking the same files over and over again. Can this be done inside the s3.getObject(getParams, function(err, data) function? Any advice regarding a different approach is appreciated
Thank you
You could try S3 ListObjects operation and filter the result by LastModified metadata field. For new objects, the LastModified attribute will contain information when the file was created, but for changed files - when the last modified.
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#listObjectsV2-property
There is a more complicated approach, using Amazon Athena with AWS Glue services, but this requires to modify your S3 Object keys to split into partitions, where partitions will be the key of date-time.
For example:
s3://bucket/reports/date=2019-08-28/report1.json
s3://bucket/reports/date=2019-08-28/report2.json
s3://bucket/reports/date=2019-08-28/report3.json
s3://bucket/reports/date=2019-08-29/report1.json
This approach can be implemented in two ways, depending on your file schema. If all your JSON files have the same format/properties/schema, then you can create a Glue Table, add the root reports path as a source for this table, add the date partition value (2019-08-28) and using Amazon Athena query data with a regular SELECT * FROM reports WHERE date='2019-08-28'. If not, then create a Glue crawler with JSON classifier, which will populate your tables, and then using the same Athena - query these data to a combined JSON file
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html

How to list S3 objects in parallel in PySpark using flatMap()?

I have a dataframe where each row contains a prefix that points to a location in S3. I want to use flatMap() to iterate over each row, list the S3 objects in each prefix and return a new dataframe that contains a row per file that was listed in S3.
I've got this code:
import boto3
s3 = boto3.resource('s3')
def flatmap_list_s3_files(row):
bucket = s3.Bucket(row.bucket)
s3_files = []
for obj in bucket.objects.filter(Prefix=row.prefix):
s3_files.append(obj.key)
rows = []
for f in s3_files:
row_dict = row.asDict()
row_dict['s3_obj'] = f
rows.append(Row(**row_dict))
return rows
df = <code that loads the dataframe>
df.rdd.flatMap(lambda x: flatmap_list_s3_files(x))).toDF()
The only problem is that the s3 object isn't pickleable I guess? So I'm getting this error and I'm not sure what to try next:
PicklingError: Cannot pickle files that are not opened for reading
I'm a spark noob so I'm hoping there's some other API or some way to parallelize the listing of files in S3 and join that together with the original dataframe. To be clear, I'm not trying to READ any of the data in the S3 files themselves, I'm building a table that is essentially a metadata catalogue of all the files in S3. Any tips would be greatly appreciated.
you can't send an s3 client around your spark cluster; you need to share all the information needed to create one and instantiate it at the far end. I don't know about .py but in the java APIs you'd just pass the path around as a string and then convert that to a Path object, call Path.getFileSystem() and work on there. The Spark workers will cache the Filesystem instances for fast reuse

Resources