How to change output CSV file name of AWS Athena select query - aws-cli

I have an Athena select query and the result will be saved in an s3 bucket location.
Above scenario is working fine, but the file name will be a bunch of random characters.
I need to save the result as a specific file name
file name should be report.csv
My query executed from shell script.
aws athena start-query-execution \
--query-string "select user_id,case file_type from <table name> group by file_type,user_id" \
--work-group "primary" \
--query-execution-context Database=<database name>\
--result-configuration "OutputLocation=s3://<bucket name>/report.csv"
Current output will be like this
Does have any simple way to set the file name

You cannot change the name of the results. However, you can make a copy of the file once the query has finished:
aws s3 cp s3://<output_bucket>/9411<…>.csv s3://<other_bucket>/report.csv
aws athena start-query-execution only starts the query, it doesn't wait for the query to finish. You can either poll the status of the query with aws athena get-query-execution, or wait for the result file to appear on S3 with aws s3 wait object-exists.
If you want a shell script that runs a query, waits for it to finish, and handles error cases, see https://gist.github.com/iconara/447a569d00a7a9c4aff86e9e0b14ff11

Related

Deleting delta files data from s3 path file

I am writing "delta" format file in AWS s3.
Due to some corrupt data I need to delete data , I am using enterprise databricks which can access AWS S3 path, which has delete permission.
While I am trying to delete using below script
val p="s3a://bucket/path1/table_name"
import io.delta.tables.*;
import org.apache.spark.sql.functions;
DeltaTable deltaTable = DeltaTable.forPath(spark, p);
deltaTable.delete("date > '2023-01-01'");
But it is not deleting data in s3 path which is "date > '2023-01-01'".
I waited for 1 hour but still I see data , I have run above script multiple times.
So what is wrong here ? how to fix it ?
If you want delete the data physically from s3 you can use dbutils.fs.rm("path")
If you want tp just delete the data run spark.sql("delete from table_name where cond") or use magic command %sql and run delete command.
Even you can try vacuum command, but the default retention period is 7 days, if you want to delete the data which is less than 7 days then set this configuration SET spark.databricks.delta.retentionDurationCheck.enabled = false; and the execute vacuum command
The DELETE operation only deletes the data from the delta table, it just dereferences it from the latest version. To delete the data physically from the storage you have to run a VACUUM command:
Check: https://docs.databricks.com/sql/language-manual/delta-vacuum.html

How can I copy AWS s3 objects that are Last Modified on a particular date? I have a bucket with millions of objects

We store thousands of objects every day. I just want to copy objects for a single day using CLI or AWS nodejs SDK
I am trying this script but It takes too much time
#!/bin/bash
SOURCE_BUCKET="b1"
DESTINATION_BUCKET="b2"
content=$(aws s3api list-objects-v2 --bucket $SOURCE_BUCKET --query 'Contents[?contains(LastModified, `2022-11-22`)]' | jq -r ".[].Key")
for file in $content;
do
aws s3api copy-object --copy-source $SOURCE_BUCKET/$file --key $file --bucket $DESTINATION_BUCKET | jq
done
A bucket with a large number of objects (100,000+) will be difficult to use with ListObjects() because each API call will only return 1000 objects. Thus, a bucket with 1 million objects would require 1000 API calls.
Some ways to make it faster:
Only list a particular Prefix (folder) -- for example, store each day's files in a separate path
Use Amazon S3 Inventory to provide a daily CSV file listing all objects, so you don't need to list the bucket yourself
Trigger an AWS Lambda function when a new object is created and store the object details in a database, then query that database instead of using ListObjects()
Trigger an AWS Lambda function when a new object is created and copy the object to a different folder organised by day (or perhaps one folder for 'latest' files that you can download and then delete)
Bottom line: Don't call ListObjects() on a bucket with millions of objects. Get creative.

Aws glue spark "No such file or directory" but the file exist

I want to execute a very simple spark script on Aws Glue as a spark job.
But I encounter the following error.
An error occurred while calling o76.sql. No such file or directory 's3://bucketname/pathToFile/file.parquet
I'm sure that the file is present in the specified path but I don't get why it does not find the file.
Here the code:
spark_context = SparkSession.builder.getOrCreate().sparkContext
glue_context = GlueContext(spark_context)
spark = glue_context.spark_session
simple_query = f"""SELECT * FROM orion_staging.conforama_purchase LIMIT 10"""
email_purchase = spark.sql(simple_query)
url = 'my valid url'
result_df.write.format("parquet").option("header","true").mode("Overwrite").save(url)
print("DONE")
And the error :
An error occurred while calling o76.sql. No such file or directory 's3://bucketname/pathToFile/file.parquet'
It happens when reading and the table I'm trying to read to is an Athena table.
I'm sure that glue is aware of this table because it display it when I'm browsing the glue interface.
I already tried to :
- enable Hive support.
I would like also to try how to :
- --enable-glue-datacatalog but I don't know how to do.
The problem is related to IAM role you have assigned to AWS GLUE Job make sure it has access to S3 bucket.

Lambda Function to convert csv to parquet from s3

I have a requirement -
1. To convert parquet file present in s3 to csv format and place it back in s3. The process should exclude the use of EMR.
2. Parquet file has more than 100 cols i need to just extract 4 cols from that parquet file and create the csv in s3.
Does anyone has any solution to this?
Note - Cannot use EMR or AWS Glue
Assuming you want to keep things easy within the AWS environment, and not using Spark (Glue / EMR), you could use AWS Athena in the following way:
Let's say your parquet files are located in S3://bucket/parquet/.
You can create a table in the Data Catalog (i.e. using Athena or a Glue Crawler), pointing to that parquet location. For example, running something like this in the Athena SQL console:
CREATE EXTERNAL TABLE parquet_table (
col_1 string,
...
col_100 string)
PARTITIONED BY (date string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://bucket/parquet/' ;
Once you can query your parquet_table table, which will be reading parquet files, you should be able to create the CSV files in the following way, using Athena too and choosing only the 4 columns you're interested in:
CREATE TABLE csv_table
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://bucket/csv/'
)
AS SELECT col_1, col_2, col_3, col_4
FROM parquet_table ;
After this, you can actually drop the csv temporary table and only use the CSV files, under s3://bucket/csv/, and do more, for example by having an S3-trigger Lambda function and doing something else or similar.
Remember that all this can be achieved from Lambda, interacting with Athena (example here) and also, bear in mind it has an ODBC connector and PyAthena to use it from Python, or more options, so using Athena through Lambda or the AWS Console is not the only option you have, in case you want to automate this in a different way.
I hope this helps.
Additional edit, on Sept 25th, 2019:
Answering to your question, about doing this in Pandas, I think the best way would be using Glue Python Shell, but you mentioned you didn't want to use it. So, if you decide to, here it is a basic example of how to:
import pandas as pd
import boto3
from awsglue.utils import getResolvedOptions
from boto3.dynamodb.conditions import Key, Attr
args = getResolvedOptions(sys.argv,
['region',
's3_bucket',
's3_input_folder',
's3_output_folder'])
## #params and #variables: [JOB_NAME]
## Variables used for now. Job input parameters to be used.
s3Bucket = args['s3_bucket']
s3InputFolderKey = args['s3_input_folder']
s3OutputFolderKey = args['s3_output_folder']
## aws Job Settings
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
s3_bucket = s3_resource.Bucket(s3Bucket)
for s3_object in s3_bucket.objects.filter(Prefix=s3InputFolderKey):
s3_key = s3_object.key
s3_file = s3_client.get_object(Bucket=s3Bucket, Key=s3_key)
df = pd.read_csv(s3_file['Body'], sep = ';')
partitioned_path = 'partKey={}/year={}/month={}/day={}'.format(partKey_variable,year_variable,month_variable,day_variable)
s3_output_file = '{}/{}/{}'.format(s3OutputFolderKey,partitioned_path,s3_file_name)
# Writing file to S3 the new dataset:
put_response = s3_resource.Object(s3Bucket,s3_output_file).put(Body=df)
Carlos.
It all depends upon your business requirement, what sort of action you want to take like Asynchronously or Synchronous call.
You can trigger a lambda github example on s3 bucket asynchronously when a parquet file arrives in the specified bucket. aws s3 doc
You can configure s3 service to send a notification to SNS or SQS as well when an object is added/removed form the bucket which in turn can then invoke a lambda to process the file Triggering a Notification.
You can run a lambda Asynchronously every 5 minutes by scheduling the aws cloudwatch events The finest resolution using a cron expression is a minute.
Invoke a lambda Synchronously over HTTPS (REST API endpoint) using API Gateway.
Also worth checking how big is your Parquet file as lambda can run max 15 min i.e. 900 sec.
Worth checking this page as well Using AWS Lambda with Other Services
It is worth taking a look at CTAS queries in Athena recently: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
We can store the query results in a different format using CTAS.

AWS Lambda Nodejs: Get all objects created in the last 24hours from a S3 bucket

I have a requirement whereby I need to convert all my JSON files in my bucket into one new line delimited JSON for a 3rd party to consume. However, I need to make sure that each newly created new delimited JSON only includes files that were received in the last 24 hours in order to avoid picking the same files over and over again. Can this be done inside the s3.getObject(getParams, function(err, data) function? Any advice regarding a different approach is appreciated
Thank you
You could try S3 ListObjects operation and filter the result by LastModified metadata field. For new objects, the LastModified attribute will contain information when the file was created, but for changed files - when the last modified.
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#listObjectsV2-property
There is a more complicated approach, using Amazon Athena with AWS Glue services, but this requires to modify your S3 Object keys to split into partitions, where partitions will be the key of date-time.
For example:
s3://bucket/reports/date=2019-08-28/report1.json
s3://bucket/reports/date=2019-08-28/report2.json
s3://bucket/reports/date=2019-08-28/report3.json
s3://bucket/reports/date=2019-08-29/report1.json
This approach can be implemented in two ways, depending on your file schema. If all your JSON files have the same format/properties/schema, then you can create a Glue Table, add the root reports path as a source for this table, add the date partition value (2019-08-28) and using Amazon Athena query data with a regular SELECT * FROM reports WHERE date='2019-08-28'. If not, then create a Glue crawler with JSON classifier, which will populate your tables, and then using the same Athena - query these data to a combined JSON file
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html

Resources