Download file from S3 subfolder by dynamically setting the key - node.js

In my S3 bucket, I have a folder structure of rootFolder1/year/month/day/myfile.csv. Depending on the time, the rootFolder folder could have multiple subfolder, organised by year, month and date, with each date folder containing multiple csv files. There could also be a rootFolder2 in my bucket, with the same folder structure.
I know for downloading object from S3, I can use the getObject method where I can pass in the params object with the bucket Key like so:
var params = {
Key: "book/2020/01/01/all_books_000.csv",
Bucket: 'my-bucket-test'
}
s3.getObject(params, (err, data) => {
if (err) throw err;
// do something with data
})
But in my case I don't want to hardcode the Key in. Is there any elegant way I can call getObject where I can dynamically set the Key, so it downloads the files under each date folder?
edit: for some further clarity, these are the keys I get back when I call s3.listObjects
'author/',
'author/2020/',
'author/2020/01/',
'author/2020/01/01/',
'author/2020/01/01/all_authors_000.csv',
'book/',
'book/2020/',
'book/2020/01/',
'book/2020/01/01/',
'book/2020/01/01/all_books_000.csv',
'book/2020/01/01/all_books_001.csv'

Related

How to upload only the file(mp4/mp3/pdf) given the path to the file to a cloud storage like s3 or spaces(Digital-Oceans)

Trying to read a video file using a relative path using fs.readfilesync(path) to upload to S3 bucket but the whole folder tree is being uploaded to S3.(in my case Spaces(digital ocean))
const file = fs.readFileSync('./downloads/this.movie_details.title/Understanding Network Hacks Attack And Defense With Python/Understanding Network Hacks Attack And Defense With Python.pdf');
const filename = video_path;
var params = {
Body: file,
Bucket: 'ocean-bucket21',
Key: filename,
};
s3.putObject(params, function (err, data) { });
The result of the above code in the cloud is the whole file structure is uploaded instead of only one file i.e. pdf
I think you are saying that you'd like to upload the file to the top-level (root) of the bucket, rather than putting it inside the folders.
If so, then you should modify the Key value in the params dictionary. The Key includes the full path of the object. If the Key contains slashes, they will be interpreted as directory names.
If you want it at the top-level, edit the value of Key so it just contains your desired filename.

AWS S3 nodejs - How to get object by his prefix

I'm searching how check if an object exist in my aws s3 bucket in nodejs without list all my object (~1500) and check the prefix of the object but I cannot find how.
The format is like that:
<prefix I want to search>.<random string>/
Ex:
tutturuuu.dhbsfd7z63hd7833u/
Because you don't know the entire object Key, you will need to perform a list and filter by prefix. The AWS nodejs sdk provides such a method. Here is an example:
s3.listObjectsV2({
Bucket: 'youBucket',
MaxKeys: 1,
Prefix: 'tutturuuu.'
}, function(err, data) {
if (err) throw err;
const objectExists = data.Contents.length > 0
console.log(objectExists);
});
Note that it is important to use MaxKeys in order to reduce network usage. If more than one object has the prefix, then you will need to return everything and decide which one you need.
This API call will return metadata only. After you have the full key you can use getObject to retrieve the object contents.

AWS transcribe: How to give a path to output folder

I am using aws transcribe to get the text of the video using node js. I can specify the particular destination bucket in params but not the particular folder. Can anyone help me with this ? This is my code
var params = {
LanguageCode: "en-US",
Media: { /* required */
MediaFileUri: "s3://bucket-name/public/events/545/videoplayback1.mp4"
},
TranscriptionJobName: 'STRING_VALUE', /* required */
MediaFormat: "mp4", //mp3 | mp4 | wav | flac,
OutputBucketName: 'test-rekognition',
// }
};
transcribeservice.startTranscriptionJob(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
I have specified the destination bucket name in OutputBucketName field. But how to specify a particular folder ?
I would recommend creating a designated S3 Bucket for Transcribe output and adding a trigger with a lambda function to respond to that trigger on 'Object create (All)'. Essentially, as soon as there is a new object added to your S3 bucket, a lambda function is invoked to move/process that output by placing it in a specific 'folder' of your choice.
This doesn't solve the API issue but I hope it serves as a good workaround - you could look at this article ( https://linuxacademy.com/hands-on-lab/0e291fc6-52a4-4ed3-ad65-8cf2fd84e0df/ ) as a guide.
have a good one.
Very late to the party, but I just had the same issue.
It appeared that Amazon added a new parameter called OutputKey that allows you to save your data in a specific folder in your bucket :
You can use output keys to specify the Amazon S3 prefix and file name
of the transcription output. For example, specifying the Amazon S3
prefix, "folder1/folder2/", as an output key would lead to the output
being stored as "folder1/folder2/your-transcription-job-name.json". If
you specify "my-other-job-name.json" as the output key, the object key
is changed to "my-other-job-name.json". You can use an output key to
change both the prefix and the file name, for example
"folder/my-other-job-name.json".
Just make sure to put 'folder/' as an OutputKey (with '/' symbol at the end), otherwise it will be interpreted as a name for your file and not the folder where to store it.
Hope it'll be useful to someone.

Reading tab-separated files in gzip archive on S3 using Lambda (NodeJS)

I have the following use case to solve. I need to ingest data from a S3 bucket using a Lambda function (NodeJS 12). The Lambda function will be triggered when a new file is created. The file is a gz archive and can contain multiple TSV (tab-separated) files. For each row an API call will be triggered from the Lambda function. Questions:
1 - Does it have to be a two-steps process: uncompress the archive in a /tmp folder and then read the TSV files. Or can you directly stream the content of the archive file?
2 - Do you have a snippet of code that you could share that shows how to stream a GZ file from S3 bucket and its content (TSV)? I've found few examples but only for pure NodeJS. Not from Lambda/S3.
Thanks a lot for your help.
Adding a snippet of code for my first test and it doesnt work. No data is logged in the console
const csv = require('csv-parser')
const aws = require('aws-sdk');
const s3 = new aws.S3();
exports.handler = async(event, context, callback) => {
const bucket = event.Records[0].s3.bucket.name;
const objectKey = event.Records[0].s3.object.key;
const params = { Bucket: bucket, Key: objectKey };
var results = [];
console.log("My File: "+objectKey+"\n")
console.log("My Bucket: "+bucket+"\n")
var otherOptions = {
columns: true,
auto_parse: true,
escape: '\\',
trim: true,
};
s3.getObject(params).createReadStream()
.pipe(csv({ separator: '|' }))
.on('data', (data) => results.push(data))
.on('end', () => {
console.log("My data: "+results);
});
return await results
};
You may want to take a look at the wiki:
Although its file format also allows for multiple [compressed files / data streams] to be concatenated (gzipped files are simply decompressed concatenated as if they were originally one file[5]), gzip is normally used to compress just single files. Compressed archives are typically created by assembling collections of files into a single tar archive (also called tarball), and then compressing that archive with gzip. The final compressed file usually has the extension .tar.gz or .tgz.
What this means is that by itself, gzip (or a Node package to use it) is not powerful enough to decompress a single .gz file into multiple files. I hope that if a single .gz item in S3 contains more than one file, it's actually a .tar.gz or similar compressed collection. To deal with these, check out
Simplest way to download and unzip files in NodeJS
You may also be interested in node-tar.
In terms of getting just one file out of the archive at at time, this depends on what the compressed collection actually is. Some compression schemes allow extracting just one file at a time, others don't (they require you decompress the whole thing in one go). Tar does the former.
First step should be to decompress .tar.gz file, using the package decompress
// typescript code for decompressing .tar.gz file
const decompress = require("decompress");
try {
const targzSrc = await aws.s3
.getObject({
Bucket: BUCKET_NAME,
Key: fileRequest.key
});
const filesPromise = decompress(targzSrc.Body);
const outputFileAsString = await filesPromise.then((files: any) => {
console.log("inflated file:", files["0"].data.toString("utf-8"));
return files["0"].data.toString("utf-8");
});
console.log("And here goes the file content:", outputFileAsString);
// here should be the code that parses the CSV content using the outputFileAsString
} catch (err) {
console.log("G_ERR:", err);
}

Dynamically created s3 folders are not showing in listObjects

I'm using signed url to upload a file from my react application to a s3 bucket. I specify the path as part of my Key and the folders are getting created properly:
let params = {
Bucket: vars.aws.bucket,
Key: `${req.body.path}/${req.body.fileName}`,
Expires: 5000,
ACL: 'public-read-write',
ContentType: req.body.fileType,
};
s3.getSignedUrl('putObject', params, (err, data)=>{...
However, when I use s3.listObject, the folders that are created this way are not getting returned. Here is my node api code:
const getFiles = (req, res) => {
let params = {
s3Params:{
Bucket: vars.aws.bucket,
Delimiter: '',
Prefix: req.body.path
}
}
s3.listObjects(params.s3Params, function (err, data) {
if (err) {
res.status(401).json(err);
} else {
res.status(200).json(data);
}
});
}
The folders that are getting created through the portal are showing in the returned object properly. Is there any attribute I need to set as part of generating the signed URL to make the folder recognized as an object?
I specify the path as part of my Key and the folders are getting created properly
Actually, they aren't.
Rule 1: The console displays a folder icon for the folder foo because one or more objects exists in the bucket with the prefix foo/.
The console appears to allow you to create "folders," but that isn't what's happening when you do that. If you create a folder named foo in the console, what actually happens is that an ordinary object, zero bytes in length, with the name foo/ is created. Because this now means there is at least one object that exists in the bucket with the prefix foo/, a folder is displayed in the console (see Rule 1).
But that folder is not really a folder. It's just a feature of the console interacting with another feature of the console. You can actually delete the foo/ object using the API/SDK and nothing happens, because the console till shows that folder as long as there remains at least one object in the bucket with the prefix foo/. (Deleting the folder in the console sends delete requests for all objects with that prefix. Deleting the dummy object via the API does not.)
In short, the behavior you are observing is normal.
If you set the delimiter to /, then the listObjects response will include CommonPrefixes -- and this is where you should be looking if you want to see "folders." Objects ending with / are just the dummy objects the console creates. CommonPrefixes does not depend on these.

Resources