How does `aws s3 sync` determine if a file has been updated? - node.js

When I run the command in the terminal back to back, it doesn't sync the second time. Which is great! It shouldn't. But, if I run my build process and run aws s3 sync programmatically, back to back, it syncs all the files both times, as if my build process is changing something differently the second time.
Can't figure out what might be happening. Any ideas?
My build process is basically pug source/ --out static-site/ and stylus -c styles/ --out static-site/styles/

According to this - http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
S3 sync compares the size of the file and the last modified timestamp to see if a file needs to be synced.
In your case, I'd suspect the build system is resulting in a newer timestamp even though the file size hasn't changed?

AWS CLI sync:
A local file will require uploading if the size of the local file is
different than the size of the s3 object, the last modified time of
the local file is newer than the last modified time of the s3 object,
or the local file does not exist under the specified bucket and
prefix.
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
You want the --size-only option which looks only at the file size not the last modified date. This is perfect for an asset build system that will change the last modified date frequently but not the actual contents of the files (I'm running into this with webpack builds where things like fonts kept syncing even though the file contents were identical). If you don't use a build method that incorporates the hash of the contents into the filename it might be possible to run into problems (if build emits same sized file but with different contents) so watch out for that.
I did manually test adding a new file that wasn't on the remote bucket and it is indeed added to the remote bucket with --size-only.

This article is a bit dated but i'll contribute nonetheless for folks arriving here via google.
I agree with checked answer. To add additional context, AWS S3 functionality is different than standard linux s3 in a number of ways. In Linux, an md5hash can be computed to determine if a file has changed. S3 does not do this, so it can only determine based on size and/or timestamp. What's worse, AWS does not preserve timestamp when transferring either way, so timestamp is ignored when syncing to local and only used when syncing to s3.

Related

how to work with local tmp files in an Nodejs API?

I am currently making a little API which returns jsons according to an input. This API needs to run some local programs on the server and also needs to place some temporary files. It all works if I ask the API once a time and just with one "user".
The problem is, that I only have one temp folder to store the temporary files.so whenever there are multiple API queries, the tmp folder screws up - the data in there mixes up.
What would be a good way to have an API using temp files - and still keep it working if every run needs its own temp data?
The current process is:
server/api/getGeometry?lat=52.5167776&lon=13.4092091&bboxSize=2000&output=glb
server queries zips according to lat/lon
unpacks
converts
does voodoo
generates json
sends back the json
cleans up the tmp folder
next run same story...
so every time the same tmp folder, I guess it's not the way to go.
Thanks a lot for ideas!

Generate hash of newly downloaded file

I'd like my bash script to perform an action every time new file is downloaded to /Downloads (generate hash of downloaded file and send it to API). So far I've been trying to make use of "inotify-tools", but it works only for newly created file and that won't do.
Script should work like this:
I download a file via browser (normal way)
Script notices new file and is executed automatically
Thanks in advance for help :D
You can use /etc/crontab to check ~/Downloads folder at startup and every n minutes. Script that will run every nth minute can do either
Keep the number of files. If number decreases script updates cache. And if number increases then gets the latest created file (or modified) and sends that file's hash to the api via curl.
Keep the name of files. If a file no longer exists, script then updates the cache of file names. If a new file appears again hashes and sends hash to the api via curl.
You can keep cache of files under /tmp.
If you can provide an example scenario I can write a simple script

Heroku cannot store files temporarily

I am writing a nodejs app which works with fonts. One action it performs is that it downloads a .ttf font from the web, converts it to a base64 string, deletes the .ttf and uses that string in other stuff. I need the .ttf file stored somewhere, so I convert it. This process takes like 1-2 seconds. I know heroku has an ephemeral file system but I need to store stuff for such a short time. Is there any way I can store my files? Using fs.writeFile currently returns this error:
Error: EROFS: read-only file system, open '/app\test.txt']
I had idea how about you make an action, That would get font, convert it and store it on a global variable before used by another task.
When you want to use it again, make sure you check that global variable already filled or not with that font buffer.
Reference
Singleton
I didn't know that you could store stuff in /tmp directory. It is working for the moment but according to the dyno/ephemeral system, it gets cleaned frequently so I don't know if it may cause other problems in the long run.

What's the best practice for watching for finished video encodes?

TL;DR
I'm unsure the best way to recognise when encoding videos have finished with Chokidar. Given the different methods encoders build their video files, what's the best way to accomodate all of them?
Context
I've developed a workflow for our office that allows us to quickly queue encode jobs in Adobe Premiere Pro. To queue them locally, I made use of Premiere's CEP API. I can easily send a job to Adobe Media Encoder (on the same machine) and it will automatically encode the video file to the relative project directory. This works great.
To queue encode jobs onto LAN workstations, I've taken a different approach, as the CEP API doesn't allow for any extensibility beyond the local machine. Instead I made use of Adobe Media Encoder's watch folders to detect added Premiere project files to a subfolder on our NAS (everything is on the NAS). This works great too.
Unfortunately, I'm unaware of a way for the queued encodes to be output to the relative project directory in the same way queuing locally does. I'm trying to find a way to do this by watching a common directory and moving finished files.
Since each video filename I'm queuing has this structure:
"projectName_sequenceName_givenName_renderType.mp4/.mxf" I've been able to move the files with this information easily. However, I'm struggling to accomodate for the different methods different encoding processes use. Different encoders - X264, MainConcept H264, etc - encode to disk differently.
Using Chokidar, I watched how different encoders build their files:
Example #1:
If I start a DnXHR MXF encode, it will first create the final .MXF container and then fill it. When it finishes, it writes the sidecar .XMP file. If the encode fails or is cancelled, the sidecar file will not be written.
Example #2:
If I start a TMPG x264 encode, it will first create the final .mp4 container, then create a temporary file: '.mp4_00_' appended. It will then write some initial metadata to the final container, start encoding to the '.mp4_00_' and depending on file size, create additional temporary files, '_.mp4_01_', etc. Finally it writes some additional information to the container, then to the temporary files and then deletes the temporary files. If the encode fails or is cancelled, the files are deleted.
Example #3:
If I start a MainConcept H264 encode (Premiere's default), it will first create the audio temp file, in this case '.aac'. Then create another temp file '.mkv.md0'. Halfway through encoding, it will create the video container '.m4v', start encoding to that, create some more temporary files '.md7/md6', create the final container '.mp4', along with 'sbjo.tmp', copy the '.mkv' file and '.aac' into the '.mp4' container, add a '00' file, very quickly delete it and then finish writing the '.mp4' metadata. Some of this happens very quickly and Chokidar has not always picked it up. Unless the encoder is being inconsistent.
These are the three encode types I've observed, and they're the three we need and use. I suppose I could watch each of them differently, but my concern would be if we ever switched encoders, I'd have to rewrite the code to accomodate them. The watch folder feature that Adobe Media Encoder has recognises when files have finished encoding before attempting to use them. I haven't tested every format, but a good deal. Would Media Encoder be accomodating each unique encoding process? Simply polling locked files? Or is there something I'm missing?
The code I have currently works fine for DNxHR MXFs provided they don't fail or are cancelled. It struggles with the h264/x264 examples. Since the file is created and left untouched while encoding to the temporary files, chokidar will register 'add'. Since the file is locked the move fails. Obviously this works fine when simply copying or moving a finished video file.
const watcher = chokidar.watch(['Z:/NETWORKRENDER/Finished/*.{mp4,mxf}'], {
persistent: true,
// On start, works on existing files
ignoreInitial: false,
followSymlinks: true,
interval: 1000,
awaitWriteFinish: {
stabilityThreshold: 5000,
pollInterval: 20000
},
});

Junk Spark output file on S3 with dollar signs

I have a simple spark job that reads a file from s3, takes five and writes back in s3.
What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$.
What is it? How I can prevent spark from creating it?
Here is some code to show what I am doing...
x = spark.sparkContext.textFile("s3n://.../0000_part_00")
five = x.take(5)
five = spark.sparkContext.parallelize(five)
five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions/output/")
After the job I have s3 "directory" called output which contains results and another s3 object called output_$folder$ which I don't know what it is.
Changing S3 paths in the application from s3:// to s3a:// seems to have done the trick for me. The $folder$ files are no longer getting created since I started using s3a://.
Ok, it seems I found out what it is.
It is some kind of marker file, probably used for determining if the S3 directory object exists or not.
How I reached this conclusion?
First, I found this link that shows the source of
org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir
method: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html
Then I googled other source repositories to see if I am going to find different version of the method. I didn't.
At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.
My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.
All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.
s3n:// and s3a:// doesn't generate marker directory like <output>_$folder$
If you are using hadoop with AWS EMR., I found moving from s3 to s3n is straight forward since they both use same file system implementation, whereas s3a involves AWS credential related code change.
('fs.s3.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3n.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')

Resources