Which way to go when Zipping files on a cloud bucket? - node.js

I have a pipeline process which streams data from DB, maps them, and convert the mapped data into XML, then finally create the files on google cloud.
Think of it as this way
pipeline(
[
db_stream(),
mapJsonStream(),
convertToXmlStream(),
checkIfNewFileIsNeeded(),
// option 1. Add a zipping stream here before writing to the cloud
writeToCloudStream()
callback() // cleanup stuff - option 2. I can re-stream files and zip them here and stream back to cloud bucket
]
)
This will create multiple xml files (there's a limit on each xml file size, so our data needs to be split to different xml files) on the bucket.
Now I need all these files to be zipped, I searched if I can do that directly on the cloud using google Nodejs client, but it seems they don't support this kinda functionality.
I am a bit confused how to proceed because after searching abit I found that all zip libraries needs to built the archive in memory before streaming it which defeats the purpose of the pipeline I already have (option 1)
Option 2 seems a bit expensive and unnecessary.
If you any thoughts on this I would really appreciate it, Thanks in advance!

Related

API to get MIP label from a file residing on remote share

I need to read the MIP label(If it is there) from a file residing on a remote shares like SMB\DFS or NFS share. One option is to download the file locally and then read file label using MIP SDK. But considering there could be very big data files, I find this option very inefficient.
Is there a better option to read MIP labels from a very large file without downloading the complete file locally ?
Thanks,
Bishnu
Unfortunately, there isn't. The SDK needs the entire file.

Quick way of uploading many images (3000+) to a google cloud server

I am working on object detection for a school project. To train my CNN model I am using a google cloud server because I do not own a strong enough GPU to train it locally.
The training data consists of images (.jpg files) and annotations (.txt files) and is spread over around 20 folders due to the fact that they come from different sources and I do not want to mix pictures from different sources so I want to keep this directory structure.
My current issue is that I could not find a fast way of uploading them to my google cloud server.
My workaround was to upload those image folders as a .zip file on google drive and download them on the cloud and unzip them there. This process needs way too much time because I have to upload many folders and google drive does not have a good API to download folders to Linux.
On my local computer, I am using Windows 10 and my cloud server runs Debian.
Therefore, I'd be really grateful if you know a fast and easy way to either upload my images directly to the server or at least to upload my zipped folders.
Couldn't you just create an infinite loop to look for jpg files and scp/sftp the jpg directly to your server once the file is there? On windows, you can achieve this using WSL.
(sorry this may not be your final answer, but i don't have the reputation to ask you this question)
I would upload them to a Google Cloud Storage bucket using gsutil with multithreading. This means that multiple files are copied at once, so the only limitation here is your internet speed. Gsutil installers for Windows and Linux are found here. Example command:
gsutil -m cp -r dir gs://my-bucket
Then on the VM you do exactly the opposite:
gsutil -m cp -r gs://my-bucket dir
This is super fast, and you only pay a small amount for the storage, which is super cheap and/or falls within the GCP free tier.
Note: make sure you have write permissions on the storage bucket and the default compute service account (i.e. the VM service account) has read permissions on the storage bucket.
The best stack for the use case will be gsutil + storage bucket
Copy the zip files to cloud storage bucket and put a sync cron to get the files on the VM.
Make use of gsutil
https://cloud.google.com/storage/docs/gsutil

How do I load a file from Cloud Storage into memory

I have end users that are going to be uploading a csv file into a bucket which will then be loaded to BigQuery.
The issue is the content of the data is unreliable.
i.e. it contains fields with free text that may contain linefeeds,extra commas, invalid date formats e.t.c. e.t.c.
I have a python script that will pre-process the file and write out a new one with all errors corrected.
I need to be able to automate this into the cloud.
I was thinking I could load the contents of the file (it's only small) into memory and process the records then write it back out to the Bucket.
I do not want to process the file locally.
Despite extensive searching I can't find how to load a file in a bucket into memory and then write it back out again.
Can anyone help ?
I believe what you’re looking for is Google Cloud Functions. You can set a Cloud Function to be triggered by an upload to the GCS bucket, and use your Python code in the same Cloud Function to process the .csv and upload it to BigQuery, however, please bear in mind that Python 3.7.1 support for Cloud Functions is currently in a Beta state of development.

AWS S3 File Sync

I am trying to do file syncing from local source to a S3 bucket where I am uploading the files to S3 bucket by calculating MD5 checksum and putting it in the metadata for each file. The issue is that while doing so I also checked the files which are already there at destination to avoid duplicate upload. This I do by creating a list of files for upload which doesn't match on name and MD5 both. This operation of fetching the metadata for S3 files and computing MD5 for local files on the fly and then matching them is taking lot of time as I have around 200000 to 500000 files for matching.
Is there any better way to achieve this either by using multithreading or anything else. I have not much idea how to achieve it in multithreading environment as I eventually need one list and multiple threads doing the processing and adding to the same list. Any code sample or help is much appreciated.
This Windows job application is written in C#, using .NET 4.6.1 framework.
You could use the AWS Command-Line Interface (CLI), which has a aws s3 sync command that performs very similar to what you describe. However, with several hundreds thousand files, it is going to perform slowly on the matching, too.
Or, you could use Amazon S3 Inventory - Amazon Simple Storage Service to obtain a daily listing of the files in the S3 bucket (including MD5 checksum) and then compare your files against that.

A way to convert bitrate/format of audio files (between upload & storage to S3)

Currently using PHP 5.3.x & Fedora
Ok. I'll try to keep this simple. I'm working on a tool that allows the upload & storing of audio files on S3 for playback. Essentially, the user uploads a file (currently only allowing mp3 & m4a) to the server, and the file is then pushed to S3 for storage via the PHP SDK for amazon aws.
The missing link is that I would like to perform a simple bitrate & format conversion of the file prior to uploading the file. (ensuring that all files are 160kbs and .mp3).
I've looked into ffmpeg, although it seems that the PHP library only allows for reading bitrates and other meta, not for actual conversion.
Does anyone have any thoughts on the best way to approach this? Would running a shell_exec() command that performs the conversion be sufficient to do this, or is there a more efficient/better way of doing this?
Thanks in advance! Any help or advice is much appreciated.
You need to perform the conversion and upload to S3 'outside' of the PHP application as it'll take to long for the user to hang around on the page. This could be a simple app that uses ffmpeg from the command line.
I'm not familar with linux, so perhaps someone else can provide a more specific answer, but here is the basic premise:
User uploads file to server.
You set some kind of flag (eg in a database) for the user to see that the file is being processed.
You 'tell' your external encoder that a file needs to be processed and uploaded - you could use an entry in a database or some kind of message queue for this.
The encoder (possibly a command line app that invokes ffmpeg) picks up the next file in the queue and encodes it.
When complete, it uploads it to S3.
The flag is then updated to show that processing is complete and that the file is available.

Resources