Google Cloud Storage - GSUtil - Copy files, skip existing, do not overwrite - linux

I want to sync a local directory to a bucket in Google Cloud Storage. I want to copy the local files that do not exist remotely, skipping files that already exist both remote and local. Is this possible to do this with GSUtil? I cant seem to find a "sync" option for GSUtil or a "do not overwrite". Is it possible to script this?
I am on Linux (Ubuntu 12.04)?

gsutil supports the noclobber flag (-n) on the cp command. This flag will skip files that already exist at the destination.

You need to add (-n) to the command, mentioned officially on Google Cloud Platform:
-n: No-clobber. When specified, existing files or objects at the destination will not be overwritten. Any items that are skipped by this option will be reported as being skipped. This option will perform an additional GET request to check if an item exists before attempting to upload the data. This will save retransmitting data, but the additional HTTP requests may make small object transfers slower and more expensive.
Example (Using multithreading):
gsutil -m cp -n -a public-read -R large_folder gs://bucket_name

Using rsync, you can copy missing/modified files/objects:
gsutil -m rsync -r <local_folderpath> gs://<bucket_id>/<cloud_folderpath>
Besides, if you use the -d option, you will also delete files/objects in your bucket that are not longer present locally.
Another option could be to use Object Versioning, so you will replace the files/objects in your bucket with your local data, but you can always go back to the previous version.

Related

How to download files which are created in last 24 hours using gsutil in GCP console?

I have a directory in a gcp storage bucket. And there are 2 subdirectories in that bucket.
Is there a way to download files which are created in last 24 hours in those subdirectories using gsutil command from console?
gsutil does not support filtering by date.
An option is to create a list of files to download via another tool or script, one object name per line.
Use stdin to specify a list of files or objects to copy. You can use
gsutil in a pipeline to upload or download objects as generated by a
program. For example:
cat filelist | gsutil -m cp -I gs://my-bucket
or:
cat filelist | gsutil -m cp -I ./download_dir
where the output of cat filelist is a one-per-line list of files,
cloud URLs, and wildcards of files and cloud URLs.
I was able to achieve part of it using gcp console and shell.
Steps:
Go to storage directory in browser gcp console.
Click on filter and you'll get options to filter based on created before, created after etc.
Provide the date and apply filter
Click on Download button
Copy the command, Open the gcp shell and run it. The required files will be downloaded there.
Run the zip command in shell and archive the downloaded files.
Select the Download from shell options and provide file path to download.

List all RSYNCed folders in GCP GAE Linux

I set up some folders in GAE to be synced using the command -
gsutil rsync -r gs://sample1bucket1 ./sample1;
But I have forgotten what all places I have done it. How to list all these?
As per my understanding of your question, all your GAE folders are in cloud storage bucket, “sample1 bucket1”, and you are trying to sync them into directory “sample1”.If yes, then while writing the rsync commands you have to mention source and destination. So you should know where you are syncing all your files to, as per public documentation.
However,
you can list the folders in the current directory using the “ ls “
command to check for your destination folder and later cd into those
folders “cd simple1” (for your case) to see if the content has been
copied from your bucket to the file.
You can also list the number of running rsync processes using :
ps -ef| grep rsync | wc -l
I am leaving some information regarding the commands, in case you need them :
You can list all objects in a bucket using :
gsutil ls -r gs://bucket
You can list the directory with detailed information using :
rsync --list-only username#servername:/directoryname
You can list the folder contents using :
rsync --list-only username#servername:/directoryname/
You can also use the following command to parse out exactly what you need :
rsync -i

Retain owner and file permissions info when syncing to AWS S3 Bucket from Linux

I am syncing a directory to AWS S3 from a Linux server for backup.
rsync -a --exclude 'cache' /path/live /path/backup
aws s3 sync path/backup s3://myBucket/backup --delete
However, I noticed that when I want to restore a backup like so:
aws s3 sync s3://myBucket/backup path/live/ --delete
The owner and file permissions are different. Is there anything I can do or change in the code to retain the original Linux information of the files?
Thanks!
I stumbled on this question while looking for something else and figured you (or someone) might like to know you can use other tools that can preserve original (Linux) ownership information.
There must be others but I know that s3cmd can keep the ownership information (stored in the metadata of the object in the bucket) and restore it if you sync it back to a Linux box.
The syntax for syncing is as follows
/usr/bin/s3cmd --recursive --preserve sync /path/ s3://mybucket/path/
And you can sync it back with the same command just reversing the from/to.
But, as you might know (if you did a little research on S3 costs optimisation), depending on the situation, it could be wiser to use a compressed file.
It saves space and it should take less requests so you could end up with some savings at the end of the month.
Also, s3cmd is not the fastest tool to synchronise with S3 as it does not use multi-threading (and is not planning to) like other tools, so you might want to look for other tools that could preserve ownership and profits of multi-threading if that's still what you're looking for.
To speedup data transfer with s3cmd, you could execute multiple s3cmd with different --exclude --include statements.
For example
/usr/bin/s3cmd --recursive --preserve --exclude="*" --include="a*" sync /path/ s3://mybucket/path/ & \
/usr/bin/s3cmd --recursive --preserve --exclude="*" --include="b*" sync /path/ s3://mybucket/path/ & \
/usr/bin/s3cmd --recursive --preserve --exclude="*" --include="c*" sync /path/ s3://mybucket/path/

wget to download new wildcard files and overwrite old ones

I'm currently using wget to download specific files from a remote server. The files are updated every week, but always have the same file names. e.g new upload file1.jpg will replace local file1.jpg
This is how I am grabbing them, nothing fancy :
wget -N -P /path/to/local/folder/ http://xx.xxx.xxx.xxx/remote/files/file1.jpg
This downloads file1.jpg from the remote server if it is newer than the local version then overwrites the local one with the new one.
Trouble is, I'm doing this for over 100 files every week and have set up cron jobs to fire the 100 different download scripts at specific times.
Is there a way I can use a wildcard for the file name and have just one script that fires every 5 minutes for example?
Something like....
wget -N -P /path/to/local/folder/ http://xx.xxx.xxx.xxx/remote/files/*.jpg
Will that work? Will it check the local folder for all current file names, see what is new and then download and overwrite only the new ones? Also, is there any danger of it downloading partially uploaded files on the remote server?
I know that some kind of file sync script between servers would be a better option but they all look pretty complicated to set up.
Many thanks!
You can specify the files to be downloaded one by one in a text file, and then pass that file name using option -i or --input-file.
e.g. contents of list.txt:
http://xx.xxx.xxx.xxx/remote/files/file1.jpg
http://xx.xxx.xxx.xxx/remote/files/file2.jpg
http://xx.xxx.xxx.xxx/remote/files/file3.jpg
....
then
wget .... --input-file list.txt
Alternatively, If all your *.jpg files are linked from a particular HTML page, you can use recursive downloading, i.e. let wget follow links on your page to all linked resources. You might need to limit the "recursion level" and file types in order to prevent downloading too much. See wget --help for more info.
wget .... --recursive --level=1 --accept=jpg --no-parent http://.../your-index-page.html

rsync : copy files if local file doesn't exist. Don't check filesize, time, checksum etc

I am using rsync to backup a million images from my linux server to my computer (windows 7 using Cygwin).
The command I am using now is :
rsync -rt --quiet --rsh='ssh -p2200' root#X.X.X.X:/home/XXX/public_html/XXX /cygdrive/images
Whenever the process is interrupted, and I start it again, it takes long time to start the copying process.
I think it is checking each file if there is any update.
The images on my server won't change once they are created.
So, is there any faster way to run the command so that it may copy files if local file doesn't exist without checking filesize, time, checksum etc...
Please suggest.
Thank you
did you try this flag -- it might help, but it might still take some time to resume the transfer:
--ignore-existing
This tells rsync to skip updating files that already exist on the destination (this does not ignore
existing directories, or nothing would get done). See also --existing.
This option is a transfer rule, not an exclude, so it doesn't affect the data that goes into the
file-lists, and thus it doesn't affect deletions. It just limits the files that the receiver requests
to be transferred.
This option can be useful for those doing backups using the --link-dest option when they need to con-
tinue a backup run that got interrupted. Since a --link-dest run is copied into a new directory hier-
archy (when it is used properly), using --ignore existing will ensure that the already-handled files
don't get tweaked (which avoids a change in permissions on the hard-linked files). This does mean that
this option is only looking at the existing files in the destination hierarchy itself.

Resources