Process a file in Google Cloud storage - python-3.x

I have some very large files (100GB) in GCS that need to be processed to remove invalid characters.
Downloading them and processing them and uploading them again takes forever.
Does anyone know if it is possible to process them in the Google Cloud Platform eliminating the need for download/upload ?
I am familiar with Python and Cloud functions if those are an option.

As John Hanley said in the comments section, there is no compute features on Cloud Storage, so to process it you need to download it.
Once that said, instead of downloading the huge file locally to process it, you can start a Compute Engine VM, download that file, process it with a Python script (since you have stated that you're familiar with Python), and updated the processed file.
It will be probably quicker to download the file on a Compute Engine VM (it depends on the machine type though) than downloading the file on your computer.
Also, for faster downloads of huge files, you can use some gsutil options :
gsutil \
-o 'GSUtil:parallel_thread_count=1' \
-o 'GSUtil:sliced_object_download_max_components=16' \
cp gs://my-bucket/my-huge-file .
And for faster uploads of huge files, you can use parallel composite uploads :
gsutil \
-o 'GSUtil:parallel_composite_upload_threshold=150M' \
cp my-huge-file gs://my-bucket

Related

How to compress/zip files in Microsoft Azure App Service console?

I know about the KUDU service in Microsoft Azure and it works great. However I have got very large data, which is greater than 3GB and it takes very long time to download and then upload to my new server. Is there a way to zip the data on Azure through command line and then do wget to download this data on new server. I have been doing this manually till now but it is taking it forever to download it to my PC first and then uploading to the server through FTP.
I am logged into Microsoft Azure App Service Console. I have tried compress, Compress-Archive and even zip command but nothing works. It says that famous internal and external command not found message.
'compress' is not recognized as an internal or external command, operable program or batch file.
How could I compress these file on Azure console? Earliest help would be appreciated.
Or is there a way to install some compression tool on this server through command line?
Now Windows and Kudu both support a native tar command, why not use that.
tar -cvzf my_archive.tar.gz input_dir
tar -xf my_archive.tar.gz
While the unzip utility is available, there's no zip tool. One way around that is to upload the command line version of 7-Zip, it's a standalone .EXE file.

Unable to extract shape_predictor_68_face_landmarks.dat for bz

I am trying to run some face frontalization code (using Python3 on Windows10), the code uses opencv and dlib and requires a file called shape_predictor_68_face_landmarks.dat. The code tries to automatically download it and then unzip it but it fails to unzip giving an unexpected end of archive error. I tried to use WinRaR to repair the file (which I also tried manualy downloading from http://sourceforge.net/projects/dclib/files/dlib/v18.10/shape_predictor_68_face_landmarks.dat.bz2) but it says it can only repair .zip and .rar files.
Does anyone know where I can download the uncompressed .dat file from? Or alternatively how I can repair a damaged .bz file in Windows?
The file is available at
http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2
I downloaded it and verified that extraction works. The file is smaller than the one used in the previous version, but I think that is due to improvements.
In case this does not work, let me (or Davis King, who maintains the dlib blog) know so that you can get the uncompressed version.
Downloading using the CLI is a lot easier.
wget http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2
To decompress the compressed file you just downloaded, use the following command
bzip2 -d shape_predictor_68_face_landmarks.dat.bz2
As mentioned above, download shape_predictor_68_face_landmarks.dat
from here. But while downloading, downloads gets failed(i faced this issue). So, if you're also facing the same issue, then i recommend to download it via command-line:
$ wget link

Downloading from s3 bucket fails while running the s3cmd get from cron job

I am running a script to download files from s3 bucket. Running the script in cron. At times, the script fails , but when i run it manually it always works.
Can anyone help me with this.
It appears that your requirement is to download all new files from Amazon S3, so that you have a local copy of all files (without downloading them repeatedly).
I would recommend using the AWS Command-Line Interface (CLI), which has an aws s3 sync command. This will synchronize the files from Amazon S3 to your local directory (or the other way). If something goes wrong, it will try to copy the files again on the next sync.

Zip files corrupt over 4 gigabytes - No warnings or errors - Did I lose my data?

I created a bunch of zip files on my computer (Mac OS X) using a command like this:
zip -r bigdirectory.zip bigdirectory
Then, I saved these zip files somewhere and deleted the original directories.
Now, when I try to extract the zip files, I get this kind of error:
$ unzip -l bigdirectory.zip
Archive: bigdirectory.zip
warning [bigdirectory.zip]: 5162376229 extra bytes at beginning or within zipfile
(attempting to process anyway)
error [bigdirectory.zip]: start of central directory not found;
zipfile corrupt.
(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)
I have since discovered that this could be because zip can't handle files over a certain size, maybe 4 gigs. At least I read that somewhere.
But why would the zip command let me create these files? The zip file in question is 9457464293 bytes and it let me make many more like this with absolutely no errors.
So clearly it can create these files.
I really hope my files aren't lost. I've learned my lesson and in the future I will check my archives before deleting the original files, and I'll probably also use another file format like tar/gzip.
For now though, what can I do? I really need my files.
Update
Some people have suggested that my unzip tool did not support big enough files (which is weird, because I used the builtin OS X zip and unzip). At any rate, I installed a new unzip from homebrew, and lo and behold, I do get a different error now:
$ unzip -t bigdirectory.zip
testing: bigdirectory/1.JPG OK
testing: bigdirectory/2.JPG OK
testing: bigdiretoryy/3.JPG OK
testing: bigdirectory/4.JPG OK
:
:
file #289: bad zipfile offset (local header sig): 4294967295
(attempting to re-compensate)
file #289: bad zipfile offset (local header sig): 4294967295
file #290: bad zipfile offset (local header sig): 9457343448
file #291: bad zipfile offset (local header sig): 9457343448
file #292: bad zipfile offset (local header sig): 9457343448
file #293: bad zipfile offset (local header sig): 9457343448
:
:
This is really worrisome because I need these files back. And there were definitely no errors upon creation of this zip file using the system zip tool. In fact, I made several of these at the same time and now they are all exhibiting the same problem.
If the file really is corrupt, how do I fix it?
Or, if it is not corrupt, how do I extract it?
Unzip below 6 seemingly fails, use
jar -xf <zipfile>
if you have java installed, or yet another unzip before you write the file off.
See: https://serverfault.com/questions/235139/how-to-unzip-files-bigger-than-4gb
Try 7z x
I had the same issue with unzip %x on Linux for a .zip file larger than 4GB, compounded with a only DEFLATED entries can have EXT descriptor error.
The command 7z x resolved all my issues though.
Be careful though, the command 7z x will extract all files with a path rooted in the current directory. The option -o allows to specify an output directory.
I had a similar problem backing up a 12GB directory before performing a hard disk format. Funnily enough I used the same command as you.
I read around and found suggestions to run:
zip -F
and
zip -FF
to try to fix the file.
Unfortunately these did not work and I still received errors.
After looking around some more, I found the ditto command and it worked perfectly against my original (untouched) zip file:
ditto -x -k original-file.zip dst-directory
-x to extract an archive
-k Specifies it to be a PKZip archive instead of the default CPIO
After using this command, I successfully extracted all of the files.
The built-in macOS Archive Utility (which is the default used when you select something in Finder and go to File -> Compress "<item>") also creates "corrupt" archives when a file in the archive is over 4 gigabytes in size, the size of the archive itself is over 4 gigabytes or you are trying to compress more than 65536 files into a single zip. This happens because it doesn't use the Zip64 extension format.
This is mentioned on https://apple.stackexchange.com/questions/221020/large-zip-files-created-in-os-x-cannot-be-opened-in-windows and is well covered in the "Apple Archive Utility (and ditto) and very large ZIP archives" 2009 blog post for the now defunct Springy utility. You can also see the 7-Zip folks are aware of the Apple tools creating corrupt zips issue too.
But why would the zip command let me create these files?
Strictly speaking, the original zip format only supports archives up to 2^32 bytes (4GiB) and which do not contain files that were originally larger than 4GiB and you there must be less than 65535 files. Because the command line version of the Infozip command tools shipped with OSX up to version OSX 10.11 (El Capitan) was no newer than 5.52, it could only produce non-conformant archives if you forced it to exceed the original zip format limits. Infozip 6.0 and above know how to make Zip64 archives and that standard has much higher limits. The Infozip 6.0 command line tools started shipping with macOS 10.12 (Sierra). In 2014 when the question was originally asked the newest OSX was 10.10 (Yosemite).
As stated above, even in macOS 10.15 (Catalina) the GUI Archive Utility still creates such "corrupt" zips.
If the file really is corrupt, how do I fix it?
It's corrupt in the sense that its non-conformant and will cause a lot of conformant tools to choke. You could extract (it see below) and then compress again with a tool that knows how to make Zip64 files...
Or, if it is not corrupt, how do I extract it?
Technically, all of the data from the files that have been compressed is still in the archive but the headers that allow fast listing of the zip's content are broken. Such zips can be a struggle to work with when using other tools (even testing such a zip with the command line unzip tool on the same version of macOS can indicate issues like invalid compressed data to inflate / bad zipfile offset (local header sig)).
To get at the files of such zips you need to use a program that will quietly just extract whatever was compressed without checking for conformance or trying to check/list the files. Examples of tools that can do this are:
macOS Archive Utility GUI tool
macOS command line tool ditto
7-zip
Java's jar tool
Infozip based tools won't be able to work with or repair such zip files once you've made such a problem zip file.
you can use
zip -FF corrupted.zip --out fixed.zip
replace corrupted.zip by your zip with issues
replace fixed.zip by the name of new .zip file fixed
I have faced exactly the same issue when I tried to unzip zip files of huge sizes (~7GB). I was damn sure that there was no error while copying the zip files to the server. (I double-checked it with rsync).
Depending on your situation, the solution is:
1) If you're doing this in a local machine, right click on the zip file and give Extract Here, this will work for (.zip) files of any size.
2) If your zip files are in a remote server, first load the server filesystem locally using sftp (sftp://username#server.url.address.com). After that just navigate to the directory and again do the same thing as you did in (1). i.e. right click on the zip file and extract it.
Might not be the best solution but that's one way of doing it.

Linux unzip preserve case?

Working on a web site. A number of third party javascript libraries use mixed-case in their files and folders.
I am working on a windows system.
When ready to upload from my local windows XAMPP environment to my linux hosting, I use 7zip to create a zip file of my site. I use 7zip's -xr! feature to skip certain directories like my .git repository.
I FTP the resulting .zip file to my server and use the server's "unzip" function to explode it. All my files are there but they are all changed to lowercase!
This kills the website as the third party libraries that are mixed-case are no longer found.
I've tried unzip -C but that did not seem to do anything.
I also look in the archive prior to uploading and on windows, all the file name cases are preserved.
Tried using GNU32's windows tar but the --exclude function is not allowing me to skip the .git directories.
I need some help in the form of:
How to use unzip in linux such that is preserves case (googled until hairless, but no love found...)
How to use tar on windows such that it excludes particular directories
How to use something else to achieve my goal. I honestly don't care what it is... I'm downloading CYGWIN right now to see if it'll help at all. I may end up installing Linux in a virtual box just to try tar-gz from a virtual machine actually running linux but would REALLY rather avoid that hassle every time I want to pack up a pretty simple archive.
Zip works fine for packing, but unpacking is not kosher.
Use tar's --exclude-vcs option:
--exclude-vcs
exclude version control system directories
Example:
tar --exclude-vcs czf foo.tar.gz foo
or for a *.tar.bz2 archive
tar --exclude-vcs cjf foo.tar.bz2 foo
Try unzip -U file.zip; this might work if you have an old version of unzip. Otherwise, post the output of unzip -v and unzip -l file.zip.

Resources