How to read a large 30GB tar.xz with colab - python-3.x

I downloaded a 30GB tar.xz file to my G-drive using Google Colab. I need help in extracting and reading this folder in Colab. Inside the tar folder, there are ten folders. Is it possible to read these folders individually? I have tried the following but it failed.
Untar the 30GB folder in G-drive but it failed because of the limitations with reading and writing files in G-Drive.
I can directly download the file to the local Colab directory, but because of the space limitations in Colab I cannot extract or read it in the local directory.
Any suggestion about how to proceed with this problem.
Thank you

You can extract only a directory inside the tar file, using --wildcards option.
!tar xf file.tar.xz --wildcards 'path_to/dir/*'
Here's an example notebook.

Related

How can I download a folder from google drive or dropbox using command in linux?

I am trying to download a folder using command in linux shell using dropbox or google drive link. The download works but it is not saved as a folder, after it is downloaded I cannot access it using 'cd ..' command. So the folder is downloaded but when I use cd .. , I get the message that the file is not a directory.
How can I download a folder and access it? I am also executing this in virtual machine.
I do not know which method you are using to download the directory. In order to download a directory, you need to recursively download all the files in it or, create a tar or zip for the directory.
You can consider using gdown.
Please also read the detailed explanation from the following post: wget/curl large file from google drive

How to upload a zip file from a google drive in colab using zip file link

I have a zip file(files.zip) in my google drive. This zipped file is being used in a shared colab file. I want to write a program such that this zipped file is uploaded, using file id, so that anybody can run the program on their drive.
In other words, zipfile in my drive but colab notebook can be run from another google drive. I am using:
!wget --no-check-certificate https://drive.google.com/uc?export=download&id=FileId
This uploads the file which does not have the .zip extension. Please help.
Thanks.

Some zip files not unarchiving in Python

I am downloading a zip file from an API and trying to unzip it using Python shutil.
shutil.unpack_archive(file_name)
It gives weird behaviour, for some files it works, for others it shows the following error:
name.zip is not a zip file
There is no issue with the downloaded file, I am able to unarchive it manually.
Any help here would be appreciated.
You should use zipfile (for zip archives) or tarfile (for tar archives)

How to verify the files after unzipping from a zip file (7zip)

How can I verify that the unzipped files are not corrupted after the extraction from the zip file?
Scenario: I am packaging some font files via 7zip and then during installation of the application I am unzipping it and copying it to some specific directory. In some cases, although my unzipping is successful my unzipped files are coming to be corrupted. I am not able to load the font files. Upon investigation, I found that they are somehow getting corrupted.
Is it possible that during extraction, extracted files can get corrupted?
What is the way to check if the extracted files are fine or corrupted?
I can test the zip file using -t flag which checks that all the files are correctly zipped in the zip files but what I want to check is that POST extraction whether the files corrupted or not?
Thanks

Using tar -zcvf against a folder creates an empty compressed file

I am ssh'ed into an Acquia server trying to download some files. I need to backup these files for local development (to get user uploaded images mainly).
I am using the following command:
tar -zcvf ~/download/stage-files_3-19-2015_1344.tar.gz files/
I have read/write access to the download folder. I created that folder. I am in the parent folder of "files". And permissions to that folder are 777.
I was able to run this the other day with no issues. So I am very confused as to why this is happening now.
Actually I just figured this darn thing out. Must have run out of disk space because once I removed a prior compressed backup of the files it started running just fine. Dang disk quotas. Sorry guys.

Resources