Unzipping image directory in Google Colab doesn't unzip entire contents - zip

I'm trying to unzip a directory of 75,000 images for training a CNN. When unzipping using,
!unzip -uq "/content/BDD_gt_8045.zip" -d "/content/drive/My Drive/Ground_Truth"
not all images unzip. I have about 5,000 I believe. I tried doing it several times but then I have some duplicates. Is there a limit to the number of images I can unzip?
I'm currently stuck on how else I'm meant to get all files into my drive to train the model.

Colab's default 'unzip' binary doesn't work as expected. It seems to cancel the unzipping automatically after a few cycles. Run latest version of 7z and you are good to go.
# To extract with full paths
!7z x <filename.zip>
# To extract all the files in the same folder (ignore paths)
!7z e <filename.zip>
# To specify output directory, use '-o'
!7z x <filename.zip> -o '/content/drive/My Drive/Datasets/FashionMNIST'

Related

Extract tar.gz{some integer} in python

I am trying to extract a file name with this format--> filename.tar.gz10
I have tried mutpile wayd but for all of them, I get the error that is unknow format. it works fine for files ends with tar.gz00. I tried to change the name but still does not work.
Here are what I have tried,
import tarfile
file = tarfile.open('filename.tar.gz10')
file.extractall('./extracted_path')
file.close()
Another way is,
shutil.unpack_archive('./filename.tar.gz10', './extracted_path', 'tar.gz17')
Thanks for your help in advance.
This coule be because the archive was split into smaller chunks, on linux you could do so using the split -b command so one big file is actually multiple smaller ones now, and they are named like
file.tar.gz01
file.tar.gz02
file.tar.gz03
file.tar.gz04
etc...
you wont be able to decompress these file individually, so you have to concatenate them first into one file then decompress.
To verify whther it was split or not, run file {filename} and if does not recognize it as a gzip compressed archive then it is propably split (this is why you get unknown format error)
You can try to do the following:
from glob import glob
import os
path = '/path/to/' # location of your files
list_of_files = glob(path + '*.tar.gz*') # list all gzip files
bash_command = 'gzip -dk filename.tar.gz' + ' '.join(list_of_files) # create bash command to concatenate the files
os.system(bash_command)

Splitting large tar file into multiple tar files

I have a tar file which is 3.1 TB(TeraByte)
File name - Testfile.tar
I would like to split this tar file into 2 parts - Testfil1.tar and Testfile2.tar
I tried the following so far
split -b 1T Testfile.tar "Testfile.tar"
What i get is Testfile.taraa(what is "aa")
And i just stopped my command. I also noticed that the output Testfile.taraa doesn't seem to be a tar file when I do ls in the directory. It seems like it is a text file. May be once the full split is completed it will look like a tar file?
The behavior from split is correct, from man page online: http://man7.org/linux/man-pages/man1/split.1.html
Output pieces of FILE to PREFIXaa, PREFIXab, ...
Don't stop the command let it run and then you can use cat to concatenate (join) them all back again.
Examples can be seen here: https://unix.stackexchange.com/questions/24630/whats-the-best-way-to-join-files-again-after-splitting-them
split -b 100m myImage.iso
# later
cat x* > myImage.iso
UPDATE
Just as clarification since I believe you have not understood the approach. You split a big file like this to transport it for example, files are not usable this way. To use it again you need to concatenate (join) pieces back. If you want usable parts, then you need to decompress the file, split it in parts and compress them. With split you basically split the binary file. I don't think you can use those parts.
You are doing the compression first and the partition later.
If you want each part to be a tar file, you should use 'split' first with de original file, and then 'tar' with each part.

How to define the condition of a corrupted file for audio file in Python

I am using Python 3.6, Jupyter notebook by connecting to a remote machine. I have a large dataset of mp3 files. I use FFmpeg (version is 2.8.14-0ubuntu0.16.04.1.) to convert mp3 files to wav format.
My code below goes over the file path list and if the file is mp3 it converts it to wav format and deletes the mp3 file. The code works but for a few files it stops and gives error. I opened those files and saw that they have no duration and each of them has size 600 looking at the terminal folder size column but it might be a coincidence. The error is file not found for 'temp_name.wav'.
I can see that these corrupted files are not able to be converted to wav. When I delete them manually and run the code again it works. But I have large datasets and cannot know which files are corrupted beforehand. Is there a way to make the code (before converting the file to wav) if the file is corrupted it deletes it and continues to next file. I just don`t know how to define the condition of a corrupted file or if the file cannot be converted to wav.
# npaths is the list of full file paths
for fpath in npaths:
if (fpath.endswith(".mp3")):
cdir=os.path.dirname(fpath) # extract the directory of file
os.chdir(cdir) # change the directory to cdir
filename=os.path.basename(fpath) # extract the filename from the path
os.system("ffmpeg -i {0} temp_name.wav".format(filename))
ofnamepath=os.path.splitext(fpath)[0] # filename without extension
temp_name=os.path.join(cdir, "temp_name.wav")
new_name = os.path.join(ofnamepath+'.wav')
os.rename(temp_name,new_name) # use original filename with wav ext
old_file = os.path.join(ofnamepath+'.mp3') # find and delete the mp3
os.remove(old_file)

Change huge amount of data from NIST to RIFF wav file

So, I am writing a speech recognition program. To do that I downloaded 400MB of data from TIMIT. When I inteded to read the wav files (I tried two libraries) as follow:
import scipy.io.wavfile as wavfile
import wave
(fs, x) = wavfile.read('../data/TIMIT/TRAIN/DR1/FCJF0/SA1.WAV')
w = wave.open('../data/TIMIT/TRAIN/DR1/FCJF0/SA1.WAV')
In both cases they have the problem that the wav file format says 'NIST' and it must be in 'RIFF' format. (Something about sph also I readed but the nist file I donwloaded are .wav, not .sph).
I downloaded then SOX from http://sox.sourceforge.net/
I added the path correctly to my enviromental variables so that my cmd recognize sox. But I can't really find how to use it correctly.
What I need now is a script or something to make sox change EVERY wav file format from NIST to RIFF under certain folder and subfolder.
EDIT:
in reading a WAV file from TIMIT database in python I found a response that worked for me...
Running sph2pipe -f wav input.wav output.wav
What I need is a script or something that searches under a folder, all subfolders that contain a .wav file to apply that line of code.
Since forfiles is a Windows command, here is a solution for unix.
Just cd to the upper folder and type:
find . -name '*.WAV' | parallel -P20 sox {} '{.}.wav'
You need to have installed parallel and sox though, but for Mac you can get both via brew install. Hope this helps.
Ok, I got it finally. Go to the upper folder and run this code:
forfiles /s /m *.wav /c "cmd /c sph2pipe -f wav #file #fnameRIFF.wav"
This code searches for every file and make it readble for the python libs. Hope it helps!

Is it possible to download just part of a ZIP archive (e.g. one file)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 1 year ago.
Improve this question
Is there a way by which I can download only a part of a .rar or .zip file without downloading the whole file?
There is a ZIP file containing files A, B, C, and D.
I only need A. Can I somehow tweak the download to download only A or if possible extract the file in the server itself and get A only?
The trick is to do what Sergio suggests without doing it manually. This is easy if you mount the ZIP file via an HTTP-backed virtual filesystem and then use the standard unzip command on it. This way the unzip utility's I/O calls are translated to HTTP range GETs, which means only the chunks of the ZIP file that you want get transferred over the network.
Here's an example for Linux using HTTPFS, a very lightweight virtual filesystem (it uses FUSE). There are similar tools for Windows.
Get/build httpfs:
$ wget http://sourceforge.net/projects/httpfs/files/httpfs/1.06.07.02
$ mv 1.06.07.10 httpfs_1.06.07.10.tar.bz2
$ tar -xjf httpfs_1.06.07.10.tar.bz2
$ rm httpfs
$ ./make_httpfs
Mount a remote ZIP file and extract one file from it:
$ mkdir mount_pt
$ sudo ./httpfs http://server.com/zipfile.zip mount_pt
$ sudo ls mount_pt
zipfile.zip
$ sudo unzip -p mount_pt/zipfile.zip the_file_I_want.txt > the_file_I_want.txt
$ sudo umount mount_pt
Of course you can also use whatever other tools beside the command-line one (I need sudo because it seems FUSE is set up that way on my machine, you shouldn't have to need it).
In a way, yes, you can.
ZIP file format says that there's a "central directory". Basically, this is a table that stores what files are in the archive and what offsets do they have.
So, using Content-Range you could download part of the file from the end (the central directory is the last thing in a ZIP file) and try to identify the central directory in it. If you succeed then you know the file list and offsets, so you can proceed and get those chunks separately and decompress them yourself.
This approach is quite error-prone and is not guaranteed to work. But so is hacking in general :-)
Another possible approach would be to build a custom server for that (see pst's answer for more details).
There are several ways for a normal person to be able to download an individual file from a compressed ZIP file, unfortunately they aren't common knowledge. There are some open-source tools and online web services, including:
Windows: Iczelion's HTTP Zip Dowloader (open-source) (that I've used for over 10 years!)
Linux: partial-zip (open-source)
Online: wobzip.org (closed-source)
You can arrange for your file to appear in the back of the ZIP file.
Download 100k:
$ curl -r -100000 https://www.keepassx.org/releases/2.0.2/KeePassX-2.0.2.zip -o tail.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 97k 100 97k 0 0 84739 0 0:00:01 0:00:01 --:--:-- 84817
Check what files we did get:
$ unzip -t tail.zip
(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)
error [tail.zip]: attempt to seek before beginning of zipfile
(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)
error [tail.zip]: attempt to seek before beginning of zipfile
(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)
error [tail.zip]: attempt to seek before beginning of zipfile
(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)
error [tail.zip]: attempt to seek before beginning of zipfile
(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)
testing: KeePassX-2.0.2/share/translations/keepassx_uk.qm OK
testing: KeePassX-2.0.2/share/translations/keepassx_zh_CN.qm OK
testing: KeePassX-2.0.2/share/translations/keepassx_zh_TW.qm OK
testing: KeePassX-2.0.2/zlib1.dll OK
At least one error was detected in tail.zip.
Then extract the last file:
$ unzip tail.zip KeePassX-2.0.2/zlib1.dll
Archive: tail.zip
error [tail.zip]: missing 7751495 bytes in zipfile
(attempting to process anyway)
inflating: KeePassX-2.0.2/zlib1.dll
I think Sergio Tulentsev's idea is brilliant.
However, if there is control over the server -- e.g., custom code can be deployed -- then it is a rather trivial operation (in the scheme of things :) to map/handle a request, extract the relevant portion of the ZIP archive, and send the data back in the HTTP stream.
The request might look like:
http://foo.bar/myfile.zip_a.jpeg
Which would mean extract -- and return -- "a.jpeg" from "myfile.zip".
(I intentionally chose this silly format so that browsers would likely choose "myfile.zip_a.jpeg" as the name in the download dialog when it appears.)
Of course, how this is implemented depends on the server/language/framework and there may already be existing solutions that support a similar operation (but I know not).
Based on the good input I have written a code-snippet in Powershell to show how it could work:
# demo code downloading a single DLL file from an online ZIP archive
# and extracting the DLL into memory to mount it finally to the main process.
cls
Remove-Variable * -ea 0
# definition for the ZIP archive, the file to be extracted and the checksum:
$url = 'https://github.com/sshnet/SSH.NET/releases/download/2020.0.1/SSH.NET-2020.0.1-bin.zip'
$sub = 'net40/Renci.SshNet.dll'
$md5 = '5B1AF51340F333CD8A49376B13AFCF9C'
# prepare HTTP client:
Add-Type -AssemblyName System.Net.Http
$handler = [System.Net.Http.HttpClientHandler]::new()
$client = [System.Net.Http.HttpClient]::new($handler)
# get the length of the ZIP archive:
$req = [System.Net.HttpWebRequest]::Create($url)
$req.Method = 'HEAD'
$length = $req.GetResponse().ContentLength
$zip = [byte[]]::new($length)
# get the last 10k:
# how to get the correct length of the central ZIP directory here?
$start = $length-10kb
$end = $length-1
$client.DefaultRequestHeaders.Add('Range', "bytes=$start-$end")
$result = $client.GetAsync($url).Result
$last10kb = $result.content.ReadAsByteArrayAsync().Result
$last10kb.CopyTo($zip, $start)
# get the block containing the DLL file:
# how to get the exact file-offset from the ZIP directory?
$start = $length-3537kb
$end = $length-3201kb
$client.DefaultRequestHeaders.Clear()
$client.DefaultRequestHeaders.Add('Range', "bytes=$start-$end")
$result = $client.GetAsync($url).Result
$block = $result.content.ReadAsByteArrayAsync().Result
$block.CopyTo($zip, $start)
# extract the DLL file from archive:
Add-Type -AssemblyName System.IO.Compression
$stream = [System.IO.Memorystream]::new()
$stream.Write($zip,0,$zip.Length)
$archive = [System.IO.Compression.ZipArchive]::new($stream)
$entry = $archive.GetEntry($sub)
$bytes = [byte[]]::new($entry.Length)
[void]$entry.Open().Read($bytes, 0, $bytes.Length)
# check MD5:
$prov = [Security.Cryptography.MD5CryptoServiceProvider]::new().ComputeHash($bytes)
$hash = [string]::Concat($prov.foreach{$_.ToString("x2")})
if ($hash -ne $md5) {write-host 'dll has wrong checksum.' -f y ;break}
# load the DLL:
[void][System.Reflection.Assembly]::Load($bytes)
# use the single demo-call from the DLL:
$test = [Renci.SshNet.NoneAuthenticationMethod]::new('test')
'done.'

Resources