how to pack (no compress) file to use mulicore - zip

Hello Thanks in advance
There are 2000 over files and I have to archive it in DIVA so I pack the files
(I found how to archive 1000 over files in DIVA but it's impossble now)
and, There was an accident a few years a go that compressed file had flaws
so I had to pack files without compression since then
The problem is that
I normaly use .tar and when I pack files, Mac and Windows only use 1 core
so it takes too much times (1TB pack = 10 hour over)
I can select options in Bandizip when i want to make .zip - no compress, Cpu threadnumber -
but it still use 1 core. I guess no compress opiton support only 1 core
According to google search result, linux support multicore when I pack .tar
Is it really possible with no compression? and
How can I use multicore .tar or .zip with no compression in linux or Win or Mac?

You can use pigz instead of gzip to compress your tar files to tar.gz. pigz can use all of your cores to speed up compression.

Related

Simplest format to archive a directory

I have a script that needs to work on multiple platforms and machines. Some of those machines don't have any available archiving software (e.g. zip, tar). I can't download any software onto these machines.
The script creates a directory containing output files. I need to package all those files into a single file so i can download it easily.
What is the simplest possible archiving format to implement, so I can easily roll my own impl in the script. It doesn't have to support compression.
I could make up something ad-hoc, e.g.
file1 base64EncodedContents
dir1/file1 base64EncodedContents
etc.
However if one already exists then that will save me having to roll my own packing and unpacking, only packing, which would be nice. Bonus points if it's zip compatible, so that I can try zipping it with compression if possible, and them impl my own without compression otherwise, and not have to worry about which it is on the other side.
The tar archive format is extremely simple - simple enough that I was able to implement a tar archiver in powershell in a couple of hours.
It consists of a sequence of file header, file data, file header, file data etc.
The header is pure ascii, so doesn't require any bit manipulation - you can literally append strings. Once you've written the header, you then append the file bytes, and pad it with nil chars till it's a multiple of 512 bytes. You then repeat for the next file.
Wikipedia has more details on the exact format: https://en.wikipedia.org/wiki/Tar_(computing).

Downloading & extracting in parallel, maximizing performance?

I want to download and extract 100 tar.gz files that are each 1GB in size. Currently, I've sped it up with multithreading and by avoiding disk IO via in-memory byte streams, but can anyone show me how to make this faster (just for curiosity's sake)?
from bs4 import BeautifulSoup
import requests
import tarfile
import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor
# speed up by only extracting what we need
def select(members):
for file in members:
if any(ext in file.name for ext in [".tif", ".img"]):
yield file
# for each url download the tar.gz and extract the necessary files
def download_and_extract(x):
# read and unzip as a byte stream
r = requests.get(x, stream=True)
tar = tarfile.open(fileobj=r.raw, mode='r|gz')
tar.extractall(members=select(tar))
tar.close()
# parallel download and extract the 96 1GB tar.gz files
links = get_asset_links()
# 3 * cpu count seemed to be fastest on a 4 core cpu
with ThreadPoolExecutor(3 * mp.cpu_count()) as executor:
executor.map(download_and_extract, links)
My current approach takes 20 - 30 minutes. I'm not sure what the theoretical possible speed up is, but if its helpful, the download speed for a single file is 20 MB/s in isolation.
If anyone could indulge my curiosity, that would be greatly appreciated! Some things I looked into were asyncio, aiohttp, and aiomultiprocess, io.BytesIO etc. But I wasn't able to get them to work well with the tarfile library.
Your computation is likely IO bound. Compression is generally a slow task, especially the gzip algorithm (new algorithms can be much faster). From the provided information, the average reading speed is about 70 Mo/s. This means that the storage throughput is at least roughly 140 Mo/s. It looks like totally normal and expected. This is especially true if you use a HDD or a slow SSD.
Besides this, it seems you iterate over the files twice due to the selection of members. Keep in mind that tar gz files are a big block of files packed together and then compressed with gzip. To iterate over the filenames the tar file need to be already partially decompressed. This may not be a problem regarding the implementation of tarfile (possible caching). If the size of all the discarded files is small, it may be better to simply decompress the whole archive in a raw and then remove the files to discard. Moreover, if you have a lot a memory and the size of all discarded files is not small, you can decompress the files in an in-memory virtual storage device first in order to write the discarded files. This can be natively done on Linux systems.

Handling archives with resource forks on non-HFS file-systems

I'm working on a website that is supposed to store compressed archive files for downloading, for different platforms (Mac and Windows).
Unfortunately, the Mac version of the download uses "resource forks", which I understand is a vendor-specific feature of the MacOS file system that attaches extra data to a file identifier. Previously, the only solution was to create the Mac archive (at that time a .sit archive, specifically) on a Mac, and manually upload both versions.
I would now like to let the website accept only the Windows file (a regular .zip that can be decompressed on any file-system), and generate a Mac archive with resource forks automatically. Basically, all I need is some way to generate an archive file on the Linux server (in any reasonably common format that can support resource forks; not sure if .sit is still the best option) that will yield the correct file structure when decompressed on Mac. As the file system doesn't support forks, the archive probably has to be assembled in memory and written to disk, rather than using any native compression tool.
Is there some software that can do this, or at least some format specification that would allow implementing it from scratch?
(1) Resource (and other "named") forks are legacy technology in macOS. While still supported, no modern software uses resource forks for anything substantial. I'd first suggest reviewing your requirements to see if this is even necessary anymore.
(2) macOS has long settled on .zip as the standard / built-in archive format. .sit was a third-party compression application (StuffIt) that has drifted out of favor.
(3) Resource forks are translated to non-native filesystems using a naming convention. For example, lets say the file Chart.jpg has a resource fork. When macOS writes this to a filesystem that doesn't support named forks it creates two files: Chart.jpg and ._Chart.jpg, with the latter containing the resource fork and metadata. Typically all that's required is for the .zip file to contain these two files and macOS unarchiving utility will reassemble the original file, with both forks.
I found some files with resource forks and compressed them using macOS's built-in compression command. Here's the content of the archive (unzip -v Archive.zip):
Archive: /Users/james/Development/Documentation/Archive.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
1671317 Defl:N 1108973 34% 12-19-2009 12:09 b1b6083c svn-book.pdf
0 Stored 0 0% 01-30-2018 12:59 00000000 __MACOSX/
263 Defl:N 157 40% 12-19-2009 12:09 9802493b __MACOSX/._svn-book.pdf
265 Defl:N 204 23% 06-01-2007 23:49 88130a77 Python Documentation.webloc
592 Defl:N 180 70% 06-01-2007 23:49 f41cd5d1 __MACOSX/._Python Documentation.webloc
-------- ------- --- -------
1672437 1109514 34% 5 files
So it appears that the special filenames are being sequestered in an invisible __MACOSX subfolder. All you'd have to do is generate a .zip file with the same structure and it would be reassembled on a macOS system into a native file with a resource fork.

How to modify a gzip compressed file

i've a single gzip compressed file (100GB uncompressed 40GB compressed). Now i would like to modify some bytes / ranges of bytes - i DO NOT want to change the files size.
For example
Bytes 8 + 10 and Bytes 5000 - 40000
is this possible without recompressing the whole file?
Stefan
Whether you want to change the file sizes makes no difference (since the resulting gzip isn't laid out according to the original file sizes anyway), but if you split the compressed file into parts so that the parts you want to modify are in isolated chunks, and use a multiple-file compression method instead of the single-file gzip method, you could update just the changed files without decompressing and compressing the entire file.
In your example:
bytes1-7.bin \
bytes8-10.bin \ bytes.zip
bytes11-4999.bin /
bytes5000-40000.bin /
Then you could update bytes8-10.bin and bytes5000-40000.bin but not the other two. But whether this will take less time is dubious.
In a word, no. It would be necessary to replace one or more deflate blocks with new blocks with exactly the same total number of bits, but with different contents. If the new data is less compressible with deflate, this becomes impossible. Even if it is more compressible, it would require a lot of bit twiddling by hand to try to get the bits to match. And it still might not be possible.
The man page for gzip says "If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip." I believe that means that gzip compression continues through the files, therefore is context-sensitive, and therefore will not permit what you want.
Either decompress/patch/recompress, or switch to a different representation of your data (perhaps an uncompressed tar or zip of individually compressed files, so you only have to decompress/recompress the one you want to change.) The latter will not store your data as compactly, in general, but that's the tradeoff you have to make.

Compress .ipa monotuch

Starting from the assumption that I have deleted all unnecessary files, i have my app that contains a folder with jpg images (1024*700 resolution minimum permitted) where the size is 400 MB. When generate my ipa size is 120 MB. I have tried to convert those images in PNG and next generate ipa but size is more than 120 MB (140 MB), but quality it's a bit worse.
Which best practices recommended to reduce the size of the application?
P.s. Those files are showed as gallery.
On tool we used in our game, Draw a Stickman: EPIC, is smusher.
To install (you have to have ruby or XCode command line tools):
sudo gem install smusher
It might print some errors installing that you can ignore.
To use it:
smusher mypng.png
smusher myjpg.jpg
The tool will send the picture off to yahoo's web service smush.it, and in a non-lossy way compress the image.
Generally you can save maybe 20% file size with no loss in quality.
There are definitely other techniques we used like using indexed PNGs, but you are already using JPGs, which are smaller.

Resources