Handling archives with resource forks on non-HFS file-systems - linux

I'm working on a website that is supposed to store compressed archive files for downloading, for different platforms (Mac and Windows).
Unfortunately, the Mac version of the download uses "resource forks", which I understand is a vendor-specific feature of the MacOS file system that attaches extra data to a file identifier. Previously, the only solution was to create the Mac archive (at that time a .sit archive, specifically) on a Mac, and manually upload both versions.
I would now like to let the website accept only the Windows file (a regular .zip that can be decompressed on any file-system), and generate a Mac archive with resource forks automatically. Basically, all I need is some way to generate an archive file on the Linux server (in any reasonably common format that can support resource forks; not sure if .sit is still the best option) that will yield the correct file structure when decompressed on Mac. As the file system doesn't support forks, the archive probably has to be assembled in memory and written to disk, rather than using any native compression tool.
Is there some software that can do this, or at least some format specification that would allow implementing it from scratch?

(1) Resource (and other "named") forks are legacy technology in macOS. While still supported, no modern software uses resource forks for anything substantial. I'd first suggest reviewing your requirements to see if this is even necessary anymore.
(2) macOS has long settled on .zip as the standard / built-in archive format. .sit was a third-party compression application (StuffIt) that has drifted out of favor.
(3) Resource forks are translated to non-native filesystems using a naming convention. For example, lets say the file Chart.jpg has a resource fork. When macOS writes this to a filesystem that doesn't support named forks it creates two files: Chart.jpg and ._Chart.jpg, with the latter containing the resource fork and metadata. Typically all that's required is for the .zip file to contain these two files and macOS unarchiving utility will reassemble the original file, with both forks.
I found some files with resource forks and compressed them using macOS's built-in compression command. Here's the content of the archive (unzip -v Archive.zip):
Archive: /Users/james/Development/Documentation/Archive.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
1671317 Defl:N 1108973 34% 12-19-2009 12:09 b1b6083c svn-book.pdf
0 Stored 0 0% 01-30-2018 12:59 00000000 __MACOSX/
263 Defl:N 157 40% 12-19-2009 12:09 9802493b __MACOSX/._svn-book.pdf
265 Defl:N 204 23% 06-01-2007 23:49 88130a77 Python Documentation.webloc
592 Defl:N 180 70% 06-01-2007 23:49 f41cd5d1 __MACOSX/._Python Documentation.webloc
-------- ------- --- -------
1672437 1109514 34% 5 files
So it appears that the special filenames are being sequestered in an invisible __MACOSX subfolder. All you'd have to do is generate a .zip file with the same structure and it would be reassembled on a macOS system into a native file with a resource fork.

Related

Simplest format to archive a directory

I have a script that needs to work on multiple platforms and machines. Some of those machines don't have any available archiving software (e.g. zip, tar). I can't download any software onto these machines.
The script creates a directory containing output files. I need to package all those files into a single file so i can download it easily.
What is the simplest possible archiving format to implement, so I can easily roll my own impl in the script. It doesn't have to support compression.
I could make up something ad-hoc, e.g.
file1 base64EncodedContents
dir1/file1 base64EncodedContents
etc.
However if one already exists then that will save me having to roll my own packing and unpacking, only packing, which would be nice. Bonus points if it's zip compatible, so that I can try zipping it with compression if possible, and them impl my own without compression otherwise, and not have to worry about which it is on the other side.
The tar archive format is extremely simple - simple enough that I was able to implement a tar archiver in powershell in a couple of hours.
It consists of a sequence of file header, file data, file header, file data etc.
The header is pure ascii, so doesn't require any bit manipulation - you can literally append strings. Once you've written the header, you then append the file bytes, and pad it with nil chars till it's a multiple of 512 bytes. You then repeat for the next file.
Wikipedia has more details on the exact format: https://en.wikipedia.org/wiki/Tar_(computing).

How to create a compressed dmg file using Linux tools?

There are a lots of question teaching how to create a DMG file from Linux. But none of them is clear about how to add compression to it.
I usually create a DMG package to redistribute to MacOS, but I would like to add compression as Apple specifies.
Did anyone have a chance to try a tool that supports compression during DMG packing?
Similar questions without compression:
How can I generate a DMG file from a folder in Linux?
How to create dmg file in Centos through command line
It depends. Here is a good analysis of the .dmg format. From that document, the compression is specified for each block chunk:
Table: DMG blxx types
Type Scheme Meaning
0x00000000 --- Zero-Fill
0x00000001 UDRW/UDRO RAW or NULL compression (uncompressed)
0x00000002 --- Ignored/unknown
0x80000004 UDCO Apple Data Compression (ADC)
0x80000005 UDZO zLib data compression
0x80000006 UDBZ bz2lib data compression
0x7ffffffe --- No blocks - Comment: +beg and +end
0xffffffff --- No blocks - Identifies last blxx entry
To convert the gist of #Valerio's comment into a proper answer so nobody else misses it like I almost did:
According to this answer by #uckelman:
instal libdmg-hfsplus
Run dmg uncompressed.dmg compressed.dmg

How can I print/log the checksum calculated by rsync?

I have to transfer millons of files of very different size summing up almost 100 TB between two Linux servers. It's easy to do it the first time with rsync, and quite safe, because data can be checksum'ed.
However, I need to keep a list of files and their checksum to do some checks regularly in the future.
Is there a way to tell rsync to print/log the checksum of the file?
And in case this is not feasible: Which tool/command would you recommend considering that performance is very important?
Thanks in advance!
It is possible to include the transfer md5 checksum in logging since rsync 3.1.0 (released on 28 Sep 2013):
Added the "%C" escape to the log-output handling, which will output the
MD5 checksum of any transferred file, or all files if --checksum was
specified (when protocol 30 or above is in effect).
For example, the log format %i %f B:%l md5:%C will log each transfer similar to
>f+++++++++ 00/64235/0664eccc-364e-11e2-af18-57a6d04fd4d5 B:16035388 md5:8ab769aa5224514a41cee0e3e2fe3aad
Take note that this is the md5 sum calculated to verify transfer integrity - it is available even for transfers without the --checksum flag.
This change also allows to log the checksum if just one side of the transfer is 3.1.0 or newer. For example, you can have a newer rsync daemon on the target machine do the checksum logging, but send with an older rsync client as long as md5 is used (3.0.0 or newer).

Check ISO is valid or not

Is there any C# way to check an ISO file is valid or not i.e. valid Iso format or any other check possible or not.
The scenario is like, if any text file(or any other format file) is renamed to ISO and given it for further processing. I want to check weather this ISO file is a valid ISO file or not? Is there any way exist programmatically like to check any property of the file or file header or any other things
Thanks for any reply in advance
To quote the wiki gods:
There is no standard definition for ISO image files. ISO disc images
are uncompressed and do not use a particular container format; they
are a sector-by-sector copy of the data on an optical disc, stored
inside a binary file. ISO images are expected to contain the binary
image of an optical media file system (usually ISO 9660 and its
extensions or UDF), including the data in its files in binary format,
copied exactly as they were stored on the disc. The data inside the
ISO image will be structured according to the file system that was
used on the optical disc from which it was created.
reference
So you basically want to detect whether a file is an ISO file or not, and not so much check the file, to see if it's valid (e.g. incomplete, corrupted, ...) ?
There's no easy way to do that and there certainly is not a C# function (that I know of) that can do this.
The best way to approach this is to guess the amount of bytes per block stored in the ISO.
Guess, or simply try all possible situations one by one, unless you have an associated CUE file that actually stores this information. PS. If the ISO is accompanied by a same-name .CUE file then you can be 99.99% sure that it's an ISO file anyway.
Sizes would be 2048 (user data) or 2352 (raw or audio) bytes per block. Other sizes are possible as well !!!! I just mentioned the two most common ones. In case of 2352 bytes per block the user data starts at an offset in this block. Usually 16 or 24 depending on the Mode.
Next I would try to detect the CD/DVD file-systems. Assume that the image starts at sector 0 (although you could for safety implement a scan that assumes -150 to 16 for instance).
You'll need to look into specifics of ISO9660 and UDF for that. Sectors 16, 256 etc. will be interesting sectors to check !!
Bottom line, it's not an easy task to do and you will need to familiarize yourself with optical disc layouts and optical disc file-systems (ISO9660, UDF but possibly also HFS and even FAT on BD).
If you're digging into this I strongly suggest to get IsoBuster (www.isobuster.com) to help you see what the size per block is, what file systems there are, to inspect the different key blocks etc.
In addition to the answers above (and especially #peter's answer): I recently made a very simple Python tool for the detection of truncated/incomplete ISO images. Definitely not validation (which as #Jake1164 correctly points out is impossible), but possibly useful for some scenarios nevertheless. It also supports ISO images that contain Apple (HFS) partitions. For more details see the following blog post:
Detecting broken ISO images: introducing Isolyzer
And the software's Github repo is here:
Isolyzer
You may run md5sum command to check the integrity of an image
For example, here's a list of ISO: http://mirrors.usc.edu/pub/linux/distributions/centos/5.4/isos/x86_64/
You may run:
md5sum CentOS-5.4-x86_64-LiveCD.iso
The output is supposed to be the same as 1805b320aba665db3e8b1fe5bd5a14cc, which you may find from here:
http://mirrors.usc.edu/pub/linux/distributions/centos/5.4/isos/x86_64/md5sum.txt

How much disk space do shared libraries really save in modern Linux distros?

In the static vs shared libraries debates, I've often heard that shared libraries eliminate duplication and reduces overall disk space. But how much disk space do shared libraries really save in modern Linux distros? How much more space would be needed if all programs were compiled using static libraries? Has anyone crunched the numbers for a typical desktop Linux distro such as Ubuntu? Are there any statistics available?
ADDENDUM:
All answers were informative and are appreciated, but they seemed to shoot down my question rather than attempt to answer it. Kaleb was on the right track, but he chose to crunch the numbers for memory space instead of disk space (my question was for disk space).
Because programs only "pay" for the portions of static libraries that they use, it seems practically impossible to quantitatively know what the disk space difference would be for all static vs all shared.
I feel like trashing my question now that I realize it's practically impossible to answer. But I'll leave it here to preserve the informative answers.
So that SO stops nagging me to choose an answer, I'm going to pick the most popular one (even if it sidesteps the question).
I'm not sure where you heard this, but reduced disk space is mostly a red herring as drive space approaches pennies per gigabyte. The real gain with shared libraries comes with security and bugfix updates for those libraries; applications using static libraries have to be individually rebuilt with the new libraries, whereas all apps using shared libraries can be updated at once by replacing only a few files.
Not only do shared libraries save disk space, they also save memory, and that's a lot more important. The prelinking step is important here... you can't share the memory pages between two instances of the same library unless they are loaded at the same address, and prelinking allows that to happen.
Shared libraries do not necessarily save disk space or memory.
When an application links to a static library, only those parts of the library that the application uses will be pulled into the application binary. The library archive (.a) contains object files (.o), and if they are well factored, the application will use less memory by only linking with the object files it uses. Shared libraries will contain the whole library on disk and in memory whether parts of it are used by applications or not.
For desktop and server systems, this is less likely to result in a win overall, but if you are developing embedded applications, it's worth trying static linking all the applications to see if that gives you an overall saving.
I was able to figure out a partial quantitative answer without having to do an obscene amount of work. Here is my (hair-brained) methodology:
1) Use the following command to generate a list of packages with their installed size and list of dependencies:
dpkg-query -Wf '${Package}\t${Installed-Size}\t${Depends}
2) Parse the results and build a map of statistics for each package:
struct PkgStats
{
PkgStats() : kbSize(0), dependantCount(0) {}
int kbSize;
int dependentCount;
};
typedef std::map<std::string, PkgStats> PkgMap;
Where dependentCount is the number of other packages that directly depend on that package.
Results
Here is the Top 20 list of packages with the most dependants on my system:
Package Installed KB # Deps Dup'd MB
libc6 10096 750 7385
python 624 112 68
libatk1.0-0 200 92 18
perl 18852 48 865
gconf2 248 34 8
debconf 988 23 21
libasound2 1428 19 25
defoma 564 18 9
libart-2.0-2 164 14 2
libavahi-client3 160 14 2
libbz2-1.0 128 12 1
openoffice.org-core 124908 11 1220
gcc-4.4-base 168 10 1
libbonobo2-0 916 10 8
cli-common 336 8 2
coreutils 12928 8 88
erlang-base 6708 8 46
libbluetooth3 200 8 1
dictionaries-common 1016 7 6
where Dup'd MB is the number of megabytes that would be duplicated if there was no sharing (= installed_size * (dependants_count - 1), for dependants_count > 1).
It's not surprising to see libc6 on top. :) BTW, I have a typical Ubuntu 9.10 setup with a few programming-related packages installed, as well as some GIS tools.
Some statistics:
Total installed packages: 1717
Average # of direct dependents: 0.92
Total duplicated size with no sharing (ignoring indirect dependencies): 10.25GB
Histogram of # of direct dependents (note logarithmic Y scale):
Note that the above totally ignores indirect dependencies (i.e. everything should be at least be indirectly dependent on libc6). What I really should have done is built a graph of all dependencies and use that as the basis for my statistics. Maybe I'll get around to it sometime and post a lengthy blog article with more details and rigor.
Ok, perhaps not an answer, but the memory savings is what I'd consider. The savings is going to be based on the number of times a library is loaded after the first application, so lets find out how much savings per library are on the system using a quick script:
#!/bin/sh
lastlib=""
let -i cnt=1
let -i size=0
lsof | grep 'lib.*\.so$' | awk '{print $9}' | sort | while read lib ; do
if [ "$lastlib" == "$lib" ] ; then
let -i cnt="$cnt + 1"
else
let -i size="`ls -l $lib | awk '{print $5}'`"
let -i savings="($cnt - 1) * $size"
echo "$lastlib: $savings"
let -i cnt=1
fi
lastlib="$lib"
done
That will give us savings per lib, as such:
...
/usr/lib64/qt4/plugins/crypto/libqca-ossl.so: 0
/usr/lib64/qt4/plugins/imageformats/libqgif.so: 540640
/usr/lib64/qt4/plugins/imageformats/libqico.so: 791200
...
Then, the total savings:
$ ./checker.sh | awk '{total = total + $2}END{print total}'
263160760
So, roughly speaking on my system I'm saving about 250 Megs of memory. Your mileage will vary.

Resources