Determining where the extra information from squashfs comes from - linux

I extracted the root file system from an IoT device, and I was able to peruse it using unsquashfs. I then changed a single byte in a single file, and recompressed it again using mksquashfs. When I inspect the two files, the original and the one I created, the output from binwalk is identical, except for the size. The original had a size of 1038570 bytes while the one I created had a size of 1086112. I have no idea where the extra data came from. Are there any tools or methods for determining what the difference is?

So it turns out I was missing a flag while creating the squashed file system.
When using the xz compressor method, you can specify another flag, -Xbcj, that further compresses optimized per architecture. Once I added this and chose arm as my architecture, the file size was the same

Related

Is there a way to dynamically determine the vhdSize flag?

I am using the MSIX manager tool to convert a *.msix (an application installer) to a *.vhdx so that it can be mounted in an Azure virtual machine. One of the flags that the tool requires is -vhdSize, which is in megabytes. This has proven to be problematic because I have to guess what the size should be based off the MSIX. I have ran into numerous creation errors due to too small of a vhdSize.
I could set it to an arbitrarily high value in order to get around these failures, but that is not ideal. Alternatively, guessing the correct size is an imprecise science and a chore to do repeatedly.
Is there a way to have the tool dynamically set the vhdSize, or am I stuck guessing a value that is both large enough to accommodate the file, but not too large as to waste disk space? Or, is there a better way to create a *.vhdx file?
https://techcommunity.microsoft.com/t5/windows-virtual-desktop/simplify-msix-image-creation-with-the-msixmgr-tool/m-p/2118585
There is an MSIX Hero app that could select a size for you, it will automatically check how big the uncompressed files are, add an extra buffer for safety (currently double the original size), and round it to the next 10MB. Reference from https://msixhero.net/documentation/creating-vhd-for-msix-app-attach/

Force rsync to compare local files byte by byte instead of checksum

I have written a Bash script to backup a folder. At the core of the script is an rsync instruction
rsync -abh --checksum /path/to/source /path/to/target
I am using --checksum because I neither want to rely on file size nor modification time to determine if the file in the source path needs to be backed up. However, most -- if not all -- of the time I run this script locally, i.e., with an external USB drive attached which contains the backup destination folder; no backup over network. Thus, there is no need for a delta transfer since both files will be read and processed entirely by the same machine. Calculating the checksums even introduces a speed down in this case. It would be better if rsync would just diff the files if they are both on stored locally.
After reading the manpage I stumbled upon the --whole-file option which seems to avoid the costly checksum calculation. The manpage also states that this is the default if source and destination are local paths.
So I am thinking to change my rsync statement to
rsync -abh /path/to/source /path/to/target
Will rsync now check local source and target files byte by byte or will it use modification time and/or size to determine if the source file needs to be backed up? I definitely do not want to rely on file size or modification times to decide if a backup should take place.
UPDATE
Notice the -b option in the rsync instruction. It means that destination files will be backed up before they are replaced. So blindly rsync'ing all files in the source folder, e.g., by supplying --ignore-times as suggested in the comments, is not an option. It would create too many duplicate files and waste storage space. Keep also in mind that I am trying to reduce backup time and workload on a local machine. Just backing up everything would defeat that purpose.
So my question could be rephrased as, is rsync capable of doing a file comparison on a byte by byte basis?
Question: is rsync capable of doing a file comparison on a byte by byte basis?
Strictly speaking, Yes:
It's a block by block comparison, but you can change the block size.
You could use --block-size=1, (but it would be unreasonably inefficient and inappropriate for basically every)
The block based rolling checksum is the default behavior over a network.
Use the --no-whole-file option to force this behavior locally. (see below)
Statement 1. Calculating the checksums even introduces a speed down in this case.
This is why it's off by default for local transfers.
Using the --checksum option forces an entire file read, as opposed to the default block-by-block delta-transfer checksum checking
Statement 2. Will rsync now check local source and target files byte by byte or
       will it use modification time and/or size to determine if the source
file        needs to be backed up?
By default it will use size & modification time.
You can use a combination of --size-only, --(no-)ignore-times, --ignore-existing and
--checksum to modify this behavior.
Statement 3. I definitely do not want to rely on file size or modification times to decide if a        backup should take place.
Then you need to use --ignore-times and/or --checksum
Statement 4. supplying --ignore-times as suggested in the comments, is not an option
Perhaps using --no-whole-file and --ignore-times is what you want then ? This forces the use of the delta-transfer algorithm, but for every file regardless of timestamp or size.
You would (in my opinion) only ever use this combination of options if it was critical to avoid meaningless writes (though it's critical that it's specifically the meaningless writes that you're trying to avoid, not the efficiency of the system, since it wouldn't actually be more efficient to do a delta-transfer for local files), and had reason to believe that files with identical modification stamps and byte size could indeed be different.
I fail to see how modification stamp and size in bytes is anything but a logical first step in identifying changed files.
If you compared the following two files:
File 1 (local) : File.bin - 79776451 bytes and modified on the 15 May 07:51
File 2 (remote): File.bin - 79776451 bytes and modified on the 15 May 07:51
The default behaviour is to skip these files. If you're not satisfied that the files should be skipped, and want them compared, you can force a block-by-block comparison and differential update of these files using --no-whole-file and --ignore-times
So the summary on this point is:
Use the default method for the most efficient backup and archive
Use --ignore-times and --no-whole-file to force delta-change (block by block checksum, transferring only differential data) if for some reason this is necessary
Use --checksum and --ignore-times to be completely paranoid and wasteful.
Statement 5. Notice the -b option in the rsync instruction. It means that destination files will be backed up before they are replaced
Yes, but this can work however you want it to, it doesn't necessarily mean a full backup every time a file is updated, and it certainly doesn't mean that a full transfer will take place at all.
You can configure rsync to:
Keep 1 or more versions of a file
Configure it with a --backup-dir to be a full incremental backup system.
Doing it this way doesn't waste space other than what is required to retain differential data. I can verify that in practise as there would not be nearly enough space on my backup drives for all of my previous versions to be full copies.
Some Supplementary Information
Why is Delta-transfer not more efficient than copying the whole file locally?
Because you're not tracking the changes to each of your files. If you actually have a delta file, you can merge just the changed bytes, but you need to know what those changed bytes are first. The only way you can know this is by reading the entire file
For example:
I modify the first byte of a 10MB file.
I use rsync with delta-transfer to sync this file
rsync immediately sees that the first byte (or byte within the first block) has changed, and proceeds (by default --inplace) to change just that block
However, rsync doesn't know it was only the first byte that's changed. It will keep checksumming until the whole file is read
For all intents and purposes:
Consider rsync a tool that conditionally performs a --checksum based on whether or not the file timestamp or size has changed. Overriding this to --checksum is essentially equivalent to --no-whole-file and --ignore-times, since both will:
Operate on every file, regardless of time and size
Read every block of the file to determine which blocks to sync.
What's the benefit then?
The whole thing is a tradeoff between transfer bandwidth, and speed / overhead.
--checksum is a good way to only ever send differences over a network
--checksum while ignoring files with the same timestamp and size is a good way to both only send differences over a network, and also maximize the speed of the entire backup operation
Interestingly, it's probably much more efficient to use --checksum as a blanket option than it would be to force a delta-transfer for every file.
There is no way to do byte-by-byte comparison of files instead of checksum, the way you are expecting it.
The way rsync works is to create two processes, sender and receiver, that create a list of files and their metadata to decide with each other, which files need to be updated. This is done even in case of local files, but in this case processes can communicate over a pipe, not over a network socket. After the list of changed files is decided, changes are sent as a delta or as whole files.
Theoretically, one could send whole files in the file list to the other to make a diff, but in practice this would be rather inefficient in many cases. Receiver would need to keep these files in the memory in case it detects the need to update the file, or otherwise the changes in files need to be re-sent. Any of the possible solutions here doesn't sound very efficient.
There is a good overview about (theoretical) mechanics of rsync: https://rsync.samba.org/how-rsync-works.html

Integrity of two image files of different size running same os issue

I have a requirement. I have two virtual image files running light weight linux distribution (eg : slitaz),whose disk sizes are different. I want to check the integrity of the kernel running of these image files at a given point in time at block/sector level.
I have already accomplished the integrity check at file system level,by mounting the image to loop device and then accessing the required kernel files (vmlinuz and initrd) and hashing them and then comparing that hash with the genuine hash for these files.
Now I want to perform case to check integrity at block level,Here is what I did :
But is there a way to check the integrity in this case?
As we know that the contents at the block/sector level match for the part which belongs to the kernel in the two image files since they are running same linux distro.
I am unable to get block level information of the kernel resides to check for its integrity.Assuming my kernel files reside more than one block,how do i get info?
Any tool or any guidance in this is greatly appreciated.
If I understand correctly, does your question not simply become:
I need to know on which disk blocks exactly (within these file systems) the kernel files are located.
If that is the case (depending on the file system involved) you could probably use debugfs like described in this post.

How to estimate a file size from header's sector start address?

Suppose I have a deleted file in my unallocated space on a linux partition and i want to retrieve it.
Suppose I can get the start address of the file by examining the header.
Is there a way by which I can estimate the number of blocks to be analyzed hence (this depends on the size of the image.)
In general, Linux/Unix does not support recovering deleted files - if it is deleted, it should be gone. This is also good for security - one user should not be able to recover data in a file that was deleted by another user by creating huge empty file spanning almost all free space.
Some filesystems even support so called secure delete - that is, they can automatically wipe file blocks on delete (but this is not common).
You can try to write a utility which will open whole partition that your filesystem is mounted on (say, /dev/sda2) as one huge file and will read it and scan for remnants of your original data, but if file was fragmented (which is highly likely), chances are very small that you will be able to recover much of the data in some usable form.
Having said all that, there are some utilities which are trying to be a bit smarter than simple scan and can try to be undelete your files on Linux, like extundelete. It may work for you, but success is never guaranteed. Of course, you must be root to be able to use it.
And finally, if you want to be able to recover anything from that filesystem, you should unmount it right now, and take a backup of it using dd or pipe dd compressed through gzip to save space required.

How to make file sparse?

If I have a big file containing many zeros, how can i efficiently make it a sparse file?
Is the only possibility to read the whole file (including all zeroes, which may patrially be stored sparse) and to rewrite it to a new file using seek to skip the zero areas?
Or is there a possibility to make this in an existing file (e.g. File.setSparse(long start, long end))?
I'm looking for a solution in Java or some Linux commands, Filesystem will be ext3 or similar.
A lot's changed in 8 years.
Fallocate
fallocate -d filename can be used to punch holes in existing files. From the fallocate(1) man page:
-d, --dig-holes
Detect and dig holes. This makes the file sparse in-place,
without using extra disk space. The minimum size of the hole
depends on filesystem I/O block size (usually 4096 bytes).
Also, when using this option, --keep-size is implied. If no
range is specified by --offset and --length, then the entire
file is analyzed for holes.
You can think of this option as doing a "cp --sparse" and then
renaming the destination file to the original, without the
need for extra disk space.
See --punch-hole for a list of supported filesystems.
(That list:)
Supported for XFS (since Linux 2.6.38), ext4 (since Linux
3.0), Btrfs (since Linux 3.7) and tmpfs (since Linux 3.5).
tmpfs being on that list is the one I find most interesting. The filesystem itself is efficient enough to only consume as much RAM as it needs to store its contents, but making the contents sparse can potentially increase that efficiency even further.
GNU cp
Additionally, somewhere along the way GNU cp gained an understanding of sparse files. Quoting the cp(1) man page regarding its default mode, --sparse=auto:
sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well.
But there's also --sparse=always, which activates the file-copy equivalent of what fallocate -d does in-place:
Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes.
I've finally been able to retire my tar cpSf - SOURCE | (cd DESTDIR && tar xpSf -) one-liner, which for 20 years was my graybeard way of copying sparse files with their sparseness preserved.
Some filesystems on Linux / UNIX have the ability to "punch holes" into an existing file. See:
LKML posting about the feature
UNIX file trunctation FAQ (search for F_FREESP)
It's not very portable and not done the same way across the board; as of right now, I believe Java's IO libraries do not provide an interface for this.
If hole punching is available either via fcntl(F_FREESP) or via any other mechanism, it should be significantly faster than a copy/seek loop.
I think you would be better off pre-allocating the whole file and maintaining a table/BitSet of the pages/sections which are occupied.
Making a file sparse would result in those sections being fragmented if they were ever re-used. Perhaps saving a few TB of disk space is not worth the performance hit of a highly fragmented file.
You can use $ truncate -s filename filesize on linux teminal to create sparse file having
only metadata.
NOTE --Filesize is in bytes.
According to this article, it seems there is currently no easy solution, except for using FIEMAP ioctl. However, I don't know how you can make "non sparse" zero blocks into "sparse" ones.

Resources