Reclaim space in differential VHD - vpc

http://www.petri.co.il/virtual_creating_differencing_disks_with.htm
I followed these steps to create a "Differencing Disks" of the WSSv3 demo VHD from microsoft. Well some time has passed, forgot that it was a "differencing" disk and upon defragging, the VUD (Virtual Undo Disk) expanded to consume the remainder of the free space on my hard drive.
Other then committing these changes back to the original VHD file, is they any other way for me to shrink a VUD.
Thanks
[Update]
Unfortunaelty the change history seems to keep every change to a file, even file fragmentation (why).
org -> verA -> verB -> verC -> verD -> verE -> current.
A tool would be nice to collapse the history tree to something like org -> current and drop/ignore the change history in between as well as defragement the change log for optimization.
[Update#2]
First, Sorry for the extensions to my questions
Second, Is it possible to shrink a differential disk by merging it with its differential parent disk.
Base.VHD
-> Child.VHD (Differential)
-> Grandchild.VHD (Differential)
In merging the Grandchild with the Child would the size be [Child Size] + [GrandChild Size] or would it be something like [Child Size] + [Size of Actual File Differences in Grandchild]?
Thanks again.

The differencing disk recorded every change made by the defragmentation program, which is why it grew out of control. I doubt you can shrink it since it contains changes that have been made and not yet committed.
I think you are going to have to either commit the changes to the original VHD, or throw away all the changes.

Thanks Grant, you're correct and I was stuck with the bloated VHD to merge, but somehow managed to screw that up and lost the changes.
Here is what else I found.
http://www.invirtus.com/blog/?p=7
This is a great article explaining why differentials are so large. Apparently each byte is written into its own 512 byte sector waisting tons of space.
http://www.invirtus.com/downloads/Differencing_Disk_Discussion.ppt
This presentation explains how to use disk compression when storing differencing disks and that undo disks utilize less space. In short, placing you differencing VHD or VUD into a NTFS compressed folder will save you tons of space.
[Example]
I created a differencing disk for the WSSv3 image from Microsoft (5GB), booted it up and installed software. Just the boot processed added 300mb to the VHD, installing TortoiseSVN (20MB) added 200mb, and installing WSPBuilderExtensions (800KB) added 1GB to the VHD.
The end result was 1.5GB differential from installing 21MB. I merged it with the base and resulted in only adding 29MB back to the parent.
I then created another differential VHD inside a NTFS compressed folder, started it up and created a new WSS Web Application through Central Admin. The file size jumped up to 900MB, but only resulted in 90MB on the file system do to the NTFS compression. I then created a VUD, renamed to VHD and completed the same action. The file size increased to 300MB which resulted in 12MB on the file system.
So yes, the differential VHD is highly ineffective and has no intelligence it in what so ever, but the bloating allows for some nice compression.
For development you should also create a new VHD, attach it as a secondary drive and move your file here since any and every file IO is captured in the differencing or undo disk.

Related

Ext4 on magnetic disk: Is it possible to process an arbitrary list of files in a seek-optimized manner?

I have a deduplicated storage of some million files in a two-level hashed directory structure. The filesystem is an ext4 partition on a magnetic disk. The path of a file is computed by its MD5 hash like this:
e93ac67def11bbef905a7519efbe3aa7 -> e9/3a/e93ac67def11bbef905a7519efbe3aa7
When processing* a list of files sequentially (selected by metadata stored in a separate database), I can literally hear the noise produced by the seeks ("randomized" by the hashed directory layout as I assume).
My actual question is: Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner, given they are stored on an ext4 partition on a magnetic disk (implying the use of linux)?
Such optimization is of course only useful if there is a sufficient share of small files. So please don't care too much about the size distribution of files. Without loss of generality, you may actually assume that there are only small files in each list.
As a potential solution, I was thinking of sorting the files by their physical disk locations or by other (heuristic) criteria that can be related to the total amount and length of the seek operations needed to process the entire list.
A note on file types and use cases for illustration (if need be)
The files are a deduplicated backup of several desktop machines. So any file you would typically find on a personal computer will be included on the partition. The processing however will affect only a subset of interest that is selected via the database.
Here are some use cases for illustration (list is not exhaustive):
extract metadata from media files (ID3, EXIF etc.) (files may be large, but only some small parts of the files are read, so they become effectively smaller)
compute smaller versions of all JPEG images to process them with a classifier
reading portions of the storage for compression and/or encryption (e.g. put all files newer than X and smaller than Y in a tar archive)
extract the headlines of all Word documents
recompute all MD5 hashes to verify data integrity
While researching for this question, I learned of the FIBMAP ioctl command (e.g. mentioned here) which may be worth a shot, because the files will not be moved around and the results may be stored along the metadata. But I suppose that will only work as sort criterion if the location of a file's inode correlates somewhat with the location of the contents. Is that true for ext4?
*) i.e. opening each file and reading the head of the file (arbitrary number of bytes) or the entire file into memory.
A file (especially when it is large enough) is scattered on several blocks on the disk (look e.g. in the figure of ext2 wikipage, it still is somehow relevant for ext4, even if details are different). More importantly, it could be in the page cache (so won't require any disk access). So "sorting the file list by disk location" usually does not make any sense.
I recommend instead improving the code accessing these files. Look into system calls like posix_fadvise(2) and readahead(2).
If the files are really small (hundreds of bytes each only), it is probable that using something else (e.g. sqlite or some real RDBMS like PostGreSQL, or gdbm ...) could be faster.
BTW, adding more RAM could enlarge the page cache size, so the overall experience. And replacing your HDD by some SSD would also help.
(see also linuxatemyram)
Is it possible to sort a list of files to optimize read speed / minimize seek times?
That is not really possible. File system fragmentation is not (in practice) important with ext4. Of course, backing up all your file system (e.g. in some tar or cpio archive) and restoring it sequentially (after making a fresh file system with mkfs) might slightly lower fragmentation, but not that much.
You might optimize your file system settings (block size, cluster size, etc... e.g. various arguments to mke2fs(8)). See also ext4(5).
Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner.
If the list is not too long (otherwise, split it in chunks of several hundred files each), you might open(2) each file there and use readahead(2) on each such file descriptor (and then close(2) it). This would somehow prefill your page cache (and the kernel could reorder the required IO operations).
(I don't know how effective is that in your case; you need to benchmark)
I am not sure there is a software solution to your issue. Your problem is likely IO-bound, so the bottleneck is probably the hardware.
Notice that on most current hard disks, the CHS addressing (used by the kernel) is some "logical" addressing handled by the disk controller and is not much related to physical geometry any more. Read about LBA, TCQ, NCQ (so today, the kernel has no direct influence on the actual mechanical movements of a hard disk head). I/O scheduling mostly happens in the hard disk itself (not much more in the kernel).

Cannot restore big file from azure backup because of six hours timeout

I am trying to restore a big file (~40GB) from Azure backup. I can see my recovery point and mount it as disk drive so I can copy/paste the file I need. The problem is that the copying takes approx. 8 hours, but the disk drive (recovery point) is automatically unmounted after 6 hours and the process fails consistently. I couldn't find any setting in the backup agent to increase this slot.
Any thoughts how to overcome this?
You can extend the mount time by setting the number of hours as a higher value.
RecoveryJobTimeOut under "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows Azure Backup\Config\CloudBackupProvider”
Type is DWORD; value is number of hours.
After some struggling I've found a workaround, so I'll post it here for others...
I've mounted the needed recovery point as a disk drive and started a file copy. It shows the standard Windows copy file progress dialog, which has an option of pausing. So after ~5.5 hours, just before the drive is unmounted, I paused the copy, unmounted the drive manually, mounted it again (getting another 6-hours slot), and than resumed the copy. Well, I don't think that this is how Microsoft wanted me to work, but it gets the job done.
Happy restoring!
Try to compress a copy of the file into a .zip on the server, then download the (hopefully smaller file)
Also, if you don't mind me asking, what the heck made a 40Gig file?

Bitbake build consumes more space

I recently started using Bitbake for building Yocto. Everytime I build, it consumes more space and currently I'm running out of disk space. The images are not getting overwritten. A set of new files with timestamp is getting created for every build. I have deleted old files from build/tmp/deploy/images/. But it doesn't make much difference in the disk free space. Is there any other locations from where I can delete stuff?
The error I observe during build is:
WARNING: The free space of source/build/tmp (/dev/sda4) is running low (0.999GB left)
ERROR: No new tasks can be executed since the disk space monitor action is "STOPTASKS"!
WARNING: The free space of source/build/sstate-cache (/dev/sda4) is running low (0.999GB left)
ERROR: No new tasks can be executed since the disk space monitor action is "STOPTASKS"!
WARNING: The free space of source/build/downloads (/dev/sda4) is running low (0.999GB left)
ERROR: No new tasks can be executed since the disk space monitor action is "STOPTASKS"!
Kindly suggest some pointers to avoid this issue.
In order of effectiveness and how easy the fix is:
Buy more disk space: Putting $TMPDIR on an SSD of its own helps a lot and removes the need to micromanage.
Delete $TMPDIR (build/tmp): old images, old packages and workdirectories/sysroots for MACHINEs you aren't currently building for accumulate and can take quite a lot of space. You can normally just delete the whole $TMPDIR once in a while: as long as you're using sstate-cache the next build should still be pretty fast.
Delete $SSTATE_DIR (build/sstate-cache): If you do a lot of builds sstate itself accumulates over time. Deleting the directory is safe but the next build will take a long time as everything will be rebuilt.
Delete $DL_DIR (build/downloads): If you use a build directory for a long time (while pulling updates from master or changing to newer branch) the obsolete downloads keep taking disk space. Keep in mind that deleting the directory will mean re-downloading everything. Looking at just the largest files and deleting the old versions may be a useful compromise here.
There are some official ways instead of deleting.
By deliberately deleting you could be forcing unnecessary builds & downloads. Some elements of the build could be not controlled by bitbake, and you can find yourself in a situation that you cannot rebuild these items in an easy way.
With these recommendations, you can beat the non written 50GB per build yocto rule:
Check your IMAGE_FSTYPES variable. My experience says it is safe to delete all images of these files that are not symlinks, or symlinks targets. Avoid the last one generated to avoid breaking the last build link, and any related with bootloaders and configuration files, as they could be rarely regenerated.
If you are keeping more than one build with the same set of layers, then you can use a common download folder for builds.
DL_DIR ?= "common_dir_across_all_builds/downloads/"
And afterwards:
To keep your /deploy clean:
RM_OLD_IMAGE: Reclaims disk space by removing previously built versions of the same image from the images directory pointed to by the DEPLOY_DIR variable.Set this variable to "1" in your local.conf file to remove these images:
RM_OLD_IMAGE = "1"
IMAGE_FSTYPES Remove the image types that you do not plan to use, you can always enable a particular one when you need it:
IMAGE_FSTYPES_remove = "tar.bz2"
IMAGE_FSTYPES_remove = "rpi-sdimg"
IMAGE_FSTYPES_remove = "ext3"
For /tmp/work, do not need all the workfiles of all recipes. You can specify which ones you are interested in your development.
RM_WORK_EXCLUDE:
With rm_work enabled, this variable specifies a list of recipes whose work directories should not be removed. See the "rm_work.bbclass" section for more details.
INHERIT += "rm_work"
RM_WORK_EXCLUDE += "home-assistant widde"

How to most easily make never ending incremental offline backups

I have for some time being thinking I can save some money on external hard drives by making this backup scheme: If I have 3TB data to backup, where less than 1TB changes from one backup to the next and I always want to have 1 copy out of the house, it should be enough to have 3 2TB external hard drives. The idea is that each time a disk is used for backup it is completely filled - a full backup is however never made as 3TB>>2TB.
So the backup starts by taking disk1 filling it with 2TB of data. Then take disk2 filling it with 1TB of data and 1TB of redundant data as it also exist on disk1. Now disk1 and disk2 can be taken out of the house.
When the next backup is made disk2 will already contain 2TB of data, where at least 2TB-1TB=1TB is still valid as only 1TB have changes. So by backing up 2TB of data (where some may also exist on disk2) to disk3 we have a complete backup on disk2+disk3. Now disk3 can be moved out of the house and disk1 can be moved back in, deleted and reused for backup.
This can of course be made better so we can use different sizes of disks, have different number of disks, have higher requirement for number of copies out of the house etc.
In theory it is quite easy to make by having stored checksums of which files is on all disks, so we can check for changes by checking the checksums.
However in practice there is a lot of cases to handle: out of disk-space, hardlinks, softlinks, file permissions, file ownership, etc.
I've tried to find existing backup programs that can do this but I have not found any.
So my question is: How do I most easily do this? Writing it from scratch would probably take too much time. So I was wondering if I could put it on top of something existing. Any ideas?
If I were in your shoes I would buy an external hard drive that is large enough to hold all your data.
Then write a Bash script that would:
Mount the external hard drive
Execute rsync to back up everything that has changed
Unmount the external hard drive
Send me a message (e-mail or whatever) letting me know the backup is complete
So you'd plug in your external drive, execute the Bash script and then return the external hard drive to a safe deposit box at a bank (or other similarly secure location).

How to estimate a file size from header's sector start address?

Suppose I have a deleted file in my unallocated space on a linux partition and i want to retrieve it.
Suppose I can get the start address of the file by examining the header.
Is there a way by which I can estimate the number of blocks to be analyzed hence (this depends on the size of the image.)
In general, Linux/Unix does not support recovering deleted files - if it is deleted, it should be gone. This is also good for security - one user should not be able to recover data in a file that was deleted by another user by creating huge empty file spanning almost all free space.
Some filesystems even support so called secure delete - that is, they can automatically wipe file blocks on delete (but this is not common).
You can try to write a utility which will open whole partition that your filesystem is mounted on (say, /dev/sda2) as one huge file and will read it and scan for remnants of your original data, but if file was fragmented (which is highly likely), chances are very small that you will be able to recover much of the data in some usable form.
Having said all that, there are some utilities which are trying to be a bit smarter than simple scan and can try to be undelete your files on Linux, like extundelete. It may work for you, but success is never guaranteed. Of course, you must be root to be able to use it.
And finally, if you want to be able to recover anything from that filesystem, you should unmount it right now, and take a backup of it using dd or pipe dd compressed through gzip to save space required.

Resources