How to most easily make never ending incremental offline backups - linux

I have for some time being thinking I can save some money on external hard drives by making this backup scheme: If I have 3TB data to backup, where less than 1TB changes from one backup to the next and I always want to have 1 copy out of the house, it should be enough to have 3 2TB external hard drives. The idea is that each time a disk is used for backup it is completely filled - a full backup is however never made as 3TB>>2TB.
So the backup starts by taking disk1 filling it with 2TB of data. Then take disk2 filling it with 1TB of data and 1TB of redundant data as it also exist on disk1. Now disk1 and disk2 can be taken out of the house.
When the next backup is made disk2 will already contain 2TB of data, where at least 2TB-1TB=1TB is still valid as only 1TB have changes. So by backing up 2TB of data (where some may also exist on disk2) to disk3 we have a complete backup on disk2+disk3. Now disk3 can be moved out of the house and disk1 can be moved back in, deleted and reused for backup.
This can of course be made better so we can use different sizes of disks, have different number of disks, have higher requirement for number of copies out of the house etc.
In theory it is quite easy to make by having stored checksums of which files is on all disks, so we can check for changes by checking the checksums.
However in practice there is a lot of cases to handle: out of disk-space, hardlinks, softlinks, file permissions, file ownership, etc.
I've tried to find existing backup programs that can do this but I have not found any.
So my question is: How do I most easily do this? Writing it from scratch would probably take too much time. So I was wondering if I could put it on top of something existing. Any ideas?

If I were in your shoes I would buy an external hard drive that is large enough to hold all your data.
Then write a Bash script that would:
Mount the external hard drive
Execute rsync to back up everything that has changed
Unmount the external hard drive
Send me a message (e-mail or whatever) letting me know the backup is complete
So you'd plug in your external drive, execute the Bash script and then return the external hard drive to a safe deposit box at a bank (or other similarly secure location).

Related

What is the performance impact of the touch command or its equivalent system call?

In a custom-developed NodeJS web server (running on Linux) that can dynamically generate thumbnail images, I want to cache these thumbnails on the filesystem and keep track of when they are actually used. If they haven't been used for a certain period of time (say, one year), I'd consider them "orphans" and delete them.
To this end, I considered to touch them each time they're requested from a client, so that I can use the modification time to check when they were last used.
I assume this would incur a significant performance hit on the web server in high-load situations, as it is an "unnecessary" filesystem write, while, apart from logging, most requests will only consist of reads.
Has anyone performed any benchmarks on how big an impact this might have and if it's worthwhile?
It's probably not great, and probably worth avoiding updating every time you open a file. That's the reason the relatime / noatime mount options were invented, to prevent the existing Unix access-time timestamp from being updated every time a file was opened.
Is your filesystem mounted with relatime? That updates atime at most once per day, when the file is opened (even for reading). The other mount option that's common on Linux is noatime: never update atime.
If you can't let the kernel do this for you without needing extra system calls, you might be better off making an fstat system call after opening the file and only touching it to update the mod time if the mod time is older than a day or week. (You're concerned about intervals of a year, so a week is fine.) i.e. manually implement the relatime logic, but for mod time.
Frequently accessed files will not need updates (and you're still making a total of one system call for them, plus a date-compare). Rarely accessed files will need another system call and a metadata write. If most of the accesses in your access pattern are to a smallish set of files repeatedly, this should be excellent.
Possible reasons for not being able to use atime could include:
The filesystem is mounted with noatime and it's not worth changing that
The files sometimes get read by something other than your web server / CGI setup. (e.g. a backup job that does more than compare size / timestamps)
Of course, the other option is to not update timestamps on use, and simply let a thumbnail be regenerated once a year after your weekly cron job deleted it. That might be ok depending on your workload.
If you manually touch some of the "hottest" thumbnails so you stagger their deletion, instead of having a big load spike this time next year, you could be ok. And/or have your deleter walk your filesystem very slowly, again so you don't have a big batch of frequently-needed thumbnails deleted at once.
You could come up with schemes like enabling mod-time updates in the week before the bi-annual cleanup, so thumbnails that should stay hot in cache get their modtimes updated. But probably better to just fstat / check / update all the time since that shouldn't be too much extra load.

Mirroring files from one partition to another on the second disk without RAID1

I am looking for a program that would allow me to mirror one partition to another disk (something like RAID1) for Linux. It doesn't have to be a windowed application, it can be a console application, I just want what is in one place to be mirrored to another.
It would be nice if it were possible to mirror a specific folder that I would care for instead of copying everything from the given partition.
I was looking on the internet, but it's hard to find something that would give such opportunities, hence the idea to ask such a question.
I do not want to make fake RAID on Linux or hardware RAID because I read that if the motherboard fails then it is best to have the same second one to recover data.
I will be grateful for every suggestion :)
You can check my script "CopyDirFile" written in bash, which is located on github.
You can perform a replication (mirroring) task of any source folder to another destination folder (deleting a file in the source folder means deleting it in the destination folder).
The script also allows you to create copy tasks (deleted files in the source folder will not be deleted in the target folder).
The tasks are executed in background at a specified time, not all the time, frequency is set by the user when creating the task.
You can also set the task to start automatically when the user logs on.
All the necessary information can be found in the README file in repository.
If I understood you correctly, I think it meets your requirements.
Linux has standard support for software RAID: mdraid.
It allows you to bundle two disk devices into a RAID 1 device (among other things); you then create a filesystem on top of that device.
LVM offers another way to do software RAID; it doesn't seem to be very popular, but it's certainly supported.
(If your system supports hardware RAID, on the motherboard or with a separate RAID controller, Linux can use that, too, but that doesn't seem to be what you're asking here.)

Cannot restore big file from azure backup because of six hours timeout

I am trying to restore a big file (~40GB) from Azure backup. I can see my recovery point and mount it as disk drive so I can copy/paste the file I need. The problem is that the copying takes approx. 8 hours, but the disk drive (recovery point) is automatically unmounted after 6 hours and the process fails consistently. I couldn't find any setting in the backup agent to increase this slot.
Any thoughts how to overcome this?
You can extend the mount time by setting the number of hours as a higher value.
RecoveryJobTimeOut under "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows Azure Backup\Config\CloudBackupProvider”
Type is DWORD; value is number of hours.
After some struggling I've found a workaround, so I'll post it here for others...
I've mounted the needed recovery point as a disk drive and started a file copy. It shows the standard Windows copy file progress dialog, which has an option of pausing. So after ~5.5 hours, just before the drive is unmounted, I paused the copy, unmounted the drive manually, mounted it again (getting another 6-hours slot), and than resumed the copy. Well, I don't think that this is how Microsoft wanted me to work, but it gets the job done.
Happy restoring!
Try to compress a copy of the file into a .zip on the server, then download the (hopefully smaller file)
Also, if you don't mind me asking, what the heck made a 40Gig file?

How to estimate a file size from header's sector start address?

Suppose I have a deleted file in my unallocated space on a linux partition and i want to retrieve it.
Suppose I can get the start address of the file by examining the header.
Is there a way by which I can estimate the number of blocks to be analyzed hence (this depends on the size of the image.)
In general, Linux/Unix does not support recovering deleted files - if it is deleted, it should be gone. This is also good for security - one user should not be able to recover data in a file that was deleted by another user by creating huge empty file spanning almost all free space.
Some filesystems even support so called secure delete - that is, they can automatically wipe file blocks on delete (but this is not common).
You can try to write a utility which will open whole partition that your filesystem is mounted on (say, /dev/sda2) as one huge file and will read it and scan for remnants of your original data, but if file was fragmented (which is highly likely), chances are very small that you will be able to recover much of the data in some usable form.
Having said all that, there are some utilities which are trying to be a bit smarter than simple scan and can try to be undelete your files on Linux, like extundelete. It may work for you, but success is never guaranteed. Of course, you must be root to be able to use it.
And finally, if you want to be able to recover anything from that filesystem, you should unmount it right now, and take a backup of it using dd or pipe dd compressed through gzip to save space required.

Multiple Machines -- Process Many Files Concurrently?

I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be accessed (read) by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
Are there any 'do' and 'donts' when one is building such an operation? is it a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...).
Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)
Things you should think about:
If the processing to be done for each file is simple, then your real bottleneck isn't the amount of parallel files you read, but the capabilities of the hard disk drive.
Unless processing takes a long time (say, some seconds per file) you'll go past a point in which adding more processes will only slow down matters to a crawl, since every process is reading and writing results, and the disk can only do so much.
Try to minimize disk access: for example, download files and produce results locally while other processes are downloading, and send the results back when the load on the disk goes down.
The more I write the more it boils down to how much processing needs to be done for each file. If it's simple parsing, something that takes milliseconds, 1 machine or 30 will make little difference.
You need to be careful that two worker processes don't pick up (and try to do) the same piece of work at the same time.
Unfortunately, NFS filesystems don't have semantics that allow you to easily do that.
So what I'd recommend is to use something like Gearman and a producer/consumer model, where one process gives out work to whoever is available to do it.
Another possibility is to have a database (e.g. mysql) with a table of all tasks, and have the processes atomically "claim" tasks for themselves.
But all of this is only worthwhile if your processes are mostly CPU-bound. If you're trying to get more IO bandwidth (or operations) out of your NAS by using multiple clients, it's not going to work.
I am assuming that you will be running at least gigabit ethernet here (or it's probably not worth it).
Have you tried running multiple processes on the same machine?

Resources