Prepend to Very Large File in Fixed Time or Very Fast [closed]

Prepend to Very Large File in Fixed Time or Very Fast [closed] - linux

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a file that is very large (>500GB) that I want to prepend with a relatively small header (<20KB). Doing commands such as:
cat header bigfile > tmp
mv tmp bigfile
or similar commands (e.g., with sed) are very slow.
What is the fastest method of writing a header to the beginning of an existing large file? I am looking for a solution that can run under CentOS 7.2. It is okay to install packages from CentOS install or updates repo, EPEL, or RPMForge.
It would be great if some method exists that doesn't involve relocating or copying the large amount of data in the bigfile. That is, I'm hoping for a solution that can operate in fixed time for a given header file regardless of the size of the bigfile. If that is too much to ask for, then I'm just asking for the fastest method.
Compiling a helper tool (as in C/C++) or using a scripting language is perfectly acceptable.

Is this something that needs to be done once, to "fix" a design oversight perhaps? Or is it something that you need to do on a regular basis, for instance to add summary data (for instance, the number of data records) to the beginning of the file?
If you need to do it just once then your best option is just to accept that a mistake has been made and take the consequences of the retro-fix. As long as you make your destination drive different from the source drive you should be able to fix up a 500GB file within about two hours. So after a week of batch processes running after hours you could have upgraded perhaps thirty or forty files
If this is a standard requirement for all such files, and you think you can apply the change only when the file is complete -- some sort of summary information perhaps -- then you should reserve the space at the beginning of each file and leave it empty. Then it is a simple matter of seeking into the header region and overwriting it with the real data once it can be supplied
As has been explained, standard file systems require the whole of a file to be copied in order to add something at the beginning
If your 500GB file is on a standard hard disk, which will allow data to be read at around 100MB per second, then reading the whole file will take 5,120 seconds, or roughly 1 hour 30 minutes
As long as you arrange for the destination to be a separate drive from the source, your can mostly write the new file in parallel with the read, so it shouldn't take much longer than that. But there's no way to speed it up other than that, I'm afraid

If you were not bound to CentOS 7.2, your problem could be solved (with some reservations1) by fallocate, which provides the needed functionality for the ext4 filesystem starting from Linux 4.2 and for the XFS filesystem since Linux 4.1:
int fallocate(int fd, int mode, off_t offset, off_t len);
This is a nonportable, Linux-specific system call. For the portable,
POSIX.1-specified method of ensuring that space is allocated for a
file, see posix_fallocate(3).
fallocate() allows the caller to directly manipulate the allocated
disk space for the file referred to by fd for the byte range starting
at offset and continuing for len bytes.
The mode argument determines the operation to be performed on the
given range. Details of the supported operations are given in the
subsections below.
...
Increasing file space
Specifying the FALLOC_FL_INSERT_RANGE flag (available since Linux 4.1)
in mode increases the file space by inserting a hole within the
file size without overwriting any existing data. The hole will start
at offset and continue for len bytes. When inserting the hole inside
file, the contents of the file starting at offset will be shifted
upward (i.e., to a higher file offset) by len bytes. Inserting a
hole inside a file increases the file size by len bytes.
...
FALLOC_FL_INSERT_RANGE requires filesystem support. Filesystems that
support this operation include XFS (since Linux 4.1) and ext4 (since
Linux 4.2).
1 fallocate allows prepending data to the file only at multiples of the filesystem block size. So it will solve your problem only if it's acceptable for you to pad the extra space with whitespace, comments, etc.
Without a support for fallocate()+FALLOC_FL_INSERT_RANGE the best you can do is
Increase the file (so that it has its final size);
mmap() the file;
memmove() the data;
Fill the header data in the beginning.

Related

Efficiently inserting blocks into the middle of a file

I'm looking for, essentially, the ext4 equivalent of mremap().
I have a big mmap()'d file that I'm allocating arrays in, and the arrays need to grow. So I want to make the first array larger at its current location, and budge all the other arrays along in the file and the address space to make room.
If this was just anonymous memory, I could use mremap() to budge over whole pages in constant time, as long as I'm inserting a whole number of memory pages. But this is a disk-backed file, so the data needs to move in the file as well as in memory.
I don't actually want to read and then rewrite whole blocks of data to and from the physical disk. I want the data to stay on disk in the physical sectors it is in, and to induce the filesystem to adjust the file metadata to insert new sectors where I need the extra space. If I have to keep my inserts to some multiple of a filesystem-dependent disk sector size, that's fine. If I end up having to copy O(N) sector or extent references around to make room for the inserted extent, that's fine. I just don't want to have 2 gigabytes move from and back to the disk in order to insert a block in the middle of a 4 gigabyte file.
How do I accomplish an efficient insert by manipulating file metadata? Is a general API for this actually exposed in Linux? Or one that works if the filesystem happens to be e.g. ext4? Will a write() call given a source address in the memory-mapped file reduce to the sort of efficient shift I want under the right circumstances?
Is there a C or C++ API function with the semantics "copy bytes from here to there and leave the source with an undefined value" that I should be calling in case this optimization gets added to the standard library and the kernel in the future?
I've considered just always allocating new pages at the end of the file, and mapping them at the right place in memory. But then I would need to work out some way to reconstruct that series of mappings when I reload the file. Also, shrinking the data structure would be a nontrivial problem. At that point, I would be writing a database page manager.

I think I actually may have figured it out.
I went looking for "linux make a file sparse", which led me to this answer on Unix & Linux Stack Exchange which mentioned the fallocate command line tool. The fallocate tool has a --dig-holes option, which turns parts of a file that could be represented by holes into holes.
I then went looking for "fallocate dig holes" to find out how that works, and I got the fallocate man page. I noticed it also offers a way to insert a hole of some size:
-i, --insert-range
Insert a hole of length bytes from offset, shifting existing
data.
If a command line tool can do it, Linux can do it, so I dug into the source code for fallocate, which you can find on Github:
case 'i':
mode |= FALLOC_FL_INSERT_RANGE;
break;
It looks like the fallocate tool accomplishes a cheap hole insert (and a move of all the other file data) by calling the fallocate() Linux-specific function with the FALLOC_FL_INSERT_RANGE flag, added in Linux 4.1. This flag won't work on all filesystems, but it does work on ext4 and it does exactly what I want: adjust the file metadata to efficiently free up some space in the file's offset space at a certain point.
It's not immediately clear to me how this interacts with currently memory-mapped pages, but I think I can work with this.

How to process a file from one to another without doubled storage requirements? [duplicate]

This question already has answers here:
Remove beginning of file without rewriting the whole file
(3 answers)
Closed 5 years ago.
I'm creating an archiver, which processes data in two steps:
creates a temporary archive file
from the temporary archive file, it creates the final archive. After the final archive created, temporary file is deleted
The 2nd step processes temporary archive file linearly, and the result is written to the final archive while processing. So this process needs twice as much storage (temporarily) as the archive file.
I'd like to avoid the double storage need. So my idea is that during processing, I'd like to tell the OS that it can drop the processed part of the temporary file. Like a truncate call, but it should truncate the file at the beginning, not the end. Is is possible to do something like this?

Write all the data. Shift the data by opening file twice: for reading and for writing in overwrite mode (inject the table of contents and make sure that you're not overwriting before reading a chunk).
If the table of contents has fixed length - then preallocate that size in the file to avoid shifting completely.

Like a truncate call, but it should truncate the file at the beginning, not the end. Is is possible to do something like this?
No, that is not possible with plain files. However, look into the Linux specific fallocate(2) (which is not portable, and might not work with every file system), so I don't recommend using it.
However, look into SQLite and GDBM indexed files. They provide an abstraction (above files) which enables you to "delete records".
Or just keep temporarily all the data in memory.
Or consider a double-pass (or multiple-pass) approach. Maybe nftw(3) could be useful.
(today, disk space is very cheap, so your requirement of avoid the double storage need is really strange; if you handle a huge amount of data you should have mentioned it)

How to estimate a file size from header's sector start address?

Suppose I have a deleted file in my unallocated space on a linux partition and i want to retrieve it.
Suppose I can get the start address of the file by examining the header.
Is there a way by which I can estimate the number of blocks to be analyzed hence (this depends on the size of the image.)

In general, Linux/Unix does not support recovering deleted files - if it is deleted, it should be gone. This is also good for security - one user should not be able to recover data in a file that was deleted by another user by creating huge empty file spanning almost all free space.
Some filesystems even support so called secure delete - that is, they can automatically wipe file blocks on delete (but this is not common).
You can try to write a utility which will open whole partition that your filesystem is mounted on (say, /dev/sda2) as one huge file and will read it and scan for remnants of your original data, but if file was fragmented (which is highly likely), chances are very small that you will be able to recover much of the data in some usable form.
Having said all that, there are some utilities which are trying to be a bit smarter than simple scan and can try to be undelete your files on Linux, like extundelete. It may work for you, but success is never guaranteed. Of course, you must be root to be able to use it.
And finally, if you want to be able to recover anything from that filesystem, you should unmount it right now, and take a backup of it using dd or pipe dd compressed through gzip to save space required.

Is moving a file safer than deleting it if you want to remove all traces of it? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I recently accidentally called "rm -rf *" on a directory an deleted some files that I needed. However, I was able to recover most of them using photorec. Apparently, "deleting" a file just removes references to said file and is not truly deleted until it is overwritten by something else.
So if I wanted to remove the file completely, couldn't I just execute
mv myfile.txt /temp/myfile.txt
(or move to external storage)

You should consider using the Linux command shred, which overwrites the target file multiple times before deleting it completely, which makes it 'impossible' to recover the file.
You can read a bit about the shred command here.
Just moving the file does not cover you for good, if you moved it to external storage, the local version of the file is deleted just as it is with the rm command.

No. that won't help either.
A move when going between file systems is really still just a "copy + rm" internally. The original storage location of the file on the "source" media is still there, just marked as available. A moving WITHIN a file system doesn't touch the file bytes at all, it just updates the bookkeeping to say "file X is now in location Y".
To truly wipe a file, you must overwriteall of its bytes. And yet again, technology gets in the way of that - if you're using a solid state storage medium, there is a VERY high chance that writing 'garbage' to the file won't touch the actual transistors the file's stored in, but actually get written somewhere completely different.
For magnetic media, repeated overwriting with alternating 0x00, 0xFF, and random bytes will eventually totally nuke the file. For SSD/flash systems, it either has to offer a "secure erase" option, or you have to smash the chips into dust. For optical media, it's even more complicated. -r media cannot be erased, only destroyed. for -rw, I don't know how many repeated-write cycles are required to truly erase the bits.

No (and not just because moving it somewhere else on your computer is not removing it from the computer). The way to completely remove a file is to completely overwrite the space on the disk where it resided. The linux command shred will accomplish this.

Basically, no, in most file systems you can't guarantee that a file is overwritten without going very low level. Removing a file and/or moving it will only change the pointer to the file, not the files existence in the file system in any way. Even the linux command shred won't guarantee a file's removal in many file systems since it assumes files are overwritten in place.
On SSDs, it's even more likely that your data stays there for a long time, since even if the file system would attempt to overwrite blocks, the SSD will remap to write to a new block (erasing takes a lot of time, if it wrote in place things would be very slow)
In the end, with modern file systems and disks, the best chance you have to have files stored securely is to keep them encrypted to begin with. If they're stored anywhere in clear text, they can be very hard to remove, and recovering an encrypted file from disk (or a backup for that matter) won't be much use to anyone without the encryption key.

Estimation or measurement of amount of iops to create a file

I'd like to know how many I/O operations (iops) does it take to create an empty file. I am interested in linux and GFS file system, however other file systems information is also very welcome.
Suggestions how to accurately measure this would be also very welcome.
Real scenario (requested by answers):
Linux
GFS file system (if you can estimate for another - pls do)
create a new file in existing directory, the file does not exist,
using the following code
Assume directory is in cache and directory depth is D
Code:
int fd = open("/my_dir/1/2/3/new_file", O_CREAT | S_IRWXU);
// assuming fd is valid
fsync(fd);

For an artifical measurement:
Create a blank filesystem on its own block device (e.g. vmware scsi etc)
Mount it, call sync(), then record the number of IOPS present on that block dev.
Run your test program against the filesystem, and do no further operations (not even "ls").
Wait until all unflushed blocks have been flushed - say about 1 minute or so
Snapshot the iops count again
Of course this is highly unrealistic, because if you created two files rather than one, you'd probably find that there were less than twice as many.
Also creating empty or blank files is unrealistic - as they don't do anything useful.
Directory structures (how deep the directories are, how many entries) might contribute, but also how fragmented it is and other arbitrary factors.

The nature of the answer to this question is; best case, normal case, worst case. There is no single answer, because the number of IOPS required will vary according to the current state of the file system. (A pristine file system is highly unrealistic scenario).
Taking FAT32 as an example, best case is 1. Normal case depends on the degree of file system fragmentation and the directory depth of the pathname for the new file. Worse case is unbounded (except by the size of the file system, which imposes a limit on the maximum possible number of IOPs to create a file).
Really, the question is not answerable, unless you define a particular file system scenario.

I did the following measurement, we wrote an application that creates N files as described in the question.
We ran this application on a disk which was devoted to this application only, and measured IOps amount using iostat -x 1
The result, on GFS and linux kernel 2.6.18 is 2 IOps per file creation.
This answer is based on MarkR answer.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Prepend to Very Large File in Fixed Time or Very Fast [closed] - linux

Related

Efficiently inserting blocks into the middle of a file

How to process a file from one to another without doubled storage requirements? [duplicate]

How to estimate a file size from header's sector start address?

Is moving a file safer than deleting it if you want to remove all traces of it? [closed]

Estimation or measurement of amount of iops to create a file

Categories

Resources