Efficient in-line search and replace for large file

Efficient in-line search and replace for large file - search

There are some standard tools to do this, but I need a simple GUI to assist some users (on windows). They will get an open file dialog and pick the file to process.
The file will be an XML file. The file will contain (within the first few lines) a text string that needs to be deleted or replaced with whitespace (doesn't matter which).
The problem is that the XML file is several gigabytes big but the fixed search and replace string will occur within the first 4k or so.
What's the best way to overwrite the search string and save in-place without requiring reading of whole amount into memory and or writing excessively to disk?

Obviously replacing with whitespace so the size of the file as a whole doesn't change is the best choice here, otherwise you must stream through the entire file to update in on disk.
If this was for a Unix environment, I would look into using mmap() to map a suitable part of the start of the file into RAM, then edit it in-place and be done.
This snippet shows how to use the Win32 equivalent, the CreateFileMapping() function.

You can easily write your own tool. If it is in the very beginning, then any brute-force approch will work. Just keep on scanning until you find it.
However avoiding a lot of disk writes is only possible if you do not change file size. If you wish to delete or insert bytes somewhere in the middle, you will have to overwrite all that follows them. Which in your case would be practically all of the file. So you'll have to replace it with whitespace. As long as you just replace one byte with another, there will be no overhead.

Related

Node.js: manipulate file like a stack

I'm envisioning a implementation in node.js that can manipulate a file on disk as if it's a stack data struct.
Suppose file is utf-8 encoded plain text, each element of the stack corresponds to a '\n' delimited line in the file, and top of stack point to first line of that file. I want something that can simultaneously read and write the file.
const file = new FileAsStack("/path/to/file");
// read the first line from the file,
// also remove that line from the file.
let line = await file.pop();
To implement such interface naively, I could simply read the whole file into memory, and when .pop() read from memory, and write the remainder back to disk. Obviously such approach isn't ideal. Imagine dealing with a 10GB file, it'll be both memory intensive and I/O intensive.
With fs.read() I can read just a slice of the file, so the "read" part is solved. But the "write" part I have no idea. How can I effectively take just one line, and write the rest of the file back to it? I hope I don't have to read every bytes of that file into the memory then write back to disk...
I remember vaguely that file in a filesystem is just a pointer to a position on disk, is there any way I can simply move the pointer to the start of next line?
I need some insight into what syscalls or whatever can do this effectively, but I'm quite ignorant to low level system stuffs. Any help is appreciated!

What you're asking for is not something that a standard file system can do. You can't insert data into the beginning of a file in any traditional OS file system without rewriting the entire file. That's just the way they work.
Systems that absolutely need to be able to do something like that without rewriting the entire file and still use a traditional OS file system will build their own mini file system on top of the regular file system so that one virtual file consists of many pieces written to separate files or to separate blocks of a file. Then, in a system like that, you can insert data at the beginning of a virtual file without rewriting any of the existing data by writing a new block of data to disk and then updating your virtual file index (stored in some other file) to indicate that the first block of your virtual file now comes from a specific location. This file index specifies the order of the blocks of data in the file and where they come from.
Most programs that need to do something like this will instead use a database for storing records and then use indexes and queries for controlling order and let the underlying database worry about where individual bits get stored on disk. In this way, you can very efficiently insert data anywhere you want in a resulting query.

How to process a file from one to another without doubled storage requirements? [duplicate]

This question already has answers here:
Remove beginning of file without rewriting the whole file
(3 answers)
Closed 5 years ago.
I'm creating an archiver, which processes data in two steps:
creates a temporary archive file
from the temporary archive file, it creates the final archive. After the final archive created, temporary file is deleted
The 2nd step processes temporary archive file linearly, and the result is written to the final archive while processing. So this process needs twice as much storage (temporarily) as the archive file.
I'd like to avoid the double storage need. So my idea is that during processing, I'd like to tell the OS that it can drop the processed part of the temporary file. Like a truncate call, but it should truncate the file at the beginning, not the end. Is is possible to do something like this?

Write all the data. Shift the data by opening file twice: for reading and for writing in overwrite mode (inject the table of contents and make sure that you're not overwriting before reading a chunk).
If the table of contents has fixed length - then preallocate that size in the file to avoid shifting completely.

Like a truncate call, but it should truncate the file at the beginning, not the end. Is is possible to do something like this?
No, that is not possible with plain files. However, look into the Linux specific fallocate(2) (which is not portable, and might not work with every file system), so I don't recommend using it.
However, look into SQLite and GDBM indexed files. They provide an abstraction (above files) which enables you to "delete records".
Or just keep temporarily all the data in memory.
Or consider a double-pass (or multiple-pass) approach. Maybe nftw(3) could be useful.
(today, disk space is very cheap, so your requirement of avoid the double storage need is really strange; if you handle a huge amount of data you should have mentioned it)

How to edit a big file

Imagine a huge file that should be edited by my program. In order to increase read time I use mmap() and then only read out the parts I'm viewing. However if I want to add a line in the middle of the file, what's the best approach for that?
Is the only way to add a line and then move the rest of the file? That sounds expensive.
So my question is basically:
What's the most efficient way of adding data in the middle of a huge file?

The only way to insert data in the middle of any (huge or small) file (on Linux or POSIX) is to copy that file (into a fresh one, then later rename(2) the copy as the original). So you'll copy its head (up to insertion point), you'll append the data to that copy, and then you copy the tail (after insertion point). You might consider also calling posix_fadvise(2) (or even the Linux specific readahead(2)...) but that does not aleviate the need to copy all the data. mmap(2) might be used e.g. to replace read(2) but whatever you do requires you to copy all the data.
Of course, if it happens that you are replacing a data chunk in the middle of the file by another chunk of the same size (so no real insertion), you can use plain lseek(2) + write(2)
Is the only way to add a line and then move the rest of the file? That sounds expensive.
Yes it is conceptually the only way.
You should consider using something else that a plain textual file: look into SQLite or GDBM (they might be very efficient in your use case). See also this answer. Both provides you with some higher abstraction than POSIX files, so give you the ability to "insert" data (Of course they are still internally based upon and using POSIX files).

Windows modify text without creating new file

I have a 1TB file that needs to be modified with a simple change: Either delete the first line or prepend a few characters to the file. Because of space limitations, I cannot use commands that redirect to a new file or use editors that load the entire file into memory.
What is the best way to achieve this?
EDIT: The best tool I've found thus far is FART http://fart-it.sourceforge.net/ but my file is encoded in UCS-2 and this tool doesn't seem to support it

There is no simple way to do this with any mainstream operating system / file system, because they do not support "prepend to beginning of file" or "remove from beginning of file" operations, only "append to end of file" and "truncate file". Any solution will require reading the whole file in to memory and writing it back out with the desired changes.

This could be done with some relatively straightforward C code (or C++, or probably even Python or perl or whatever). However, there would be no backup, so if it doesn't go right... well, how important is this file?
The idea for the insert case would be to use ftruncate() to extend the size of the file to include the space for the new bits, then working from the end of the file to the start (to avoid overwriting any of the existing data), read a block and write it back out offset by the correct amount. Then write the new data at the front.
The deletion case would work by finding the first byte past what you want to delete, and starting there, read blocks and write them back offset back towards the front of the file by the proper amount, and then at the end, ftruncate() the extra bytes off the end of the file.
These are obviously not safe operations if they are interrupted for any reason, or if the code to perform them has not been well tested before hand - but it can be done. Buying the additional storage to make it possible to retain the original file while writing out the new one would probably be a considerably better investment.

Editing large data files

I'm about to start on a project wherein I can foresee there being large files (mostly flat text files, but could be CSV, fixed-width, XML, ...so far) that need to be edited. I need to develop the pieces to do this editing within the application.
In trying to determine a Good Way to handle editing large amounts of data (possibly into the GB range) without having to load the whole thing, I found that Audacity is able to handle large files quite well. Audacity is open source, so I thought it would make an excellent teaching tool for me in this circumstance. However, I started thinking myself in circles going through the code and now I'm thoroughly confused.
I'm hoping for two results from this question:
A good way to handle this editing without loading the whole file. I thought about loading the data as they edited it, caching it on demand.
An explanation of how Audacity does it.
I'm using C# and .NET, but the answers don't need to be coupled to that environment.

Several tricks can make editing simpler and faster.
INDEX it for faster access. While the user is doing nothing, skim through the file and create an index so you can quickly find a specific spot in the file (see below).
Only store changes the user makes. Don't try to apply them directly to the file until the user saves.
Set a limit on how much to read into memory when the user jumps to a point. Read one or two screens of data initially so you can display it, and then if the user doesn't jump to a new spot immediately, read a bit before and a bit after the current spot.
Indexing:
When the user wants to jump to line X or timestamp T, you don't want to skim the whole file counting line breaks and characters. Skim the data, and create a record. Say, every 50 lines, record the byte offset, character count, and line number. This data can be stored in a hashtable, tree, or just an ordered list. Then when user jumps within the file, you can find the nearest index spot and read from there until you find the requested point. This technique is especially useful when working with Unicode, where the number of bytes per character may vary. If files are so large a full index won't fit in memory, you may want to limit the index points and space them more widely, or store the index in a temporary file.
Editing and altering big files:
As suggested by Harvey -- only store changes in memory (as diffs), and then apply them to the file when saved by streaming from input to output. A tree or ordered list may be helpful, so you can quickly find the next place where you need to make a change when writing from input to output.
If changes are too large to fit in memory, you may want to track them in a separate temporary file (perhaps in the same folder as the original). You can just keep writing a continuous list of changes, with new ones appended to this change file. When you save, you'll read through the change list and create a final list of alterations to apply, before deleting the temp file. For performance reasons, it may be helpful to avoid rewriting the change log file; instead, just append to the end of it, and remove redundant or cancelling edits when performing a save.
Interesting fact: the same structures you use for the change log can be used to provide Undo/Redo information.

Sound files are basically a data stream, right? So you don't actually need to deal with the whole file at once. Audacity users may only work with a small snippet of that large file at any given moment.
Hypothetically, if you are adding a 1 second snippet of sound to a large sound file, you only actually have to deal with the entire file when you have to save, at which point you splice together 3 parts: Before, 1 second snippet, and after. So the only thing that needs to actually be in memory is the 1 second snippet, and maybe a small portion of the sound before and after the snippet.
So when you save, you read, say 64 megabytes of file at a time (if you are really aggressive), and stream that out to a temporary file, until you get to your insertion point. Then you stream out the 1 second snippet, stream the remainder of the original file, close the temporary write file, delete the original file, and rename the new file to the original file name.
Of course, it's a little more complicated than this. There might be multiple edits before save, for example, and an undo buffer. But I can pretty much guarantee you that Audacity is limited in unsaved edit complexity by the amount of available RAM.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string