Editing large data files

Editing large data files - editing

I'm about to start on a project wherein I can foresee there being large files (mostly flat text files, but could be CSV, fixed-width, XML, ...so far) that need to be edited. I need to develop the pieces to do this editing within the application.
In trying to determine a Good Way to handle editing large amounts of data (possibly into the GB range) without having to load the whole thing, I found that Audacity is able to handle large files quite well. Audacity is open source, so I thought it would make an excellent teaching tool for me in this circumstance. However, I started thinking myself in circles going through the code and now I'm thoroughly confused.
I'm hoping for two results from this question:
A good way to handle this editing without loading the whole file. I thought about loading the data as they edited it, caching it on demand.
An explanation of how Audacity does it.
I'm using C# and .NET, but the answers don't need to be coupled to that environment.

Several tricks can make editing simpler and faster.
INDEX it for faster access. While the user is doing nothing, skim through the file and create an index so you can quickly find a specific spot in the file (see below).
Only store changes the user makes. Don't try to apply them directly to the file until the user saves.
Set a limit on how much to read into memory when the user jumps to a point. Read one or two screens of data initially so you can display it, and then if the user doesn't jump to a new spot immediately, read a bit before and a bit after the current spot.
Indexing:
When the user wants to jump to line X or timestamp T, you don't want to skim the whole file counting line breaks and characters. Skim the data, and create a record. Say, every 50 lines, record the byte offset, character count, and line number. This data can be stored in a hashtable, tree, or just an ordered list. Then when user jumps within the file, you can find the nearest index spot and read from there until you find the requested point. This technique is especially useful when working with Unicode, where the number of bytes per character may vary. If files are so large a full index won't fit in memory, you may want to limit the index points and space them more widely, or store the index in a temporary file.
Editing and altering big files:
As suggested by Harvey -- only store changes in memory (as diffs), and then apply them to the file when saved by streaming from input to output. A tree or ordered list may be helpful, so you can quickly find the next place where you need to make a change when writing from input to output.
If changes are too large to fit in memory, you may want to track them in a separate temporary file (perhaps in the same folder as the original). You can just keep writing a continuous list of changes, with new ones appended to this change file. When you save, you'll read through the change list and create a final list of alterations to apply, before deleting the temp file. For performance reasons, it may be helpful to avoid rewriting the change log file; instead, just append to the end of it, and remove redundant or cancelling edits when performing a save.
Interesting fact: the same structures you use for the change log can be used to provide Undo/Redo information.

Sound files are basically a data stream, right? So you don't actually need to deal with the whole file at once. Audacity users may only work with a small snippet of that large file at any given moment.
Hypothetically, if you are adding a 1 second snippet of sound to a large sound file, you only actually have to deal with the entire file when you have to save, at which point you splice together 3 parts: Before, 1 second snippet, and after. So the only thing that needs to actually be in memory is the 1 second snippet, and maybe a small portion of the sound before and after the snippet.
So when you save, you read, say 64 megabytes of file at a time (if you are really aggressive), and stream that out to a temporary file, until you get to your insertion point. Then you stream out the 1 second snippet, stream the remainder of the original file, close the temporary write file, delete the original file, and rename the new file to the original file name.
Of course, it's a little more complicated than this. There might be multiple edits before save, for example, and an undo buffer. But I can pretty much guarantee you that Audacity is limited in unsaved edit complexity by the amount of available RAM.

Related

What is the optimal way of merge few lines or few words in the large file using NodeJS?

I would appreciate insight from anyone who can suggest the best or better solution in editing large files anyway ranges from 1MB to 200MB using nodejs.
Our process needs to merge lines to an existing file in the filesystem, we get the changed data in the following format which needs to be merged to filesystem file at the position defined in the changed details.
[{"range":{"startLineNumber":3,"startColumn":3,"endLineNumber":3,"endColumn":3},"rangeLength":0,"text":"\n","rangeOffset":4,"forceMoveMarkers":false},{"range":{"startLineNumber":4,"startColumn":1,"endLineNumber":4,"endColumn":1},"rangeLength":0,"text":"\n","rangeOffset":5,"forceMoveMarkers":false},{"range":{"startLineNumber":5,"startColumn":1,"endLineNumber":5,"endColumn":1},"rangeLength":0,"text":"\n","rangeOffset":6,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":1,"endLineNumber":6,"endColumn":1},"rangeLength":0,"text":"f","rangeOffset":7,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":2,"endLineNumber":6,"endColumn":2},"rangeLength":0,"text":"a","rangeOffset":8,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":3,"endLineNumber":6,"endColumn":3},"rangeLength":0,"text":"s","rangeOffset":9,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":4,"endLineNumber":6,"endColumn":4},"rangeLength":0,"text":"d","rangeOffset":10,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":5,"endLineNumber":6,"endColumn":5},"rangeLength":0,"text":"f","rangeOffset":11,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":6,"endLineNumber":6,"endColumn":6},"rangeLength":0,"text":"a","rangeOffset":12,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":7,"endLineNumber":6,"endColumn":7},"rangeLength":0,"text":"s","rangeOffset":13,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":8,"endLineNumber":6,"endColumn":8},"rangeLength":0,"text":"f","rangeOffset":14,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":9,"endLineNumber":6,"endColumn":9},"rangeLength":0,"text":"s","rangeOffset":15,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":10,"endLineNumber":6,"endColumn":10},"rangeLength":0,"text":"a","rangeOffset":16,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":11,"endLineNumber":6,"endColumn":11},"rangeLength":0,"text":"f","rangeOffset":17,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":12,"endLineNumber":6,"endColumn":12},"rangeLength":0,"text":"s","rangeOffset":18,"forceMoveMarkers":false}]
If we just open the full file and merge those details would work but it would break if we getting too many of those changed details very frequently that can cause out of memory issues as the file been opened many times which is also a very inefficient way.
There is a similar question aimed specifically at c# here. If we open the file in stream mode, is there similar example in nodejs?

I would appreciate insight from anyone who can suggest the best or better solution in editing large files anyway ranges from 1MB to 200MB using nodejs.
Our process needs to merge lines to an existing file in the filesystem, we get the changed data in the following format which needs to be merged to filesystem file at the position defined in the changed details.
General OS file systems do not directly support the concept of inserting info into a file. So, if you have a flat file and you want to insert data into it starting at a particular line number, you have to do the following steps:
Open the file and start reading from the beginning.
As you read data from the file, count lines until you reach the desired linenumber.
Then, if you're inserting new data, you need to read some more and buffer into memory the amount of data you intend to insert.
Then do a write to the file at the position of insertion of the data to insert.
Now using another buffer the size of the data you inserted, take turns reading another buffer, then writing out the previous buffer.
Continue until the end of the file is reach and all data is written back to the file (after the newly inserted data).
This has the effect of rewriting all the data after the insertion point back to the file so it will now correctly be in its new location in the file.
As you can tell, this is not efficient at all for large files as you have to read the entire file a buffer at a time and you have to write the insertion and everything after the insertion point.
In node.js, you can use features in the fs module to carry out all these steps, but you have to write the logic to connect them all together as there is no built-in feature to insert new data into a file while pushing the existing data after it.
There is a similar question aimed specifically at c# here. If we open the file in stream mode, is there similar example in nodejs?
The C# example you reference appears to just be appending new data onto the end of the file. That's trivial to do in pretty much any file system library. In node.js, you can do that with fs.appendFile() or you can open any file handle in append mode and then write to it.
To insert data into a file more efficiently, you would need to use a more efficient storage system than a single flat file for all the data. For example, if you stored the file in pieces in approximately 100 line blocks, then to insert data you'd only have to rewrite a portion of one block of data and then perhaps have some cleanup process that rebalances the block boundaries if a block gets way too big or too small.
For efficient line management, you would need to maintain an accurate index of how many lines each file piece contains and obviously what order the pieces should be in. This would allow you to insert data at a somewhat fixed cost no matter how big the entire file was as the most you would need to do is to rewrite one or two blocks of data, even if the entire content was hundreds of GB in size.
Note, you would essentially be building a new file system on top of the OS file system in order to give yourself more efficient inserts or deletions within the overall data. Obviously, the chunks of data could also be stored in a database too and managed there.
Note, if this project is really an editor, text editing a line-based structure is a very well studied problem and you could also study the architectures used in previous projects for further ideas. It's a bit beyond the scope of a typical answer here to study the pros and cons of various architectures. If your system is also a client/server editor where the change instructions are being sent from a client to a server, that also affects some of the desired tradeoffs in the design since you may desire differing tradeoffs in terms of the number of transactions or the amount of data to be sent between client and server.
If some other language uses an optimal way then I think it would be better to find that option as you saying nodejs might not have that option.
This doesn't really have anything to do with the language you choose. This is about how modern and typical operating systems store data in files.

In fs module there is a function named appendFile. It would let you append data in your file. Link.

How to process a file from one to another without doubled storage requirements? [duplicate]

This question already has answers here:
Remove beginning of file without rewriting the whole file
(3 answers)
Closed 5 years ago.
I'm creating an archiver, which processes data in two steps:
creates a temporary archive file
from the temporary archive file, it creates the final archive. After the final archive created, temporary file is deleted
The 2nd step processes temporary archive file linearly, and the result is written to the final archive while processing. So this process needs twice as much storage (temporarily) as the archive file.
I'd like to avoid the double storage need. So my idea is that during processing, I'd like to tell the OS that it can drop the processed part of the temporary file. Like a truncate call, but it should truncate the file at the beginning, not the end. Is is possible to do something like this?

Write all the data. Shift the data by opening file twice: for reading and for writing in overwrite mode (inject the table of contents and make sure that you're not overwriting before reading a chunk).
If the table of contents has fixed length - then preallocate that size in the file to avoid shifting completely.

Like a truncate call, but it should truncate the file at the beginning, not the end. Is is possible to do something like this?
No, that is not possible with plain files. However, look into the Linux specific fallocate(2) (which is not portable, and might not work with every file system), so I don't recommend using it.
However, look into SQLite and GDBM indexed files. They provide an abstraction (above files) which enables you to "delete records".
Or just keep temporarily all the data in memory.
Or consider a double-pass (or multiple-pass) approach. Maybe nftw(3) could be useful.
(today, disk space is very cheap, so your requirement of avoid the double storage need is really strange; if you handle a huge amount of data you should have mentioned it)

How to edit a big file

Imagine a huge file that should be edited by my program. In order to increase read time I use mmap() and then only read out the parts I'm viewing. However if I want to add a line in the middle of the file, what's the best approach for that?
Is the only way to add a line and then move the rest of the file? That sounds expensive.
So my question is basically:
What's the most efficient way of adding data in the middle of a huge file?

The only way to insert data in the middle of any (huge or small) file (on Linux or POSIX) is to copy that file (into a fresh one, then later rename(2) the copy as the original). So you'll copy its head (up to insertion point), you'll append the data to that copy, and then you copy the tail (after insertion point). You might consider also calling posix_fadvise(2) (or even the Linux specific readahead(2)...) but that does not aleviate the need to copy all the data. mmap(2) might be used e.g. to replace read(2) but whatever you do requires you to copy all the data.
Of course, if it happens that you are replacing a data chunk in the middle of the file by another chunk of the same size (so no real insertion), you can use plain lseek(2) + write(2)
Is the only way to add a line and then move the rest of the file? That sounds expensive.
Yes it is conceptually the only way.
You should consider using something else that a plain textual file: look into SQLite or GDBM (they might be very efficient in your use case). See also this answer. Both provides you with some higher abstraction than POSIX files, so give you the ability to "insert" data (Of course they are still internally based upon and using POSIX files).

Windows modify text without creating new file

I have a 1TB file that needs to be modified with a simple change: Either delete the first line or prepend a few characters to the file. Because of space limitations, I cannot use commands that redirect to a new file or use editors that load the entire file into memory.
What is the best way to achieve this?
EDIT: The best tool I've found thus far is FART http://fart-it.sourceforge.net/ but my file is encoded in UCS-2 and this tool doesn't seem to support it

There is no simple way to do this with any mainstream operating system / file system, because they do not support "prepend to beginning of file" or "remove from beginning of file" operations, only "append to end of file" and "truncate file". Any solution will require reading the whole file in to memory and writing it back out with the desired changes.

This could be done with some relatively straightforward C code (or C++, or probably even Python or perl or whatever). However, there would be no backup, so if it doesn't go right... well, how important is this file?
The idea for the insert case would be to use ftruncate() to extend the size of the file to include the space for the new bits, then working from the end of the file to the start (to avoid overwriting any of the existing data), read a block and write it back out offset by the correct amount. Then write the new data at the front.
The deletion case would work by finding the first byte past what you want to delete, and starting there, read blocks and write them back offset back towards the front of the file by the proper amount, and then at the end, ftruncate() the extra bytes off the end of the file.
These are obviously not safe operations if they are interrupted for any reason, or if the code to perform them has not been well tested before hand - but it can be done. Buying the additional storage to make it possible to retain the original file while writing out the new one would probably be a considerably better investment.

Efficient in-line search and replace for large file

There are some standard tools to do this, but I need a simple GUI to assist some users (on windows). They will get an open file dialog and pick the file to process.
The file will be an XML file. The file will contain (within the first few lines) a text string that needs to be deleted or replaced with whitespace (doesn't matter which).
The problem is that the XML file is several gigabytes big but the fixed search and replace string will occur within the first 4k or so.
What's the best way to overwrite the search string and save in-place without requiring reading of whole amount into memory and or writing excessively to disk?

Obviously replacing with whitespace so the size of the file as a whole doesn't change is the best choice here, otherwise you must stream through the entire file to update in on disk.
If this was for a Unix environment, I would look into using mmap() to map a suitable part of the start of the file into RAM, then edit it in-place and be done.
This snippet shows how to use the Win32 equivalent, the CreateFileMapping() function.

You can easily write your own tool. If it is in the very beginning, then any brute-force approch will work. Just keep on scanning until you find it.
However avoiding a lot of disk writes is only possible if you do not change file size. If you wish to delete or insert bytes somewhere in the middle, you will have to overwrite all that follows them. Which in your case would be practically all of the file. So you'll have to replace it with whitespace. As long as you just replace one byte with another, there will be no overhead.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string