Node.js: manipulate file like a stack

Node.js: manipulate file like a stack - node.js

I'm envisioning a implementation in node.js that can manipulate a file on disk as if it's a stack data struct.
Suppose file is utf-8 encoded plain text, each element of the stack corresponds to a '\n' delimited line in the file, and top of stack point to first line of that file. I want something that can simultaneously read and write the file.
const file = new FileAsStack("/path/to/file");
// read the first line from the file,
// also remove that line from the file.
let line = await file.pop();
To implement such interface naively, I could simply read the whole file into memory, and when .pop() read from memory, and write the remainder back to disk. Obviously such approach isn't ideal. Imagine dealing with a 10GB file, it'll be both memory intensive and I/O intensive.
With fs.read() I can read just a slice of the file, so the "read" part is solved. But the "write" part I have no idea. How can I effectively take just one line, and write the rest of the file back to it? I hope I don't have to read every bytes of that file into the memory then write back to disk...
I remember vaguely that file in a filesystem is just a pointer to a position on disk, is there any way I can simply move the pointer to the start of next line?
I need some insight into what syscalls or whatever can do this effectively, but I'm quite ignorant to low level system stuffs. Any help is appreciated!

What you're asking for is not something that a standard file system can do. You can't insert data into the beginning of a file in any traditional OS file system without rewriting the entire file. That's just the way they work.
Systems that absolutely need to be able to do something like that without rewriting the entire file and still use a traditional OS file system will build their own mini file system on top of the regular file system so that one virtual file consists of many pieces written to separate files or to separate blocks of a file. Then, in a system like that, you can insert data at the beginning of a virtual file without rewriting any of the existing data by writing a new block of data to disk and then updating your virtual file index (stored in some other file) to indicate that the first block of your virtual file now comes from a specific location. This file index specifies the order of the blocks of data in the file and where they come from.
Most programs that need to do something like this will instead use a database for storing records and then use indexes and queries for controlling order and let the underlying database worry about where individual bits get stored on disk. In this way, you can very efficiently insert data anywhere you want in a resulting query.

Related

What is the optimal way of merge few lines or few words in the large file using NodeJS?

I would appreciate insight from anyone who can suggest the best or better solution in editing large files anyway ranges from 1MB to 200MB using nodejs.
Our process needs to merge lines to an existing file in the filesystem, we get the changed data in the following format which needs to be merged to filesystem file at the position defined in the changed details.
[{"range":{"startLineNumber":3,"startColumn":3,"endLineNumber":3,"endColumn":3},"rangeLength":0,"text":"\n","rangeOffset":4,"forceMoveMarkers":false},{"range":{"startLineNumber":4,"startColumn":1,"endLineNumber":4,"endColumn":1},"rangeLength":0,"text":"\n","rangeOffset":5,"forceMoveMarkers":false},{"range":{"startLineNumber":5,"startColumn":1,"endLineNumber":5,"endColumn":1},"rangeLength":0,"text":"\n","rangeOffset":6,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":1,"endLineNumber":6,"endColumn":1},"rangeLength":0,"text":"f","rangeOffset":7,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":2,"endLineNumber":6,"endColumn":2},"rangeLength":0,"text":"a","rangeOffset":8,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":3,"endLineNumber":6,"endColumn":3},"rangeLength":0,"text":"s","rangeOffset":9,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":4,"endLineNumber":6,"endColumn":4},"rangeLength":0,"text":"d","rangeOffset":10,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":5,"endLineNumber":6,"endColumn":5},"rangeLength":0,"text":"f","rangeOffset":11,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":6,"endLineNumber":6,"endColumn":6},"rangeLength":0,"text":"a","rangeOffset":12,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":7,"endLineNumber":6,"endColumn":7},"rangeLength":0,"text":"s","rangeOffset":13,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":8,"endLineNumber":6,"endColumn":8},"rangeLength":0,"text":"f","rangeOffset":14,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":9,"endLineNumber":6,"endColumn":9},"rangeLength":0,"text":"s","rangeOffset":15,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":10,"endLineNumber":6,"endColumn":10},"rangeLength":0,"text":"a","rangeOffset":16,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":11,"endLineNumber":6,"endColumn":11},"rangeLength":0,"text":"f","rangeOffset":17,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":12,"endLineNumber":6,"endColumn":12},"rangeLength":0,"text":"s","rangeOffset":18,"forceMoveMarkers":false}]
If we just open the full file and merge those details would work but it would break if we getting too many of those changed details very frequently that can cause out of memory issues as the file been opened many times which is also a very inefficient way.
There is a similar question aimed specifically at c# here. If we open the file in stream mode, is there similar example in nodejs?

I would appreciate insight from anyone who can suggest the best or better solution in editing large files anyway ranges from 1MB to 200MB using nodejs.
Our process needs to merge lines to an existing file in the filesystem, we get the changed data in the following format which needs to be merged to filesystem file at the position defined in the changed details.
General OS file systems do not directly support the concept of inserting info into a file. So, if you have a flat file and you want to insert data into it starting at a particular line number, you have to do the following steps:
Open the file and start reading from the beginning.
As you read data from the file, count lines until you reach the desired linenumber.
Then, if you're inserting new data, you need to read some more and buffer into memory the amount of data you intend to insert.
Then do a write to the file at the position of insertion of the data to insert.
Now using another buffer the size of the data you inserted, take turns reading another buffer, then writing out the previous buffer.
Continue until the end of the file is reach and all data is written back to the file (after the newly inserted data).
This has the effect of rewriting all the data after the insertion point back to the file so it will now correctly be in its new location in the file.
As you can tell, this is not efficient at all for large files as you have to read the entire file a buffer at a time and you have to write the insertion and everything after the insertion point.
In node.js, you can use features in the fs module to carry out all these steps, but you have to write the logic to connect them all together as there is no built-in feature to insert new data into a file while pushing the existing data after it.
There is a similar question aimed specifically at c# here. If we open the file in stream mode, is there similar example in nodejs?
The C# example you reference appears to just be appending new data onto the end of the file. That's trivial to do in pretty much any file system library. In node.js, you can do that with fs.appendFile() or you can open any file handle in append mode and then write to it.
To insert data into a file more efficiently, you would need to use a more efficient storage system than a single flat file for all the data. For example, if you stored the file in pieces in approximately 100 line blocks, then to insert data you'd only have to rewrite a portion of one block of data and then perhaps have some cleanup process that rebalances the block boundaries if a block gets way too big or too small.
For efficient line management, you would need to maintain an accurate index of how many lines each file piece contains and obviously what order the pieces should be in. This would allow you to insert data at a somewhat fixed cost no matter how big the entire file was as the most you would need to do is to rewrite one or two blocks of data, even if the entire content was hundreds of GB in size.
Note, you would essentially be building a new file system on top of the OS file system in order to give yourself more efficient inserts or deletions within the overall data. Obviously, the chunks of data could also be stored in a database too and managed there.
Note, if this project is really an editor, text editing a line-based structure is a very well studied problem and you could also study the architectures used in previous projects for further ideas. It's a bit beyond the scope of a typical answer here to study the pros and cons of various architectures. If your system is also a client/server editor where the change instructions are being sent from a client to a server, that also affects some of the desired tradeoffs in the design since you may desire differing tradeoffs in terms of the number of transactions or the amount of data to be sent between client and server.
If some other language uses an optimal way then I think it would be better to find that option as you saying nodejs might not have that option.
This doesn't really have anything to do with the language you choose. This is about how modern and typical operating systems store data in files.

In fs module there is a function named appendFile. It would let you append data in your file. Link.

How to edit a big file

Imagine a huge file that should be edited by my program. In order to increase read time I use mmap() and then only read out the parts I'm viewing. However if I want to add a line in the middle of the file, what's the best approach for that?
Is the only way to add a line and then move the rest of the file? That sounds expensive.
So my question is basically:
What's the most efficient way of adding data in the middle of a huge file?

The only way to insert data in the middle of any (huge or small) file (on Linux or POSIX) is to copy that file (into a fresh one, then later rename(2) the copy as the original). So you'll copy its head (up to insertion point), you'll append the data to that copy, and then you copy the tail (after insertion point). You might consider also calling posix_fadvise(2) (or even the Linux specific readahead(2)...) but that does not aleviate the need to copy all the data. mmap(2) might be used e.g. to replace read(2) but whatever you do requires you to copy all the data.
Of course, if it happens that you are replacing a data chunk in the middle of the file by another chunk of the same size (so no real insertion), you can use plain lseek(2) + write(2)
Is the only way to add a line and then move the rest of the file? That sounds expensive.
Yes it is conceptually the only way.
You should consider using something else that a plain textual file: look into SQLite or GDBM (they might be very efficient in your use case). See also this answer. Both provides you with some higher abstraction than POSIX files, so give you the ability to "insert" data (Of course they are still internally based upon and using POSIX files).

Windows modify text without creating new file

I have a 1TB file that needs to be modified with a simple change: Either delete the first line or prepend a few characters to the file. Because of space limitations, I cannot use commands that redirect to a new file or use editors that load the entire file into memory.
What is the best way to achieve this?
EDIT: The best tool I've found thus far is FART http://fart-it.sourceforge.net/ but my file is encoded in UCS-2 and this tool doesn't seem to support it

There is no simple way to do this with any mainstream operating system / file system, because they do not support "prepend to beginning of file" or "remove from beginning of file" operations, only "append to end of file" and "truncate file". Any solution will require reading the whole file in to memory and writing it back out with the desired changes.

This could be done with some relatively straightforward C code (or C++, or probably even Python or perl or whatever). However, there would be no backup, so if it doesn't go right... well, how important is this file?
The idea for the insert case would be to use ftruncate() to extend the size of the file to include the space for the new bits, then working from the end of the file to the start (to avoid overwriting any of the existing data), read a block and write it back out offset by the correct amount. Then write the new data at the front.
The deletion case would work by finding the first byte past what you want to delete, and starting there, read blocks and write them back offset back towards the front of the file by the proper amount, and then at the end, ftruncate() the extra bytes off the end of the file.
These are obviously not safe operations if they are interrupted for any reason, or if the code to perform them has not been well tested before hand - but it can be done. Buying the additional storage to make it possible to retain the original file while writing out the new one would probably be a considerably better investment.

Reading file in Kernel Mode

I am building a driver and i want to read some files.
Is there any way to use "ZwReadFile()" or a similar function to read the
contents of the files line by line so that i can process them in a loop.
The documentation in MSDN states that :-
ZwReadFile begins reading from the given ByteOffset or the current file position into the given Buffer. It terminates the read operation under one of the following conditions:
The buffer is full because the number of bytes specified by the Length parameter has been read. Therefore, no more data can be placed into the buffer without an overflow.
The end of file is reached during the read operation, so there is no more data in the file to be transferred into the buffer.
Thanks.

No, there is not. You'll have to create a wrapper to achieve what you want.
However, given that kernel mode code has the potential to crash the system rather than the process it runs in, you have to make sure that problems such as those known from usermode with very long lines etc will not cause issues.
If the amount of data is (and will stay) below the threshold of what registry values can hold, you should use that instead. In particular REG_MULTI_SZ which has the properties you are looking for ("line-wise" storage of data).

In this situation unless performance is a critical (like 'realtime') then I would pass the filtering to a user mode service or application. Send the file name to the application to process. A user mode application is easier to test and easier to debug. It wont blue screen or hang your box either.

Efficient in-line search and replace for large file

There are some standard tools to do this, but I need a simple GUI to assist some users (on windows). They will get an open file dialog and pick the file to process.
The file will be an XML file. The file will contain (within the first few lines) a text string that needs to be deleted or replaced with whitespace (doesn't matter which).
The problem is that the XML file is several gigabytes big but the fixed search and replace string will occur within the first 4k or so.
What's the best way to overwrite the search string and save in-place without requiring reading of whole amount into memory and or writing excessively to disk?

Obviously replacing with whitespace so the size of the file as a whole doesn't change is the best choice here, otherwise you must stream through the entire file to update in on disk.
If this was for a Unix environment, I would look into using mmap() to map a suitable part of the start of the file into RAM, then edit it in-place and be done.
This snippet shows how to use the Win32 equivalent, the CreateFileMapping() function.

You can easily write your own tool. If it is in the very beginning, then any brute-force approch will work. Just keep on scanning until you find it.
However avoiding a lot of disk writes is only possible if you do not change file size. If you wish to delete or insert bytes somewhere in the middle, you will have to overwrite all that follows them. Which in your case would be practically all of the file. So you'll have to replace it with whitespace. As long as you just replace one byte with another, there will be no overhead.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string