Spark: Provide input data as file to external executable

Spark: Provide input data as file to external executable - apache-spark

Using (py)Spark, is it possible somehow to provide input data to an executable as a file in an efficient manner?
What I currently do is:
listOfPaths=["path1.bin","path2.bin",...]
rdd_paths=spark_context.parallelize(listOfPaths)
def downloadAndRun(path):
localFilePath = downloadFile(path)
subprocess.Popen(["executable.exe", "--input", localFilePath])
rdd_paths.map(lambda x: downloadAndRun(x))
But this downloads the file from an accessible location each time to the worker node, stores it locally and then runs the application on it. This works, but is inefficient.
Since I need to run many iterations of that kind I rather would like to have the input data inside an rdd itself, which then would need to be provided to the 3rd party executable in the map stage. Ideally directly out of memory. Sadly I cannot change the executable itself, which only takes data via its --input argument.
Storing the input data inside an rdd might be possible by using a bytestream or something similar, but what about providing it to the executable?
Is this possible somehow without dumping the bytestream to a local file each time? I imagine something like a RamDisk or so for making the data a "virtual" file.
Any hints or suggestions are highly appreciated.
Thanks!

Related

Node.js: manipulate file like a stack

I'm envisioning a implementation in node.js that can manipulate a file on disk as if it's a stack data struct.
Suppose file is utf-8 encoded plain text, each element of the stack corresponds to a '\n' delimited line in the file, and top of stack point to first line of that file. I want something that can simultaneously read and write the file.
const file = new FileAsStack("/path/to/file");
// read the first line from the file,
// also remove that line from the file.
let line = await file.pop();
To implement such interface naively, I could simply read the whole file into memory, and when .pop() read from memory, and write the remainder back to disk. Obviously such approach isn't ideal. Imagine dealing with a 10GB file, it'll be both memory intensive and I/O intensive.
With fs.read() I can read just a slice of the file, so the "read" part is solved. But the "write" part I have no idea. How can I effectively take just one line, and write the rest of the file back to it? I hope I don't have to read every bytes of that file into the memory then write back to disk...
I remember vaguely that file in a filesystem is just a pointer to a position on disk, is there any way I can simply move the pointer to the start of next line?
I need some insight into what syscalls or whatever can do this effectively, but I'm quite ignorant to low level system stuffs. Any help is appreciated!

What you're asking for is not something that a standard file system can do. You can't insert data into the beginning of a file in any traditional OS file system without rewriting the entire file. That's just the way they work.
Systems that absolutely need to be able to do something like that without rewriting the entire file and still use a traditional OS file system will build their own mini file system on top of the regular file system so that one virtual file consists of many pieces written to separate files or to separate blocks of a file. Then, in a system like that, you can insert data at the beginning of a virtual file without rewriting any of the existing data by writing a new block of data to disk and then updating your virtual file index (stored in some other file) to indicate that the first block of your virtual file now comes from a specific location. This file index specifies the order of the blocks of data in the file and where they come from.
Most programs that need to do something like this will instead use a database for storing records and then use indexes and queries for controlling order and let the underlying database worry about where individual bits get stored on disk. In this way, you can very efficiently insert data anywhere you want in a resulting query.

What is the optimal way of merge few lines or few words in the large file using NodeJS?

I would appreciate insight from anyone who can suggest the best or better solution in editing large files anyway ranges from 1MB to 200MB using nodejs.
Our process needs to merge lines to an existing file in the filesystem, we get the changed data in the following format which needs to be merged to filesystem file at the position defined in the changed details.
[{"range":{"startLineNumber":3,"startColumn":3,"endLineNumber":3,"endColumn":3},"rangeLength":0,"text":"\n","rangeOffset":4,"forceMoveMarkers":false},{"range":{"startLineNumber":4,"startColumn":1,"endLineNumber":4,"endColumn":1},"rangeLength":0,"text":"\n","rangeOffset":5,"forceMoveMarkers":false},{"range":{"startLineNumber":5,"startColumn":1,"endLineNumber":5,"endColumn":1},"rangeLength":0,"text":"\n","rangeOffset":6,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":1,"endLineNumber":6,"endColumn":1},"rangeLength":0,"text":"f","rangeOffset":7,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":2,"endLineNumber":6,"endColumn":2},"rangeLength":0,"text":"a","rangeOffset":8,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":3,"endLineNumber":6,"endColumn":3},"rangeLength":0,"text":"s","rangeOffset":9,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":4,"endLineNumber":6,"endColumn":4},"rangeLength":0,"text":"d","rangeOffset":10,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":5,"endLineNumber":6,"endColumn":5},"rangeLength":0,"text":"f","rangeOffset":11,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":6,"endLineNumber":6,"endColumn":6},"rangeLength":0,"text":"a","rangeOffset":12,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":7,"endLineNumber":6,"endColumn":7},"rangeLength":0,"text":"s","rangeOffset":13,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":8,"endLineNumber":6,"endColumn":8},"rangeLength":0,"text":"f","rangeOffset":14,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":9,"endLineNumber":6,"endColumn":9},"rangeLength":0,"text":"s","rangeOffset":15,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":10,"endLineNumber":6,"endColumn":10},"rangeLength":0,"text":"a","rangeOffset":16,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":11,"endLineNumber":6,"endColumn":11},"rangeLength":0,"text":"f","rangeOffset":17,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":12,"endLineNumber":6,"endColumn":12},"rangeLength":0,"text":"s","rangeOffset":18,"forceMoveMarkers":false}]
If we just open the full file and merge those details would work but it would break if we getting too many of those changed details very frequently that can cause out of memory issues as the file been opened many times which is also a very inefficient way.
There is a similar question aimed specifically at c# here. If we open the file in stream mode, is there similar example in nodejs?

I would appreciate insight from anyone who can suggest the best or better solution in editing large files anyway ranges from 1MB to 200MB using nodejs.
Our process needs to merge lines to an existing file in the filesystem, we get the changed data in the following format which needs to be merged to filesystem file at the position defined in the changed details.
General OS file systems do not directly support the concept of inserting info into a file. So, if you have a flat file and you want to insert data into it starting at a particular line number, you have to do the following steps:
Open the file and start reading from the beginning.
As you read data from the file, count lines until you reach the desired linenumber.
Then, if you're inserting new data, you need to read some more and buffer into memory the amount of data you intend to insert.
Then do a write to the file at the position of insertion of the data to insert.
Now using another buffer the size of the data you inserted, take turns reading another buffer, then writing out the previous buffer.
Continue until the end of the file is reach and all data is written back to the file (after the newly inserted data).
This has the effect of rewriting all the data after the insertion point back to the file so it will now correctly be in its new location in the file.
As you can tell, this is not efficient at all for large files as you have to read the entire file a buffer at a time and you have to write the insertion and everything after the insertion point.
In node.js, you can use features in the fs module to carry out all these steps, but you have to write the logic to connect them all together as there is no built-in feature to insert new data into a file while pushing the existing data after it.
There is a similar question aimed specifically at c# here. If we open the file in stream mode, is there similar example in nodejs?
The C# example you reference appears to just be appending new data onto the end of the file. That's trivial to do in pretty much any file system library. In node.js, you can do that with fs.appendFile() or you can open any file handle in append mode and then write to it.
To insert data into a file more efficiently, you would need to use a more efficient storage system than a single flat file for all the data. For example, if you stored the file in pieces in approximately 100 line blocks, then to insert data you'd only have to rewrite a portion of one block of data and then perhaps have some cleanup process that rebalances the block boundaries if a block gets way too big or too small.
For efficient line management, you would need to maintain an accurate index of how many lines each file piece contains and obviously what order the pieces should be in. This would allow you to insert data at a somewhat fixed cost no matter how big the entire file was as the most you would need to do is to rewrite one or two blocks of data, even if the entire content was hundreds of GB in size.
Note, you would essentially be building a new file system on top of the OS file system in order to give yourself more efficient inserts or deletions within the overall data. Obviously, the chunks of data could also be stored in a database too and managed there.
Note, if this project is really an editor, text editing a line-based structure is a very well studied problem and you could also study the architectures used in previous projects for further ideas. It's a bit beyond the scope of a typical answer here to study the pros and cons of various architectures. If your system is also a client/server editor where the change instructions are being sent from a client to a server, that also affects some of the desired tradeoffs in the design since you may desire differing tradeoffs in terms of the number of transactions or the amount of data to be sent between client and server.
If some other language uses an optimal way then I think it would be better to find that option as you saying nodejs might not have that option.
This doesn't really have anything to do with the language you choose. This is about how modern and typical operating systems store data in files.

In fs module there is a function named appendFile. It would let you append data in your file. Link.

Transactionally writing files in Node.js

I have a Node.js application that stores some configuration data in a file. If you change some settings, the configuration file is written to disk.
At the moment, I am using a simple fs.writeFile.
Now my question is: What happens when Node.js crashes while the file is being written? Is there the chance to have a corrupt file on disk? Or does Node.js guarantee that the file is written in an atomic way, so that either the old or the new version is valid?
If not, how could I implement such a guarantee? Are there any modules for this?

What happens when Node.js crashes while the file is being written? Is
there the chance to have a corrupt file on disk? Or does Node.js
guarantee that the file is written in an atomic way, so that either
the old or the new version is valid?
Node implements only a (thin) async wrapper over system calls, thus it does not provide any guarantees about atomicity of writes. In fact, fs.writeAll repeatedly calls fs.write until all data is written. You are right that when Node.js crashes, you may end up with a corrupted file.
If not, how could I implement such a guarantee? Are there any modules for this?
The simplest solution I can come up with is the one used e.g. for FTP uploads:
Save the content to a temporary file with a different name.
When the content is written on disk, rename temporary file to destination file.
The man page says that rename guarantees to leave an instance of newpath in place (on Unix systems like Linux or OSX).

fs.writeFile, just like all the other methods in the fs module are implemented as simple wrappers around standard POSIX functions (as stated in the docs).
Digging a bit in nodejs' code, one can see that the fs.js, where all the wrappers are defined, uses fs.c for all its file system calls. More specifically, the write method is used to write the contents of the buffer. It turns out that the POSIX specification for write explicitly says that:
Atomic/non-atomic: A write is atomic if the whole amount written in
one operation is not interleaved with data from any other process.
This is useful when there are multiple writers sending data to a
single reader. Applications need to know how large a write request can
be expected to be performed atomically. This maximum is called
{PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say whether
write requests for more than {PIPE_BUF} bytes are atomic, but requires
that writes of {PIPE_BUF} or fewer bytes shall be atomic.
So it seems it is pretty safe to write, as long as the size of the buffer is smaller than PIPE_BUF. This is a constant that is system-dependent though, so you might need to check it somewhere else.

write-file-atomic will do what you need. It writes to temporary file, then rename. That's safe.

Program embedded inside an ISO

I'm trying to find a good way to embed some application / script inside / beside an ISO file. What I'm trying to achieve is having an cd image which can be used to kick-start its own installation via virtual-manager (virt-install). It's just to simplify the downloads really, so that you don't need to have an installer and an image separately.
The simplest option to do something like that would be to put uu/b64-encoded image after a bash script. The bad side of that is that the file needs to be copied again, its size is bigger than it needs to be and bash likes to reserve a lot of memory for strings that are streamed into some other program.
Another alternative is a Perl script with the binary contents after an __END__. That saves me from encoding the contents and high memory usage, but it doesn't prevent the copy.
Is there any elegant solution then? Are there any script languages that can be embedded inside of another file with no specific header (or at the end)? Or something that can be merged with the ISO format itself?

store some data in the struct inode

Hello I am a newbie to kernel programming. I am writing a small kernel module
that is based on wrapfs template to implement a backup mechanism. This is
purely for learning basis.
I am extending wrapfs so that when a write call is made wrapfs transparently
makes a copy of that file in a separate directory and then write is performed
on the file. But I don't want that I create a copy for every write call.
A naive approach could be I check for existence of file in that directory. But
I think for each call checking this could be a severe penalty.
I could also check for first write call and then store a value for that
specific file using private_data attribute. But that would not be stored on
disk. So I would need to check that again.
I was also thinking of making use of modification time. I could save a
modification time. If the older modification time is before that time then only
a copy is created otherwise I won't do anything. I tried to use inode.i_mtime
for this but it was the modified time even before write was called, also
applications can modify that time.
So I was thinking of storing some value in inode on disk that indicates its
backup has been created or not. Is that possible? Any other suggestions or
approaches are welcome.

You are essentially saying you want to do a Copy-On-Write virtual filesystem layer.
IMO, some of these have been done, and it would be easier to implement these in userland (using libfuse and the fuse module, e.g.). That way, you can be king of your castle and add your metadata in any which way you feel is appriate:
just add (hidden) metadata files to each directory
use extended POSIX attributes (setfattr and friends)
heck, you could even use a sqlite database
If you really insist on doing these things in-kernel, you'll have a lot more work since accessing the metadata from kernel mode is goind to take a lot more effort (you'd most likely want to emulate your own database using memory mapped files so as to minimize the amount of 'userland (style)' work required and to make it relatively easy to get atomicity and reliability right1.
1
On How Everybody Gets File IO Wrong: see also here

You can use atime instead of mtime. In that case setting S_NOATIME flag on the inode prevents it from updating (see touch_atime() function at the inode.c). The only thing you'll need is to mount your filesystem with noatime option.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string