QFileSystemWatcher - does it need to run in another thread? - multithreading

I have a class that does some parsing of two large (~90K rows, 11 columns in the first and around ~20K, 5 columns in the second) CSV files. According to the specification I'm working with the CSV files can be externally changed (removing/adding of new rows; columns remain constant as well as the paths). Such updates can happen at any time (though highly unlikely that an update will be launched in time intervals shorter than a couple of minutes) and an update of any of the two files has to terminate the current processing of all that data (CSV, XML from an HTTP GET request, UDP telegrams), followed by re-parsing the content of each of the two (or just one if only one has changed).
I keep the CSV data (quite reduced since I apply multiple filters to remove unwanted entries) in memory to speed working with it and also to avoid unnecessary IO operations (opening, reading, closing file).
Right now I'm looking into the QFileSystemWatcher, which seems to be exactly what I need. However I'm unable to find any information on how it actually works internally.
Since all I need is to monitor 2 files for changes the number of files shouldn't be an issue. Do I need to run it in a separate thread (since the watcher is part of the same class where the CSV parsing happens) or is it safe to say that it can run without too much fuss (that is it works asynchronously like the QNetworkAccessManager)? My dev environment for now is a 64bit Ubuntu VM (VirtualBox) on a relatively powerful host (a HP Z240 workstation) however the target system is an embedded one. While the whole parsing of the CSV files takes just 2-3 seconds at the most I don't know how much performance impact there will be once the application gets deployed so additional overhead is something of a concern of mine.

Related

What is the optimal way of merge few lines or few words in the large file using NodeJS?

I would appreciate insight from anyone who can suggest the best or better solution in editing large files anyway ranges from 1MB to 200MB using nodejs.
Our process needs to merge lines to an existing file in the filesystem, we get the changed data in the following format which needs to be merged to filesystem file at the position defined in the changed details.
[{"range":{"startLineNumber":3,"startColumn":3,"endLineNumber":3,"endColumn":3},"rangeLength":0,"text":"\n","rangeOffset":4,"forceMoveMarkers":false},{"range":{"startLineNumber":4,"startColumn":1,"endLineNumber":4,"endColumn":1},"rangeLength":0,"text":"\n","rangeOffset":5,"forceMoveMarkers":false},{"range":{"startLineNumber":5,"startColumn":1,"endLineNumber":5,"endColumn":1},"rangeLength":0,"text":"\n","rangeOffset":6,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":1,"endLineNumber":6,"endColumn":1},"rangeLength":0,"text":"f","rangeOffset":7,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":2,"endLineNumber":6,"endColumn":2},"rangeLength":0,"text":"a","rangeOffset":8,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":3,"endLineNumber":6,"endColumn":3},"rangeLength":0,"text":"s","rangeOffset":9,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":4,"endLineNumber":6,"endColumn":4},"rangeLength":0,"text":"d","rangeOffset":10,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":5,"endLineNumber":6,"endColumn":5},"rangeLength":0,"text":"f","rangeOffset":11,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":6,"endLineNumber":6,"endColumn":6},"rangeLength":0,"text":"a","rangeOffset":12,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":7,"endLineNumber":6,"endColumn":7},"rangeLength":0,"text":"s","rangeOffset":13,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":8,"endLineNumber":6,"endColumn":8},"rangeLength":0,"text":"f","rangeOffset":14,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":9,"endLineNumber":6,"endColumn":9},"rangeLength":0,"text":"s","rangeOffset":15,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":10,"endLineNumber":6,"endColumn":10},"rangeLength":0,"text":"a","rangeOffset":16,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":11,"endLineNumber":6,"endColumn":11},"rangeLength":0,"text":"f","rangeOffset":17,"forceMoveMarkers":false},{"range":{"startLineNumber":6,"startColumn":12,"endLineNumber":6,"endColumn":12},"rangeLength":0,"text":"s","rangeOffset":18,"forceMoveMarkers":false}]
If we just open the full file and merge those details would work but it would break if we getting too many of those changed details very frequently that can cause out of memory issues as the file been opened many times which is also a very inefficient way.
There is a similar question aimed specifically at c# here. If we open the file in stream mode, is there similar example in nodejs?
I would appreciate insight from anyone who can suggest the best or better solution in editing large files anyway ranges from 1MB to 200MB using nodejs.
Our process needs to merge lines to an existing file in the filesystem, we get the changed data in the following format which needs to be merged to filesystem file at the position defined in the changed details.
General OS file systems do not directly support the concept of inserting info into a file. So, if you have a flat file and you want to insert data into it starting at a particular line number, you have to do the following steps:
Open the file and start reading from the beginning.
As you read data from the file, count lines until you reach the desired linenumber.
Then, if you're inserting new data, you need to read some more and buffer into memory the amount of data you intend to insert.
Then do a write to the file at the position of insertion of the data to insert.
Now using another buffer the size of the data you inserted, take turns reading another buffer, then writing out the previous buffer.
Continue until the end of the file is reach and all data is written back to the file (after the newly inserted data).
This has the effect of rewriting all the data after the insertion point back to the file so it will now correctly be in its new location in the file.
As you can tell, this is not efficient at all for large files as you have to read the entire file a buffer at a time and you have to write the insertion and everything after the insertion point.
In node.js, you can use features in the fs module to carry out all these steps, but you have to write the logic to connect them all together as there is no built-in feature to insert new data into a file while pushing the existing data after it.
There is a similar question aimed specifically at c# here. If we open the file in stream mode, is there similar example in nodejs?
The C# example you reference appears to just be appending new data onto the end of the file. That's trivial to do in pretty much any file system library. In node.js, you can do that with fs.appendFile() or you can open any file handle in append mode and then write to it.
To insert data into a file more efficiently, you would need to use a more efficient storage system than a single flat file for all the data. For example, if you stored the file in pieces in approximately 100 line blocks, then to insert data you'd only have to rewrite a portion of one block of data and then perhaps have some cleanup process that rebalances the block boundaries if a block gets way too big or too small.
For efficient line management, you would need to maintain an accurate index of how many lines each file piece contains and obviously what order the pieces should be in. This would allow you to insert data at a somewhat fixed cost no matter how big the entire file was as the most you would need to do is to rewrite one or two blocks of data, even if the entire content was hundreds of GB in size.
Note, you would essentially be building a new file system on top of the OS file system in order to give yourself more efficient inserts or deletions within the overall data. Obviously, the chunks of data could also be stored in a database too and managed there.
Note, if this project is really an editor, text editing a line-based structure is a very well studied problem and you could also study the architectures used in previous projects for further ideas. It's a bit beyond the scope of a typical answer here to study the pros and cons of various architectures. If your system is also a client/server editor where the change instructions are being sent from a client to a server, that also affects some of the desired tradeoffs in the design since you may desire differing tradeoffs in terms of the number of transactions or the amount of data to be sent between client and server.
If some other language uses an optimal way then I think it would be better to find that option as you saying nodejs might not have that option.
This doesn't really have anything to do with the language you choose. This is about how modern and typical operating systems store data in files.
In fs module there is a function named appendFile. It would let you append data in your file. Link.

Node .fs Working with a HUGE Directory

Picture a directory with a ton of files. As a rough gauge of magnitude I think the most that we've seen so far is a couple of million but it could technically go another order higher. Using node, I would like to read files from this directory, process them (upload them, basically), and then move them out of the directory. Pretty simple. New files are constantly being added while the application is running, and my job (like a man on a sinking ship holding a bucket) is to empty this directory as fast as it's being filled.
So what are my options? fs.readdir is not ideal, it loads all of the filenames into memory which becomes a problem at this kind of scale. Especially as new files are being added all the time and so it would require repeated calls. (As an aside for anybody referring to this in the future, there is something being proposed to address this whole issue which may or may not have been realised within your timeline.)
I've looked at the myriad of fs drop-ins (graceful-fs, chokadir, readdirp, etc), none of which have this particular use-case within their remit.
I've also come across a couple of people suggesting that this can be handled with child_process, and there's a wrapper called inotifywait which tasks itself with exactly what I am asking but I really don't understand how this addresses the underlying problem, especially at this scale.
I'm wondering if what I really need to do is find a way to just get the first file (or, realistically, batch of files) from the directory without having the overhead of reading the entire directory structure into memory. Some sort of stream that could be terminated after a certain number of files had been read? I know Go has a parameter for reading the first n files from a directory but I can't find a node equivalent, has anybody here come across one or have any interesting ideas? Left-field solutions more than welcome at this point!
You can use your operation system listing file command, and stream the result into NodeJS.
For example in Linux:
var cp=require('child_process')
var stdout=cp.exec('ls').stdout
stdout.on('data',function(a){
console.log(a)
});0
RunKit: https://runkit.com/aminanadav/57da243180f3bb140059a31d

How to open and read 1000s of files very quickly

My problem is that application takes too long to load thousands of files. Yes, I know it's going to take a long time, but I would like to make it faster by any amount of time. What I mean by "load" is open the file to get its descriptor and then read the first 100 bytes or so of it.
So, my main strategy has been to create a second thread that will open and close (without reading any contents) all the files. This seems to help because the thread runs ahead of the main thread and I'm guessing the OS is caching these file descriptors ahead of time so that when my main thread opens them it's a quick open. This has actually helped because the thread can start caching these file descriptors while my main thread is parsing the data read in from these files.
So my real question is...what else can I do to make this faster? What approaches are there? Has anyone had success doing this?
I've heard of OS prefetching calls but it was for virtual memory pages. Is there a way to tell the OS, hey I'm going to be needed all these files pretty soon - I suggest that you start gathering them for me ahead of time. My lookahead thread is pretty crude.
Are there low level disk techniques I could use? Is there possibly a pattern of file access that would help? Right now, the files that are loaded all come from the same folder. I suppose there is no way to determine where exactly on disk they lie and which ordering of file opens would be fastest for the disk. I'm also guessing that the disk has some hard ware to make this as efficient as possible too.
My application is mainly for windows, but unix suggestions would help as well.
I am programming in C++ if that makes a difference.
Thanks,
-julian
My first thought is that this is going to be hard to work around from a programmatic level.
You'll find Linux and OSX can access thousands of files like this in a fraction of the time it takes Windows. I don't know how much control you have over the machine. If you can keep the thousands of files on a FAT partition, you should see better results than with NTFS.
How often are you scanning these files and how often are they changing. If the ratio is heavily on the reading side, it would make sense to copy the start of each file into a cache. The cache could store the filename, modification time, and 100 bytes of each of the thousand files.

multithreading and reading from one file (perl)

Hej sharp minds!
I need your expert guidance in making some choices.
Situation is like this:
1. I have approx. 500 flat files containing from 100 to 50000 records that have to be processed.
2. Each record in the files mentioned above has to be replaced using value from the separate huge file (2-15Gb) containing 100-200 million entries.
So I thought to make the processing using multicores - one file per thread/fork.
Is that a good idea? Since each thread needs to read from same huge file? It's a bit of a problem loading it into memory do to the size? Using file::tie is an option, but is that working with threads/forks?
Need your advise how to proceed.
Thanks
Yes, of course, using multiple cores for multi-threaded application is a good idea, because that's what those cores are for. Though it sounds like your problem involves heavy I/O, so, it might be that you will not use that much of CPU anyway.
Also since you are only going to read that big file, tie should work perfectly. I haven't heard of problems with that. But if you are going to search that big file for each record in your smaller files, then I guess it would take you a long time despite of the number of threads you use. If data from big file can be indexed based on some key, then I would advice to put it in some NoSQL databse and access it from your program. That would probably speed up your task even more than using multiple threads/cores.

Multiple Machines -- Process Many Files Concurrently?

I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be accessed (read) by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
Are there any 'do' and 'donts' when one is building such an operation? is it a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...).
Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)
Things you should think about:
If the processing to be done for each file is simple, then your real bottleneck isn't the amount of parallel files you read, but the capabilities of the hard disk drive.
Unless processing takes a long time (say, some seconds per file) you'll go past a point in which adding more processes will only slow down matters to a crawl, since every process is reading and writing results, and the disk can only do so much.
Try to minimize disk access: for example, download files and produce results locally while other processes are downloading, and send the results back when the load on the disk goes down.
The more I write the more it boils down to how much processing needs to be done for each file. If it's simple parsing, something that takes milliseconds, 1 machine or 30 will make little difference.
You need to be careful that two worker processes don't pick up (and try to do) the same piece of work at the same time.
Unfortunately, NFS filesystems don't have semantics that allow you to easily do that.
So what I'd recommend is to use something like Gearman and a producer/consumer model, where one process gives out work to whoever is available to do it.
Another possibility is to have a database (e.g. mysql) with a table of all tasks, and have the processes atomically "claim" tasks for themselves.
But all of this is only worthwhile if your processes are mostly CPU-bound. If you're trying to get more IO bandwidth (or operations) out of your NAS by using multiple clients, it's not going to work.
I am assuming that you will be running at least gigabit ethernet here (or it's probably not worth it).
Have you tried running multiple processes on the same machine?

Resources