on our Linux servers we observe quite some .vscode-server processes (basically $PREFIX/.vscode-server/bin/$ID/node) from developers using the vscode-remote-ssh extension. Unfortunately these processes put considerable load onto the systems, due to waiting for I/O (state "D"/uninterruptible sleep).
All affected filesystems are NFS (v3 and v4.0) mounted shares. There's nothing we can do on the fileserver end.
Why exactly do these processes require so much I/O? The .vscode-server processes sometimes generate more load than some of the data processings on these servers.
Is this a known problem of vscode-remote-ssh and/or is there a way to solve or work around this I/O problem?
Related
I'm developing a service in NodeJS which will create text files from images using a node wrapper for the tesseract OCR engine. I want it to be a constantly running service, being started and restarted (on crash) by upstart.
I have the option of making the servers (Virtual machines on which this going to run) multiple core machines with large RAM and disk space or I have the option of creating 4 or 5 small VMs with one core each, 1 GB RAM and relatively small disk size.
With the first approach, I would have to fork various child processes to make use of all cores, which adds complexity to the code. On the other hand, I just have one VM to worry about.
With the second approach, I don't have to worry about forking child processes, but I would have to create and configure multiple VMs.
Are there other pros and cons of each approach that I haven't thought of?
I'd avoid partitioning VMs since that means you'll likely end up wasting RAM and CPU -- it's not unlikely that you'll find one VM using 100% of its resources while another sits idle. There's also non-trivial overhead involved in running 5 operating systems instead of one.
Why are you considering forking many processes? If you use the right library, this will be unnecessary.
Many of the tesseract libraries on npm are poorly written. They are ultra-simplistic bindings to the tesseract C code. In JS, you call the addon's recognize() function, which just calls tesseract's Recognize(), which does CPU-intensive work in a blocking fashion. This means you're doing the recognition on the main V8 thread, which we know is a no-no. I assume this is why you're considering forking processes, since each would only be able to do a single blocking OCR operation at once.
Instead, you want a library that does the OCR work on a separate thread. tesseract_native is an example. It is properly designed: it uses libuv to call into tesseract on a worker thread.
libuv maintains a worker thread pool, so you can have as many concurrent OCR operations as you have cores, all in a single process.
The question title is pretty awkward, sorry about that.
I am currently working on the design of a server, and a comment came up from one of my co-workers that we should use multiple processes, since the was some performance hit to having too many threads in a single process (as opposed to having that same number of threads spread over multiple processes on the same machine)
The only thing I can think of which would cause this (other than bad OS scheduling), would be from increased contention (for example on the memory allocator), but I'm not sure how much that matters.
Is this a 'best practice'? Does anyone have some benchmarks they could share with me? Of course the answer may depend on the platform (I'm interested mostly in windows/linux/osx, although I need to care about HP-UX, AIX, and Solaris to some extent)
There are of course other benefits to using a multi-process architecture, such as process isolation to limit the effect of a crash, but I'm interested about performance for this question.
For some context, the server is going to service long-running, stateful connections (so they cannot be migrated to other server processes) which send back a lot of data, and can also cause a lot of local DB processing on the server machine. It's going to use the proactor architecture in-process and be implemented in C++. The server will be expected to run for weeks/months without need of restart (although this may be implemented by rotating new instances transparently under some proxy).
Also, we will be using a multi-process architecture, my concern is more about scheduling connections to processes.
I am using flock within an HPC application on a file system shared among many machines via NFS. Locking works fine as long as all machines behave as expected (Quote from http://en.wikipedia.org/wiki/File_locking: "Kernel 2.6.12 and above implement flock calls on NFS files using POSIX byte-range locks. These locks will be visible to other NFS clients that implement fcntl-style POSIX locks").
I would like to know what is expected to happen if one of the machines that has acquired a certain lock unexpectedly shuts down, e.g. due to a power outage. I am not sure where to look this up. My guess is that this is entirely up to NFS and its way to deal with NFS handles of non-responsive machines. I could imagine that the other clients will still see the lock until a timeout occurs and the NFS server declares all NFS handles of the machine that timed out as invalid. Is that correct? What would that timeout be? What happens if the machine comes up again within the timeout? Can you recommend a definite reference to look all of this up?
Thanks!
When you use NFS v4 (!) the file will be unlocked when the server hasn't heard from the client for a certain amount of time. This lease period defaults to 90s.
There is a good explanation in the O'Reilly book about NFS and NIS, chapter 11.2. To sum up quickly: As NFS is stateless, the server has no way of knowing the client has crashed. The client is responsible for clearing the lock after it reboots.
Running Windows Server 2008 R2 SP1. The application I'm running was not designed with NUMA in mind. Would it be better to disable NUMA on my dual-socket system? My guess is yes, but I wanted to confirm. My server is a Westmere dual-socket system.
If your application is not multithreaded or is multithreaded but does not employ the threads to work simultaneously on the same problem (e.g. is not data parallel), then you can simply bind the program to one of the NUMA nodes. This can be done with various tools, e.g. with the "Set Affinity..." context menu command in Windows Task Manager. If your program is parallel, then you can still use half of the available process cores and bind to one NUMA node.
Note that remote memory accesses on Westmere systems are not that expensive - the latency is 1.6x higher than local access and the bandwidth is almost the same as the local on, therefore if you do a lot of processing on each memory value the impact would be minimal. On the other hand, disabling NUMA on such systems results in fine-mesh interleave of both NUMA domains which makes all applications perform equally bad as roughly 50% of all memory accesses will be local and 50% will be remote.
If I understand correctly, turning NUMA on cannot harm the performance.
If your application is not NUMA aware, accesses will be managed by the OS, so might be across NUMA nodes or might be on the same one - depending on what other pressures the OS has, how much memory / CPU you're using, etc. The OS will try to get your data fast.
If you have it turned off, the OS doesn't know enough to even try to keep each application's execution CPU close to it's memory.
Scenario A:
To share a read/write block of memory between two processes running on the same host, Joe mmaps the same local file from both processes.
Scenario B:
To share a read/write block of memory between two processes running on two different hosts, Joe shares a file via nfs between the hosts, and then mmaps the shared file from both processes.
Has anyone tried Scenario B? What are the extra problems that arise in Scenario B that do not apply to Scenario A?.
Mmap will not share data without some additional actions.
If you change data in mmaped part of file, changes will be stored only in memory. They will not be flushed to the filesystem (local or remote) until msync or munmap or close or even decision of OS kernel and its FS.
When using NFS, locking and storing of data will be slower than if using local FS. Timeouts of flushing and time of file operations will vary too.
On the sister site people says that NFS may have poor caching policy, so there will be much more I/O requests to the NFS server comparing I/O request count to local FS.
You will need byte-range-lock for correct behavior. They are available in NFS >= v4.0.
I'd say scenario B has all kinds of problems (assuming it works as suggested in the comments). The most obvious is the standards concurrency issues - 2 processes sharing 1 resource with no form of locking etc. That could lead to problems... Not sure whether NFS has its own peculiar quirks in this regard or not.
Assuming you can get around the concurrency issues somehow, you are now reliant on maintaining a stable (and speedy) network connection. Obviously if the network drops out, you might miss some changes. Whether this matters depends on your architecture.
My thought is it sounds like an easy way to share a block of memory on different machines, but I can't say I've heard of it being done which makes me think it isn't so good. When I think sharing data between procs, I think DBs, messaging or a dedicated server. In this case if you made one proc the master (to handle concurrency and owning the concept -i.e. whatever this guy says is the best copy of the data) it might work...