where is disk scheduling implemented - io

I'm recently learning the part of disk schedulingof operating system. And I could understand the various algorithms for this issue, like FCFS, LIFO, SSTF, SCAN and so on. But I was wondering where these algorithms are implemented?
I don't think operating system is the answer because OS can't know the details of the I/O devices. So are they implemented on the devices themselves? Could anyone clarify it for me? Any related literature or links will be appreciated.

The simple answer is that these days, this all takes place in the drive controller.
In ye olde days, operating systems usually implemented disk I/O in two layers. At the the top was a drive independent logical layer. This viewed the drive as an array of blocks. Below this was a physical layer that viewed disks as platters, tracks, and sectors. Because the physical details varied among drives, the physical layer was usually implemented in a disk-(or class of disks) specific device driver.
In these dark times, you often had to wait for your drive vendor to create a new device driver before you could upgrade your operating system.
In the mid-1980's it started to become common for disk drives to provide a logical I/O interface. The device driver no longer saw disks/platters/sectors. Instead, it just saw an array of logical blocks. The drive took care of physical locations and redirecting of bad blocks (tasks that the operating system used to handle). This allowed single device driver to manage multiple types of devices, sharing the same interface and differing only in the number of logical blocks.
These days, you'd be hard pressed to find a disk drive that does not provide a logical interface.
All the scheduling algorithms that involve physical locations have to take place within the disk drive.
Unless you are doing disk drive engineering, such scheduling algorithms are quite meaningless. If you learning hard drive engineering, expect that occupation to disappear soon.

In practice, disk scheduling (in the sense of e.g. reordering the pending disk reads to minimize rotational delay) is less important today than it was in the XXth century.
hard disks are probably less used, in favor of SSDs, and they are even more slow w.r.t. fast RAM access time.
the disk sector as seen by the kernel have been reorganized by the disk controller itself, so the CHS addressing (as seen by the OS kernel) does not correspond to geometrical reality
the hard disk drive is smarter today and its internal controller has significant memory and computing capabilities. The SATA protocol has some "higher level" requests (e.g. TRIM). Read about SMART and hybrid drives.
However, application code can give hints to the operating system about access patterns. Look for example into posix_fadvise(2).
Read also Operating Systems: Three Easy Pieces


Using loopback for synchronous IPC when using NUMA architecture

(For a Linux platform) Is it feasible (from a performance point of view) to try to communicate (in a synchronous way) via loopback interface between processes on different NUMA nodes?
What about if the processes reside on the same NUMA node?
I know it's possible to memory bind a process and/or set CPU affinity to a node (using libnuma). I don't know if this true also for the network interface.
Later edit. If loopback interface is just a memory buffer used by kernel, is there a way to be sure that buffer is on the same NUMA node in order for two processes to communicate without the cross node overhead?
Network interfaces don't reside on a node; they're a device - virtual or real - shared across the whole machine. The loopback interface is just a memory buffer somewhere or other, and some kernel code. The code that runs to support that device is likely bouncing round the CPU cores, just like any other thread in the system.
You talk of NUMA nodes, and tagged the question with Linux. Linux doens't run on pure NUMA architectures, it runs on SMP architectures. Modern CPUs from, say, Intel, AMD, ARM all synthesise an SMP hardware environment using separate cores, varying degrees of cache / memory interface unification, and high speed serial links between cores or CPUs. Effectively it's not possible for the operating system or software running on top to see the underlying NUMA architecture; it thinks it's running on a classical SMP architecture.
Intel / AMD / everyone else have done this because, back in the day, successful multiple CPU machines really were SMP; they had multiple CPUs all sharing the same memory bus, and had equal access to the RAM at the other end of the bus. Software got written to take advantage of that (Linux, Windows, etc).
Then the CPU manufacturers realised that SMP architectures suck so far as speed improvements are concerned. AMD blinked first, and ditched SMP in favour of Hypertransport, and were successful. Intel persisted with pure SMP for longer, but soon gave up too and started using QPI between CPUs.
But to give the old software (Linux, Windows, etc) backward compatibility, the CPU designers had to create a synthetic SMP hardware environment on top of Hypertransport and QPI. In principal they might have, at that point in time, decided that SMP was dead and delivered us pure NUMA architectures. But that would likely have been commercial suicide; it would have taken coorindation of the entire hardware and software industries to agree to go that way, but by then it was already far too late to rewrite everything from scratch.
Thinks like network sockets (including via the loopback interface), pipes, serial ports are not synchronous. They're stream carriers, and the sender and receiver are not synchronised by the act of transferring data. That is, the sender can write() data and think that that has completed, but the data is in reality still stuck in some network buffer somewhere and hasn't yet made it into the read() that the destination process will have to call to receive the data.
What Linux will do with processes and threads is endeavour to run them all at once, up to the limit of the number of CPU cores in the machine. By and large that will result in your processes running simultaneously on separate cores. I think Linux will also use knowledge of which physical CPU's memory holds the bulk of a process's data, and will try to run the process on that CPU; memory latency will be a tiny bit better that way.
If your processes try to communicate via socket, pipe or similar, it results in data being copied out of one process's memory space into a memory buffer controlled by the kernel (that's what write() is doing under the hood), and then being copied out of that into the receiving process's memory space (that's what read() does). Where that intermediate kernel buffer actually is doesn't really matter because the transactions taking place at the microelectronic level (below the SMP level) are pretty much the same regardless. Memory allocations and processes can be bound to specific CPU cores, but you can't influence whereabouts the kernel puts its memory buffers through which the exchanged data must pass.
Regarding memory and process core affinity - it's really, really hard to do this to any measurable benefit. The OSes are so good nowadays at understanding the behaviour of CPUs that it's almost always best to simply let the OS run your processes and cores whereever it chooses. Companies like Intel make large code contributions to the Linux project, specifically to ensure that Linux does this as well as possible on the latest and greatest chips.
Additions in the light of engaging comments!
By "pure NUMA" I really mean systems where one CPU core cannot directly address memory physically attached to another CPU core. Such systems include Transputers, and even the Cell processor found in the Sony PS3. These aren't SMP, there's nothing in the silicon that unifies the separate memories into a single address space, so the question of cache coherency doesn't come into it.
With Transputer systems the only way to access memory attached to another transputer was to have the application software send the data over via a serial link; what made it CSP was that the sending application would finish sending until the receiving application had read the last byte.
For the Cell processor, there were 8 maths cores each with 256kbyte of RAM. That was the only RAM the maths cores could address. To use them the application had to move data and code into that 256k of RAM, tell the core to run, and then move the results out (possibly back out to RAM, or onto another maths core).
There are some supercomputers today that aren't disimilar to this. The K machine (Riken, Kobe in Japan) has an awful lot of cores, a very complex on-chip interconnect fabric, and OpenMPI is used by applications to move data around between nodes; nodes cannot directly address memory on other nodes.
The point is that on the PS3 it was up to application software to decide what data was in what memory and when, whereas modern x86 implementations from Intel and AMD make all data in all memories (no matter if they're shared via an L3 cache or are remote at the other end of a hypertransport or QPI link) accessible from any cores (that's what SMP means afterall).
The all out performance of code written on the Cell process was truly astounding for the Watts and transistor count. Trouble was in a world where programemrs are trained in writing for SMP environments, it takes a brain transplant to get to grips with one that isn't.
Newer languages like Rust and Go have reintroduced the concept of communicating sequential processes, which is all one had with Transputers back in the 1980s, early 1990s. CSP is almost ideal for multicore systems as the hardware does not need to implement an SMP environment. In principle this saves an awful lot of silicon.
CSP implemented on top of today's cache coherent SMP chips in languages like C generally involves a thread writing data into a bufffer, and that being copied into a buffer belonging to another thread (Rust can do it a little differently because Rust knows about memory ownership, and so can transfer ownership instead of copying memory. I'm not familiar with Go - maybe it can do the same).
Looked at at the microelectronic level, copying data from one buffer to another is not really any different to what happens if the data is shared by 2 cores instead of copied (especially in AMD's hypertransport CPUs where each has its own memory system). To share data, the remote core has to use hypertransport to request data from another core's memory, and more traffic to maintain cache coherency. That's about the same amount of hypertransport traffic as if the data where copied from one core to the other, but then there's no subsequent cache coherency traffic.

Does disk IO correspond directly to its physical sector location?

I've been playing around with disk IO on flash drives, HDDs, and SSDs by opening /dev/sd* paths in Linux the way I would any other file.
I understand that the memory controller on the disk can hide true block order (via a mapping) from the OS.
This boils down to these questions:
Are the blocks in /dev/sd* in the order perceived by the OS, or in the order as perceived by the disk's memory controller?
Is the order of blocks in /dev/sd* subjective between POSIX OSes?
Can these properties change if done on an NT or Cygwin system?
Is this property different among Flash, HDD, and SSD?
Can a write occur to a specific index in an opened /dev/sd* path, or is this determined by the memory controller?
Thanks in advance!
If you use the device nodes for entire disks (/dev/sda, /dev/sdb, and so on), then the file offsets for the block device correspond to logical block addresses and will be portable across systems (assuming that the disk sector size is supported). This is independent of the storage technology.
However, the names of the device nodes are different from system to system.
If you use sub-devices (partitions), this is not necessarily the case because interpretation of and support for partition tables varies considerably.

How to make the OS schedule disk accesses optimally?

Suppose that a process needs to access the file system in many (1000+) places, and the order is not important to the program logic. However, the order obviously matters for performance if the file system is stored on a (spinning) hard disk.
How can the application programmer communicate to the OS that it should schedule the accesses optimally? Launching 1000+ threads does not seem practical. Does database management software accomplish this, and if so, then how?
Additional details: I had a large (1TB+) mmapped file where I needed to read 1000+ chunks of about 1KB, each time in new, unpredictable places.
In the early days when parameters like Wikipedia: Hard disk drive performance characteristics → Seek time were very expensive and thus very important, database vendors payed attention to the on-disk data representation and layout as can be seen e.g. in Oracle8i: Designing and Tuning for Performance → Tuning I/O.
The important optimization parameters changed with appearance of Solid-state drives (SSD) where the seek time is 0 (or at least constant) as there is nothing to rotate. Some of the new parameters are addressed by Wikipedia: Solid-state drive (SSD) → optimized file systems.
But even those optimization parameters go away with the use of Wikipedia: In-memory databases. The list of vendors is pretty long, all big players on it.
So how to schedule your access optimally depends a lot on the use case (1000 concurrent hits is not sufficient problem description) and buying some RAM is one of the options and "how can the programmer communicate with the OS" will be one of the last (not first) questions
Files and their transactions are cached in various devices in your computer; RAM and the HD cache are the most usual places. The file system driver may also implement IO transaction queues, defragmentation, and error-correction logic that makes things complicated for the developer who wants to control every aspect of file access. This level of complexity is ultimately designed to provide integrity, security, performance, and coordination of file access across all processes of your system.
Optimization efforts should not interfere with the system's own caching and prediction algorithms, not just for IO but for all caches. Trying to second-guess your system is a waste of your time and your processors' time.
Most probably your IO operations and data will stay on caches and later be committed to your storage devices when your OS sees fit.
That said, there's always options like database suites, mmap, readahead mechanisms, and direct IO to your drive. You will need to invest time benchmarking any of your efforts. I advise against multiple IO threads because cache contention will make things even slower than one thread.
The kernel will already reorder the read/write requests (e.g. to fit the spin of a mechanical disk), if they come from various processes or threads. BTW, most of the reads & writes would go to the kernel file system cache, not to the disk.
You might consider using posix_fadvise(2) & perhaps (in a separate thread) readahead(2). If -instead of read(2)-ing- you use mmap(2) to project some file portion to virtual memory, you might use also madvise(2)
Of course, the file system does not usually guarantee that a sequential portion of a file is physically sequentially located on the disk (and even the disk firmware might reorder sectors). See picture in Ext2 wikipage, also relevant for Ext4. Some file systems might be better in that respect, and you could tune their block size (at mkfs time).
I would not recommend having thousands of threads (only at most a few dozens).
At last, it might worth buying some SSD or some more RAM (for file cache). See http://linuxatemyram.com/
Actual performance would depend a lot on the particular system and hardware.
Perhaps using an indexed file library like GDBM or a database library Sqlite (or a real database like PostGreSQL) might be worthwhile! Perhaps have fewer files but bigger ones could help.
BTW, you are mmap-ing, and reading small chunk of 1K (smaller than page size of 4K). You could use madvise (if possible in advance), but you should try to read larger chunks, since every file access will bring at least a whole page.
You really should benchmark!

Are they same thing: Linux's framebuffer and GPU's memory

From my understanding they are different.
Linux framebuffer is a software object and GPU's memory is a physical memory mapped to GPU device.
My questions are the following:
1) Is my understanding correct?
2) If so, somehow merging two things into one looks like possible to improve the performance (I guess there are much more technical details why this is not possible and so on...)
3) If not, could you explain how Linux framebuffer and GPU work together?
Linux framebuffer device is a virtual device that wraps data it receives to display. So generally answer is no - it is not GPU memory. In theory driver can map GPU memory into fbdev, but it is unlikely anyone doing this. Main problem is that there may be many virtual consoles, but e.g. only one monitor - fbdev must handle this. Other thing is that GPU memory only quite recently became virtualised (directly accessible), on older GPUs you can't just write into GPU memory anything you like.
Aside from that, fbdev provides unified interface, while direct access to GPU memory will require hardware-specific data formats. When there is a difference between formats, fbdev driver performs conversion.
As for performance - it is already very good. There is probably not much benefit to raise it even further.

single common address space for all tasks

How to give single common address space for all tasks. IF its happening like this can we avoid virtual to physical memory mapping.
I f all task sharing common address space then how can we avoid virtual to physical memory mapping.
There are a few modern (research) OS's that do this, like Singularity and there are performance benefits, primarily because it no longer needs to do context changes and the file/symbol loader no longer needs to do address translation for global caches and kernel functions.
You do need to be a bit more specific about what you're looking for, tho'. You tagged your post as OSX and Linux, both of which require virtual memory. When running on systems without a MMU (and thus no virtual memory) it emulates it, which I'm fairly certain you can't circumvent. I'm not an expert by any means.
uClinux is an implementation of Linux that runs on processors that lack an MMU (such as ARM7), so by definition must have a single address space for all tasks.
So one answer to "how" is "use uClinux".
You tagged this VxWorks, and there is another answer; VxWorks supports a flat memory. In fact when I last used it the MMU protection was an (expensive) add on. Many other RTOS designed for micro controllers similarly do not support an MMU, such as eCOS, and FreeRTOS.
Of RTOS's that do support an MMU, QNX is probably amongst the most robust and mature, while still maintaining high performance.
I'm not sure why you would want to disable virtual memory mapping - it's a built in function of the cpu, and pretty much essential when running an OS to properly isolate processes from each other.
Most operating systems allow you to disable virtual memory, so that your memory capacity is limited by physical memory. However, A processes address space is still virtual, and virtual to physical mapping is still happening.
A way to get what you want is to run an operating system that executes in Real Mode, such as DOS or Windows 3.0, or write your own.
The advantages of virtual memory far outweigh the disadvantages. Why do you want to avoid virtual memory.
This is how some older operating systems and even how some modern operating systems that lack VM still work. It has many disadvantages for things like desktop and server applications but it can be useful in an embedded and/or real-time context, or where you have minimal hardware.
The VxWorks AE(Advanced Edition supports) deviates from the concept of Common address space for all tasks.So it can effectively be used in both systems with MMU and without MMU .The common address space for all tasks is called flat memory model and the separate address space for different tasks is called over lapped memory model or segmented memory model.You should not confuse the memory model with the memory lay out as seen in object files which divides data in to Code Segment ,Data Segment ,BSS etc .Both are entirely different things :).
This link in stack overflow will help better
Difference between flat memory model and protected memory model?
