Load a Disc (HD) trail into memory (Minix) - linux

Minix is a micro-kernel OS, programmed in C, based in the unix archtecture, used sometimes in embbeded systems, and I have a task to alter the way it works in some ways.
In Minix there is a cache for disk blocks (used to make access to disk fast). I need to alter that cache, so it will keep disc trails instead of disk blocks.
A trail is a circular area of the HD, composed of sectors.
So I'm a bit lost here, how can I load a disk trail into memory ? (Answers related to Linux systems might help)
Should I alter the disk driver or use functions and methods of an existing one ?
How to calculate where in the HD a disk block is located ?
Thanks for your attention.

The typical term for what you're describing is a disk cylinder, not a "trail".
What you're trying to do isn't precisely possible; modern hard drives do not expose their physical organization to the operating system. While cylinder/head/sector addressing is still supported for compatibility, the numbers used have no relationship to the actual location of data on the drive.
Instead, consider defining fixed "chunks" of the disk which will always be loaded into cache together. (For instance, perhaps you could group every 128 sectors together, creating a 64 KB "chunk". So a read for sector 400 would cause the cache to pull in sectors 384-511, for example.) Figuring out how to make the Minix disk cache do this will be your project. :)

Related

Does disk IO correspond directly to its physical sector location?

I've been playing around with disk IO on flash drives, HDDs, and SSDs by opening /dev/sd* paths in Linux the way I would any other file.
I understand that the memory controller on the disk can hide true block order (via a mapping) from the OS.
This boils down to these questions:
Are the blocks in /dev/sd* in the order perceived by the OS, or in the order as perceived by the disk's memory controller?
Is the order of blocks in /dev/sd* subjective between POSIX OSes?
Can these properties change if done on an NT or Cygwin system?
Is this property different among Flash, HDD, and SSD?
Can a write occur to a specific index in an opened /dev/sd* path, or is this determined by the memory controller?
Thanks in advance!
If you use the device nodes for entire disks (/dev/sda, /dev/sdb, and so on), then the file offsets for the block device correspond to logical block addresses and will be portable across systems (assuming that the disk sector size is supported). This is independent of the storage technology.
However, the names of the device nodes are different from system to system.
If you use sub-devices (partitions), this is not necessarily the case because interpretation of and support for partition tables varies considerably.

In non-DMA scenario, does a storage device/disk content go to CPU registers first and then to main memory during a disk read?

I am learning computer organization but struggling with the following concept. In non-DMA scenarios, do all disk reads follow the following sequence to get into main memory:
Disk storage surface -> Disk registers -> CPU registers -> Main memory
Similarly for writes, is the sequence:
Main memory -> CPU registers -> Disk registers -> Disk storage surface
(I know that in a DMA scenario, the CPU only initiates the transfer after which the content of the disks are transferred directly to main memory).
If yes, before DMA came, was the above sequence a serious bottleneck as overall CPU registers' capacity is much less compared to main memory and storage disk? Or it is so fast that a human user won't notice in non-DMA modes?
PS: Please bear with my rudimentary terminology, but I hope I conveyed what I want to ask.
Yes, what you describe is what happened in the bad old days with programmed-I/O instead of DMA.
For example, IDE disk-controller hardware used to be less well standardized, so the Linux drivers defaulted to programmed I/O (i.e. a copy loop using x86 IN instructions, since ATA predated memory-mapped I/O registers being common). For decent performance, you had to manually enable DMA in your boot scripts.
But before doing that, check by manually enabling DMA it didn't lead to lockups, or far worse cause data corruption.
re: memory-mapped file: nothing to do with how the data gets from disk into the pagecache (or vice versa). mmap() just means your process's address space includes a shared mapping of the same pages that the OS is using to cache the file's contents.

Bypassing 4KB block size limitation on block layer/device

We are developing an ssd-type storage hardware device that can take read/write request for big block size >4KB at a time (even in MBs size).
My understanding is that linux and its filesystem will "chop down" files into 4KB block size that will be passed to block device driver, which will need to physically fill the block with data from the device (ex., for write)
I am also aware the kernel page size has a role in this limitation as it is set at 4KB.
For experiment, I want to find out if there is a way to actually increase this block size, so that we will save some time (instead of doing multiple 4KB writes, we can do it with bigger block size).
Is there any FS or any existing project that I can take a look for this?
If not, what is needed to do this experiment - what parts of linux needs to be modified?
Trying to find out the level of difficulties and resource needed. Or, if it is even impossible to do so and/or any reason why we do not even need to do so. Any comment is appreciated.
Thanks.
The 4k limitation is due to the page cache. The main issue is that if you have a 4k page size, but a 32k block size, what happens if the file is only 2000 bytes long, so you only allocate a 4k page to cover the first 4k of the block. Now someone seeks to offset 20000, and writes a single byte. Now suppose the system is under a lot of memory pressure, and the 4k page for the first 2000 bytes, which is clean, gets pushed out of memory. How do you track which parts of the 32k block contain valid data, and what happens when the system needs to write out the dirty page at offset 20000?
Also, let's assume that the system is under a huge amount of memory pressure, we need to write out that last page; what if there isn't enough memory available to instantiante the other 28k of the 32k block, so we can do the read-modify-write cycle just to update that one dirty 4k page at offset 20000?
These problems can all be solved, but it would require a lot of surgery in the VM layer. The VM layer would need to know that for this file system, pages need to be instantiated in chunks of 8 pages at a time, and if that there is memory pressure to push out a particular page, you need write out all of the 8 pages at the same time if it is dirty, and then drop all 8 pages from the page cache at the same time. All of this implies that you want to track page usage and page dirty not at the 4k page level, but at the compound 32k page/"block" level. It basically will involve changes to almost every single part of the VM subsystem, from the page cleaner, to the page fault handler, the page scanner, the writeback algorithms, etc., etc., etc.
Also consider that even if you did hire a Linux VM expert to do this work, (which the HDD vendors would deeply love you for, since they also want to be able to deploy HDD's with a 32k or 64k physical sector size), it will be 5-7 years before such a modified VM layer would make its appearance in a Red Hat Enterprise Linux kernel, or the equivalent enterprise or LTS kernel for SuSE or Ubuntu. So if you are working at a startup who is hoping to sell your SSD product into the enterprise market --- you might as well give up now with this approach. It's just not going to work before you run out of money.
Now, if you happen to be working for a large Cloud company who is making their own hardware (ala Facebook, Amazon, Google, etc.) maybe you could go down this particular path, since they don't use enterprise kernels that add new features at a glacial pace --- but for that reason, they want to stick relatively close to the upstream kernel to minimize their maintenance cost.
If you do work for one of these large cloud companies, I'd strongly recommend that you contact other companies who are in this same space, and maybe you could collaborate with them to see if together you could do this kind of development work and together try to get this kind of change upstream. It really, really is not a trivial change, though --- especially since the upstream linux kernel developers will demand that this not negatively impact performance in the common case, which will not be involving > 4k block devices any time in the near future. And if you work at a Facebook, Google, Amazon, etc., this is not the sort of change that you would want to maintain as a private change to your kernel, but something that you would want to get upstream, since other wise it would be such a massive, invasive change that supporting it as an out-of-tree patch would be huge headache.
Although I've never written a device driver for Linux, I find it very unlikely that this is a real limitation of the driver interface. I guess it's possible that you would want to break I/O into scatter-gather lists where each entry in the list is one page long (to improve memory allocation performance and decrease memory fragmentation), but most device types can handle those directly nowadays, and I don't think anything in the driver interface actually requires it. In fact, the simplest way that requests are issued to block devices (described on page 13 -- marked as page 476 -- of that text) looks like it receives:
a sector start number
a number of sectors to transfer (no limit is mentioned, let alone a limit of 8 512B sectors)
a pointer to write the data into / read the data from (not a scatter-gather list for this simple case, I guess)
whether this is a read versus a write
I suspect that if you're seeing exclusively 4K accesses it's probably a result of the caller not requesting more than 4K at a time -- if the filesystem you're running on top of your device only issues 4K reads, or whatever is using the filesystem only accesses one block at a time, there is nothing your device driver can do to change that on its own!
Using one block at a time is common for random access patterns like database read workloads, but database log or FS journal writes or large serial file reads on a traditional (not copy-on-write) filesystem would issue large I/Os more like what you're expecting. If you want to try issuing large reads against your device directly to see if it's possible through whatever driver you have now, you could use dd if=/dev/rdiskN of=/dev/null bs=N to see if increasing the bs parameter from 4K to 1M shows a significant throughput increase.

Using a hard disk without filesystem for big data

I'm working on a web crawler and have to handle big data (about 160 TB raw data in trillions of data files).
The data should be stored sequencial as one big bz2 file on the magnetic hard disk. A SSD is used to hold the meta data. THe most important operation on the hard disk is a squential read over all of the 4 TB off the disk, which should happen with full maximum speed of 150 MB/s.
I want to not waste the overhead of a file system an instead use the "/dev/file" devices directly. Does this access use the os block buffer? Are the access operations queued or synchronous in a FIFO style?
Is it better to use /dev/file or write your own user level file system?
Has anyone experience with it.
If you don't use any file system but read your disk device (e.g. /dev/sdb) directly, you are losing all the benefit of file system cache. I am not at all sure it is worthwhile.
Remember that you could use syscalls like readahead(2) or posix_fadvise(2) or madvise(2) to give hints to the kernel to improve performance.
Also, you might when making your file system use a larger than usual block size. And don't forget to use big blocks (e.g. of 64 to 256 Kbytes) when read(2)-ing data. You could also use mmap(2) to get the data from disk.
I would recommend against "coding your own file system". Existing file systems are quite tuned (and some are used on petabytes of storage). You may want to chose big blocks when making them (e.g. -b with mke2fs(8)...)
BTW, choosing between filesystem and raw disk data is mostly a configuration issue (you specify a /dev/sdb path if you want raw disk, and /home/somebigfile if you want a file). You could code a webcrawler to be able to do both, then benchmark both approaches. Very likely, performance could depend upon actual system and hardware.
As a case in point, relational database engines used often raw disk partitions in the previous century (e.g. 1990s) but seems to often use big files today.
Remember that the real bottleneck is the hardware (i.e. disk): CPU time used by filesystems is often insignificant and cannot even be measured.
PS. I have not much real recent experience with these issues.

Store more than 3GB of video-frames in memory, on 32-bit OS

At work we have an application to play 2K (2048*1556px) OpenEXR film sequences. It works well.. apart from when sequences that are over 3GB (quite common), then it has to unload old frames from memory, despite the fact all machines have 8-16GB of memory (which is addressable via the linux BIGMEM stuff).
The frames have to he cached into memory to play back in realtime. The OS is a several-year old 32-bit Fedora Distro (not possible to upgradable to 64bit, for the foreseeable future). The per-process limitation is 3GB per process.
Basically, is it possible to cache more than 3GB of data in memory, somehow? My initial idea was to spread the data between multiple processes, but I've no idea if this is possible..
One possibility may be to use mmap. You would map/unmap different parts of your data into the same virtual memory region. You could only have one set mapped at a time, but as long as there was enough physical memory, the data should stay resident.
How about creating a RAM drive and loading the file into that ... assuming the RAM drive supports the BIGMEM stuff for you.
You could use multiple processes: each process loads a view of the file as a shared memory segment, and the player process then maps the segments in turn as needed.
My, what an interesting problem :)
(EDIT: Oh, I just read Rob's ram drive post...I got all excited by the problem...but have a bit more to suggest, so I won't delete)
Would it be possible to...
setup a multi-gigabyte ram disk, and then
modify the program to do all it's reading from the "disk"?
I'd guess the ram disk part is where all the problem would be, since the size of the ram disk would be OS and file system dependent. You might have to create multiple ram disks and have your code jump between them. Or maybe you could setup a RAID-0 stripe set over multiple ram disks. Or, if there are still OS limitations and you can afford to drop a couple grand (4k?), setup a hardware RAID-0 strip set with some of those new blazing fast solid state drives. Or...
Fun, fun, fun.
Be sure to follow up!
I assume you can modify the application. If so, the easiest thing would be to start the application several times (once for each 3GB chunk of video), have each one hold a chunk of video, and use another program to synchronize them so they each take control of the framebuffer (or other video output) in turn.
The synchronization is going to be a little messy, perhaps, but it can be simplified if each app has its own framebuffer and the sync program points the video controller to the correct framebuffer inbetween frames when switching to the next app.
#dbr said:
There is a review machine with an absurd fiber-channel-RAID-array that can play 2K files direct from the array easily. The issue is with the artist-workstations, so it wouldn't be one $4000 RAID array, it'd be hundreds..
Well, if you can accept a limit of ~30GB, then maybe a single 36GB SSD drive would be enough? Those go for ~US$1k each I think, and the data rates might be enough. That very well maybe cheaper than a pure RAM approach. There are smaller sizes available, too. If ~60GB is enough you could probably get away with a JBOD array of 2 for double the cost, and skip the RAID controller. Be sure only to look at the higher end SSD options--the low end is filled with glorified memory sticks. :P

Resources