Do partition tables use logical block size or 512 bytes as the unit? - partition

When I read the partition table (MBR or GPT) from a device, are the numbers in units of logical block size, or nominal 512-byte sectors? Surprisingly, I couldn't find the canonical answer through googling.

Conclusion has been reversed based on further investigation
Although almost all drives use 512-byte logical sectors, modern partition tables use LBA addresses, and LBA unit size is the logical sector size of the device, which today may be as great as 4096 bytes.
In the end I posted the question about unit size to the main GNU parted (partition editor) mailing list and have received this response. Specifically:
"LBA always refers to the drive's block size. So it may be 512 or 4096
or some other value, depending on what the drive reports."
Incorrect previous answer version: [[Partition tables (in the MBR and otherwise) refer to 512 byte blocks / logical sectors. See for example https://en.wikipedia.org/wiki/Master_boot_record#PTE.]]
Background information
Reporting of disk physical disk sector sizes seems to be fundamentally done through commands in the ATA-8 specification, specifically the "IDENTIFY DEVICE" command. Compatibility issues (most often discussed) are alignment of I/O operations. Apparently most drives handle 512 byte alignment, but with performance penalties, though there are some drives advertised as "4k native" or "4kn" that do not support 512 byte aligned I/O at all. In general, drives with physical 4k sectors use what is called "Advanced Format", which may help you search if you want more info.
This article https://linuxconfig.org/linux-wd-ears-advanced-format has some relatively clear discussion, especially if you are a Linux user. For what it's worth, on Linux the "parted -l" command reports physical and logical sector size, and parted also knows how to align partitions appropriately for Advanced Format devices.
Also, you might find this article http://www.seagate.com/tech-insights/advanced-format-4k-sector-hard-drives-master-ti/ informative and reassuring on the issue.

Related

When does a device get a 512B request from the filesystem?

Am new to linux and have been doing a bit of reading. But am a little confused about the following. Can the device receive a request for a single 512B sector ? Under what conditions does this happen? From what I understand , while the sector size defines the smallest unit a device can be addressed by , the FS usually has a block size of 4K(smallest unit of access for the fs) . So this means most(all) commands are addressed by the FS on a 4k granularity.
Can a file system generate traffic for <4K(1-7 512bytes) from application traffic?
Is there some file system meta data that can cause this kind of traffic?
If we align the partition to a 4k boundary, will the device always get commands aligned on 4k boundaries?
This can happen for a variety of reasons (assuming your disks expose a logical sector size of 512 bytes) because you send a direct request for 512 bytes correctly aligned outside of the filesystem:
Some cases when this can happen during general usage:
Reading an old style MBR partition table (which fits in 512 bytes at the start of the disk)
Rewriting the bootloader or because you told it to happen
Trying to read the smallest sized sector from a broken disk with 512 byte sectors

Load a Disc (HD) trail into memory (Minix)

Minix is a micro-kernel OS, programmed in C, based in the unix archtecture, used sometimes in embbeded systems, and I have a task to alter the way it works in some ways.
In Minix there is a cache for disk blocks (used to make access to disk fast). I need to alter that cache, so it will keep disc trails instead of disk blocks.
A trail is a circular area of the HD, composed of sectors.
So I'm a bit lost here, how can I load a disk trail into memory ? (Answers related to Linux systems might help)
Should I alter the disk driver or use functions and methods of an existing one ?
How to calculate where in the HD a disk block is located ?
Thanks for your attention.
The typical term for what you're describing is a disk cylinder, not a "trail".
What you're trying to do isn't precisely possible; modern hard drives do not expose their physical organization to the operating system. While cylinder/head/sector addressing is still supported for compatibility, the numbers used have no relationship to the actual location of data on the drive.
Instead, consider defining fixed "chunks" of the disk which will always be loaded into cache together. (For instance, perhaps you could group every 128 sectors together, creating a 64 KB "chunk". So a read for sector 400 would cause the cache to pull in sectors 384-511, for example.) Figuring out how to make the Minix disk cache do this will be your project. :)

writeback of dirty pages in linux

I have a question regarding the writeback of the dirty pages. If a portion of page data is modified, will the writeback write the whole page to the disk, or only the partial page with modified data?
The memory management hardware on x86 systems has a granularity of 4096 bytes. This means: It is not possible to find out which bytes of a 4096-byte page are really changed and which ones are unchanged.
Theoretically the disk driver system could check if bytes have been changed and not write the 512-byte blocks that have not been changed.
However this would mean that - if the blocks are no longer in disk cache memory - the page must be read from hard disk to check if it has changed before writing.
I do not think that Linux would do this in that way because reading the page from disk would cost too much time.
Upon EACH hardware interrupt, the CPU would like to write as much data as possible that the harddisk controller can handle - this size is defined by us as the blksize (or ONE sector, in Linux):
http://en.wikipedia.org/wiki/Disk_sector
https://superuser.com/questions/121252/how-do-i-find-the-hardware-block-read-size-for-my-hard-drive
But waiting too long for SINGLE interrupt for a large file can make the system appear unresponsive, so it is logical to break the chunks into smaller size (like 512bytes) so that the CPU can handle other tasks while transferring each 512 bytes down. Therefore, whether u changed one byte or 511 bytes, so long as it is within that single block, all data get written at the same time. And throughout linux kernel, flagging the blocks as dirty for write or not, all goes by the single unique identifier: sector number, so anything smaller than sector size is too difficult for efficient management.
All these said, don't forget that the harddisk controller itself also has a minimum block size for write operation.

Linux: writes are split into 512K chunks

I have a user-space application that generates big SCSI writes (details below). However, when I'm looking at the SCSI commands that reach the SCSI target (i.e. the storage, connected by the FC) something is splitting these writes into 512K chunks.
The application basically does 1M-sized direct writes directly into the device:
fd = open("/dev/sdab", ..|O_DIRECT);
write(fd, ..., 1024 * 1024);
This code causes two SCSI WRITEs to be sent, 512K each.
However, if I issue a direct SCSI command, without the block layer, the write is not split.
I issue the following command from the command line:
sg_dd bs=1M count=1 blk_sgio=1 if=/dev/urandom of=/dev/sdab oflag=direct
I can see one single 1M-sized SCSI WRITE.
The question is, what is splitting the write and, more importantly, is it configurable?
Linux block layer seems to be guilty (because SG_IO doesn't pass through it) and 512K seems too arbitrary a number not to be some sort of a configurable parameter.
As described in an answer to the "Why is the size of my IO requests being limited, to about 512K" Unix & Linux Stack Exchange question and the "Device limitations" section of the "When 2MB turns into 512KB" document by kernel block layer maintainer Jens Axboe, this can be because your device and kernel have size restrictions (visible in /sys/block/<disk>/queue/):
max_hw_sectors_kb maximum size of a single I/O the hardware can accept
max_sectors_kb the maximum size the block layer will send
max_segment_size and max_segments the DMA engine limitations for scatter gather (SG) I/O (maximum size of each segment and the maximum number of segments for a single I/O)
The segment restrictions matter a lot when the buffer the I/O is coming from is not contiguous and in the worst case each segment can be as small as page (which is 4096 bytes on x86 platforms). This means SG I/O for one I/O can be limited to a size of 4096 * max_segments.
The question is, what is splitting the write
As you guessed the Linux block layer.
and, more importantly, is it configurable?
You can fiddle with max_sectors_kb but the rest is fixed and come from device/driver restrictions (so I'm going to guess in your case probably not but you might see bigger I/O directly after a reboot due to less memory fragmentation).
512K seems too arbitrary a number not to be some sort of a configurable parameter
The value is likely related to fragment SG buffers. Let's assume you're on an x86 platform and have a max_segments of 128 so:
4096 * 128 / 1024 = 512
and that's where 512K could come from.
Bonus chatter: according to https://twitter.com/axboe/status/1207509190907846657 , if your device uses an IOMMU rather than a DMA engine then you shouldn't be segment limited...
The blame is indeed on the block layer, the SCSI layer itself has little regard to the size. You should check though that the underlying layers are indeed able to pass your request, especially with regard to direct io since that may be split into many small pages and requires a scatter-gather list that is longer than what can be supported by the hardware or even just the drivers (libata is/was somewhat limited).
You should look and tune /sys/class/block/$DEV/queue there are assorted files there and the most likely to match what you need is max_sectors_kb but you can just try it out and see what works for you. You may also need to tune the partitions variables as well.
There's a max sectors per request attribute of the block driver. I'd have to check how to modify it. You used to be able to get this value via blockdev --getmaxsect but I'm not seeing the --getmaxsect option on my machine's blockdev.
Looking at the following files should tell you if the logical block size is different, possibly 512 in your case. I am not however sure if you can write to these files to change those values. (the logical block size that is)
/sys/block/<disk>/queue/physical_block_size
/sys/block/<disk>/queue/logical_block_size
try ioctl(fd, BLKSECTSET, &blocks)

Historical perspective to Linux Filesystems

Jonathan Leffler's comment in the question "How can I find the Size of some specified files?" is thought-provoking. I will break it into parts for analysis.
-- files are stored on pages;
you normally end up with more space being
used than that calculation gives
because a 1 byte file (often) occupies
one page (of maybe 512 bytes).
The
exact values vary - it was easier in
the days of the 7th Edition Unix file
system (though not trivial even then
4-5. if you wanted to take account of
indirect blocks referenced by the
inode as well as the raw data blocks).
Questions about the parts
What is the definition of "page"?
Why is the word "maybe" in the after-thought "one page (of maybe 512 bytes)"?
Why was it easier to measure exact sizes in the "7th Edition Unix file system"?
What is the definition of "indirect block"?
How can you have references by two things: "the inode" and "the raw data blocks"?
Historical Questions Emerged
I. What is the historical context Leffler is speaking about?
II. Have the
definitions changed over time?
I think he means block instead of page, a block being the minimum addressable unit on the filesystem.
block sizes can vary
Not sure why but perhaps it the filesystem interface exposed api's allowing a more exact measurement.
An indirect block is a block referenced by a pointer
The inode occupies space (blocks) just as the raw data does. This is what the author meant.
As usual for Wikipedia pages, Block (data storage) is informative despite being far too exuberant about linking all keywords.
In computing (specifically data transmission and data storage), a block is a sequence of bytes or bits, having a nominal length (a block size). Data thus structured is said to be blocked. The process of putting data into blocks is called blocking. Blocking is used to facilitate the handling of the data-stream by the computer program receiving the data. Blocked data is normally read a whole block at a time. Blocking is almost universally employed when storing data to 9-track magnetic tape, to rotating media such as floppy disks, hard disks, optical discs and to NAND flash memory.
Most file systems are based on a block device, which is a level of abstraction for the hardware responsible for storing and retrieving specified blocks of data, though the block size in file systems may be a multiple of the physical block size. In classical file systems, a single block may only contain a part of a single file. This leads to space inefficiency due to internal fragmentation, since file lengths are often not multiples of block size, and thus the last block of files will remain partially empty. This will create slack space, which averages half a block per file. Some newer file systems attempt to solve this through techniques called block suballocation and tail merging.
There's also a reasonable overview of the classical Unix File System.
Traditionally, hard disk geometry (the layout of blocks on the disk itself) has been CHS.
Head: the magnetic reader/writer on each (side of a) platter; can move in and out to access different cylinders
Cylinder: a track that passes under a head as the platter rotates
Sector: a constant-sized amount of data stored contiguously on a portion the cylinder; the smallest unit of data that the drive can deal with
CHS isn't used much these days, as
Hard disks no longer use a constant number of sectors per cylinder. More data is squeezed onto a platter by using a constant arclength per sector rather than a constant rotational angle, so there are more sectors on the outer cylinders than there are on the inner cylinders.
By the ATA specification, a drive may have no more than 216 cylinders per head, 24 heads, and 28 sectors per cylinder; with 512B sectors, this is a limit of 128GB. Through BIOS INT13, it is not possible to access anything beyond 7.88GB through CHS anyways.
For backwards-compatibility, larger drives still claim to have a CHS geometry (otherwise DOS wouldn't be able to boot), but getting to any of the higher data requires using LBA addressing.
CHS doesn't even make sense on RAID or non-rotational media.
but for historical reasons, this has affected block sizes: because sector sizes were almost always 512B, filesystem block sizes have always been multiples of 512B. (There is a movement afoot to introduce drives with 1kB and 4kB sector sizes, but compatibility looks rather painful.)
Generally speaking, smaller filesystem block sizes result in less wasted space when storing many small files (unless advanced techniques like tail merging are in use), while larger block sizes reduce external fragmentation and have lower overhead on large disks. The filesystem block size is usually a power of 2, is limited below by the block device's sector size, and is often limited above by the OS's page size.
The page size varies by OS and platform (and, in the case of Linux, can vary by configuration as well). Like block size, smaller block sizes reduce internal fragmentation but require more administrative overhead. 4kB page sizes on 32-bit platforms is common.
Now, on to describe indirect blocks. In the UFS design,
An inode describes a file.
In the UFS design, the number of pointers to data blocks that an inode could hold is very limited (less than 16). The specific number appears to vary in derived implementations.
For small files, the pointers can directly point to the data blocks that compose a file.
For larger files, there must be indirect pointers, which point to a block which only contains more pointers to blocks. These may be direct pointers to data blocks belonging to the file, or if the file is very large, they may be even more indirect pointers.
Thus the amount of storage required for a file may be greater than just the blocks containing its data, when indirect pointers are in use.
Not all filesystems use this method for keeping track of the data blocks belong to a file. FAT simply uses a single file allocation table which is effectively a gigantic series of linked lists, and many modern filesystems use extents.

Resources