How does stat command calculate the blocks of a file? - linux

I am wondering how the stat command calculates the number of blocks for a file. I read this article that says:
The value st_blocks gives the size of the file in 512-byte blocks. (This may be smaller than st_size/512 e.g. when the file has holes.) The value st_blksize gives the "preferred" blocksize for efficient file system I/O. (Writing to a file in smaller chunks may cause an inefficient read-modify-rewrite.)
Yet I cannot verify this with my own tests.
My file system is ext3.
The command dumpe2fs -h /dev/sda3 shows:
...
First block: 0
Block size: 4096
Fragment size: 4096
...
Then I run
kent#KentT60:~/Desktop$ stat Email
File: `Email'
Size: 965 Blocks: 8 IO Block: 4096 regular file
Device: 80ah/2058d Inode: 746095 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ kent) Gid: ( 1000/ kent)
Access: 2009-08-11 21:36:36.000000000 +0200
Modify: 2009-08-11 21:36:35.000000000 +0200
Change: 2009-08-11 21:36:35.000000000 +0200
If "Blocks" here means: "how many 512 bytes blocks", the number should be 2, not 8. I thought that the block size of the file system (IO block) is 4k.
If the file system gets the file Email, it will fetch a minimum of 4k from the disk (8 x 512 bytes blocks), which means 965/512 + 6 = 8. I am not sure if this guess is correct.
Another test:
kent#KentT60:~/Desktop$ stat wxPython-demo-2.8.10.1.tar.bz2
File: `wxPython-demo-2.8.10.1.tar.bz2'
Size: 3605257 Blocks: 7056 IO Block: 4096 regular file
Device: 80ah/2058d Inode: 746210 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ kent) Gid: ( 1000/ kent)
Access: 2009-08-12 21:45:45.000000000 +0200
Modify: 2009-08-12 21:43:46.000000000 +0200
Change: 2009-08-12 21:43:46.000000000 +0200
3605257/512=7041.xx = 7042
Following my guess above, this would be 7042 + 6 = 7048. But the stat result shows 7056.
And another example from the internet at https://www.computerhope.com/unix/stat.htm. I pasted the example at the bottom of the page here:
File: `index.htm'
Size: 17137 Blocks: 40 IO Block: 8192 regular file
Device: 8h/8d Inode: 23161443 Links: 1
Access: (0644/-rw-r--r--) Uid: (17433/comphope) Gid: ( 32/ www)
Access: 2007-04-03 09:20:18.000000000 -0600
Modify: 2007-04-01 23:13:05.000000000 -0600
Change: 2007-04-02 16:36:21.000000000 -0600
In this example, the file system block size is 8k. I suppose the "Blocks" value should be 16xN, but it is 40. I'm getting lost...
Can anyone explain how stat calculates the "Blocks" value?
Thanks!

The stat command-line tool uses the stat / fstat etc. functions, which return data in the stat structure. The st_blocks member of the stat structure returns:
The total number of physical blocks of size 512 bytes actually allocated on disk. This field is not defined for block special or character special files.
So for your "Email" example, with a size of 965 and a block count of 8, it is indicating that 8*512=4096 bytes are physically allocated on disk. The reason it's not 2 is that the file system on disk does not allocate space in units of 512, it evidently allocates them in units of 4096. (And the unit of allocation may vary depending on file size and filesystem sophistication. E.g. ZFS supports different units of allocation.)
Similarly, for the wxPython example, it indicates that 7056*512 bytes, or 3612672 bytes are physically allocated on disk. You get the idea.
The IO block size is "a hint as to the 'best' unit size for I/O operations" - it's usually the unit of allocation on the physical disk. Don't get confused between the IO block and the block that stat uses to indicate physical size; the blocks for physical size are always 512 bytes.
Update based on comment:
Like I said, st_blocks is how the OS indicates how much space is used by the file on disk. The actual units of allocation on disk are the choice of the file system. For example, ZFS can have allocation blocks of variable size, even in the same file, because of the way it allocates blocks: files start out having a small block size, and the block size keeps on increasing until it reaches a particular point. If the file is later truncated, it will probably keep the old block size. So based on the history of the file, it can have multiple possible block sizes. So given a file size it is not always obvious why it has a particular physical size.
Concrete example: on my Solaris box, with a ZFS file system, I can create a very short file:
$ echo foo > test
$ stat test
Size: 4 Blocks: 2 IO Block: 512 regular file
(irrelevant details omitted)
OK, small file, 2 blocks, physical disk usage is 1024 for this file.
$ dd if=/dev/zero of=test2 bs=8192 count=4
$ stat test2
Size: 32768 Blocks: 65 IO Block: 32768 regular file
OK, now we see physical disk usage of 32.5K, and an IO block size of 32K. I then copied it to test3 and truncated this test3 file in an editor:
$ cp test2 test3
$ joe -hex test3
$ stat test3
Size: 4 Blocks: 65 IO Block: 32768 regular file
Well now, here's a file with 4 bytes in it - just like test - but it's using 32.5K physically on the disk, because of the way the ZFS file system allocates space. Block sizes increase as the file gets larger, but they don't decrease when the file gets smaller. (And yes, this can lead to substantial wasted space depending on the kinds of files and file operations you do on ZFS, which is why it allows you to set the maximum block size on a per-filesystem basis, and change it dynamically.)
Hopefully, you can now appreciate that there isn't necessarily a simple relationship between file size and physical disk usage. Even in the above it's not clear why 32.5K bytes are needed to store a file that's exactly 32K in size - it appears that ZFS generally needs an extra 512 bytes for extra storage of its own. Perhaps it's using that storage for checksums, reference counts, transaction state - file system bookkeeping. By including these extras in the indicated physical file size, it seems like ZFS is trying not to mislead the user as to the physical costs of the file. That doesn't mean it's trivial to reverse-engineer the calculation without knowing intimate details about the underlying file system implementation.

Related

I just "move" image, and its metadata changes...

I simply copied my image and saved it to another temp folder in the current directory, nothing is modified, but the image is taking way more "disk space" than it's "bytes size".
And! When I done so, I lost most of my image's metadata such as location data, Device model, F number and others, only Color space, Alpha channel and Dimensions are preserved.
Here is the code I do:
from PIL import Image
import os
image_path = "/Users/moomoochen/Desktop/XXXXX.jpg"
img = Image.open(image_path)
pathname, filename = os.path.split(image_path)
new_pathname = (pathname + "/temp")
if not os.path.exists(new_pathname):
os.makedirs(new_pathname)
img.save(os.path.join(new_pathname, filename))
# If I want to adjust the quality, I do this:
img.save(os.path.join(new_pathname, filename), quality=80)
So my question is:
1) Why bytes size is different than disk space?
2) How can I adjust my code so that it will preserve all image's metadata?
Two things...
You are not actually "simply copying" your file. You are opening it in an image processor, expanding it out to a processable matrix of pixels and then resaving the image to disk - minus anything that your image processor wasn't interested in :-)
If you want to copy the complete file including EXIF data, use shutil like this:
#!/usr/local/bin/python3
from shutil import copyfile
copyfile('source.jpg', 'destination.jpg')
Check in Finder:
Secondly, all "on-disk" filesystems have a minimum allocation unit which means that if your file grows, it will grow by a whole unit, even if you just need 1 more byte of space. Most disks use a 4,096 byte allocation unit, so a 33 byte file will take up 4,096 bytes of space. I must say yours is rather higher than 4k of slack so maybe you are running on a fat RAID that works in big blocks to increase performance?
As an example, you can do this in Terminal:
# Write a file with 1 logical byte
echo -n "1" > file
# Look at file on disk
ls -ls file
8 -rw-r--r-- 1 mark staff 1 15 Nov 08:10 file
Look carefully, the 1 after staff means the logical size is 1 byte and that's what programs get if they read the whole file. But the first 8 on the left is the number of blocks on disk. Each block is 512 bytes, so this 1-byte file takes 8 blocks of 512 bytes, i.e. 4kB on disk!

Linux `top` command: how much process memory is physically stored in swap space?

Let's say I run my program on a 64-bit Linux machine with 64 Gb of RAM. In my very small C program immediately after the start I do
void *p = sbrk(1024ull * 1024 * 1024 * 120);
this moving my data segment break forward by 120 Gb.
After the above sbrk call top entry for my process shows RES at some low value, VIRT at 120g, and SWAP at 120g.
After this operation I write something into the first 90 Gb of the above region
memset(p, 0xAB, 1024ull * 1024 * 1024 * 90);
This causes some changes in the top entry for my process: VIRT expectedly remains at 120g, RES becomes almost 64g, SWAP drops to around 56g.
The common Swap stats in the header of top output show that swap file usage increases, which is expected since my program will have to push about 26 Gb of memory pages into the swap file.
So, according to the above observations, SWAP column simply reports my process's non-RES address space regardless of whether this address space has been "materialized", i.e. regardless of whether I already wrote something into that region of virtual memory.
But is there any way to figure out how much of that SWAP size has actually been "materialized" and backed up by something stored in the swap file? I.e. is there any way to make top to display that 26 Gb value for my process?
The behavior depends on a version of procps you are using. For instance, in version 3.0.5 SWAP value equals:
task->size - task->resident
and it is exactly what you are encountering. Man top.1 says:
VIRT = SWAP + RES
Procps-ng, however, reads /proc/pid/status and sets SWAP correctly
https://gitlab.com/procps-ng/procps/blob/master/proc/readproc.c#L383
So, you can update procps or look at /proc/pid/status directly

Determine the number of "logical" bytes read/written in a Linux system

I would like to determine the number of bytes logically read/written by all processes via syscalls such as read() and write(). This is different than the number of bytes actually fetched from the storage layer (displayed by tools like iotop) since it includes (for example) reads that hit the pagecache, and is also differs in when writes are recognized: the logical write IO happens immediately when the write call is issued, while the actual physical IO may occur some time later depending on various factors (Linux usually buffers writes and does the physical IO some time later).
I know how to do it on a per-process basis (see this question for example), but not how to the get the system-wide count.
If you want to use /proc filesystem for the total counts (and not for per second counts), it is quite easy.
This works also on quite old kernels (tested on Debian Squeeze 2.6.32 kernel).
# cat /proc/1979/io
rchar: 111195372883082
wchar: 10424431162257
syscr: 130902776102
syscw: 6236420365
read_bytes: 2839822376960
write_bytes: 803408183296
cancelled_write_bytes: 374812672
For system-wide, just sum the numbers from all processes, which however will be good enough only in short-term, because as processes die, their statistics are removed from memory. You would need process accounting enabled to save them.
Meaning of these files is documented in the kernel sources file Documentation/filesystems/proc.txt:
rchar - I/O counter: chars read
The number of bytes which this task has caused
to be read from storage. This is simply the sum of bytes which this
process passed to read() and pread(). It includes things like tty IO
and it is unaffected by whether or not actual physical disk IO was
required (the read might have been satisfied from pagecache)
wchar - I/O counter: chars written
The number of bytes which this task has
caused, or shall cause to be written to disk. Similar caveats apply
here as with rchar.
syscr - I/O counter: read syscalls
Attempt to count the number of read I/O
operations, i.e. syscalls like read() and pread().
syscw - I/O counter: write syscalls
Attempt to count the number of write I/O
operations, i.e. syscalls like write() and pwrite().
read_bytes - I/O counter: bytes read
Attempt to count the number of bytes which
this process really did cause to be fetched from the storage layer.
Done at the submit_bio() level, so it is accurate for block-backed
filesystems.
write_bytes - I/O counter: bytes written
Attempt to count the number of bytes which
this process caused to be sent to the storage layer. This is done at
page-dirtying time.
cancelled_write_bytes
The big inaccuracy here is truncate. If a process writes 1MB to a file
and then deletes the file, it will in fact perform no writeout. But it
will have been accounted as having caused 1MB of write. In other
words: The number of bytes which this process caused to not happen, by
truncating pagecache. A task can cause "negative" IO too.
Here is a SystemTap script that tracks the logical IO. It is based on the script at https://sourceware.org/systemtap/SystemTap_Beginners_Guide/traceiosect.html
#! /usr/bin/env stap
# traceio.stp
# Copyright (C) 2007 Red Hat, Inc., Eugene Teo <eteo#redhat.com>
# Copyright (C) 2009 Kai Meyer <kai#unixlords.com>
# Fixed a bug that allows this to run longer
# And added the humanreadable function
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
#
global reads, writes
probe vfs.read.return {
if ($return > 0) {
reads += $return
}
}
probe vfs.write.return {
if ($return > 0) {
writes += $return
}
}
function humanreadable(bytes) {
if (bytes > 1024*1024*1024) {
return sprintf("%d GiB", bytes/1024/1024/1024)
} else if (bytes > 1024*1024) {
return sprintf("%d MiB", bytes/1024/1024)
} else if (bytes > 1024) {
return sprintf("%d KiB", bytes/1024)
} else {
return sprintf("%d B", bytes)
}
}
probe timer.s(1) {
printf("reads: %12s writes: %12s\n", humanreadable(reads), humanreadable(writes))
# Note we don't zero out reads and writes,
# so the values are cumulative since the script started.
}

du and size report different values for object files

While compiling a project I notice that du and size command outputs don't add up:
> du -sh X.o
490K X.o
> size X.o
text data bss dec hex filename
2128 0 12 2140 85c X.o
Why is the disk space taken by the object file different from the sum of the text data and bss segments of the file? What am I missing here?
The size command shows you how much the code and data will take up during execution. The object file consists of much more than that of course.
It begins with the overhead of the file format itself, which would have to contain at least the information that size uses to find out how big each part will be in memory. Then there's symbol tables, debugging information and who knows what (depends on compiler and object file format).
You can get more comprehensive information with objdump -h (or objdump -x to see just how many relocation records there are) which still doesn't cover the overhead, but shows how much actual content there is.
du displays the size of the file which resides on file system Vs. size which is actual size in bytes.
Reason behind why du has huge size - file systems usually made up of blocks on which files doesn't fit into these blocks exactly causing this difference. For example, if the file size is 4096 Bytes, size displays 4096 simliar to du but when file size is 5000 Bytes, size displays 5000 Bytes but du displays 8192.
This is referred as slack space.
Note: Above assuming file system allocation in units of 4096 Bytes.

why is the output of `du` often so different from `du -b`

why is the output of du often so different from du -b? -b is shorthand for --apparent-size --block-size=1. only using --apparent-size gives me the same result most of the time, but --block-size=1 seems to do the trick. i wonder if the output is then correct even, and which numbers are the ones i want? (i.e. actual filesize, if copied to another storage device)
Apparent size is the number of bytes your applications think are in the file. It's the amount of data that would be transferred over the network (not counting protocol headers) if you decided to send the file over FTP or HTTP. It's also the result of cat theFile | wc -c, and the amount of address space that the file would take up if you loaded the whole thing using mmap.
Disk usage is the amount of space that can't be used for something else because your file is occupying that space.
In most cases, the apparent size is smaller than the disk usage because the disk usage counts the full size of the last (partial) block of the file, and apparent size only counts the data that's in that last block. However, apparent size is larger when you have a sparse file (sparse files are created when you seek somewhere past the end of the file, and then write something there -- the OS doesn't bother to create lots of blocks filled with zeros -- it only creates a block for the part of the file you decided to write to).
Minimal block granularity example
Let's play a bit to see what is going on.
mount tells me I'm on an ext4 partition mounted at /.
I find its block size with:
stat -fc %s .
which gives:
4096
Now let's create some files with sizes 1 4095 4096 4097:
#!/usr/bin/env bash
for size in 1 4095 4096 4097; do
dd if=/dev/zero of=f bs=1 count="${size}" status=none
echo "size ${size}"
echo "real $(du --block-size=1 f)"
echo "apparent $(du --block-size=1 --apparent-size f)"
echo
done
and the results are:
size 1
real 4096 f
apparent 1 f
size 4095
real 4096 f
apparent 4095 f
size 4096
real 4096 f
apparent 4096 f
size 4097
real 8192 f
apparent 4097 f
So we see that anything below or equal to 4096 takes up 4096 bytes in fact.
Then, as soon as we cross 4097, it goes up to 8192 which is 2 * 4096.
It is clear then that the disk always stores data at a block boundary of 4096 bytes.
What happens to sparse files?
I haven't investigated what is the exact representation is, but it is clear that --apparent does take it into consideration.
This can lead to apparent sizes being larger than actual disk usage.
For example:
dd seek=1G if=/dev/zero of=f bs=1 count=1 status=none
du --block-size=1 f
du --block-size=1 --apparent f
gives:
8192 f
1073741825 f
Related: How to test if sparse file is supported
What to do if I want to store a bunch of small files?
Some possibilities are:
use a database instead of filesystem: Database vs File system storage
use a filesystem that supports block suballocation
Bibliography:
https://serverfault.com/questions/565966/which-block-sizes-for-millions-of-small-files
https://askubuntu.com/questions/641900/how-file-system-block-size-works
Tested in Ubuntu 16.04.
Compare (for example) du -bm to du -m.
The -b sets --apparent-size --block-size=1,
but then the m overrides the block-size to be 1M.
Similar for -bh versus -h:
the -bh means --apparent-size --block-size=1 --human-readable, and again the h overrides that block-size.
Files and folders have their real size and the size on disk.
--apparent-size is file or folder real size
size on disk is the amount of bytes the file or folder takes on disk.
Same thing when using just du.
If you encounter that apparent-size is almost always several magnitudes higher than disk usage then it means that you have a lot of (`sparse') files of files with internal fragmentation or indirect blocks.
Because by default du gives disk usage, which is the same or larger than the file size. As said under --apparent-size
print apparent sizes, rather than disk usage; although the apparent size is usually smaller, it may be
larger due to holes in (`sparse') files, internal fragmentation, indirect blocks, and the like

Resources