Get directory size with xen image - linux

I want to check the size of my directory.
directory is xen domU image.
directory name is xendisk
du -sh ./xendisk
returns 5.4G.
but xen domU Image size is 10G.
ls -alh and du -sh image
what happened?

You have created a sparse file for your image. If you used a command like truncate -s 10G domU.img to create the image then this would be the result.
The wiki which I have linked has more information but basically a sparse file is one where the empty parts of the file take no space. This is useful when dealing with VMs because in most cases your VM will only take a fraction of the space available to it so using a sparse file will mean that it takes far less space on your filesystem (as you have observed). The article states that this is acheived using the following mechanism:
When reading sparse files, the file system transparently converts
metadata representing empty blocks into "real" blocks filled with zero
bytes at runtime. The application is unaware of this conversion.
If you need to check the size with du you may be interested in the --apparent-size option, which will include all of the unallocated blocks in the calculation. Therefore you could use this command if you need the output to match what ls is telling you:
du -sh --apparent-size ./xendisk

Related

How does du estimate file size?

I am downloading a large file with wget, which I ran in the background with wget -bqc. I wanted to see how much of the file was downloaded so I ran
du -sh *
in the directory. (I'd also be interested to know a better way to check wget progress in this case if anyone knows...) I saw that 25 GB had been downloaded, but for several attempts afterwards it showed the same result of 25 GB. I became worried that du had somehow interfered with the download until some time later when du showed a result of 33 GB and subsequently 40 GB.
In searching stackoverflow and online, I didn't find whether it is safe to use du on files being written to but I did see that it is only an estimate that can be somewhat off. However, 7-8 GB seems like a lot, particularly because it is a single file, and not a directory tree, which it seems is what causes errors in the estimate. I'd be interested to know how it makes this estimate for a single file that is being written and why I would see this result.
The operating system has to go guarantee safe access.
du does not estimate anything. the kernel knows the size of the file and when du asks for it that's what it learns.
If the file is in the range of gigabytes and the reported size is only with that granularity, it should not be a surprise that consecutive invocations show the same size - do you expect wget to fetch enough data to flip to another gigabyte in between your checks? You can try running du without sh in order to get a more accurate read.
Also wget will hold some amount of data in ram, but that should be negligible.
du doesn't estimate, it sums up. But it has access to some file-system-internal information which might make its output be a surprise. The various aspects should be looked up separately as they are a bit too much to explain here in detail.
Sparse files may make a file look bigger than it is on disk.
Hard links may make a directory tree look bigger than it is on disk.
Block sizes may make a file look smaller than it is on disk.
du will always print out the size a directory tree (or several) actually and really occupy on disk. Due to various facts (the three most common are given above) this can be different from the size of the information stored in theses trees.

in linux, how to (quickly) get a list of all files in a directory - with their filesize

I need to get a list of all the files, along with their sizes, in Linux.
the filesystem is ext4, running over USB on a machine with very little RAM
the functions I'm using are these - is there a better technique?
a) opendir()
b) readdir()
c) stat()
I believe I'm getting hit pretty hard with the stat() call, I don't have much RAM and the HD is USB connected
is there a way to say
"give me all the files in the directory, along with the file sizes?" - my guess is that I'm getting impacted because stat() needs to go query the inode for the size, leading to lots of seeks?
No, not really. If you don't want to hit the disk, you would need have the inodes cached in memory. That's the tradeoff in this case.
You could try tuning inode_readahead_blks and vfs_cache_pressure though.
Use cd to get to the directory you want to get the size of the files and directories for, then type:
du -sh *
This will give you all the sizes of the files and directories.
The s gives a total for each argument, h makes the output human readable. The * means that it will show all the contents of that directory.
Type man du to find out more about the du command (q or Ctrl-c to exit)

how to get size of folder including apparent size of sparse files? (du is too slow)

I have a folder containing a lot of KVM qcow2 files, they are all sparse files.
Now I need to get the total size of folder, the qcow2 file size should be counted as apparent size(not real size).
for example:
image: c9f38caf104b4d338cc1bbdd640dca89.qcow2
file format: qcow2
virtual size: 100G (107374182400 bytes)
disk size: 3.3M
cluster_size: 65536
the image should be treated as 100G but not 3.3M
originally I use statvfs() but it can only return real size of the folder. then I switch to 'du --apparent-size', but it's too slow given I have 10000+ files and it takes almost 5 minutes to caculate.
anybody knows a fast way that can get the size of folder counting qcow2's virtual size? thank you
There is no way to find out this information without stat()ing every file in the directory. It is slow if you have this many files in a single directory. stat() needs to retrieve the inode of every single file.
Adding more memory might help due to caching.
You could use something like this:
find images/ -name "*.qcow2" -exec qemu-img info {} \; | grep virtual | cut -d"(" -f2 | awk '{ SUM += $1} END { print SUM }'
Modern Unix*ish OSes provide a way to retrieve the stats of all entries of a directory in one step. This also needs to look at all inodes but probably it can be done optimized in the file system driver itself and thus might be faster.
Apparently you are not looking for a way to do this using system calls from C, so I guess a feasible approach could be to use Python. There you have access to this feature using the function scandir() in module os.

How to make file sparse?

If I have a big file containing many zeros, how can i efficiently make it a sparse file?
Is the only possibility to read the whole file (including all zeroes, which may patrially be stored sparse) and to rewrite it to a new file using seek to skip the zero areas?
Or is there a possibility to make this in an existing file (e.g. File.setSparse(long start, long end))?
I'm looking for a solution in Java or some Linux commands, Filesystem will be ext3 or similar.
A lot's changed in 8 years.
Fallocate
fallocate -d filename can be used to punch holes in existing files. From the fallocate(1) man page:
-d, --dig-holes
Detect and dig holes. This makes the file sparse in-place,
without using extra disk space. The minimum size of the hole
depends on filesystem I/O block size (usually 4096 bytes).
Also, when using this option, --keep-size is implied. If no
range is specified by --offset and --length, then the entire
file is analyzed for holes.
You can think of this option as doing a "cp --sparse" and then
renaming the destination file to the original, without the
need for extra disk space.
See --punch-hole for a list of supported filesystems.
(That list:)
Supported for XFS (since Linux 2.6.38), ext4 (since Linux
3.0), Btrfs (since Linux 3.7) and tmpfs (since Linux 3.5).
tmpfs being on that list is the one I find most interesting. The filesystem itself is efficient enough to only consume as much RAM as it needs to store its contents, but making the contents sparse can potentially increase that efficiency even further.
GNU cp
Additionally, somewhere along the way GNU cp gained an understanding of sparse files. Quoting the cp(1) man page regarding its default mode, --sparse=auto:
sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well.
But there's also --sparse=always, which activates the file-copy equivalent of what fallocate -d does in-place:
Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes.
I've finally been able to retire my tar cpSf - SOURCE | (cd DESTDIR && tar xpSf -) one-liner, which for 20 years was my graybeard way of copying sparse files with their sparseness preserved.
Some filesystems on Linux / UNIX have the ability to "punch holes" into an existing file. See:
LKML posting about the feature
UNIX file trunctation FAQ (search for F_FREESP)
It's not very portable and not done the same way across the board; as of right now, I believe Java's IO libraries do not provide an interface for this.
If hole punching is available either via fcntl(F_FREESP) or via any other mechanism, it should be significantly faster than a copy/seek loop.
I think you would be better off pre-allocating the whole file and maintaining a table/BitSet of the pages/sections which are occupied.
Making a file sparse would result in those sections being fragmented if they were ever re-used. Perhaps saving a few TB of disk space is not worth the performance hit of a highly fragmented file.
You can use $ truncate -s filename filesize on linux teminal to create sparse file having
only metadata.
NOTE --Filesize is in bytes.
According to this article, it seems there is currently no easy solution, except for using FIEMAP ioctl. However, I don't know how you can make "non sparse" zero blocks into "sparse" ones.

When running ls -l, why does the filesize on a directory not match the output of du?

What does 4096 mean in output of ls -l?
[root#file nutch-0.9]# du -csh resume.new/
2.3G resume.new/
[root#file nutch-0.9]# ls -l
total 55132
drwxr-xr-x 7 root root 4096 Jun 18 03:19 resume.new
That the directory takes up 4096 bytes of disk space (not including its contents).
I have been wondering about it too. So, after searching I came across:
"It's the size necessary to store the
meta-data about files (including the
file names contained in that
directory). The number of files /
sub-directories at a given time might
not map directly to the size reported,
because once allocated, space is not
freed if the number of files changes.
This behaviour makes sense for most
use cases (where disk space is cheap,
and once a directory has a lot of
files in it, it will probably have
them again in future), and helps to
reduce fragmentation."
Reference: http://www.linuxquestions.org/questions/showthread.php?p=2978839#post2978839
Directories are just like files with <name, inode> tuples, and that are specially treated by the filesystem. The size reported by ls is the size of this "file". Check this answer in Server Fault for an overview of how directories are under the hood.
So, the 4096 bytes mean, most likely, that the filesystem block size is 4096 and that directory is currently using a single block to store this table of names and inodes.
4096, in your example, is the number of bytes used by the directory itself. In other words, this is the space required to store the list of items contained in the directory. It is not, as the question title suggests, the sum of the space of all of the items stored in the directory.
You don't say what system you're using, but in many UNIX/Linux file systems, the minimum unit of storage allocation is 4K, which is why the size is showing as 4096. The directory entries for two items, plus "." and "..", should take considerably less space.

Resources