What size of folders show ls -la [duplicate] - linux

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
When running ls -l, why does the filesize on a directory not match the output of du?
Hi.
I'm interesting in information what show me output ls -la in linux. So, default size is 4K. But if there are a lot of files, maybe with zero size, such as PHP sessions =), the size != 4K.
What showing me ls -la?
And after, when i clean this folder i see tha last max size.

ls -al will give you the space taken up by the directory itself, not the files within it.
As such, it has a minimum size. When a directory is created, it's given this much space to store file information which is a set number of bytes per file (let's say 64 bytes thought the number could be different).
If the initial size was 4K, that would allow up to 64 files. Once you put more than 64 files into the directory, it would have to be expanded.
As for your comment:
The reason why it may not get smaller when you delete all the files in it is because there's usually no real advantage. It's just left at the same size so that it doesn't have to be expanded again next time you put a bucketload of files in there (it tends to assume that past behaviour is an indicator of future behaviour).
If you want to reduce the space taken, there's an old trick for doing that. To reduce the size of /tmp/qq, create a brand new /tmp/qq2, copy all the files across (after deleting those you don't need), then simply rename /tmp/qq to /tmp/qq3 and /tmp/qq2 to /tmp/qq. Voila! Oh yeah, eventually delete /tmp/qq2.

Related

How does du estimate file size?

I am downloading a large file with wget, which I ran in the background with wget -bqc. I wanted to see how much of the file was downloaded so I ran
du -sh *
in the directory. (I'd also be interested to know a better way to check wget progress in this case if anyone knows...) I saw that 25 GB had been downloaded, but for several attempts afterwards it showed the same result of 25 GB. I became worried that du had somehow interfered with the download until some time later when du showed a result of 33 GB and subsequently 40 GB.
In searching stackoverflow and online, I didn't find whether it is safe to use du on files being written to but I did see that it is only an estimate that can be somewhat off. However, 7-8 GB seems like a lot, particularly because it is a single file, and not a directory tree, which it seems is what causes errors in the estimate. I'd be interested to know how it makes this estimate for a single file that is being written and why I would see this result.
The operating system has to go guarantee safe access.
du does not estimate anything. the kernel knows the size of the file and when du asks for it that's what it learns.
If the file is in the range of gigabytes and the reported size is only with that granularity, it should not be a surprise that consecutive invocations show the same size - do you expect wget to fetch enough data to flip to another gigabyte in between your checks? You can try running du without sh in order to get a more accurate read.
Also wget will hold some amount of data in ram, but that should be negligible.
du doesn't estimate, it sums up. But it has access to some file-system-internal information which might make its output be a surprise. The various aspects should be looked up separately as they are a bit too much to explain here in detail.
Sparse files may make a file look bigger than it is on disk.
Hard links may make a directory tree look bigger than it is on disk.
Block sizes may make a file look smaller than it is on disk.
du will always print out the size a directory tree (or several) actually and really occupy on disk. Due to various facts (the three most common are given above) this can be different from the size of the information stored in theses trees.

Prepend to Very Large File in Fixed Time or Very Fast [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a file that is very large (>500GB) that I want to prepend with a relatively small header (<20KB). Doing commands such as:
cat header bigfile > tmp
mv tmp bigfile
or similar commands (e.g., with sed) are very slow.
What is the fastest method of writing a header to the beginning of an existing large file? I am looking for a solution that can run under CentOS 7.2. It is okay to install packages from CentOS install or updates repo, EPEL, or RPMForge.
It would be great if some method exists that doesn't involve relocating or copying the large amount of data in the bigfile. That is, I'm hoping for a solution that can operate in fixed time for a given header file regardless of the size of the bigfile. If that is too much to ask for, then I'm just asking for the fastest method.
Compiling a helper tool (as in C/C++) or using a scripting language is perfectly acceptable.
Is this something that needs to be done once, to "fix" a design oversight perhaps? Or is it something that you need to do on a regular basis, for instance to add summary data (for instance, the number of data records) to the beginning of the file?
If you need to do it just once then your best option is just to accept that a mistake has been made and take the consequences of the retro-fix. As long as you make your destination drive different from the source drive you should be able to fix up a 500GB file within about two hours. So after a week of batch processes running after hours you could have upgraded perhaps thirty or forty files
If this is a standard requirement for all such files, and you think you can apply the change only when the file is complete -- some sort of summary information perhaps -- then you should reserve the space at the beginning of each file and leave it empty. Then it is a simple matter of seeking into the header region and overwriting it with the real data once it can be supplied
As has been explained, standard file systems require the whole of a file to be copied in order to add something at the beginning
If your 500GB file is on a standard hard disk, which will allow data to be read at around 100MB per second, then reading the whole file will take 5,120 seconds, or roughly 1 hour 30 minutes
As long as you arrange for the destination to be a separate drive from the source, your can mostly write the new file in parallel with the read, so it shouldn't take much longer than that. But there's no way to speed it up other than that, I'm afraid
If you were not bound to CentOS 7.2, your problem could be solved (with some reservations1) by fallocate, which provides the needed functionality for the ext4 filesystem starting from Linux 4.2 and for the XFS filesystem since Linux 4.1:
int fallocate(int fd, int mode, off_t offset, off_t len);
This is a nonportable, Linux-specific system call. For the portable,
POSIX.1-specified method of ensuring that space is allocated for a
file, see posix_fallocate(3).
fallocate() allows the caller to directly manipulate the allocated
disk space for the file referred to by fd for the byte range starting
at offset and continuing for len bytes.
The mode argument determines the operation to be performed on the
given range. Details of the supported operations are given in the
subsections below.
...
Increasing file space
Specifying the FALLOC_FL_INSERT_RANGE flag (available since Linux 4.1)
in mode increases the file space by inserting a hole within the
file size without overwriting any existing data. The hole will start
at offset and continue for len bytes. When inserting the hole inside
file, the contents of the file starting at offset will be shifted
upward (i.e., to a higher file offset) by len bytes. Inserting a
hole inside a file increases the file size by len bytes.
...
FALLOC_FL_INSERT_RANGE requires filesystem support. Filesystems that
support this operation include XFS (since Linux 4.1) and ext4 (since
Linux 4.2).
1 fallocate allows prepending data to the file only at multiples of the filesystem block size. So it will solve your problem only if it's acceptable for you to pad the extra space with whitespace, comments, etc.
Without a support for fallocate()+FALLOC_FL_INSERT_RANGE the best you can do is
Increase the file (so that it has its final size);
mmap() the file;
memmove() the data;
Fill the header data in the beginning.

Is there a Linux filesystem, perhaps fuse, which gives the directory size as the size of its contents and its subdirs? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
If there isn't, how feasible would it be to write one? A filesystem which for each directory keeps the size of its contents recursively and which is kept updated not by re-calculating the size on each change on the filesystem, but for example update the dir size when a file is removed or grows.
I am not aware of such a file system. From filesystem's point of view a directory is a file.
You can use:
du -s -h <dir>
to display the total size of all the files in the directory.
From the filesystem point of view, size of directory is size of information about its existence, which needs to be saved on the medium physically. Note, that "size" of directory containing files which have 10GB in total, will be actually the same as "size" of empty directory, because information needed to mark its existence will take same storage space. That's why size of files ( sockets, links and other stuff inside ), isn't actually the same as "directory size". Subdirectories can be mounted from various locations, including remote, and recursively mounted. Somewhat directory size is just a human vision, for real files are not "inside" directories physically - a directory is just a mark of container, exactly the same way as special file ( e.g. device file ) is marked a special file. Recounting and updating total directory size depends more on NUMBER of items in it, than sum of their sizes, and modern filesystem can keep hundreds of thousands of files ( if not more ) "in" one directory, even without subdirs, so counting their sizes could be quite heavy task, in comparison with possible profit from having this information. In short, when you execute e.g. "du" ( disk usage ) command, or when you count directory size in windows, actually doing it someway by the kernel with filesystem driver won't be faster - counting is counting.
There are quota systems, which keep and update information about total size of files owned by particular user or groups, they're, however, limited to monitor partitions separately, as for particular partition quota may be enabled or not. Moreover, quota usage gets updated, as you said, when file grows or is removed, and that's why information may be inaccurate - for this reason quota is rebuild from time to time, e.g. with cron job, by scanning all files in all directories "from the scratch", on the partition on which it is enabled.
Also note, that bottleneck of IO operations speed ( including reading information about the files ) is usually speed of the medium itself, then communication bus, and then CPU, while you're considering every filesystem to be fast as RAM FS. RAM FS is probably most trivial files system, virtually kept in RAM, which makes IO operations go very fast. You can build it at module and try to add functionality you've described, you will learn many interesting things :)
FUSE stands for "file system in user space", FS implemented with fuse are usually quite slow. They make sense when functionality in particular case is more important than speed, e.g. you can create a pseudo-filesystem basing on temperature reading from your newly bought e-thermometer you connected to your computer via USB, however they're not speed daemons, you know :)

Maximum number of files/directories on Linux?

I'm developing a LAMP online store, which will allow admins to upload multiple images for each item.
My concern is - right off the bat there will be 20000 items meaning roughly 60000 images.
Questions:
What is the maximum number of files and/or directories on Linux?
What is the usual way of handling this situation (best practice)?
My idea was to make a directory for each item, based on its unique ID, but then I'd still have 20000 directories in a main uploads directory, and it will grow indefinitely as old items won't be removed.
Thanks for any help.
ext[234] filesystems have a fixed maximum number of inodes; every file or directory requires one inode. You can see the current count and limits with df -i. For example, on a 15GB ext3 filesystem, created with the default settings:
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/xvda 1933312 134815 1798497 7% /
There's no limit on directories in particular beyond this; keep in mind that every file or directory requires at least one filesystem block (typically 4KB), though, even if it's a directory with only a single item in it.
As you can see, though, 80,000 inodes is unlikely to be a problem. And with the dir_index option (enablable with tune2fs), lookups in large directories aren't too much of a big deal. However, note that many administrative tools (such as ls or rm) can have a hard time dealing with directories with too many files in them. As such, it's recommended to split your files up so that you don't have more than a few hundred to a thousand items in any given directory. An easy way to do this is to hash whatever ID you're using, and use the first few hex digits as intermediate directories.
For example, say you have item ID 12345, and it hashes to 'DEADBEEF02842.......'. You might store your files under /storage/root/d/e/12345. You've now cut the number of files in each directory by 1/256th.
If your server's filesystem has the dir_index feature turned on (see tune2fs(8) for details on checking and turning on the feature) then you can reasonably store upwards of 100,000 files in a directory before the performance degrades. (dir_index has been the default for new filesystems for most of the distributions for several years now, so it would only be an old filesystem that doesn't have the feature on by default.)
That said, adding another directory level to reduce the number of files in a directory by a factor of 16 or 256 would drastically improve the chances of things like ls * working without over-running the kernel's maximum argv size.
Typically, this is done by something like:
/a/a1111
/a/a1112
...
/b/b1111
...
/c/c6565
...
i.e., prepending a letter or digit to the path, based on some feature you can compute off the name. (The first two characters of md5sum or sha1sum of the file name is one common approach, but if you have unique object ids, then 'a'+ id % 16 is easy enough mechanism to determine which directory to use.)
60000 is nothing, 20000 as well. But you should put group these 20000 by any means in order to speed up access to them. Maybe in groups of 100 or 1000, by taking the number of the directory and dividing it by 100, 500, 1000, whatever.
E.g., I have a project where the files have numbers. I group them in 1000s, so I have
id/1/1332
id/3/3256
id/12/12334
id/350/350934
You actually might have a hard limit - some systems have 32 bit inodes, so you are limited to a number of 2^32 per file system.
In addition of the general answers (basically "don't bother that much", and "tune your filesystem", and "organize your directory with subdirectories containing a few thousand files each"):
If the individual images are small (e.g. less than a few kilobytes), instead of putting them in a folder, you could also put them in a database (e.g. with MySQL as a BLOB) or perhaps inside a GDBM indexed file. Then each small item won't consume an inode (on many filesystems, each inode wants at least some kilobytes). You could also do that for some threshold (e.g. put images bigger than 4kbytes in individual files, and smaller ones in a data base or GDBM file). Of course, don't forget to backup your data (and define a backup stategy).
The year is 2014. I come back in time to add this answer.
Lots of big/small files? You can use Amazon S3 and other alternatives based on Ceph like DreamObjects, where there are no directory limits to worry about.
I hope this helps someone decide from all the alternatives.
md5($id) ==> 0123456789ABCDEF
$file_path = items/012/345/678/9AB/CDE/F.jpg
1 node = 4096 subnodes (fast)

When running ls -l, why does the filesize on a directory not match the output of du?

What does 4096 mean in output of ls -l?
[root#file nutch-0.9]# du -csh resume.new/
2.3G resume.new/
[root#file nutch-0.9]# ls -l
total 55132
drwxr-xr-x 7 root root 4096 Jun 18 03:19 resume.new
That the directory takes up 4096 bytes of disk space (not including its contents).
I have been wondering about it too. So, after searching I came across:
"It's the size necessary to store the
meta-data about files (including the
file names contained in that
directory). The number of files /
sub-directories at a given time might
not map directly to the size reported,
because once allocated, space is not
freed if the number of files changes.
This behaviour makes sense for most
use cases (where disk space is cheap,
and once a directory has a lot of
files in it, it will probably have
them again in future), and helps to
reduce fragmentation."
Reference: http://www.linuxquestions.org/questions/showthread.php?p=2978839#post2978839
Directories are just like files with <name, inode> tuples, and that are specially treated by the filesystem. The size reported by ls is the size of this "file". Check this answer in Server Fault for an overview of how directories are under the hood.
So, the 4096 bytes mean, most likely, that the filesystem block size is 4096 and that directory is currently using a single block to store this table of names and inodes.
4096, in your example, is the number of bytes used by the directory itself. In other words, this is the space required to store the list of items contained in the directory. It is not, as the question title suggests, the sum of the space of all of the items stored in the directory.
You don't say what system you're using, but in many UNIX/Linux file systems, the minimum unit of storage allocation is 4K, which is why the size is showing as 4096. The directory entries for two items, plus "." and "..", should take considerably less space.

Resources