Fastest Way To Calculate Directory Sizes

Fastest Way To Calculate Directory Sizes - linux

what is the best and fastest way to calculate directory sizes? For example we will have the following structure:
/users
/a
/b
/c
/...
We need the output to be per user directory:
a = 1224KB
b = 3533KB
c = 3324KB
...
We plan on having tens maybe even hundred of thousands of directories under /users. The following shell command works:
du -cms /users/a | grep total | awk '{print $1}'
But, we will have to call it N number of times. The entire point, is that the output; each users directory size will be stored in our database. Also, we would love to have it update as frequently as possible, but without blocking all the resources on the server. Is it even possible to have it calculate users directory size every minute? How about every 5 minutes?
Now that I am thinking about it some more, would it make sense to use node.js? That way, we can calculate directory sizes, and even insert into the database all in one transaction. We could do that as well in PHP and Python, but not sure it is as fast.
Thanks.

Why not just:
du -sm /users/*
(The slowest part is still likely to be du traversing the filesystem to calculate the size, though).

What do you need this information for? If it's only for reminding the users that their home directories are too big, you should add quota limits to the filesystem. You can set the quota to 1000 GB if you just want the numbers without really limiting disk usage.
The numbers are usually accurate whenever you access anything on the disk. The only downside is that they tell you how large the files are that are owned by a particular user, instead of how large the files below his home directory are. But maybe you can live with that.

I think what you are looking for is:
du -cm --max-depth=1 /users | awk '{user = substr($2,7,300);
> ans = user ": " $1;
> print ans}'
The magic numbers 7 is taking away the substring /users/, and 300 is just an arbitrary big number (awk is not one of my best languages =D, but I am guessing that part is not going to be written in awk anyways.) It's faster since you don't involve greping for the total and the loop is contained inside du. I bet it can be done faster, but this should be fast enough.

If you have multiple cores you can run the du command in parallel,
For example (running from the folder you want to examine):
>> parallel du -sm ::: *
>> ls -a | xargs -P4 du -sm
[The number after the -P argument sets the amount of cpus you want to use]

not that slow but will show you folders size: du -sh /* > total.size.files.txt

Fastest way for analyze storage using ncdu package:
sudo apt-get install ncdu
command example:
ncdu /your/directory/

Related

How can I use the parallel command to exploit multi-core parallelism on my MacBook?

I often use the find command on Linux and macOS. I just discovered the command parallel, and I would like to combine it with find command if possible because find command takes a long time when we search a specific file into large directories.
I have searched for this information but the results are not accurate enough. There appear to be a lot of possible syntaxes, but I can't tell which one is relevant.
How do I combine the parallel command with the find command (or any other command) in order to benefit from all 16 cores that I have on my MacBook?
Update
From #OleTange, I think I have found the kind of commands that interests me.
So, to know more about these commands, I would like to know the usefulness of characters {}and :::in the following command :
parallel -j8 find {} ::: *
1) Are these characters mandatory ?
2) How can I insert classical options of find command like -type f or -name '*.txt ?
3) For the moment I have defined in my .zshrc the function :
ff () {
find $1 -type f -iname $2 2> /dev/null
}
How could do the equivalent with a fixed number of jobs (I could also set it as a shell argument)?

Parallel processing makes sense when your work is CPU bound (the CPU does the work, and the peripherals are mostly idle) but here, you are trying to improve the performance of a task which is I/O bound (the CPU is mostly idle, waiting for a busy peripheral). In this situation, adding parallelism will only add congestion, as multiple tasks will be fighting over the already-starved I/O bandwidth between them.
On macOS, the system already indexes all your data anyway (including the contents of word-processing documents, PDFs, email messages, etc); there's a friendly magnifying glass on the menu bar at the upper right where you can access a much faster and more versatile search, called Spotlight. (Though I agree that some of the more sophisticated controls of find are missing; and the "user friendly" design gets in the way for me when it guesses what I want, and guesses wrong.)
Some Linux distros offer a similar facility; I would expect that to be the norm for anything with a GUI these days, though the details will differ between systems.
A more traditional solution on any Unix-like system is the locate command, which performs a similar but more limited task; it will create a (very snappy) index on file names, so you can say
locate fnord
to very quickly obtain every file whose name matches fnord. The index is simply a copy of the results of a find run from last night (or however you schedule the backend to run). The command is already installed on macOS, though you have to enable the back end if you want to use it. (Just run locate locate to get further instructions.)
You could build something similar yourself if you find yourself often looking for files with a particular set of permissions and a particular owner, for example (these are not features which locate records); just run a nightly (or hourly etc) find which collects these features into a database -- or even just a text file -- which you can then search nearly instantly.
For running jobs in parallel, you don't really need GNU parallel, though it does offer a number of conveniences and enhancements for many use cases; you already have xargs -P. (The xargs on macOS which originates from BSD is more limited than GNU xargs which is what you'll find on many Linuxes; but it does have the -P option.)
For example, here's how to run eight parallel find instances with xargs -P:
printf '%s\n' */ | xargs -I {} -P 8 find {} -name '*.ogg'
(This assumes the wildcard doesn't match directories which contain single quotes or newlines or other shenanigans; GNU xargs has the -0 option to fix a large number of corner cases like that; then you'd use '%s\0' as the format string for printf.)
As the parallel documentation readily explains, its general syntax is
parallel -options command ...
where {} will be replaced with the current input line (if it is missing, it will be implicitly added at the end of command ...) and the (obviously optional) ::: special token allows you to specify an input source on the command line instead of as standard input.
Anything outside of those special tokens is passed on verbatim, so you can add find options at your heart's content just by specifying them literally.
parallel -j8 find {} -type f -name '*.ogg' ::: */
I don't speak zsh but refactored for regular POSIX sh your function could be something like
ff () {
parallel -j8 find {} -type f -iname "$2" ::: "$1"
}
though I would perhaps switch the arguments so you can specify a name pattern and a list of files to search, à la grep.
ff () {
# "local" is not POSIX but works in many sh versions
local pat=$1
shift
parallel -j8 find {} -type f -iname "$pat" ::: "$#"
}
But again, spinning your disk to find things which are already indexed is probably something you should stop doing, rather than facilitate.

Just use background running at each first level paths separately
In example below will create 12 subdirectories analysis
$ for i in [A-Z]*/ ; do find "$i" -name "*.ogg" & >> logfile ; done
[1] 16945
[2] 16946
[3] 16947
# many lines
[1] Done find "$i" -name "*.ogg"
[2] Done find "$i" -name "*.ogg"
#many lines
[11] Done find "$i" -name "*.ogg"
[12] Done find "$i" -name "*.ogg"
$
Doing so creates many find process the system will dispatch on different cores as any other.
Note 1: it looks a little pig way to do so but it just works..
Note 2: the find command itself is not taking hard on cpus/cores this is 99% of use-case just useless because the find process will spend is time to wait for I/O from disks. Then using parallel or similar commands won't work*

As others have written find is I/O heavy and most likely not limited by your CPUs.
But depending on your disks it can be better to run the jobs in parallel.
NVMe disks are known for performing best if there are 4-8 accesses running in parallel. Some network file systems also work faster with multiple processes.
So some level of parallelization can make sense, but you really have to measure to be sure.
To parallelize find with 8 jobs running in parallel:
parallel -j8 find {} ::: *
This works best if you are in a dir that has many subdirs: Each subdir will then be searched in parallel. Otherwise this may work better:
parallel -j8 find {} ::: */*
Basically the same idea, but now using subdirs of dirs.
If you want the results printed as soon as they are found (and not after the find is finished) use --line-buffer (or --lb):
parallel --lb -j8 find {} ::: */*
To learn about GNU Parallel spend 20 minutes reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 and print the cheat sheet: https://www.gnu.org/software/parallel/parallel_cheat.pdf
Your command line will thank you for it.

You appear to want to be able to locate files quickly in large directories under macOS. I think the correct tool for that job is mdfind.
I made a hierarchy with 10,000,000 files under my home directory, all with unique names that resemble UUIDs, e.g. 80104d18-74c9-4803-af51-9162856bf90d. I then tried to find one with:
mdfind -onlyin ~ -name 80104d18-74c9-4803-af51-9162856bf90d
The result was instantaneous and too fast to measure the time, so I did 100 lookups and it took under 20s, so on average a lookup takes 0.2s.
If you actually wanted to locate 100 files, you can group them into a single search like this:
mdfind -onlyin ~ 'kMDItemDisplayName==ffff4bbd-897d-4768-99c9-d8434d873bd8 || kMDItemDisplayName==800e8b37-1f22-4c7b-ba5c-f1d1040ac736 || kMDItemDisplayName==800e8b37-1f22-4c7b-ba5c-f1d1040ac736'
and it executes even faster.
If you only know a partial filename, you can use:
mdfind -onlyin ~ "kMDItemDisplayName = '*cdd90b5ef351*'"
/Users/mark/StackOverflow/MassiveDirectory/800f0058-4021-4f2d-8f5c-cdd90b5ef351
You can also use creation dates, file types, author, video duration, or tags in your search. For example, you can find all PNG images whose name contains "25DD954D73AF" like this:
mdfind -onlyin ~ "kMDItemKind = 'PNG image' && kMDItemDisplayName = '*25DD954D73AF*'"
/Users/mark/StackOverflow/MassiveDirectory/9A91A1C4-C8BF-467E-954E-25DD954D73AF.png
If you want to know what fields you can search on, take a file of the type you want to be able to look for, and run mdls on it and you will see all the fields that macOS knows about:
mdls SomeMusic.m4a
mdls SomeVideo.avi
mdls SomeMS-WordDocument.doc
More examples here.
Also, unlike with locate, there is no need to update a database frequently.

How to speed up grep/awk command?

I am going to process the text file (>300 GB) and split it into small text files (~1 GB). I want to speed up grep/awk commands.
I need to grep the line which has values on column b, here are my ways:
# method 1:
awk -F',' '$2 ~ /a/ { print }' input
# method 2:
grep -e ".a" < inpuy
Both of ways cost 1min for each file. So how can I speed up this operation?
Sample of input file:
a,b,c,d
1,4a337485,2,54
4,2a4645647,4,56
6,5a3489556,3,22
9,,3,12
10,0,34,45
24,4a83944,3,22
45,,435,34
Expected output file:
a,b,c,d
1,4a337485,2,54
4,2a4645647,4,56
6,5a3489556,3,22
24,4a83944,3,22

How to speed up grep/awk command?
Are you so sure that grep or awk is the culprit of your perceived slowness ? Do you know about cut(1) or sed(1) ? Have you benchmarked the time to run wc(1) on your data? Probably the textual I/O is taking a lot of time.
Please benchmark several times, and use time(1) to benchmark your program.
I have a high-end Debian desktop (with a AMD 2970WX, 64Gb RAM, 1Tbyte SSD system disk, multi-terabyte 7200RPM SATA data disks) and just running wc on a 25Gbyte file (some *.tar.xz archive) sitting on a hard disk takes more than 10 minutes (measured with time), and wc is doing some really simple textual processing by reading that file sequentially so should run faster than grep (but, to my surprise, does not!) or awk on the same data :
wc /big/basile/backup.tar.xz 640.14s user 4.58s system 99% cpu 10:49.92 total
and (using grep on the same file to count occurrences of a)
grep -c a /big/basile/backup.tar.xz 38.30s user 7.60s system 33% cpu 2:17.06 total
general answer to your question:
Just write cleverly (with efficient O(log n) time complexity data structures: red-black trees, or hash tables, etc ...) an equivalent program in C or C++ or Ocaml or most other good language and implementation. Or buy more RAM to increase your page cache. Or buy an SSD to hold your data. And repeat your benchmarks more than once (because of the page cache).
suggestion for your problem : use a relational database
It is likely that using a plain textual file of 300Gb is not the best approach. Having huge textual files is usually wrong and is likely to be wrong once you need to process several times the same data. You'll better pre-process it somehow..
If you repeat the same grep search or awk execution on the same data file more than once, consider instead using sqlite (see also this answer) or even some other real relational database (e.g. with PostGreSQL or some other good RDBMS) to store then process your original data.
So a possible approach (if you have enough disk space) might be to write some program (in C, Python, Ocaml etc...), fed by your original data, and filling some sqlite database. Be sure to have clever database indexes and take time to design a good enough database schema, being aware of database normalization.

Use mawk, avoid regex and do:
$ mawk -F, '$2!=""' file
a,b,c,d
1,4a337485,2,54
4,2a4645647,4,56
6,5a3489556,3,22
10,0,34,45
24,4a83944,3,22
Let us know how long that took.
I did some tests with 10M records of your data, based on the results: use mawk and regex:
GNU awk and regex:
$ time gawk -F, '$2~/a/' file > /dev/null
real 0m7.494s
user 0m7.440s
sys 0m0.052s
GNU awk and no regex:
$ time gawk -F, '$2!=""' file >/dev/null
real 0m9.330s
user 0m9.276s
sys 0m0.052s
mawk and no regex:
$ time mawk -F, '$2!=""' file >/dev/null
real 0m4.961s
user 0m4.904s
sys 0m0.060s
mawk and regex:
$ time mawk -F, '$2~/a/' file > /dev/null
real 0m3.672s
user 0m3.600s
sys 0m0.068s

I suspect your real problem is that you're calling awk repeatedly (probably in a loop), once per set of values of $2 and generating an output file each time, e.g.:
awk -F, '$2==""' input > novals
awk -F, '$2!=""' input > yesvals
etc.
Don't do that as it's very inefficient since it's reading the whole file on every iteration. Do this instead:
awk '{out=($2=="" ? "novals" : "yesvals")} {print > out}' input
That will create all of your output files with one call to awk. Once you get past about 15 output files it would require GNU awk for internal handling of open file descriptors or you need to add close(out)s when $2 changes and use >> instead of >:
awk '$2!=prev{close(out); out=($2=="" ? "novals" : "yesvals"); prev=$2} {print >> out}' input
and that would be more efficient if you sorted your input file first with (requires GNU sort for -s for stable sort if you care about preserving input ordering for the unique $2 values):
sort -t, -k2,2 -s

what is the most reliable command to find actual size of a file linux

Recently I tried to find out the size of a file using various command and it showed huge differences.
ls -ltr showed its size around 34GB (bytes rounded off by me ) while
du -sh filename showed it to be around 11GB. while
stat command showed the same to be around 34GB .
Any idea which is the most reliable command to find actual size of the file ?
There was some copy operation performed on it and we are unsure of if this was appropriately done as after a certain time source file from where copy was being performed was removed by a job of ours.

There is no inaccuracy or reliability issue here, you're just comparing two different numbers: logical size vs physical size.
Here's Wikipedia's illustration for sparse files:
ls shows the gray+green areas, the logical length of the file. du (without --apparent-size) shows only the green areas, since those are the ones that take up space.
You can create a sparse file with dd count=0 bs=1M seek=100 of=myfile.
ls shows 100MiB because that's how long the file is:
$ ls -lh myfile
-rw-r----- 1 me me 100M Jul 15 10:57 myfile
du shows 0, because that's how much data it's allocated:
$ du myfile
0 myfile

ls -l --block-size=M
will give you a long format listing (needed to actually see the file size) and round file sizes up to the nearest MiB.
If you want MB (10^6 bytes) rather than MiB (2^20 bytes) units, use --block-size=MB instead.
If you don't want the M suffix attached to the file size, you can use something like --block-size=1M. Thanks Stéphane Chazelas for suggesting this.
This is described in the man page for ls; man ls and search for SIZE. It allows for units other than MB/MiB as well, and from the looks of it (I didn't try that) arbitrary block sizes as well (so you could see the file size as number of 412-byte blocks, if you want to).
Note that the --block-size parameter is a GNU extension on top of the Open Group's ls, so this may not work if you don't have a GNU userland (which most Linux installations do). The ls from GNU coreutils 8.5 does support --block-size as described above.

There are several notions of file size, as explained in that other guiy's answer and the wikipage figure on sparse files.
However, you might want to use both ls(1) & stat(1) commands.
If coding in C, consider using stat(2) & lseek(2) syscalls.
See also the references in this answer.

How to copy multiple files simultaneously using scp

I would like to copy multiple files simultaneously to speed up my process I currently used the follow
scp -r root#xxx.xxx.xx.xx:/var/www/example/example.example.com .
but it only copies one file at a time. I have a 100 Mbps fibre so I have the bandwidth available to really copy a lot at the same time, please help.

You can use background task with wait command.
Wait command ensures that all the background tasks are completed before processing next line. i.e echo will be executed after scp for all three nodes are completed.
#!/bin/bash
scp -i anuruddha.pem myfile1.tar centos#192.168.30.79:/tmp &
scp -i anuruddha.pem myfile2.tar centos#192.168.30.80:/tmp &
scp -i anuruddha.pem myfile.tar centos#192.168.30.81:/tmp &
wait
echo "SCP completed"

SSH is able to do so-called "multiplexing" - more connections over one (to one server). It can be one way to afford what you want. Look up keywords like "ControlMaster"
Second way is using more connections, then send every job at background:
for file in file1 file2 file3 ; do
scp $file server:/tmp/ &
done
But, this is answer to your question - "How to copy multiple files simultaneously". For speed up, you can use weaker encryption (rc4 etc) and also don't forget, that the bottleneck can be your hard drive - because SCP don't implicitly limit transfer speed.
Last thing is using rsync - in some cases, it can be lot faster than scp...

I am not sure if this helps you, but I generally archive (compression is not required. just archiving is sufficient) file at the source, download it, extract them. This will speed up the process significantly.
Before archiving it took > 8 hours to download 1GB
After archiving it took < 8 minutes to do the same

You can use parallel-scp (AKA pscp): http://manpages.ubuntu.com/manpages/natty/man1/parallel-scp.1.html
With this tool, you can copy a file (or files) to multiple hosts simultaneously.
Regards,

100mbit Ethernet is pretty slow, actually. You can expect 8 MiB/s in theory. In practice, you usually get between 4-6 MiB/s at best.
That said, you won't see a speed increase if you run multiple sessions in parallel. You can try it yourself, simply run two parallel SCP sessions copying two large files. My guess is that you won't see a noticeable speedup. The reasons for this are:
The slowest component on the network path between the two computers determines the max. speed.
Other people might be accessing example.com at the same time, reducing the bandwidth that it can give you
100mbit Ethernet requires pretty big gaps between two consecutive network packets. GBit Ethernet is much better in this regard.
Solutions:
Compress the data before sending it over the wire
Use a tool like rsync (which uses SSH under the hood) to copy on the files which have changed since the last time you ran the command.
Creating a lot of small files takes a lot of time. Try to create an archive of all the files on the remote side and send that as a single archive.
The last suggestion can be done like this:
ssh root#xxx "cd /var/www/example ; tar cf - example.example.com" > example.com.tar
or with compression:
ssh root#xxx "cd /var/www/example ; tar czf - example.example.com" > example.com.tar.gz
Note: bzip2 compresses better but slower. That's why I use gzip (z) for tasks like this.

If you specify multiple files scp will download them sequentially:
scp -r root#xxx.xxx.xx.xx:/var/www/example/file1 root#xxx.xxx.xx.xx:/var/www/example/file2 .
Alternatively, if you want the files to be downloaded in parallel, then use multiple invocations of scp, putting each in the background.
#! /usr/bin/env bash
scp root#xxx.xxx.xx.xx:/var/www/example/file1 . &
scp root#xxx.xxx.xx.xx:/var/www/example/file2 . &

Directory Stats command line interface?

Windirstat/ Kdirstat/ Disk Inventory X has been nothing short of revolutionary in file managment. Why is there no text-only command line equivalent? I'd need it for SSH administration of my file servers.
We have all the building blocks: du, tree etc.
Is there one? Why not? Can someone please write one? :)
EDIT: du does ALMOST what I want. What I want is something that sorts each subdirectory by size (rather than full path) and indents so that it's easier to avoid double-counting. du would give me this:
cd a
du . -h
1G b
2G c
1K c/d
1K c/e
2G c/f
It's not immediately obvious that c and c/f are overlapping. What I want is this:
cd a
dir_stats .
1G b
2G c
|
+---- 2G f
|
+---- 1K d
|
+---- 1K e
in which it is clear that the 2G from f is because of the 2G from c. I can find all the info not related to c more easily (i.e. by just scanning the first column).

I'd recommend using ncdu, which stands for NCurses Disk Usage. Basically it's a collapsible version of du, with a basic command line user interface.
One thing worth noting is that it runs a bit slower than du on large amounts of data, so I'd recommend running it in a screen or using the command line options to first scan the directory and then view the results. Note the q option, it reduces the refresh rate from 1/10th of a second to 2 seconds, recommended for SSH connections.
Viewing total root space usage:
ncdu -xq /
Generate results file and view later:
ncdu -1xqo- / | gzip > export.gz
# ...some time later:
zcat export.gz | ncdu -f-

You can use KDirStat (or the new QDirStat) together with the perl script that comes along with either one to collect data on your server, then copy that file to your desktop machine and view it with KDirStat / QDirStat.
See also
https://github.com/shundhammer/qdirstat/tree/master/scripts
or
https://github.com/shundhammer/kdirstat/blob/master/kdirstat/kdirstat-cache-writer
The script does not seem to be included with the KDE 4 port K4DirStat, but it can still read and write the same cache files.
--
HuHa (Stefan Hundhammer - author of the original KDirStat)

As mentioned here: https://unix.stackexchange.com/questions/45828/print-size-of-directory-content-with-tree-command-in-tree-1-5
tree --du -h -L 2
is very much in the right spirit of my goal. The only problem is, I don't think it supports sorting so is not suitable for huge file system hierarchies :(

Don't bother trying to do disk space management with ascii art visializations. Du follows Unix's elegant philosophy in all respects and so gives you sorting etc for free.
Get comfortable with du and you'll have much more power in finding disk hogs remotely

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string