Total size of the contents of all the files in a directory [closed] - linux

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 4 years ago.
Improve this question
When I use ls or du, I get the amount of disk space each file is occupying.
I need the sum total of all the data in files and subdirectories I would get if I opened each file and counted the bytes. Bonus points if I can get this without opening each file and counting.

If you want the 'apparent size' (that is the number of bytes in each file), not size taken up by files on the disk, use the -b or --bytes option (if you got a Linux system with GNU coreutils):
% du -sbh <directory>

Use du -sb:
du -sb DIR
Optionally, add the h option for more user-friendly output:
du -sbh DIR

cd to directory, then:
du -sh
ftw!
Originally wrote about it here:
https://ao.ms/get-the-total-size-of-all-the-files-in-a-directory/

Just an alternative:
ls -lAR | grep -v '^d' | awk '{total += $5} END {print "Total:", total}'
grep -v '^d' will exclude the directories.

stat's "%s" format gives you the actual number of bytes in a file.
find . -type f |
xargs stat --format=%s |
awk '{s+=$1} END {print s}'
Feel free to substitute your favourite method for summing numbers.

If you use busybox's "du" in emebedded system, you can not get a exact bytes with du, only Kbytes you can get.
BusyBox v1.4.1 (2007-11-30 20:37:49 EST) multi-call binary
Usage: du [-aHLdclsxhmk] [FILE]...
Summarize disk space used for each FILE and/or directory.
Disk space is printed in units of 1024 bytes.
Options:
-a Show sizes of files in addition to directories
-H Follow symbolic links that are FILE command line args
-L Follow all symbolic links encountered
-d N Limit output to directories (and files with -a) of depth < N
-c Output a grand total
-l Count sizes many times if hard linked
-s Display only a total for each argument
-x Skip directories on different filesystems
-h Print sizes in human readable format (e.g., 1K 243M 2G )
-m Print sizes in megabytes
-k Print sizes in kilobytes(default)

For Win32 DOS, you can:
c:> dir /s c:\directory\you\want
and the penultimate line will tell you how many bytes the files take up.
I know this reads all files and directories, but works faster in some situations.

When a folder is created, many Linux filesystems allocate 4096 bytes to store some metadata about the directory itself.
This space is increased by a multiple of 4096 bytes as the directory grows.
du command (with or without -b option) take in count this space, as you can see typing:
mkdir test && du -b test
you will have a result of 4096 bytes for an empty dir.
So, if you put 2 files of 10000 bytes inside the dir, the total amount given by du -sb would be 24096 bytes.
If you read carefully the question, this is not what asked. The questioner asked:
the sum total of all the data in files and subdirectories I would get if I opened each file and counted the bytes
that in the example above should be 20000 bytes, not 24096.
So, the correct answer IMHO could be a blend of Nelson answer and hlovdal suggestion to handle filenames containing spaces:
find . -type f -print0 | xargs -0 stat --format=%s | awk '{s+=$1} END {print s}'

There are at least three ways to get the "sum total of all the data in files and subdirectories" in bytes that work in both Linux/Unix and Git Bash for Windows, listed below in order from fastest to slowest on average. For your reference, they were executed at the root of a fairly deep file system (docroot in a Magento 2 Enterprise installation comprising 71,158 files in 30,027 directories).
1.
$ time find -type f -printf '%s\n' | awk '{ total += $1 }; END { print total" bytes" }'
748660546 bytes
real 0m0.221s
user 0m0.068s
sys 0m0.160s
2.
$ time echo `find -type f -print0 | xargs -0 stat --format=%s | awk '{total+=$1} END {print total}'` bytes
748660546 bytes
real 0m0.256s
user 0m0.164s
sys 0m0.196s
3.
$ time echo `find -type f -exec du -bc {} + | grep -P "\ttotal$" | cut -f1 | awk '{ total += $1 }; END { print total }'` bytes
748660546 bytes
real 0m0.553s
user 0m0.308s
sys 0m0.416s
These two also work, but they rely on commands that don't exist on Git Bash for Windows:
1.
$ time echo `find -type f -printf "%s + " | dc -e0 -f- -ep` bytes
748660546 bytes
real 0m0.233s
user 0m0.116s
sys 0m0.176s
2.
$ time echo `find -type f -printf '%s\n' | paste -sd+ | bc` bytes
748660546 bytes
real 0m0.242s
user 0m0.104s
sys 0m0.152s
If you only want the total for the current directory, then add -maxdepth 1 to find.
Note that some of the suggested solutions don't return accurate results, so I would stick with the solutions above instead.
$ du -sbh
832M .
$ ls -lR | grep -v '^d' | awk '{total += $5} END {print "Total:", total}'
Total: 583772525
$ find . -type f | xargs stat --format=%s | awk '{s+=$1} END {print s}'
xargs: unmatched single quote; by default quotes are special to xargs unless you use the -0 option
4390471
$ ls -l| grep -v '^d'| awk '{total = total + $5} END {print "Total" , total}'
Total 968133

du is handy, but find is useful in case if you want to calculate the size of some files only (for example, using filter by extension). Also note that find themselves can print the size of each file in bytes. To calculate a total size we can connect dc command in the following manner:
find . -type f -printf "%s + " | dc -e0 -f- -ep
Here find generates sequence of commands for dc like 123 + 456 + 11 +.
Although, the completed program should be like 0 123 + 456 + 11 + p (remember postfix notation).
So, to get the completed program we need to put 0 on the stack before executing the sequence from stdin, and print the top number after executing (the p command at the end).
We achieve it via dc options:
-e0 is just shortcut for -e '0' that puts 0 on the stack,
-f- is for read and execute commands from stdin (that generated by find here),
-ep is for print the result (-e 'p').
To print the size in MiB like 284.06 MiB we can use -e '2 k 1024 / 1024 / n [ MiB] p' in point 3 instead (most spaces are optional).

Use:
$ du -ckx <DIR> | grep total | awk '{print $1}'
Where <DIR> is the directory you want to inspect.
The '-c' gives you grand total data which is extracted using the 'grep total' portion of the command, and the count in Kbytes is extracted with the awk command.
The only caveat here is if you have a subdirectory containing the text "total" it will get spit out as well.

This may help:
ls -l| grep -v '^d'| awk '{total = total + $5} END {print "Total" , total}'
The above command will sum total all the files leaving the directories size.

Related

extracting last n percentage of a file output from zcat command

I am trying to extract the last 2 percentage of a file output coming from the zcat command. I tried something doing
numlines=$(zcat file.tar.gz | wc -l)
zcat file.tar.gz | tail -n + $numlines*(98/100)
But the problem with this approach is my file is too big, and I can't afford to run the zcat command twice. Is there some way I could do it by maybe piping the number of lines , or some other ways.
EDIT :
The output of zcat file.tar.gz | tar -xO | dd 2>&1 | tail -n 1 is
16942224047 bytes (17 GB, 16 GiB) copied, 109.154 s, 155 MB/s
Any help would be greatly appreciated.
Read content to a variable. I assume that there is enough RAM available.
content=$(zcat file.tar.gz| tar -xO)
lines=$(wc -l <<<"$content")
ninetyeight=$((100-$lines/100*98))
tail -n $ninetyeight
This only works if the file contains at least 100 lines.
The following awk program will only keep the last n% of your file into memory. The percentage is taken floor wise, that is to say, if we n% of the file represents 134.56 lines, it will print 134 lines
awk -v n=2 '{a[FNR]=$0; min=FNR-int(FNR*n/100)}
{i=min; while(i in a) delete a[i--]}
END{for(i=min+1;i<=FNR;++i) print a[i]}' - < <(zcat file)
you can verify this when you replace zcat file with seq 100

Get size of image in bash

I want to get size of image. The image is in folder by name encodedImage.jpc
a="$(ls -s encodedImage.jpc | cut -d " " -f 1)"
temp="$(( $a*1024 * 8))"
echo "$a"
The output is not correct. How to get size? Thank You
Better than parsing ls output, the proper way is to use the command stat like this :
stat -c '%s' file
Check
man stat | less +/'total size, in bytes'
If by size you mean bytes or pretty bytes can you just use
ls -lh
-h When used with the -l option, use unit suffixes: Byte, Kilobyte, Megabyte, Gigabyte, Terabyte and Petabyte in order to reduce the number of digits to three or less using base 2 for sizes.
I guess the more complete answer if you're just trying to tear off the file size alone (I added the file name as well you can remove ,$9 to drop it)
ls -lh | awk '{print $5,$9}'
U can use this command
du -sh your_file

Total number of lines in a directory

I have a directory with thousands of files (100K for now). When I use wc -l ./*, I'll get:
c1 ./test1.txt
c2 ./test2.txt
...
cn ./testn.txt
c1+c2+...+cn total
Because there are a lot of files in the directory, I just want to see the total count and not the details. Is there any way to do so?
I tried several ways and I got following error:
Argument list too long
If what you want is the total number of lines and nothing else, then I would suggest the following command:
cat * | wc -l
This catenates the contents of all of the files in the current working directory and pipes the resulting blob of text through wc -l.
I find this to be quite elegant. Note that the command produces no extraneous output.
UPDATE:
I didn't realize your directory contained so many files. In light of this information, you should try this command:
for file in *; do cat "$file"; done | wc -l
Most people don't know that you can pipe the output of a for loop directly into another command.
Beware that this could be very slow. If you have 100,000 or so files, my guess would be around 10 minutes. This is a wild guess because it depends on several parameters that I'm not able to check.
If you need something faster, you should write your own utility in C. You could make it surprisingly fast if you use pthreads.
Hope that helps.
LAST NOTE:
If you're interested in building a custom utility, I could help you code one up. It would be a good exercise, and others might find it useful.
Credit: this builds on #lifecrisis's answer, and extends it to handle large numbers of files:
find . -maxdepth 1 -type f -exec cat {} + | wc -l
find will find all of the files in the current directory, break them into groups as large as can be passed as arguments, and run cat on the groups.
awk 'END {print NR" total"}' ./*
Would be an interesting comparison to find out how many lines don't end with a new line.
Combining the awk and Gordon’s find solutions and avoiding the "." files.
find ./* -maxdepth 0 -type f -exec awk 'END {print NR}' {} +
No idea if this is better or worse but it does give a more accurate count (for me) and does not count lines in "." files. Using ./* is just a guess that appears to work.
Still need depth and ./* requires "0" depth.
I did get the same result with the "cat" and "awk" solutions (using the same find) since the "cat *" takes care of the new line issue. I don't have a directory with enough files to measure time. Interesting, I'm liking the "cat" solution.
This will give you the total count for all the files (including hidden files) in your current directory :
$ find . -maxdepth 1 -type f | xargs wc -l | grep total
1052 total
To count for files excluding hidden files use :
find . -maxdepth 1 -type f -not -path "*/\.*" | xargs wc -l | grep total
(Apologies for adding this as an answer—but I do not have enough reputation for commenting.)
A comment on #lifecrisis's answer. Perhaps cat is slowing things down a bit. We could replace cat by wc -l and then use awkto add the numbers. (This could be faster since much less data needs to go throught the pipe.)
That is
for file in *; do wc -l "$file"; done | awk '{sum += $1} END {print sum}'
instead of
for file in *; do cat "$file"; done | wc -l
(Disclaimer: I am not incorporating many of the improvements in other answers, but I thought the point was valid enough to write down.)
Here are my results for comparison (I ran the newer version first so that any cache effects would go against the newer candidate).
$ time for f in `seq 1 1500`; do head -c 5M </dev/urandom >myfile-$f |sed -e 's/\(................\)/\1\n/g'; done
real 0m50.360s
user 0m4.040s
sys 0m49.489s
$ time for file in myfile-*; do wc -l "$file"; done | awk '{sum += $1} END {print sum}'
30714902
real 0m3.455s
user 0m2.093s
sys 0m1.515s
$ time for file in myfile-*; do cat "$file"; done | wc -l
30714902
real 0m4.481s
user 0m2.544s
sys 0m4.312s
iF you want to know only total number Lines in directory excluding total line
ls -ltr | sed -n '/total/!p' | awk '{print NR}'
Previous comment will give total count of lines which includes only count of lines in all files
Below command will provide the total count of lines from all files in path
for i in `ls- ltr | awk ‘$1~”^-rw”{print $9}’`; do wc -l $I | awk ‘{print $1}’; done >>/var/tmp/filelinescount.txt
Cat /var/tmp/filelinescount.txt| sed -r “s/\s+//g”|tr “\n” “+”| sed “s:+$::g”| sed ’s/^/“/g’| sed ’s/$/“/g’ | awk ‘{print “echo” “ “ $0”+bc”}’| sh

Linux: Find total filesize of files path inside a file

I have a file containing some file paths like:
./file1
./dir/file2
./dir3/dir4/fil3
etc
How can i find the total filesize of all of them? I know about "du" for getting the filesize of single file but no idea how to use a file.
Thank you
you can use du to give total size of multiple files
cat file | tr "\n" "\0" | du -ch --files0-from=- | tail -n1
Use awk for getting file size
cat file | awk '{system("ls -l " $0)}' | awk '{ TOTAL += $5} END { print TOTAL}'
GNU coreutils du only suggestion
EDIT: the named option --files0-from is a GNU extension, so this suggested solution won't work with any non GNU coreutils du version. As you don't semm to have it the awk version posted by Vivek Goel is the one you should try instead.
You already answered your own question. Using du is the key. The "missing" option you're looking for might be this one found in the manual pages. (man du)
--files0-from=F
summarize disk usage of the NUL-terminated file names specified in file F; If F is - then read names from standard input
Usage would be like this:
tr "\n" "\0" <file-list | du --files0-from=-

How to list all binary file extensions within a directory tree?

I need to build a list of all the file extensions of binary files located within a directory tree.
The main question would need to be how to distinguish a text file from a binary one, and the rest should be cake.
EDIT: This is the closest I got, any better ideas?
find . -type f|xargs file|grep -v text|sed -r 's:.*\.(.*)\:.*:\1:g'
Here's a trick to find the binary files:
grep -r -m 1 "^" <Your Root> | grep "^Binary file"
The -m 1 makes grep not read all the file.
This perly one-liner worked for me, it was also quite fast:
find . -type f -exec perl -MFile::Basename -e 'print (-T $_ ? "" : (fileparse ($_, qr/\.[^.]*/))[2] . "\n" ) for #ARGV' {} + | sort | uniq
and this is how you can find all binary files in the current folder:
find . -type f -exec perl -e 'print (-B $_ ? "$_\n" : "" ) for #ARGV' {} +
-T is a test for text files, and -B for binary, and they are opposites of each other*.
*perl file tests doc
There is no difference between a binary file and a text file on Linux. The file utility looks at the contents and guesses. Unfortunately, it's not of much help because file doesn't produce a simple "binary or text" answer; it has a complex output with a large number of cases that you would have to parse.
One approach is to read some fixed-sized prefix of a file, like say 256 bytes, and then apply some heuristics. For instance, are all the byte values 0x0 to 0x7F, avoiding control codes except for common whitespace? That suggests ASCII? If there are bytes 0x80 through 0xFF, does the entire buffer (except for one code at the end which may be chopped) decode as valid UTF-8? Etc.
One idea might be to sneakily exploit utilities which detect binary files, like GNU diff.
$ diff -r /bin/ls <(echo foo)
Binary files /bin/ls and /dev/fd/63 differ
Without process substitution, still works:
$ diff -r /bin/ls /dev/null
Binary files /bin/ls and /dev/null differ
Now just grep the output of that and look for the word Binary.
The question is whether diff's heuristic for binary files works for your purposes.
There is no sure way to differentiate a "text" file from a "binary" file, it is guess work.
#!/bin/bash
guess=`echo \`head -c 4096 $1 | strings -a -n 1 | wc -c \` '* 1.05 /' \`head -c 4096 $1 | wc -c \` | bc `;
if [ $guess -eq 1 ] ; then
echo $1 "is text file"
exit 0
else
echo $1 "is binary file"
exit 1
fi
Here is simple command to list all binary files (which consist NULL character) using GNU grep:
grep -Palr '\x00' .
To print the file extension shorter than 5 characters we can use awk and then filter out the duplicates by using either uniq or sort.
So all together should be something like:
grep -Palr '\x00' . | awk -F. '{if (length($NF) < 5) print $NF}' | sort -u
Here is one-liner in Python to check if the file is binary:
b"\x00" in open("/etc/hosts", "rb").read()
Find using it recursively in shell, see the example below:
IS_BINARY='import sys; sys.exit(not b"\x00" in open(sys.argv[1], "rb").read())'
find . -type f -exec bash -c "python -c '$IS_BINARY' {} && echo {}" \;
To find all non-binary files, change && to ||.

Resources