File size check, Linux - linux

File size check
file_path="/home/d-vm/"
cd $file_path
# file_path directory
file=($(cat video.txt | xargs ls -lah | awk '{ print $9}'))
# get name in video.txt --- two files for example VID_141523.mp4 VID_2_141523.mp4
minimumsize=1
actualsize=$(wc -c <"$file")
if [ $actualsize -ge $minimumsize ]; then
echo $file size $actualsize bytes
else
echo error $file size 0 bytes
fi
VID_141523.mp4 file corrupted during conversion. its size 0 bytes
Script output---- error VID_20220709_141523.mp4 size 0 bytes
video.txt
VID_141523.mp4
VID_2_141523.mp4
How to add this construct to the loop ?
It should check all files in the list video.txt

To read a file size stat is the best for use-case here. On a linux machine with GNU stat
$ stat --printf="%s" one
4
This can be looped like
#!/bin/bash
while read -r file ; do
if [ "$(stat --printf="%s" "$file")" -gt 0 ]
then
echo "$file" yes
else
echo "$file" no
fi
done < video.txt
This is too complicated approach in my opinion. Just use find
$ touch zero
$ echo one > one
$ ll
total 8
-rw-r--r-- 1 sverma wheel 4B Jul 11 16:20 one
-rw-r--r-- 1 sverma wheel 0B Jul 11 16:20 zero
one is a 4 byte file. Use -size predicate. Here + means greater than, -type f means only files
$ find . -type f -size +0 -print
./one
You can add a filter for names with something like
$ find . -name '*.mp4' -type f -size +0

You probably mean - "how to loop instead of this: file=($(cat video.txt | xargs ls -lah | awk '{ print $9}'))" (which is pretty horrible by itself (*)).
You should just use a while loop:
cat video.txt | while read file; do
# something with $file
done
Also, please use stat -c%s FILE instead of wc (which, BTW, also takes a file - you can use wc -c FILE instead of using input redirection), as that looks just at the file system information to check the size, instead of loading and counting each byte.
(*) doing ls -lah and then awk '{print$9}' is the same as just doing ls, but there are other issues with this code that is very not bash idiomatic.

Related

Linux: what's a fast way to find all duplicate files in a directory?

I have a directory with many subdirs and about 7000+ files in total. What I need to find all duplicates of all files. For any given file, its duplicates might be scattered around various subdirs and may or may not have the same file name. A duplicate is a file that you get a 0 return code from the diff command.
The simplest thing to do is to run a double loop over all the files in the directory tree. But that's 7000^2 sequential diffs and not very efficient:
for f in `find /path/to/root/folder -type f`
do
for g in `find /path/to/root/folder -type f`
do
if [ "$f" = "$g" ]
then
continue
fi
diff "$f" "$g" > /dev/null
if [ $? -eq 0 ]
then
echo "$f" MATCHES "$g"
fi
done
done
Is there a more efficient way to do it?
On Debian 11:
% mkdir files; (cd files; echo "one" > 1; echo "two" > 2a; cp 2a 2b)
% find files/ -type f -print0 | xargs -0 md5sum | tee listing.txt | \
awk '{print $1}' | sort | uniq -c | awk '$1>1 {print $2}' > dups.txt
% grep -f dups.txt listing.txt
c193497a1a06b2c72230e6146ff47080 files/2a
c193497a1a06b2c72230e6146ff47080 files/2b
Find and print all files null terminated (-print0).
Use xargs to md5sum them.
Save a copy of the sums and filenames in "listing.txt" file.
Grab the sum and pass to sort then uniq -c to count, saving into the "dups.txt" file.
Use awk to list duplicates, then grep to find the sum and filename.

How to get total count of grouped files along with the extension and path in Linux

I have mulitple directories and for each directory i want to list the total count of files grouped by their extensions, like follows:
/Documents
2 zip 20 M
3 tar 50 M
5 png 10 K
/Desktop
5 txt 10 K
7 png 5 M
10 jpg 5 M
The following script is my approach, but it's output differs from what i expect:
#!/bin/bash
find . -maxdepth 1 -type d -print0 | while read -d '' -r dir; do
num=$(find $dir find . -type f | sed -n 's/..*\.//p' | sort | uniq -c);
printf "%5d files in directory %s\n" "$num" "$dir";
done
Script Output:
7 csv
1 csv#
2 doc
1 docx
9 html
24 jpg
33 js
20 mp3
2 mp4
9 pdf
101 png
3 ppt
128 swf
3 xml
4 xsd
26 zip
From your script, I get the idea that you just want to go one level deep. So, why not try something like this:
for dir in * ; do
if [ -d "$dir" ] ; then
echo "$dir"
extlist=`ls $dir | sed 's/.*\.//' | uniq`
for ext in $extlist ; do
qty=`qty=`ls "$dir/*$ext"| wc -l`
space=`du -ch "$dir/*$ext"| sed -n 's/ *total//p'`
echo "$qty $ext $space"
done
fi
done
Is this what you're after?
$ mkdir Desktop Documents
$ for d in D*; do for e in zip jpg foo; do for i in $(seq 1 $((RANDOM/3000))); do touch $d/$i.$e; done; done; done
$ declare -A a; for d in D*; do a=(); for f in $d/*.*; do (( a[${f##*.}]++ )); done; printf '%s: ' "$d"; declare -p a; done;
Desktop: declare -A a=([jpg]="3" [foo]="5" [zip]="3" )
Documents: declare -A a=([jpg]="6" [foo]="2" [zip]="2" )
You can parse the counting array as you see fit at the end of each iteration of the outer loop.

How to show file and folder size in ls format

ls -l will show all file and folder in detail information. but folder size is not correct(alwasy show 4.0k).
So how to show file/folder as ls -l format besides folder with correct size. Thanks!
ls -l format:
drwxr-xr-x 9 jerry jerry XXX.0K Mar 3 14:34 Flask-0.10.1
-rw-rw-r-- 1 jerry jerry 532K Mar 3 14:25 Flask-0.10.1.tar.gz
drwxrwxr-x 10 jerry jerry XXXK Feb 8 15:41 leveldb1.15
You can use the -printf option of find to get most of the info - see the man page for more info. The easiest way I know to find the size of a directory including everything inside it is with du -s, so you will have to print/paste these values together. eg:
paste <(find . -maxdepth 1 -printf "%M %u %c %p\n") <(find . -maxdepth 1 -exec du -s {} \; | cut -f1 ) | column -t
Sample output:
drwxrwxr-x ooh Thu Apr 3 07:07:45 2014 . 12260
-rw-rw-r-- ooh Thu Apr 3 07:07:41 2014 ./test.txt 5080
drwxrwxr-x ooh Thu Apr 3 07:07:54 2014 ./testdir 7140
So: permission/owner/date/name/size in bytes
I'm not quite sure if this is what you are looking for - but from man ls:
--block-size=SIZE scale sizes by SIZE before printing them. E.g., '--block-size=M' prints sizes in units of 1,048,576 bytes. See SIZE
format below.
SIZE is an integer and optional unit (example: 10M is 10*1024*1024).
Units are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, ...
(powers of 1000).
EDIT:
Based on your comment I would set up an alias to get an output which gives you the desired data but breaks the typical format of ls - e.g.:
alias lsd="ls -l && du -h --max-depth 1"
Result:
$ lsd
total 4
dr-xr-xr-x 4 user group 4096 Mar 12 2012 network
12K ./network
16K .
I do not know of any way to achieve your goal with ls only.
For sure there's a better way, but this works for me:
ls -l | while read f
do
echo -ne $( echo $f | cut -f -4 -d ' ')
echo -ne " $(du -s $(echo $f | cut -f 9 -d ' ') | cut -f 1) ";
echo $( echo $f | cut -f 6- -d ' ');
done
quoting #izx
The directory is just a link to a list of files. So the size which you
are seeing is not the total space occupied by the folder but the space
occupied by the link. The minimum size a file or directory entry/link
must occupy is one block, which is usually 4096 bytes/4K on most
ext3/4 filesystems.
So the minimum size displayed can be only 4.0k
You can't do it with ls. But, You can try this,
ls -lhrt | awk '$1 ~ /^d/{"du -sh $9" | getline s; split(s,a," ");$5=a[1];}1'
Made a script similar to this (replicating some ls options) before, made du change and should work here
#!/bin/bash
[[ ! -d "$1" ]] && echo "Error $1 not a directory" 1>&2 && exit
for f in "$1"/*; do
f="${f##*/}"
symlink=
if [[ "$(stat -c '%h' "$f")" -gt 9 ]]; then
string="$(stat -c "%A %h %U %G" "$f")"
else
string="$(stat -c "%A %h %U %G" "$f")"
fi
t="$(stat -c "%x" "$f")"
date="$(date -d "$t" "+%-d")"
if [[ $date -lt 10 ]]; then
date="$(date -d "$t" "+%b %-d %H:%M")"
else
date="$(date -d "$t" "+%b %d %H:%M")"
fi
if [[ -h $f ]]; then
size="$(ls -sh "$f" | cut -d ' ' -f1)"
symlink="$f -> $(readlink "$f")"
f=
elif [[ -d $f ]]; then
size="$(du -sh "$f" | cut -f1 2>&1)"
f="$f/"
elif [[ -f $f ]]; then
size="$(ls -sh "$f" | cut -d ' ' -f1)"
else
echo "Error $f" 1>&2 && exit
fi
printf "%s %-4s %s %s\n" "$string" "$size" "$date" "${f}$symlink"
done
Will obviously be quite a bit slower than regular ls, since it has to calculate size.
Example output
> ./abovescript "./testdir"
drwxrwxr-x 5 owner group 11M Apr 3 06:53 workspace/
-rwxrwxr-x 1 owner group 8.0K Mar 30 03:53 a.out
lrwxrwxrwx 1 owner group 0 Apr 3 07:19 foo -> test/file
Try:
du -b -d 1
This prints the disk usage for each folder to a depth of 1, which is the equivalent of ls -al

File with the most lines in a directory NOT bytes

I'm trying to to wc -l an entire directory and then display the filename in an echo with the number of lines.
To add to my frustration, the directory has to come from a passed argument. So without looking stupid, can someone first tell me why a simple wc -l $1 doesn't give me the line count for the directory I type in the argument? I know i'm not understanding it completely.
On top of that I need validation too, if the argument given is not a directory or there is more than one argument.
wc works on files rather than directories so, if you want the word count on all files in the directory, you would start with:
wc -l $1/*
With various gyrations to get rid of the total, sort it and extract only the largest, you could end up with something like (split across multiple lines for readability but should be entered on a single line):
pax> wc -l $1/* 2>/dev/null
| grep -v ' total$'
| sort -n -k1
| tail -1l
2892 target_dir/big_honkin_file.txt
As to the validation, you can check the number of parameters passed to your script with something like:
if [[ $# -ne 1 ]] ; then
echo 'Whoa! Wrong parameteer count'
exit 1
fi
and you can check if it's a directory with:
if [[ ! -d $1 ]] ; then
echo 'Whoa!' "[$1]" 'is not a directory'
exit 1
fi
Is this what you want?
> find ./test1/ -type f|xargs wc -l
1 ./test1/firstSession_cnaiErrorFile.txt
77 ./test1/firstSession_cnaiReportFile.txt
14950 ./test1/exp.txt
1 ./test1/test1_cnaExitValue.txt
15029 total
so your directory which is the argument should go here:
find $your_complete_directory_path/ -type f|xargs wc -l
I'm trying to to wc -l an entire directory and then display the
filename in an echo with the number of lines.
You can do a find on the directory and use -exec option to trigger wc -l. Something like this:
$ find ~/Temp/perl/temp/ -exec wc -l '{}' \;
wc: /Volumes/Data/jaypalsingh/Temp/perl/temp/: read: Is a directory
11 /Volumes/Data/jaypalsingh/Temp/perl/temp//accessor1.plx
25 /Volumes/Data/jaypalsingh/Temp/perl/temp//autoincrement.pm
12 /Volumes/Data/jaypalsingh/Temp/perl/temp//bless1.plx
14 /Volumes/Data/jaypalsingh/Temp/perl/temp//bless2.plx
22 /Volumes/Data/jaypalsingh/Temp/perl/temp//classatr1.plx
27 /Volumes/Data/jaypalsingh/Temp/perl/temp//classatr2.plx
7 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee1.pm
18 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee2.pm
26 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee3.pm
12 /Volumes/Data/jaypalsingh/Temp/perl/temp//ftp.plx
14 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit1.plx
16 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit2.plx
24 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit3.plx
33 /Volumes/Data/jaypalsingh/Temp/perl/temp//persisthash.pm
Nice question!
I saw the answers. Some are pretty good. The find ...|xrags is my most preferred. It could be simplified anyway using find ... -exec wc -l {} + syntax. But there is a problem. When the command line buffer is full a wc -l ... is called and every time a <number> total line is printer. As wc has no arg to disable this feature wc has to be reimplemented. To filter out these lines with grep is not nice:
So my complete answer is
#!/usr/bin/bash
[ $# -ne 1 ] && echo "Bad number of args">&2 && exit 1
[ ! -d "$1" ] && echo "Not dir">&2 && exit 1
find "$1" -type f -exec awk '{++n[FILENAME]}END{for(i in n) printf "%8d %s\n",n[i],i}' {} +
Or using less temporary space, but a little bit larger code in awk:
find "$1" -type f -exec awk 'function pr(){printf "%8d %s\n",n,f}FNR==1{f&&pr();n=0;f=FILENAME}{++n}END{pr()}' {} +
Misc
If it should not be called for subdirectories then add -maxdepth 1 before -type to find.
It is pretty fast. I was afraid that it would be much slower then the find ... wc + version, but for a directory containing 14770 files (in several subdirs) the wc version run 3.8 sec and awk version run 5.2 sec.
awk and wc consider the not \n ended lines differently. The last line ended with no \n is not counted by wc. I prefer to count it as awk does.
It does not print the empty files
To find the file with most lines in the current directory and its subdirectories, with zsh:
lines() REPLY=$(wc -l < "$REPLY")
wc -l -- **/*(D.nO+lined[1])
That defines a lines function which is going to be used as a glob sorting function that returns in $REPLY the number of lines of the file whose path is given in $REPLY.
Then we use zsh's recursive globbing **/* to find regular files (.), numerically (n) reverse sorted (O) with the lines function (+lines), and select the first one [1]. (D to include dotfiles and traverse dotdirs).
Doing it with standard utilities is a bit tricky if you don't want to make assumptions on what characters file names may contain (like newline, space...). With GNU tools as found on most Linux distributions, it's a bit easier as they can deal with NUL terminated lines:
find . -type f -exec sh -c '
for file do
size=$(wc -c < "$file") &&
printf "%s\0" "$size:$file"
done' sh {} + |
tr '\n\0' '\0\n' |
sort -rn |
head -n1 |
tr '\0' '\n'
Or with zsh or GNU bash syntax:
biggest= max=-1
find . -type f -print0 |
{
while IFS= read -rd '' file; do
size=$(wc -l < "$file") &&
((size > max)) &&
max=$size biggest=$file
done
[[ -n $biggest ]] && printf '%s\n' "$max: $biggest"
}
Here's one that works for me with the git bash (mingw32) under windows:
find . -type f -print0| xargs -0 wc -l
This will list the files and line counts in the current directory and sub dirs. You can also direct the output to a text file and import it into Excel if needed:
find . -type f -print0| xargs -0 wc -l > fileListingWithLineCount.txt

Linux Shell - String manipulation then calculating age of file in minutes

I am writing a script that calculates the age of the oldest file in a directory. The first commands run are:
OLDFILE=`ls -lt $DIR | grep "^-" | tail -1 `
echo $OLDFILE
The output contains a lot more than just the filename. eg
-rwxrwxr-- 1 abc abc 334 May 10 2011 ABCD_xyz20110510113817046.abc.bak
Q1/. How do I obtain the output after the last space of the above line? This would give me the filename. I realise some sort of string manipulation is required but am new to shell scripting.
Q2/. How do I obtain the age of this file in minutes?
To obtain just the oldest file's name,
ls -lt | awk '/^-/{file=$NF}END{print file}'
However, this is not robust if you have files with spaces in their names, etc. Generally, you should try to avoid parsing the output from ls.
With stat you can obtain a file's creation date in machine-readable format, expressed as seconds since Jan 1, 1970; with date +%s you can obtain the current time in the same format. Subtract and divide by 60. (More Awk skills would come in handy for the arithmetic.)
Finally, for an alternate solution, look at the options for find; in particular, its printf format strings allow you to extract a file's age. The following will directly get you the age in seconds and inode number of the oldest file:
find . -maxdepth 1 -type f -printf '%T# %i\n' |
sort -n | head -n 1
Using the inode number avoids the issues of funny file names; once you have a single inode, converting that to a file name is a snap:
find . -maxdepth 1 -inum "$number"
Tying the two together, you might want something like this:
# set -- Replace $# with output from command
set -- $(find . -maxdepth 1 -type f -printf '%T# %i\n' |
sort -n | head -n 1)
# now $1 is the timestamp and $2 is the inode
oldest_filename=$(find . -maxdepth 1 -inum "$2")
age_in_minutes=$(date +%s | awk -v d="$1" '{ print ($1 - d) / 60 }')
an awk solution, giving you how old the file is in minutes, (as your ls output does not contain the min of creation, so 00 is assumed by default). Also as tripleee pointed out, ls outputs are inherently risky to be parsed.
[[bash_prompt$]]$ echo $l; echo "##############";echo $l | awk -f test.sh ; echo "#############"; cat test.sh
-rwxrwxr-- 1 abc abc 334 May 20 2013 ABCD_xyz20110510113817046.abc.bak
##############
File ABCD_xyz20110510113817046.abc.bak is 2074.67 min old
#############
BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
for(o=1;o<=m;o++){
date[d[o]]=sprintf("%02d",o)
}
}
{
month=date[$6];
old_time=mktime($8" "month" "$7" "00" "00" "00);
curr_time=systime();
print "File " $NF " is " (curr_time-old_time)/60 " min old";
}
For Q2 a bash one-liner could be:
let mtime_minutes=\(`date +%s`-`stat -c%Y "$file_to_inspect"`\)/60
You probably want ls -t. Including the '-l' option explicitly asks for ls to give you all that other info that you actually want to get rid of.
However what you probably REALLY want is find, first:
NEWEST=$(find . -maxdepth 1 -type f | cut -c3- |xargs ls -t| head -1)
this will get you the plain filename of the newest item, no other processing needed.
Directories will be correctly excluded. (thanks to tripleee for pointing out that's what you were aiming for.)
For the second question, you can use stat and bc:
TIME_MINUTES=$(stat --format %Y/60 "$NEWEST" |bc -l)
remove the '-l' option from bc if you only want a whole number of minutes.

Resources