File with the most lines in a directory NOT bytes

File with the most lines in a directory NOT bytes - linux

I'm trying to to wc -l an entire directory and then display the filename in an echo with the number of lines.
To add to my frustration, the directory has to come from a passed argument. So without looking stupid, can someone first tell me why a simple wc -l $1 doesn't give me the line count for the directory I type in the argument? I know i'm not understanding it completely.
On top of that I need validation too, if the argument given is not a directory or there is more than one argument.

wc works on files rather than directories so, if you want the word count on all files in the directory, you would start with:
wc -l $1/*
With various gyrations to get rid of the total, sort it and extract only the largest, you could end up with something like (split across multiple lines for readability but should be entered on a single line):
pax> wc -l $1/* 2>/dev/null
| grep -v ' total$'
| sort -n -k1
| tail -1l
2892 target_dir/big_honkin_file.txt
As to the validation, you can check the number of parameters passed to your script with something like:
if [[ $# -ne 1 ]] ; then
echo 'Whoa! Wrong parameteer count'
exit 1
fi
and you can check if it's a directory with:
if [[ ! -d $1 ]] ; then
echo 'Whoa!' "[$1]" 'is not a directory'
exit 1
fi

Is this what you want?
> find ./test1/ -type f|xargs wc -l
1 ./test1/firstSession_cnaiErrorFile.txt
77 ./test1/firstSession_cnaiReportFile.txt
14950 ./test1/exp.txt
1 ./test1/test1_cnaExitValue.txt
15029 total
so your directory which is the argument should go here:
find $your_complete_directory_path/ -type f|xargs wc -l

I'm trying to to wc -l an entire directory and then display the
filename in an echo with the number of lines.
You can do a find on the directory and use -exec option to trigger wc -l. Something like this:
$ find ~/Temp/perl/temp/ -exec wc -l '{}' \;
wc: /Volumes/Data/jaypalsingh/Temp/perl/temp/: read: Is a directory
11 /Volumes/Data/jaypalsingh/Temp/perl/temp//accessor1.plx
25 /Volumes/Data/jaypalsingh/Temp/perl/temp//autoincrement.pm
12 /Volumes/Data/jaypalsingh/Temp/perl/temp//bless1.plx
14 /Volumes/Data/jaypalsingh/Temp/perl/temp//bless2.plx
22 /Volumes/Data/jaypalsingh/Temp/perl/temp//classatr1.plx
27 /Volumes/Data/jaypalsingh/Temp/perl/temp//classatr2.plx
7 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee1.pm
18 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee2.pm
26 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee3.pm
12 /Volumes/Data/jaypalsingh/Temp/perl/temp//ftp.plx
14 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit1.plx
16 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit2.plx
24 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit3.plx
33 /Volumes/Data/jaypalsingh/Temp/perl/temp//persisthash.pm

Nice question!
I saw the answers. Some are pretty good. The find ...|xrags is my most preferred. It could be simplified anyway using find ... -exec wc -l {} + syntax. But there is a problem. When the command line buffer is full a wc -l ... is called and every time a <number> total line is printer. As wc has no arg to disable this feature wc has to be reimplemented. To filter out these lines with grep is not nice:
So my complete answer is
#!/usr/bin/bash
[ $# -ne 1 ] && echo "Bad number of args">&2 && exit 1
[ ! -d "$1" ] && echo "Not dir">&2 && exit 1
find "$1" -type f -exec awk '{++n[FILENAME]}END{for(i in n) printf "%8d %s\n",n[i],i}' {} +
Or using less temporary space, but a little bit larger code in awk:
find "$1" -type f -exec awk 'function pr(){printf "%8d %s\n",n,f}FNR==1{f&&pr();n=0;f=FILENAME}{++n}END{pr()}' {} +
Misc
If it should not be called for subdirectories then add -maxdepth 1 before -type to find.
It is pretty fast. I was afraid that it would be much slower then the find ... wc + version, but for a directory containing 14770 files (in several subdirs) the wc version run 3.8 sec and awk version run 5.2 sec.
awk and wc consider the not \n ended lines differently. The last line ended with no \n is not counted by wc. I prefer to count it as awk does.
It does not print the empty files

To find the file with most lines in the current directory and its subdirectories, with zsh:
lines() REPLY=$(wc -l < "$REPLY")
wc -l -- **/*(D.nO+lined[1])
That defines a lines function which is going to be used as a glob sorting function that returns in $REPLY the number of lines of the file whose path is given in $REPLY.
Then we use zsh's recursive globbing **/* to find regular files (.), numerically (n) reverse sorted (O) with the lines function (+lines), and select the first one [1]. (D to include dotfiles and traverse dotdirs).
Doing it with standard utilities is a bit tricky if you don't want to make assumptions on what characters file names may contain (like newline, space...). With GNU tools as found on most Linux distributions, it's a bit easier as they can deal with NUL terminated lines:
find . -type f -exec sh -c '
for file do
size=$(wc -c < "$file") &&
printf "%s\0" "$size:$file"
done' sh {} + |
tr '\n\0' '\0\n' |
sort -rn |
head -n1 |
tr '\0' '\n'
Or with zsh or GNU bash syntax:
biggest= max=-1
find . -type f -print0 |
{
while IFS= read -rd '' file; do
size=$(wc -l < "$file") &&
((size > max)) &&
max=$size biggest=$file
done
[[ -n $biggest ]] && printf '%s\n' "$max: $biggest"
}

Here's one that works for me with the git bash (mingw32) under windows:
find . -type f -print0| xargs -0 wc -l
This will list the files and line counts in the current directory and sub dirs. You can also direct the output to a text file and import it into Excel if needed:
find . -type f -print0| xargs -0 wc -l > fileListingWithLineCount.txt

Related

Linux: what's a fast way to find all duplicate files in a directory?

I have a directory with many subdirs and about 7000+ files in total. What I need to find all duplicates of all files. For any given file, its duplicates might be scattered around various subdirs and may or may not have the same file name. A duplicate is a file that you get a 0 return code from the diff command.
The simplest thing to do is to run a double loop over all the files in the directory tree. But that's 7000^2 sequential diffs and not very efficient:
for f in `find /path/to/root/folder -type f`
do
for g in `find /path/to/root/folder -type f`
do
if [ "$f" = "$g" ]
then
continue
fi
diff "$f" "$g" > /dev/null
if [ $? -eq 0 ]
then
echo "$f" MATCHES "$g"
fi
done
done
Is there a more efficient way to do it?

On Debian 11:
% mkdir files; (cd files; echo "one" > 1; echo "two" > 2a; cp 2a 2b)
% find files/ -type f -print0 | xargs -0 md5sum | tee listing.txt | \
awk '{print $1}' | sort | uniq -c | awk '$1>1 {print $2}' > dups.txt
% grep -f dups.txt listing.txt
c193497a1a06b2c72230e6146ff47080 files/2a
c193497a1a06b2c72230e6146ff47080 files/2b
Find and print all files null terminated (-print0).
Use xargs to md5sum them.
Save a copy of the sums and filenames in "listing.txt" file.
Grab the sum and pass to sort then uniq -c to count, saving into the "dups.txt" file.
Use awk to list duplicates, then grep to find the sum and filename.

How to delete files similar to items in a string

I want to delete files in the current folder with the following pattern.
0_something.sql.tar
I have a string provided which contains numbers
number_string="0,1,2,3,4"
How can I delete any files not included in the number_string while also keeping to the x_x.sql.tar pattern?
For example, I have these files:
0_something.sql.tar
2_something.sql.tar
4_something.sql.tar
15_something.sql.tar
Based on this logic, and the numbers in the number string - I should only remove 15 because:
It follows the pattern _.sql.tar
It doesnt have a number
in the number string

This might help you out:
s="0,1,2,3,4"
s=",${s},"
for f in *.sql.tar; do
n="${f%_*}"
[ "${n//[0-9]}" ] && continue
[ "$s" == "${s/,${n},/}" ] && echo rm -- "$f"
done
Remove the echo if this answer pleases you
What this is doing is the following:
convert your number_string s into a string which is fully comma-separated and
also starts and ends with a comma (s=",0,1,2,3,"). This allows us to search for entries like ,5,
loop over all files matched by the glob *.sql.tar
n="${f%_*}": Extract the substring before the first underscore `
[ "{n//[0-9]}" ] && continue: validate if the substring is an integer, if not, skip the file and move to the next one.
substitute the number in the number_string (with commas), if the substring does not change, it implies we should not keep the file

# Get the unmatched numbers from the second stream
# ie. files to be removed
join -v2 -o2.2 <(
# output sorted numbers on separate lines
sort <<<${number_string//,/$'\n'}
) <(
# fins all files named in such way
# and print filename, tab and path separated by newlines
find . -name '[0-9]*_something.sql.tar' -printf "%f\t%p\n" |
# extract numbers from filenames only
sed 's/\([0-9]*\)[^\t]*/\1/' |
# sort for join
sort
) |
# pass the input to xargs
# remove echo to really remove files
xargs -d '\n' echo rm
Tested on repl

$IFS can help here.
( IFS=,; for n in $number_string; do echo rm $n\_something.sql.tar; done; )
The parens run the command in a subshell so the reassignment of IFS is scoped.
Setting it to a comma lets the command parser split the string into discrete numbers for you and loop over them.
If that gives you the right list of commands you want to execute, just take out the echo. :)
UPDATE
OH! I see that now. Sorry, my bad, lol...
Well then, let's try a totally different approach. :)
Extended Globbing is likely what you need.
shopt -s extglob # turn extended globbing on
echo rm !(${number_string//,/\|})_something.sql.tar
That'll show you the command that would be executed. If you're satisfied, take the echo off. :)
This skips the need for a brute-force loop.
Explanation -
Once extglob is on, !(...) means "anything that does NOT match any of these patterns."
${number_string//,/\|} replaces all commas in the string with pipe separators, creating a match pattern for the extended glob.
Thus, !(${number_string//,/\|}) means anything NOT matching one of those patterns; !(${number_string//,/\|})_something.sql.tar then means "anything that starts with something NOT one of these patterns, followed by this string."
I created these:
$: printf "%s\n" *_something.sql.tar
0_something.sql.tar
1_something.sql.tar
2_something.sql.tar
3_something.sql.tar
4_something.sql.tar
5_something.sql.tar
6_something.sql.tar
7_something.sql.tar
8_something.sql.tar
9_something.sql.tar
then after setting extglob and using the above value for $number_string, I get this:
$: echo !(${number_string//,/\|})_something.sql.tar
5_something.sql.tar 6_something.sql.tar 7_something.sql.tar 8_something.sql.tar 9_something.sql.tar
Be careful about quoting, though. You can quote it to see the pattern itself, but then it matches nothing.
$: echo "!(${number_string//,/\|})_something.sql.tar"
!(0|1|2|3|4)_something.sql.tar
if you prefer the loop...
for f in *_something.sql.tar # iterating over all these
do case ",${f%_something.sql.tar}," in # for each, with suffix removed
",$number_string,") continue ;; # skip matches
*) rm "$f" ;; # delete nonmatches
esac
done

Write a script to do the matching, and remove those names that do not match. For example:
$ rm -rf foo
$ mkdir foo
$ cd foo
$ touch {2,4,6,8}.tar
$ echo "$number_string" | tr , \\n | sed 's/$/.tar/' > match-list
$ find . -type f -exec sh -c 'echo $1 | grep -f match-list -v -q' _ {} \; -print
./6
./8
./match-list
Replace -print with -delete to actually unlink the names. Note that this will cause problems since match-list will probably get deleted midway through and no longer exist for future matches, so you'll want to modify it a bit. Perhaps:
find . -type f -not -name match-list -name '*.tar' -exec sh -c 'echo $1 | grep -f match-list -v -q' _ {} \; -delete
In this case, there's no need to explicitly exclude 'match-list' since it will not match the -name '*.tar' primitive, but is included here for completeness.

I have sacked some previous answers, but credit is given and the resulting script is nice
$ ls -l
total 4
-rwxr-xr-x 1 boffi boffi 355 Jul 27 10:58 rm_tars_except
$ cat rm_tars_except
#!/usr/bin/env bash
dont_rm="$1"
# https://stackoverflow.com/a/10586169/2749397
IFS=',' read -r -a dont_rm_a <<< "$dont_rm"
for tarfile in ?.tar ; do
digit=$( basename "$tarfile" .tar )
# https://stackoverflow.com/a/15394738/2749397
[[ " ${dont_rm_a[#]} " =~ " ${digit} " ]] && \
echo "# Keep $tarfile" || \
echo "rm $tarfile"
done
$ touch 1.tar 3.tar 5.tar 7.tar
$ ./rm_tars_except 3,5
rm 1.tar
# Keep 3.tar
# Keep 5.tar
rm 7.tar
$ ./rm_tars_except 3,5 | sh
$ ls -l
total 4
-rw-r--r-- 1 boffi boffi 0 Jul 27 11:00 3.tar
-rw-r--r-- 1 boffi boffi 0 Jul 27 11:00 5.tar
-rwxr-xr-x 1 boffi boffi 355 Jul 27 10:58 rm_tars_except
$
If we can remove the restrictions on the "keep info" presented in a comma separated string then the script can be significantly simplified
#!/usr/bin/env bash
for tarfile in ?.tar ; do
digit=$( basename "$tarfile" .tar )
# https://stackoverflow.com/a/15394738/2749397
[[ " ${#} " =~ " ${digit} " ]] && \
echo "# Keep $tarfile" || \
echo "rm $tarfile"
done
that, of course, should be called like this ./rm_tars_except 3 5 | sh

find . -type f -name '*_something.sql.tar' | grep "<input the series with or | symbol>" | xargs rm -f
example:-
find . -type f -name '*_something.sql.tar' | grep "0\|1\|2\|3\|4" | xargs rm -f

Find the longest file name in Linux

I am searching for the longest filename from my root directory to the very bottom.
I have coded a C program that will calculate the longest file name's length and its name.
However, I cannot get the shell to redirect the long list of file names to standard input for my program to receive it.
Here is what I did:
ls -Rp | grep -v / | grep -v "Permission denied" | ./home/user/findlongest
findlongest has been compiled and I check it on one of my IDE's to make sure it's working correctly. No run time errors were detected so far.
How do I get the list of file names into my 'findlongest' code by redirecting stdin?

Try this:
find / -type f -printf '%f\n' 2>/dev/null | /home/user/findlongest
The 2>/dev/null will discard all data written to stderr (which is where you're seeing the 'Permission denied' messages from).
Or the following to remove the dependancy on your application (from here):
find / -type f -printf '%f\n' 2>/dev/null | \
awk 'length > max_length {
max_length = length; longest_line = $0
}
END {
print length(longest_line) " " longest_line
}'

What about
find / -type f | /home/user/findlongest
It will list all files from root with absolute path and print only those files you have permissions to list.

Based on the command:
find -exec basename '{}' ';'
which prints recursively only the filenames of all the files starting from the directory you are: all the filenames.
This bash line will provide the file with longest name and the its number of characters:
Note that the loop involved will make the process slow.
for i in $(find -exec basename '{}' ';'); do printf $i" " && echo -e -n $i | wc -c; done | sort -nk 2 | tail -1
By parts:
Prints the name of the file followed by a single space:
printf $i" "
Prints the number of characters of such file:
echo -e -n $i | wc -c
Sorts the output by number of characters and takes the longest one (the very latest):
sort -nk 2 | tail -1
All this inside a for loop to handle line by line.
The for sentence can be also changed by:
for i in $(find -type f -printf '%f\n');
As stated in #Attie's answer

How to find files that have long chains of zero bytes?

How can I find files that have long chains of consecutive 0s (zero bytes - 0x00) as a result of disk failure? For example, how can I find files that have more than 10000 zero bytes in sequence?
Sure, I can write a program using Java or other programming language, but is there a way to do it using more or less standard Linux command line tools?
Update
You can generate test file with dd if=/dev/zero of=zeros bs=1 count=100000.

This may be a start:
find /some/starting/point -type f -size +10000 -exec \
perl -nE 'if (/\x0{10000}/) {say $ARGV; close ARGV}' '{}' +

To test for a single file, named filename:
if tr -sc '\0' '\n' < filename | tr '\0' Z | grep -qE 'Z{1000}'; then
# ...
fi
You can now use a suitable find command to filter relevant files for test.
For example, all *.txt files in PWD:
while read -rd '' filename;do
if tr -sc '\0' '\n' < "$filename" | tr '\0' Z | grep -qE 'Z{1000}'; then
# For example, simply print "$filename"
printf '%s\n' "$filename"
fi
done < <(find . -type f -name '*.txt' -print0)

Find and grep should work just fine:
grep -E "(\0)\1{1000}" <file name>
if it's a single file or a group of files in the same dir
If you want to search throughout the system there's:
find /dir/ -exec grep -E "(\0)\1{1000}" {} \; 2> /dev/null
this is very slow though, if you're looking for something faster and can do without the thousand(or large number) of zeros
I'll suggest replacing the grep with 'grep 000000000*' instead

Combining greps to make script to count files in folder

I need some help combining elements of scripts to form a read output.
Basically I need to get the file name of a user for the folder structure listed below and using count the number of lines in the folder for that user with the file type *.ano
This is shown in the extract below, to note that the location on the filename is not always the same counting from the front.
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.txt
/home/user/Drive-backup/2011 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/3.ano
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.ano
awk -F/ '{print $(NF-2)}'
This will give me the username I need but I also need to know how many non blank lines they are in that users folder for file type *.ano. I have the grep below that works but I dont know how to put it all together so it can output a file that makes sense.
grep -cv '^[[:space:]]*$' *.ano | awk -F: '{ s+=$2 } END { print s }'
Example output needed
UserA 500
UserB 2
UserC 20

find /home -name '*.ano' | awk -F/ '{print $(NF-2)}' | sort | uniq -c
That ought to give you the number of "*.ano" files per user given your awk is correct. I often use sort/uniq -c to count the number of instances of a string, in this case username, as opposed to 'wc -l' only counting input lines.
Enjoy.

Have a look at wc (word count).

To count the number of *.ano files in a directory you can use
find "$dir" -iname '*.ano' | wc -l
If you want to do that for all directories in some directory, you can just use a for loop:
for dir in * ; do
echo "user $dir"
find "$dir" -iname '*.ano' | wc -l
done

Execute the bash-script below from folder
/home/user/Drive-backup/2010 Backup/2010 Account/Jan
and it will report the number of non-blank lines per user.
#!/bin/bash
#save where we start
base=$(pwd)
# get all top-level dirs, skip '.'
D=$(find . \( -type d ! -name . -prune \))
for d in $D; do
cd $base
cd $d
# search for all files named *.ano and count blank lines
sum=$(find . -type f -name *.ano -exec grep -cv '^[[:space:]]*$' {} \; | awk '{sum+=$0}END{print sum}')
echo $d $sum
done

This might be what you want (untested): requires bash version 4 for associative arrays
declare -A count
cd /home/user/Drive-backup
for userdir in */*/*/*; do
username=${userdir##*/}
lines=$(grep -cv '^[[:space:]]$' $userdir/user.dir/*.ano | awk '{sum += $2} END {print sum}')
(( count[$username] += lines ))
done
for user in "${!count[#]}"; do
echo $user ${count[$user]}
done

Here's yet another way of doing it (on Mac OS X 10.6):
find -x "$PWD" -type f -iname "*.ano" -exec bash -c '
ar=( "${#%/*}" ) # perform a "dirname" command on every array item
printf "%s\000" "${ar[#]%/*}" # do a second "dirname" and add a null byte to every array item
' arg0 '{}' + | sort -uz |
while IFS="" read -r -d '' userDir; do
# to-do: customize output to get example output needed
echo "$userDir"
basename "$userDir"
find -x "${userDir}" -type f -iname "*.ano" -print0 |
xargs -0 -n 500 grep -hcv '^[[:space:]]*$' | awk '{ s+=$0 } END { print s }'
#xargs -0 -n 500 grep -cv '^[[:space:]]*$' | awk -F: '{ s+=$NF } END { print s }'
printf '%s\n' '----------'
done

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

File with the most lines in a directory NOT bytes - linux

Related

Linux: what's a fast way to find all duplicate files in a directory?

How to delete files similar to items in a string

Find the longest file name in Linux

How to find files that have long chains of zero bytes?

Combining greps to make script to count files in folder

Categories

Resources