How to find files that have long chains of zero bytes?

How to find files that have long chains of zero bytes? - linux

How can I find files that have long chains of consecutive 0s (zero bytes - 0x00) as a result of disk failure? For example, how can I find files that have more than 10000 zero bytes in sequence?
Sure, I can write a program using Java or other programming language, but is there a way to do it using more or less standard Linux command line tools?
Update
You can generate test file with dd if=/dev/zero of=zeros bs=1 count=100000.

This may be a start:
find /some/starting/point -type f -size +10000 -exec \
perl -nE 'if (/\x0{10000}/) {say $ARGV; close ARGV}' '{}' +

To test for a single file, named filename:
if tr -sc '\0' '\n' < filename | tr '\0' Z | grep -qE 'Z{1000}'; then
# ...
fi
You can now use a suitable find command to filter relevant files for test.
For example, all *.txt files in PWD:
while read -rd '' filename;do
if tr -sc '\0' '\n' < "$filename" | tr '\0' Z | grep -qE 'Z{1000}'; then
# For example, simply print "$filename"
printf '%s\n' "$filename"
fi
done < <(find . -type f -name '*.txt' -print0)

Find and grep should work just fine:
grep -E "(\0)\1{1000}" <file name>
if it's a single file or a group of files in the same dir
If you want to search throughout the system there's:
find /dir/ -exec grep -E "(\0)\1{1000}" {} \; 2> /dev/null
this is very slow though, if you're looking for something faster and can do without the thousand(or large number) of zeros
I'll suggest replacing the grep with 'grep 000000000*' instead

Related

Linux: what's a fast way to find all duplicate files in a directory?

I have a directory with many subdirs and about 7000+ files in total. What I need to find all duplicates of all files. For any given file, its duplicates might be scattered around various subdirs and may or may not have the same file name. A duplicate is a file that you get a 0 return code from the diff command.
The simplest thing to do is to run a double loop over all the files in the directory tree. But that's 7000^2 sequential diffs and not very efficient:
for f in `find /path/to/root/folder -type f`
do
for g in `find /path/to/root/folder -type f`
do
if [ "$f" = "$g" ]
then
continue
fi
diff "$f" "$g" > /dev/null
if [ $? -eq 0 ]
then
echo "$f" MATCHES "$g"
fi
done
done
Is there a more efficient way to do it?

On Debian 11:
% mkdir files; (cd files; echo "one" > 1; echo "two" > 2a; cp 2a 2b)
% find files/ -type f -print0 | xargs -0 md5sum | tee listing.txt | \
awk '{print $1}' | sort | uniq -c | awk '$1>1 {print $2}' > dups.txt
% grep -f dups.txt listing.txt
c193497a1a06b2c72230e6146ff47080 files/2a
c193497a1a06b2c72230e6146ff47080 files/2b
Find and print all files null terminated (-print0).
Use xargs to md5sum them.
Save a copy of the sums and filenames in "listing.txt" file.
Grab the sum and pass to sort then uniq -c to count, saving into the "dups.txt" file.
Use awk to list duplicates, then grep to find the sum and filename.

wc with find. error if space in folders name

I need to calculate folder size in bytes.
if folder name contains space /folder/with spaces/ then following command not work properly
wc -c `find /folder -type f` | grep total | awk '{print $1}'
with error
wc: /folder/with: No such file or directory
wc: spaces/file2: No such file or directory
How can it done?

Try this line instead:
find /folder -type f | xargs -I{} wc -c "{}" | awk '{print $1}'

You need the names individually quoted.
$: while read n; # assign whole row read to $n
do a+=("$n"); # add quoted "$n" to array
done < <( find /folder -type f ) # reads find as a stream
$: wc -c "${a[#]}" | # pass wc the quoted names
sed -n '${ s/ .*//; p; }' # ignore all but total, scrub and print
Compressed to short couple lines -
$: while read n; do a+=( "$n"); done < <( find /folder -type f )
$: wc -c "${a[#]}" | sed -n '${ s/ .*//; p; }'

This is because bash (different to zsh) word-splits the result of the command substitution. You could use an array to collect the file names:
files=()
for entry in *
do
[[ -f $entry ]] && files+=("$entry")
done
wc -c "${files[#]}" | grep .....

Using AWK to sort lines and columns together

I'm doing an assignment where I've been asked to find files between a certain size and sha1sum them. I've specifically been asked to provide the output in the following format:
Filename Sha1 File Size
all on one line. I'm not able to create additional files.
I've tried the following:
for listed in $(find /my/output -type f -size -20000c):
do
sha1sum "$listed" | awk '{print $1}'
ls -l "$listed" | awk '{print $9 $5}'
done
which gives me the required output fields, but not in the requested format, i.e.
sha1sum
filename filesize
Could anyone suggest a manner in which I'd be able to get all of this on a single line?
Thank you :)

If you use the stat command to avoid needing to parse the output of ls, you can simply echo all the values you need:
while IFS= read -r -d '' listed
do
echo "$listed" $(sha1sum "$listed") $(stat -c "%s" "$listed")
done < <(find /my/output -type f -size -20000c -print0)
Check your version of stat, though, the above is GNU. On OS X, e.g., would be
stat -f "%z" $listed

With single pipeline:
find /my/path-type f -size -20000c -printf "%s " -exec sha1sum {} \; | awk '{ print $3,$2,$1 }'
An exemplary output (as a result of my local test) in the needed format FILENAME SHA1 FILESIZE:
./GSE11111/GSE11111_RAW.tar 9ed615fcbcb0b771fcba1f2d2e0aef6d3f0f8a11 25446400
./artwork_tmp 3a43f1be6648cde0b30fecabc3fa795a6ae6d30a 40010166

File with the most lines in a directory NOT bytes

I'm trying to to wc -l an entire directory and then display the filename in an echo with the number of lines.
To add to my frustration, the directory has to come from a passed argument. So without looking stupid, can someone first tell me why a simple wc -l $1 doesn't give me the line count for the directory I type in the argument? I know i'm not understanding it completely.
On top of that I need validation too, if the argument given is not a directory or there is more than one argument.

wc works on files rather than directories so, if you want the word count on all files in the directory, you would start with:
wc -l $1/*
With various gyrations to get rid of the total, sort it and extract only the largest, you could end up with something like (split across multiple lines for readability but should be entered on a single line):
pax> wc -l $1/* 2>/dev/null
| grep -v ' total$'
| sort -n -k1
| tail -1l
2892 target_dir/big_honkin_file.txt
As to the validation, you can check the number of parameters passed to your script with something like:
if [[ $# -ne 1 ]] ; then
echo 'Whoa! Wrong parameteer count'
exit 1
fi
and you can check if it's a directory with:
if [[ ! -d $1 ]] ; then
echo 'Whoa!' "[$1]" 'is not a directory'
exit 1
fi

Is this what you want?
> find ./test1/ -type f|xargs wc -l
1 ./test1/firstSession_cnaiErrorFile.txt
77 ./test1/firstSession_cnaiReportFile.txt
14950 ./test1/exp.txt
1 ./test1/test1_cnaExitValue.txt
15029 total
so your directory which is the argument should go here:
find $your_complete_directory_path/ -type f|xargs wc -l

I'm trying to to wc -l an entire directory and then display the
filename in an echo with the number of lines.
You can do a find on the directory and use -exec option to trigger wc -l. Something like this:
$ find ~/Temp/perl/temp/ -exec wc -l '{}' \;
wc: /Volumes/Data/jaypalsingh/Temp/perl/temp/: read: Is a directory
11 /Volumes/Data/jaypalsingh/Temp/perl/temp//accessor1.plx
25 /Volumes/Data/jaypalsingh/Temp/perl/temp//autoincrement.pm
12 /Volumes/Data/jaypalsingh/Temp/perl/temp//bless1.plx
14 /Volumes/Data/jaypalsingh/Temp/perl/temp//bless2.plx
22 /Volumes/Data/jaypalsingh/Temp/perl/temp//classatr1.plx
27 /Volumes/Data/jaypalsingh/Temp/perl/temp//classatr2.plx
7 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee1.pm
18 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee2.pm
26 /Volumes/Data/jaypalsingh/Temp/perl/temp//employee3.pm
12 /Volumes/Data/jaypalsingh/Temp/perl/temp//ftp.plx
14 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit1.plx
16 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit2.plx
24 /Volumes/Data/jaypalsingh/Temp/perl/temp//inherit3.plx
33 /Volumes/Data/jaypalsingh/Temp/perl/temp//persisthash.pm

Nice question!
I saw the answers. Some are pretty good. The find ...|xrags is my most preferred. It could be simplified anyway using find ... -exec wc -l {} + syntax. But there is a problem. When the command line buffer is full a wc -l ... is called and every time a <number> total line is printer. As wc has no arg to disable this feature wc has to be reimplemented. To filter out these lines with grep is not nice:
So my complete answer is
#!/usr/bin/bash
[ $# -ne 1 ] && echo "Bad number of args">&2 && exit 1
[ ! -d "$1" ] && echo "Not dir">&2 && exit 1
find "$1" -type f -exec awk '{++n[FILENAME]}END{for(i in n) printf "%8d %s\n",n[i],i}' {} +
Or using less temporary space, but a little bit larger code in awk:
find "$1" -type f -exec awk 'function pr(){printf "%8d %s\n",n,f}FNR==1{f&&pr();n=0;f=FILENAME}{++n}END{pr()}' {} +
Misc
If it should not be called for subdirectories then add -maxdepth 1 before -type to find.
It is pretty fast. I was afraid that it would be much slower then the find ... wc + version, but for a directory containing 14770 files (in several subdirs) the wc version run 3.8 sec and awk version run 5.2 sec.
awk and wc consider the not \n ended lines differently. The last line ended with no \n is not counted by wc. I prefer to count it as awk does.
It does not print the empty files

To find the file with most lines in the current directory and its subdirectories, with zsh:
lines() REPLY=$(wc -l < "$REPLY")
wc -l -- **/*(D.nO+lined[1])
That defines a lines function which is going to be used as a glob sorting function that returns in $REPLY the number of lines of the file whose path is given in $REPLY.
Then we use zsh's recursive globbing **/* to find regular files (.), numerically (n) reverse sorted (O) with the lines function (+lines), and select the first one [1]. (D to include dotfiles and traverse dotdirs).
Doing it with standard utilities is a bit tricky if you don't want to make assumptions on what characters file names may contain (like newline, space...). With GNU tools as found on most Linux distributions, it's a bit easier as they can deal with NUL terminated lines:
find . -type f -exec sh -c '
for file do
size=$(wc -c < "$file") &&
printf "%s\0" "$size:$file"
done' sh {} + |
tr '\n\0' '\0\n' |
sort -rn |
head -n1 |
tr '\0' '\n'
Or with zsh or GNU bash syntax:
biggest= max=-1
find . -type f -print0 |
{
while IFS= read -rd '' file; do
size=$(wc -l < "$file") &&
((size > max)) &&
max=$size biggest=$file
done
[[ -n $biggest ]] && printf '%s\n' "$max: $biggest"
}

Here's one that works for me with the git bash (mingw32) under windows:
find . -type f -print0| xargs -0 wc -l
This will list the files and line counts in the current directory and sub dirs. You can also direct the output to a text file and import it into Excel if needed:
find . -type f -print0| xargs -0 wc -l > fileListingWithLineCount.txt

Finding the number of files in a directory for all directories in pwd

I am trying to list all directories and place its number of files next to it.
I can find the total number of files ls -lR | grep .*.mp3 | wc -l. But how can I get an output like this:
dir1 34
dir2 15
dir3 2
...
I don't mind writing to a text file or CSV to get this information if its not possible to get it on screen.
Thank you all for any help on this.

This seems to work assuming you are in a directory where some subdirectories may contain mp3 files. It omits the top level directory. It will list the directories in order by largest number of contained mp3 files.
find . -mindepth 2 -name \*.mp3 -print0| xargs -0 -n 1 dirname | sort | uniq -c | sort -r | awk '{print $2 "," $1}'
I updated this with print0 to handle filenames with spaces and other tricky characters and to print output suitable for CSV.

find . -type f -iname '*.mp3' -printf "%h\n" | uniq -c
Or, if order (dir-> count instead of count-> dir) is really important to you:
find . -type f -iname '*.mp3' -printf "%h\n" | uniq -c | awk '{print $2" "$1}'

There's probably much better ways, but this seems to work.
Put this in a shell script:
#!/bin/sh
for f in *
do
if [ -d "$f" ]
then
cd "$f"
c=`ls -l *.mp3 2>/dev/null | wc -l`
if test $c -gt 0
then
echo "$f $c"
fi
cd ..
fi
done

With Perl:
perl -MFile::Find -le'
find {
wanted => sub {
return unless /\.mp3$/i;
++$_{$File::Find::dir};
}
}, ".";
print "$_,$_{$_}" for
sort {
$_{$b} <=> $_{$a}
} keys %_;
'

Here's yet another way to even handle file names containing unusual (but legal) characters, such as newlines, ...:
# count .mp3 files (using GNU find)
find . -xdev -type f -iname "*.mp3" -print0 | tr -dc '\0' | wc -c
# list directories with number of .mp3 files
find "$(pwd -P)" -xdev -depth -type d -exec bash -c '
for ((i=1; i<=$#; i++ )); do
d="${#:i:1}"
mp3s="$(find "${d}" -xdev -type f -iname "*.mp3" -print0 | tr -dc "${0}" | wc -c )"
[[ $mp3s -gt 0 ]] && printf "%s\n" "${d}, ${mp3s// /}"
done
' "'\\0'" '{}' +

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to find files that have long chains of zero bytes? - linux

This may be a start: find /some/starting/point -type f -size +10000 -exec \ perl -nE 'if (/\x0{10000}/) {say $ARGV; close ARGV}' '{}' +

Related

Linux: what's a fast way to find all duplicate files in a directory?

wc with find. error if space in folders name

Using AWK to sort lines and columns together

File with the most lines in a directory NOT bytes

Finding the number of files in a directory for all directories in pwd

Categories

Resources