How to find random files in Linux shell - linux

How to pick 100 files randomly from a directory by Linux shell. I read other topic, 'shuf' command can do this: find . -type f | shuf -n100, but our environments do not have 'shuf' cmd. Is there other method to do this? use bash, awk, sed or sth else?

You can get a directory listing, then randomize it, then pick the top N lines.
ls | sort -R | head -n 100
Replace ls with an appropriate find command if you want a recursive listing or need finer control of the files to be included.

This should work on your CentOS5:
shuf() { awk 'BEGIN{srand()}{print rand()"\t"$0}' "$#" | sort | cut -f2- ;}
This comes from a comment by Meow on https://stackoverflow.com/a/2153889/5844347
Use like so: find . -type f | shuf | head -100

# To get a integer number between 1 to 100 :
N=`echo|awk 'srand() {print 99*rand() + 1 }' | sed -e "s/\..*$//g"`
echo $N
# To get the Nth file :
find . -type f | head -${N} | tail -1
# To get 100 files randomly :
for i in $(seq 1 100 )
N=`echo|awk 'srand() {print 99*rand() + 1 }' | sed -e "s/\..*$//g"`
find . -type f | head -${N} | tail -1
done

Related

Use wc on all subdirectories to count the sum of lines

How can I count all lines of all files in all subdirectories with wc?
cd mydir
wc -l *
..
11723 total
man wc suggests wc -l --files0-from=-, but I do not know how to generate the list of all files as NUL-terminated names
find . -print | wc -l --files0-from=-
did not work.
You probably want this:
find . -type f -print0 | wc -l --files0-from=-
If you only want the total number of lines, you could use
find . -type f -exec cat {} + | wc -l
Perhaps you are looking for exec option of find.
find . -type f -exec wc -l {} \; | awk '{total += $1} END {print total}'
To count all lines for specific file extension u can use ,
find . -name '*.fileextension' | xargs wc -l
if you want it on two or more different types of files u can put -o option
find . -name '*.fileextension1' -o -name '*.fileextension2' | xargs wc -l
Another option would be to use a recursive grep:
grep -hRc '' . | awk '{k+=$1}END{print k}'
The awk simply adds the numbers. The grep options used are:
-c, --count
Suppress normal output; instead print a count of matching lines
for each input file. With the -v, --invert-match option (see
below), count non-matching lines. (-c is specified by POSIX.)
-h, --no-filename
Suppress the prefixing of file names on output. This is the
default when there is only one file (or only standard input) to
search.
-R, --dereference-recursive
Read all files under each directory, recursively. Follow all
symbolic links, unlike -r.
The grep, therefore, counts the number of lines matching anything (''), so essentially just counts the lines.
I would suggest something like
find ./ -type f | xargs wc -l | cut -c 1-8 | awk '{total += $1} END {print total}'
Based on ДМИТРИЙ МАЛИКОВ's answer:
Example for counting lines of java code with formatting:
one liner
find . -name *.java -exec wc -l {} \; | awk '{printf ("%3d: %6d %s\n",NR,$1,$2); total += $1} END {printf (" %6d\n",total)}'
awk part:
{
printf ("%3d: %6d %s\n",NR,$1,$2);
total += $1
}
END {
printf (" %6d\n",total)
}
example result
1: 120 ./opencv/NativeLibrary.java
2: 65 ./opencv/OsCheck.java
3: 5 ./opencv/package-info.java
190
Bit late to the game here, but wouldn't this also work? find . -type f | wc -l
This counts all lines output by the 'find' command. You can fine-tune the 'find' to show whatever you want. I am using it to count the number of subdirectories, in one specific subdir, in deep tree: find ./*/*/*/*/*/*/TOC -type d | wc -l . Output: 76435. (Just doing a find without all the intervening asterisks yielded an error.)

How to write a Bash program that searches through Directories and lists the files that have the largest numbered suffix?

I am trying to write a program that will search through a main directories sub-directories and list the files that have the largest number at the end. ex: filename_100.
find . -name "*_*" | sort -n | tail
sort sorts starting from the beginning of the string, so you can't use it without first splitting off the leading part of the filename. This loop will do that; it prints out the part of the filename after the _, followed by the full filename.
for fn in `find . -name '*_*'`; do
echo "${fn##*_} $fn"
done
Then you can pipe the output to sort and tail to get the largest number, and then to cut to pick out only the filename itself.
for fn in `find . -name '*_*'`; do
echo "${fn##*_} $fn"
done | sort -n | tail -n 1 | cut -d' ' -f 2-
Then you'll need to extract the part of the filename before the underscore. This is probably best done by storing the result of the last part in a variable,
largest_filename="$(for fn in `find . -name '*_*'`; do
echo "${fn##*_} $fn"
done | sort -n | tail -n 1 | cut -d' ' -f 2-)"
after which you can use bash's suffix removal to strip off the part after the underscore, and then list all files that share that prefix.
largest_filename="$(for fn in `find . -name '*_*'`; do
echo "${fn##*_} $fn"
done | sort -n | tail -n 1 | cut -d' ' -f 2-)"
ls ${largest_filename%_*}_*
tmp1=$(mktemp)
#retrieve largest suffix
find . -name '*_*' | xargs -n 1 basename | awk -F'_' '{print($NF, $0)}' | sort -k1,1 -n -r | awk '{print($1)}' | head -1 > $tmp1
tmp2=$(mktemp)
#retrieve file names containing largest suffix
join -o2.2 -1 1 -2 1 $tmp1 <(find . -name '*_*' | xargs -n 1 basename | awk -F'_' '{print($NF, $0)}' | sort -k1,1 ) > $tmp2
join -o2.1,2.2 -1 1 -2 1 -t"_" $tmp2 <(find . -name '*_*' | xargs -n 1 basename | sort -k1,1 -t"_")

print search term with line count

Hello bash beginner question. I want to look through multiple files, find the lines that contain a search term, count the number of unique lines in this list and then print into a tex file:
the input file name
the search term used
the count of unique lines
so an example output line for file 'Firstpredictoroutput.txt' using search term 'Stop_gained' where there are 10 unique lines in the file would be:
Firstpredictoroutput.txt Stop_gained 10
I can get the unique count for a single file using:
grep 'Search_term' inputfile.txt | uniq -c | wc -l | >>output.txt
But I don't know enough yet about implementing loops in pipelines using bash.
All my inputfiles end with *predictoroutput.txt
Any help is greatly appreciated.
Thanks in advance,
Rubal
You can write a function called fun, and call the fun with two arguments: filename and pattern
$ fun() { echo "$1 $2 `grep -c $2 $1`"; }
$ fun input.txt Stop_gained
input.txt Stop_gained 2
You can use find :
find . -type f -exec sh -c "grep 'Search_term' {} | uniq -c | wc -l >> output.txt" \;
Although you can have issue with weird filenames. You can add more options to find, for example to treat only '.txt' files :
find . -type f -name "*.txt" -exec sh -c "grep 'Search_term' {} | uniq -c | wc -l >> output.txt" \;
q="search for this"
for f in *.txt; do echo "$f $q $(grep $q $f | uniq | wc -l)"; done > out.txt

How to count occurrences of a word in all the files of a directory?

I’m trying to count a particular word occurrence in a whole directory. Is this possible?
Say for example there is a directory with 100 files all of whose files may have the word “aaa” in them. How would I count the number of “aaa” in all the files under that directory?
I tried something like:
zegrep "xception" `find . -name '*auth*application*' | wc -l
But it’s not working.
grep -roh aaa . | wc -w
Grep recursively all files and directories in the current dir searching for aaa, and output only the matches, not the entire line. Then, just use wc to count how many words are there.
Another solution based on find and grep.
find . -type f -exec grep -o aaa {} \; | wc -l
Should correctly handle filenames with spaces in them.
Use grep in its simplest way. Try grep --help for more info.
To get count of a word in a particular file:
grep -c <word> <file_name>
Example:
grep -c 'aaa' abc_report.csv
Output:
445
To get count of a word in the whole directory:
grep -c -R <word>
Example:
grep -c -R 'aaa'
Output:
abc_report.csv:445
lmn_report.csv:129
pqr_report.csv:445
my_folder/xyz_report.csv:408
Let's use AWK!
$ function wordfrequency() { awk 'BEGIN { FS="[^a-zA-Z]+" } { for (i=1; i<=NF; i++) { word = tolower($i); words[word]++ } } END { for (w in words) printf("%3d %s\n", words[w], w) } ' | sort -rn; }
$ cat your_file.txt | wordfrequency
This lists the frequency of each word occurring in the provided file. If you want to see the occurrences of your word, you can just do this:
$ cat your_file.txt | wordfrequency | grep yourword
To find occurrences of your word across all files in a directory (non-recursively), you can do this:
$ cat * | wordfrequency | grep yourword
To find occurrences of your word across all files in a directory (and it's sub-directories), you can do this:
$ find . -type f | xargs cat | wordfrequency | grep yourword
Source: AWK-ward Ruby
find .|xargs perl -p -e 's/ /\n'|xargs grep aaa|wc -l
cat the files together and grep the output: cat $(find /usr/share/doc/ -name '*.txt') | zegrep -ic '\<exception\>'
if you want 'exceptional' to match, don't use the '\<' and '\>' around the word.
How about starting with:
cat * | sed 's/ /\n/g' | grep '^aaa$' | wc -l
as in the following transcript:
pax$ cat file1
this is a file number 1
pax$ cat file2
And this file is file number 2,
a slightly larger file
pax$ cat file[12] | sed 's/ /\n/g' | grep 'file$' | wc -l
4
The sed converts spaces to newlines (you may want to include other space characters as well such as tabs, with sed 's/[ \t]/\n/g'). The grep just gets those lines that have the desired word, then the wc counts those lines for you.
Now there may be edge cases where this script doesn't work but it should be okay for the vast majority of situations.
If you wanted a whole tree (not just a single directory level), you can use somthing like:
( find . -name '*.txt' -exec cat {} ';' ) | sed 's/ /\n/g' | grep '^aaa$' | wc -l
There's also a grep regex syntax for matching words only:
# based on Carlos Campderrós solution posted in this thread
man grep | less -p '\<'
grep -roh '\<aaa\>' . | wc -l
For a different word matching regex syntax see:
man re_format | less -p '\[\[:<:\]\]'

Combining greps to make script to count files in folder

I need some help combining elements of scripts to form a read output.
Basically I need to get the file name of a user for the folder structure listed below and using count the number of lines in the folder for that user with the file type *.ano
This is shown in the extract below, to note that the location on the filename is not always the same counting from the front.
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.txt
/home/user/Drive-backup/2011 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/3.ano
/home/user/Drive-backup/2010 Backup/2010 Account/Jan/usernameneedtogrep/user.dir/4.ano
awk -F/ '{print $(NF-2)}'
This will give me the username I need but I also need to know how many non blank lines they are in that users folder for file type *.ano. I have the grep below that works but I dont know how to put it all together so it can output a file that makes sense.
grep -cv '^[[:space:]]*$' *.ano | awk -F: '{ s+=$2 } END { print s }'
Example output needed
UserA 500
UserB 2
UserC 20
find /home -name '*.ano' | awk -F/ '{print $(NF-2)}' | sort | uniq -c
That ought to give you the number of "*.ano" files per user given your awk is correct. I often use sort/uniq -c to count the number of instances of a string, in this case username, as opposed to 'wc -l' only counting input lines.
Enjoy.
Have a look at wc (word count).
To count the number of *.ano files in a directory you can use
find "$dir" -iname '*.ano' | wc -l
If you want to do that for all directories in some directory, you can just use a for loop:
for dir in * ; do
echo "user $dir"
find "$dir" -iname '*.ano' | wc -l
done
Execute the bash-script below from folder
/home/user/Drive-backup/2010 Backup/2010 Account/Jan
and it will report the number of non-blank lines per user.
#!/bin/bash
#save where we start
base=$(pwd)
# get all top-level dirs, skip '.'
D=$(find . \( -type d ! -name . -prune \))
for d in $D; do
cd $base
cd $d
# search for all files named *.ano and count blank lines
sum=$(find . -type f -name *.ano -exec grep -cv '^[[:space:]]*$' {} \; | awk '{sum+=$0}END{print sum}')
echo $d $sum
done
This might be what you want (untested): requires bash version 4 for associative arrays
declare -A count
cd /home/user/Drive-backup
for userdir in */*/*/*; do
username=${userdir##*/}
lines=$(grep -cv '^[[:space:]]$' $userdir/user.dir/*.ano | awk '{sum += $2} END {print sum}')
(( count[$username] += lines ))
done
for user in "${!count[#]}"; do
echo $user ${count[$user]}
done
Here's yet another way of doing it (on Mac OS X 10.6):
find -x "$PWD" -type f -iname "*.ano" -exec bash -c '
ar=( "${#%/*}" ) # perform a "dirname" command on every array item
printf "%s\000" "${ar[#]%/*}" # do a second "dirname" and add a null byte to every array item
' arg0 '{}' + | sort -uz |
while IFS="" read -r -d '' userDir; do
# to-do: customize output to get example output needed
echo "$userDir"
basename "$userDir"
find -x "${userDir}" -type f -iname "*.ano" -print0 |
xargs -0 -n 500 grep -hcv '^[[:space:]]*$' | awk '{ s+=$0 } END { print s }'
#xargs -0 -n 500 grep -cv '^[[:space:]]*$' | awk -F: '{ s+=$NF } END { print s }'
printf '%s\n' '----------'
done

Resources