Find file with largest number of lines in single directory - linux

I'm trying to create a function that only outputs the file with the largest number of lines in a directory (and not any sub-directories). I'm being asked to make use of the wc function but don't really understand how to read each file individually and then sort them just to find the largest. Here is what I have so far:
#!/bin/bash
function sort {
[ $# -ne 1 ] && echo "Invalid number of arguments">&2 && exit 1;
[ ! -d "$1" ] && echo "Invalid input: not a directory">&2 && exit 1;
# Insert function here ;
}
# prompt if wanting current directory
# if yes
# sort $PWD
# if no
#sort $directory

This solution is almost pure Bash (wc is the only external command used):
shopt -s dotglob # Include filenames with initial '.' in globs
shopt -s nullglob # Make globs produce nothing when nothing matches
dir=$1
maxlines=-1
maxfile=
for file in "$dir"/* ; do
[[ -f $file ]] || continue # Skip non-files
[[ -L $file ]] && continue # Skip symlinks
numlines=$(wc -l < "$file")
if (( numlines > maxlines )) ; then
maxfile=$file
maxlines=$numlines
fi
done
[[ -n "$maxfile" ]] && printf '%s\n' "$maxfile"
Remove the shopt -s dotglob if you don't want to process files whose names begin with a dot. Remove the [[ -L $file ]] && continue if you want to process symlinks to files.
This solution should handle all filenames (ones containing spaces, ones containing glob characters, ones beginning with '-', ones containing newlines, ...), but it runs wc for each file so it may be unacceptably slow compared to solutions that feed many files to wc at once if you need to handle directories that have large numbers of files.

How about this:
wc -l * | sort -nr | head -2 | tail -1
wc -l counts lines (you get an error for directories, though), then sort in reverse order treating the first column as a number, then take the first two lines, then the second, as we need to skip over the total line.
wc -l * 2>/dev/null | sort -nr | head -2 | tail -1
The 2>/dev/null throws away all the errors, if you want a neater output.

Use a function like this:
my_custom_sort() {
for i in "${1+$1/}"*; do
[[ -f "$i" ]] && wc -l "$i"
done | sort -n | tail -n1 | cut -d" " -f2
}
And use it with or without directory (in latter case, it uses the current directory):
my_custom_sort /tmp
helloworld.txt

Related

Finding index for new folder

I am given a name and I am supposed to make a dir with this name. If this dir already exists, name of the folder should have _$number as its suffix.
Number is calculated as highest value + 1. Examples:
Name:awesome
Files: dummy awesome awesome_2 awesome_4 dummy_3
New folder: awesome_5
Name:awesome
Files: dummy dummy_1
New folder: awesome
My solution for finding highest value works only for names without special characters. Should the name be for example: "$#&*!(#)(%+#$ asdasd \ ^ sad", it fails.
function max_item() {
local prefix="$1"
local max="0"
shopt -s nullglob
for in_file in * ; do
if [[ "$in_file" =~ ^"$prefix"_(-{0,1}[0-9][0-9]*)$ ]]; then
num="${BASH_REMATCH[1]}";
[[ "$max" -lt "$num" ]] && max="$num";
fi
done
echo "$max"
shopt -u nullglob
return 0
}
I guess it has something to do with special characters in regex but I have exhausted all my ideas.
Since you are looking for a number at the end of the name, prefixed by an _, you could do this instead:
max=0
number='^[[:digit:]]+$'
for in_file in "${prefix}_"* ; do
num="${in_file##*_}"
[[ "$num" =~ $number ]] && [[ "$max" -lt "$num" ]] && max="$num"
done
num=$((max + 1))
I have incorporated #Jens' excellent suggestion to loop through the just the matching files.
Looping in shell code is notoriously slow.
For small numbers, codeforester's solution is fine, but starting at around 30 items (the exact number depends on many factors), the external-utility-based solution below will be faster and scale much better.
(For fewer items, an external-utility solution is slower, but that will rarely matter).
The solution below has the added advantage of being more concise:
max_index() {
printf '%d\n' "$(shopt -s nullglob;
printf '%s\n' "$1_"* |
awk -F_ '{print $NF}' |
sort -rn | head -n 1)"
}
Note: The reasonable assumption is made that your filenames have no embedded newlines.
shopt -s nullglob ensures that if a globbing pattern ("$1_"* in this case) matches nothing, it expands to the null (empty) string.
printf '%s\n' "$1_"* prints all matching filesystem items line by line.
awk -F_ '{print $NF}' outputs the last _-based token on each line, i.e., the trailing number.
Note: cut -d_ -f2 would work too, but makes the assumption that only one _ is present in the filename.
sort -rn sorts the trailing numbers numerically (-n), in reverse (-r).
head -n 1 then extracts only the 1st output line, which is by definition the highest number (if any).
Note that printf '%d\n' '' outputs 0, which is effectively what happens if no existing _<number> suffixes are found.

I want to check if some given files contain more then 3 words from an input file in a shell script

My first parameter is the file that contains the given words and the rest are the other directories in which I'm searching for files, that contain at least 3 of the words from the 1st parameter
I can successfully print out the number of matching words, but when testing if it's greater then 3 it gives me the error: test: too many arguments
Here's my code:
#!/bin/bash
file=$1
shift 1
for i in $*
do
for j in `find $i`
do
if test -f "$j"
then
if test grep -o -w "`cat $file`" $j | wc -w -ge 3
then
echo $j
fi
fi
done
done
You first need to execute the grep | wc, and then compare that output with 3. You need to change your if statement for that. Since you are already using the backquotes, you cannot nest them, so you can use the other syntax $(command), which is equivalent to `command`:
if [ $(grep -o -w "`cat $file`" $j | wc -w) -ge 3 ]
then
echo $j
fi
I believe your problem is that you are trying to get the result of grep -o -w "cat $file" $j | wc -w to see if it's greater or equal to three, but your syntax is incorrect. Try this instead:
if test $(grep -o -w "`cat $file`" $j | wc -w) -ge 3
By putting the grep & wc commands inside the $(), the shell executes those commands and uses the output rather than the text of the commands themselves. Consider this:
> cat words
western
found
better
remember
> echo "cat words | wc -w"
cat words | wc -w
> echo $(cat words | wc -w)
4
> echo "cat words | wc -w gives you $(cat words | wc -w)"
cat words | wc -w gives you 4
>
Note that the $() syntax is equivalent to the double backtick notation you're already using for the cat $file command.
Hope this helps!
Your code can be refactored and corrected at few places.
Have it this way:
#!/bin/bash
input="$1"
shift
for dir; do
while IFS= read -r d '' file; do
if [[ $(grep -woFf "$input" "$file" | sort -u | wc -l) -ge 3 ]]; then
echo "$file"
fi
done < <(find "$dir" -type f -print0)
done
for dir loops through all the arguments
Use of sort -u is to remove duplicate words from output of grep.
Usewc -linstead ofwc -wsincegrep -o` prints matching words in separate lines.
find ... -print0 is to take care of file that may have whitespaces.
find ... -type f is to retrieve only files and avoid checking for -f later.

Bash - Error: Syntax error: operand expected (error token is "testdir/.hidd1/")

I'm working on a task for uni work where the aim is to count all files and directories within a given directory and then all subdirectories as well. We are forbidden from using find, locate, du or any recursive commands (e.g. ls -R).
To solve this I've tried making my own recursive command and have run into the error above, more specificly it is line 37: testdir/.hidd1/: syntax error: operand expected (error token is ".hidd1/")
The Hierarchy I'm using
The code for this is as follows:
tgtdir=$1
visfiles=0
hidfiles=0
visdir=0
hiddir=0
function searchDirectory {
curdir=$1
echo "curdir = $curdir"
# Rather than change directory ensure that each recursive call uses the $curdir/NameOfWantedDirectory
noDir=$(ls -l -A $curdir| grep ^d | wc -l) # Work out the number of directories in the current directory
echo "noDir = $noDir"
shopt -s nullglob # Enable nullglob to prevent a null term being added to the array
directories=(*/ .*/) # Store all directories and hidden directories into the array 'directories'
shopt -u nullglob #Turn off nullglob to ensure it doesn't later interfere
echo "${directories[#]}" # Print out the array directories
y=0 # Declares a variable to act as a index value
for i in $( ls -d ${curdir}*/ ${curdir}.*/ ); do # loops through all directories both visible and hidden
if [[ "${i:(-3)}" = "../" ]]; then
echo "Found ./"
continue;
elif [[ "${i:(-2)}" = "./" ]]; then
echo "Found ../"
continue;
else # When position i is ./ or ../ the loop advances otherwise the value is added to directories and y is incremented before the loop advances
echo "Adding $i to directories"
directories[y]="$i"
let "y++"
fi
done # Adds all directories except ./ and ../ to the array directories
echo "${directories[#]}"
if [[ "${noDir}" -gt "0" ]]; then
for i in ${directories[#]}; do
echo "at position i ${directories[$i]}"
searchDirectory ${directories[$i]} #### <--- line 37 - the error line
done # Loops through subdirectories to reach the bottom of the hierarchy using recursion
fi
visfiles=$(ls -l $tgtdir | grep -v ^total | grep -v ^d | wc -l)
# Calls the ls -l command which puts each file on a new line, then removes the line which states the total and any lines starting with a 'd' which would be a directory with grep -v,
#finally counts all lines using wc -l
hiddenfiles=$(expr $(ls -l -a $tgtdir | grep -v ^total | grep -v ^d | wc -l) - $visfiles)
# Finds the total number of files including hidden and puts them on a line each (using -l and -a (all)) removes the line stating the total as well as any directoriesand then counts them.
#Then stores the number of hidden files by expressing the complete number of files minus the visible files.
visdir=$(ls -l $tgtdir | grep ^d | wc -l)
# Counts visible directories by using ls -l then filtering it with grep to find all lines starting with a d indicating a directory. Then counts the lines with wc -l.
hiddir=$(expr $(ls -l -a $tgtdir | grep ^d | wc -l) - $visdir)
# Finds hidden directories by expressing total number of directories including hidden - total number of visible directories
#At minimum this will be 2 as it includes the directories . and ..
total=$(expr $visfiles + $hiddenfiles + $visdir + $hiddir) # Calculates total number of files and directories including hidden.
}
searchDirectory $tgtdir
echo "Total Files: $visfiles (+$hiddenfiles hidden)"
echo "Directories Found: $visdir (+$hiddir hidden)"
echo "Total files and directories: $total"
exit 0
Thanks for any help you can give
Line 37 is searchDirectory ${directories[$i]}, as I count. Yes?
Replace the for loop with for i in "${directories[#]}"; do - add double quotes. This will keep each element as its own word.
Replace line 37 with searchDirectory "$i". The for loop gives you each element of the array in i, not each index. Therefore, you don't need to go into directories again - i already has the word you need.
Also, I note that the echos on lines 22 and 25 are swapped :) .

Searching for specific beginning pattern in first lines of files only

I am searching for files containing records that begin with specific pattern but am now running into problems with files (bad data) that contain multiple values in that position within the file, which should never be the case (it should match each record in the file but sometime doesn't). Below is the current code:
echo "Parsing out list of warehouses contained in file set."
( cd $DATA && grep -l '^ 80' * ) >$TEMP/$program.list.whse80.$$
( cd $DATA && grep -l '^ 61' * ) >$TEMP/$program.list.whse61.$$
( cd $DATA && grep -l '^ 68' * ) >>$TEMP/$program.list.whse61.$$
( cd $DATA && grep -l '^ 69' * ) >>$TEMP/$program.list.whse61.$$
( cd $DATA && grep -l '^ 01' * ) >$TEMP/$program.list.whse01.$$
.etc...
What is happening is when there is a file containing records records that begin with both the 61 pattern (with preceding 9 spaces) and the 01 pattern, the same filename is being captured in the 61 file and the 01 file. I would like to force only grepping the first line of each file in this manner as I have other logic to catch mixed files later in my program.
Many thanks in advance for any assistance.
use head to restrict to top n lines only, for example
head -3 file | grep ...
for globbing files you can do in a for loop
for f in *; do if [ -f "$f" ]; then head -1 "$f" | grep ...; fi; done
If you want to output the file name, this solution is not going to work since head just extracts the first line. However, you can check for grep status and report the file name yourself.
if grep -q pattern; then echo $f fi
Alternatively you can use awk instead of grep
for f in *; do if [ -f "$f" ]; then awk 'NR==1 && /pattern/{print FILENAME}' "$f"; fi; done
replace pattern with your pattern.

How can I count the different file types within a folder using linux terminal?

Hey I'm star struck on how to count the different amounts of file types / extensions recursively in a folder. I also need to print them to a .txt file.
For example I have 10 txt's 20 .docx files mixed up in multiple folders.
Help me !
find ./ -type f |awk -F . '{print $NF}' | sort | awk '{count[$1]++}END{for(j in count) print j,"("count[j]" occurences)"}'
Gets all filenames with find, then uses awk to get the extension, then uses awk again to count the occurences
Just with bash: version 4 required for this code
#!/bin/bash
shopt -s globstar nullglob
declare -A exts
for f in * **/*; do
[[ -f $f ]] || continue # only count files
filename=${f##*/} # remove directories from pathname
ext=${filename##*.}
[[ $filename == $ext ]] && ext="no_extension"
: ${exts[$ext]=0} # initialize array element if unset
(( exts[$ext]++ ))
done
for ext in "${!exts[#]}"; do
echo "$ext ${exts[$ext]}"
done | sort -k2nr | column -t
this one seems unsolved so far, so here is how far I got counting files and ordering them:
find . -type f | sed -n 's/..*\.//p' | sort -f | uniq -ic

Resources