How to tell how many files match description with * in unix - string

Pretty simple question: say I have a set of files:
a1.txt
a2.txt
a3.txt
b1.txt
And I use the following command:
ls a*.txt
It will return:
a1.txt a2.txt a3.txt
Is there a way in a bash script to tell how many results will be returned when using the * pattern. In the above example if I were to use a*.txt the answer should be 3 and if I used *1.txt the answer should be 2.

Comment on using ls:
I see all the other answers attempt this by parsing the output of
ls. This is very unpredictable because this breaks when you have
file names with "unusual characters" (e.g. spaces).
Another pitfall would be, it is ls implementation dependent. A
particular implementation might format output differently.
There is a very nice discussion on the pitfalls of parsing ls output on the bash wiki maintained by Greg Wooledge.
Solution using bash arrays
For the above reasons, using bash syntax would be the more reliable option. You can use a glob to populate a bash array with all the matching file names. Then you can ask bash the length of the array to get the number of matches. The following snippet should work.
files=(a*.txt) && echo "${#files[#]}"
To save the number of matches in a variable, you can do:
files=(a*.txt)
count="${#files[#]}"
One more advantage of this method is you now also have the matching files in an array which you can iterate over.
Note: Although I keep repeating bash syntax above, I believe the above solution applies to all sh-family of shells.

You can't know ahead of time, but you can count how many results are returned. I.e.
ls -l *.txt | wc -l
ls -l will display the directory entries matching the specified wildcard, wc -l will give you the count.
You can save the value of this command in a shell variable with either
num=$(ls * | wc -l)
or
num=`ls -l *.txt | wc -l`
and then use $num to access it. The first form is preferred.

You can use ls in combination with wc:
ls a*.txt | wc -l
The ls command lists the matching files one per line, and wc -l counts the number of lines.

I like suvayu's answer, but there's no need to use an array:
count() { echo $#; }
count *

In order to count files that might have unpredictable names, e.g. containing new-lines, non-printable characters etc., I would use the -print0 option of find and awk with RS='\0':
num=$(find . -maxdepth 1 -print0 | awk -v RS='\0' 'END { print NR }')
Adjust the options to find to refine the count, e.g. if the criteria is files starting with a lower-case a with .txt extension in the current directory, use:
find . -type f -name 'a*.txt' -maxdepth 1 -print0

Related

Bash script that counts and prints out the files that start with a specific letter

How do i print out all the files of the current directory that start with the letter "k" ?Also needs to count this files.
I tried some methods but i only got errors or wrong outputs. Really stuck on this as a newbie in bash.
Try this Shellcheck-clean pure POSIX shell code:
count=0
for file in k*; do
if [ -f "$file" ]; then
printf '%s\n' "$file"
count=$((count+1))
fi
done
printf 'count=%d\n' "$count"
It works correctly (just prints count=0) when run in a directory that contains nothing starting with 'k'.
It doesn't count directories or other non-files (e.g. fifos).
It counts symlinks to files, but not broken symlinks or symlinks to non-files.
It works with 'bash' and 'dash', and should work with any POSIX-compliant shell.
Here is a pure Bash solution.
files=(k*)
printf "%s\n" "${files[#]}"
echo "${#files[#]} files total"
The shell expands the wildcard k* into the array, thus populating it with a list of matching files. We then print out the array's elements, and their count.
The use of an array avoids the various problems with metacharacters in file names (see e.g. https://mywiki.wooledge.org/BashFAQ/020), though the syntax is slightly hard on the eyes.
As remarked by pjh, this will include any matching directories in the count, and fail in odd ways if there are no matches (unless you set nullglob to true). If avoiding directories is important, you basically have to get the directories into a separate array and exclude those.
To repeat what Dominique also said, avoid parsing ls output.
Demo of this and various other candidate solutions:
https://ideone.com/XxwTxB
To start with: never parse the output of the ls command, but use find instead.
As find basically goes through all subdirectories, you might need to limit that, using the -maxdepth switch, use value 1.
In order to count a number of results, you just count the number of lines in your output (in case your output is shown as one piece of output per line, which is the case of the find command). Counting a number of lines is done using the wc -l command.
So, this comes down to the following command:
find ./ -maxdepth 1 -type f -name "k*" | wc -l
Have fun!
This should work as well:
VAR="k"
COUNT=$(ls -p ${VAR}* | grep -v ":" | wc -w)
echo -e "Total number of files: ${COUNT}\n" 1>&2
echo -e "Files,that begin with ${VAR} are:\n$(ls -p ${VAR}* | grep -v ":" )" 1>&2

grep search for pipe term Argument list too long

I have something like
grep ... | grep -f - *orders*
where the first grep ... gives a list of order numbers like
1393
3435
5656
4566
7887
6656
and I want to find those orders in multiple files (a_orders_1, b_orders_3 etc.), these files look something like
1001|strawberry|sam
1002|banana|john
...
However, when the first grep... returns too many order numbers I get the error "Argument list too long".
I also tried to give the grep command one order number at a time using a while loop but that's just way too slow. I did
grep ... | while read order; do grep $order *orders*; done
I'm very new to Unix clearly, explanations would be greatly appreciated, thanks!
The problem is the expansion of *orders* in grep ... | grep -f - *orders*. Your shell expands the pattern to the full list of files before passing that list to grep.
So we need to pass fewer "orders" files to each grep invocation. The find program is one way to do that, because it accepts wildcards and expands them internally:
find . -name '*orders*' # note this searches subdirectories too
Now that you know how to generate the list of filenames without running into the command line length limit, you can tell find to execute your second grep:
grep ... | find . -name '*orders*' -exec grep -f - {} +
The {} is where find places the filenames, and the + terminates the command and lets find know you're OK with passing multiple arguments to each invocation of grep -f, while still respecting the command line length limit by invoking grep -f more than once if the list of files exceeds the allowed length of a single command.

Unable to run cat command in CentOS (argument list too long)

I have a folder which has around 300k files of each file contains 2-3mb
Now I want to run a command to find the count of char { in shell
My command:
nohup cat *20200119*| grep "{" | wc -l > /mpt_sftp/mpt_cdr_ocs/file.txt
This works fine with small number of files
When i run in files location where I have all the files (300k files) it showing
Argument too long
Would you please try the following:
find . -maxdepth 1 -type f -name "*20200119*" -print0 | xargs -0 grep -F -o "{" | wc -l > /mpt_sftp/mpt_cdr_ocs/file.txt
I have actually tested with 300,000 files of 10-character-long filenames and it is working well.
xargs automatically adjusts the length of argument list fed to grep and we don't need to worry about it. (You can see how the grep command is executed by putting -t option to xargs.)
The -F option drastically speeds-up the execution of grep to search for a fixed string, not a regex.
The -o option will be needed if the character { appears multiple times in a line and you want to count them individually.
The maximum size of the argument list varies, but it is usually something like 128 KiB or 256 KiB. That means you have an awful lot of files if the *20200119* part is overflowing the maximum argument list. But you say "around 3 lakhs files", which is around 300,000 — each file has at least the 8-character date string in it, plus enough other characters to make the name unique, so the list of file names will be far too long for even the largest plausible 'maximum argument list size'.
Note that the nohup cat part of your command is not sensible (see UUoC: Useless Use of Cat); you should be using grep '{' *20200119* to save transferring all that data down a pipe unnecessarily. However, that too would run into problems with the argument list being too long.
You will probably have to use a variant of the following command to get the desired result without overflowing your command line:
find . -depth 1 -name '*20200119*' -exec grep '{' {} + | wc -l
This uses the feature of POSIX find that groups as many arguments as will fit on the command line without overflowing to run grep on large (but not too large) numbers of files, and then pass the output of the grep commands to wc. If you're worried about the file names appearing in the output, suppress them with the grep -h.
Or you might use:
find . -depth 1 -name '*20200119*' -exec grep -c -h '{' {} + |
awk '{sum += $1} END {print sum}'
The grep -c -h on macOS produces a simple number (the count of the number of lines containing at least one {) on its standard output for each file listed in its argument list; so too does GNU grep. The awk script adds up those numbers and prints the result.
Using -depth 1 is supported by find on macOS; so too is -maxdepth 1 — they are equivalent. GNU find does not appear to support -depth 1. It would be better to use -maxdepth 1. POSIX find only supports -depth with no number. You'd probably get a better error message from using -maxdepth 1 with a find that only supports POSIX's rather minimal set of options than you would when using -depth 1.

listing file in unix and saving the output in a variable(Oldest File fetching for a particular extension)

This might be a very simple thing for a shell scripting programmer but am pretty new to it. I was trying to execute the below command in a shell script and save the output into a variable
inputfile=$(ls -ltr *.{PDF,pdf} | head -1 | awk '{print $9}')
The command works fine when I fire it from terminal but fails when executed through a shell script (sh). Why is that the command fails, does it mean that shell script doesn't support the command or am I doing it wrong? Also how do I know if a command will work in shell or not?
Just to give you a glimpse of my requirement, I was trying to get the oldest file from a particular directory (I also want to make sure upper case and lower case extensions are handled). Is there any other way to do this ?
The above command will work correctly only if BOTH *.pdf and *.PDF files are in the directory you are currently.
If you would like to execute it in a directory with only one of those you should consider using e.g.:
inputfiles=$(find . -maxdepth 1 -type f \( -name "*.pdf" -or -name "*.PDF" \) | xargs ls -1tr | head -1 )
NOTE: The above command doesn't work with files with new lines, or with long list of found files.
Parsing ls is always a bad idea. You need another strategy.
How about you make a function that gives you the oldest file among the ones given as argument? the following works in Bash (adapt to your needs):
get_oldest_file() {
# get oldest file among files given as parameters
# return is in variable get_oldest_file_ret
local oldest f
for f do
[[ -e $f ]] && [[ ! $oldest || $f -ot $oldest ]] && oldest=$f
done
get_oldest_file_ret=$oldest
}
Then just call as:
get_oldest_file *.{PDF,pdf}
echo "oldest file is: $get_oldest_file_ret"
Now, you probably don't want to use brace expansions like this at all. In fact, you very likely want to use the shell options nocaseglob and nullglob:
shopt -s nocaseglob nullglob
get_oldest_file *.pdf
echo "oldest file is: $get_oldest_file_ret"
If you're using a POSIX shell, it's going to be a bit trickier to have the equivalent of nullglob and nocaseglob.
Is perl an option? It's ubiquitous on Unix.
I would suggest:
perl -e 'print ((sort { -M $b <=> -M $a } glob ( "*.{pdf,PDF}" ))[0]);';
Which:
uses glob to fetch all files matching the pattern.
sort, using -M which is relative modification time. (in days).
fetches the first element ([0]) off the sort.
Prints that.
As #gniourf_gniourf says, parsing ls is a bad idea. Such as leaving unquoted globs, and generally not counting for funny characters in file names.
find is your friend:
#!/bin/sh
get_oldest_pdf() {
#
# echo path of oldest *.pdf (case-insensitive) file in current directory
#
find . -maxdepth 1 -mindepth 1 -iname "*.pdf" -printf '%T# %p\n' \
| sort -n \
| tail -1 \
| cut -d\ -f1-
}
whatever=$(get_oldest_pdf)
Notes:
find has numerous ways of formatting the output, including
things like access time and/or write time. I used '%T# %p\n',
where %T# is last write time in UNIX time format incl.fractal part.
This will never containt space so it's safe to use as separator.
Numeric sort and tail get the last item, sorting by the time,
cut removes the time from the output.
I used IMO much easier to read/maintain pipe notation, with help of \.
the code should run on any POSIX shell,
You could easily adjust the function to parametrize the pattern,
time used (access/write), control the search depth or starting dir.

Listing entries in a directory using grep

I'm trying to list all entries in a directory whose names contain ONLY upper-case letters. Directories need "/" appended.
#!/bin/bash
cd ~/testfiles/
ls | grep -r *.*
Since grep by default looks for upper-case letters only (right?), I'm just recursively searching through the directories under testfiles for all names who contain only upper-case letters.
Unfortunately this doesn't work.
As for appending directories, I'm not sure why I need to do this. Does anyone know where I can start with some detailed explanations on what I can do with grep? Furthermore how to tackle my problem?
No, grep does not only consider uppercase letters.
Your question I a bit unclear, for example:
from your usage of the -r option, it seems you want to search recursively, however you don't say so. For simplicity I assume you don't need to; consider looking into #twm's answer if you need recursion.
you want to look for uppercase (letters) only. Does that mean you don't want to accept any other (non letter) characters, but which are till valid for file names (like digits or dashes, dots, etc.)
since you don't say th it i not permissible to have only on file per line, I am assuming it is OK (thus using ls -1).
The naive solution would be:
ls -1 | grep "^[[:upper:]]\+$"
That is, print all lines containing only uppercase letters. In my TEMP directory that prints, for example:
ALLBIG
LCFEM
WPDNSE
This however would exclude files like README.TXT or FILE001, which depending on your requirements (see above) should most likely be included.
Thus, a better solution would be:
ls -1 | grep -v "[[:lower:]]\+"
That is, print all lines not containing an lowercase letter. In my TEMP directory that prints for example:
ALLBIG
ALLBIG-01.TXT
ALLBIG005.TXT
CRX_75DAF8CB7768
LCFEM
WPDNSE
~DFA0214428CD719AF6.TMP
Finally, to "properly mark" directories with a trailing '/', you could use the -F (or --classify) option.
ls -1F | grep -v "[[:lower:]]\+"
Again, example output:
ALLBIG
ALLBIG-01.TXT
ALLBIG005.TXT
CRX_75DAF8CB7768
LCFEM/
WPDNSE/
~DFA0214428CD719AF6.TMP
Note a different option would to be use find, if you can live with the different output (e.g. find ! -regex ".*[a-z].*"), but that will have a different output.
The exact regular expression depend on the output format of your ls command. Assuming that you do not use an alias for ls, you can try this:
ls -R | grep -o -w "[A-Z]*"
note that with -R in ls you will recursively list directories and files under the current directory. The grep option -o tells grep to only print the matched part of the text. The -w options tell grep to consider as match only for whole words. The "[A-Z]*" is a regexp to filter only upper-cased words.
Note that this regexp will print TEST.txt as well as TEXT.TXT. In other words, it will only consider names that are formed by letters.
It's ls which lists the files, not grep, so that is where you need to specify that you want "/" appended to directories. Use ls --classify to append "/" to directories.
grep is used to process the results from ls (or some other source, generally speaking) and only show lines that match the pattern you specify. It is not limited to uppercase characters. You can limit it to just upper case characters and "/" with grep -E '^[A-Z/]*$ or if you also want numbers, periods, etc. you could instead filter out lines that contain lowercase characters with grep -v -E [a-z].
As grep is not the program which lists the files, it is not where you want to perform the recursion. ls can list paths recursively if you use ls -R. However, you're just going to get the last component of the file paths that way.
You might want to consider using find to handle the recursion. This works for me:
find . -exec ls -d --classify {} \; | egrep -v '[a-z][^/]*/?$'
I should note, using ls --classify to append "/" to the end of directories may also append some other characters to other types of paths that it can classify. For instance, it may append "*" to the end of executable files. If that's not OK, but you're OK with listing directories and other paths separately, this could be worked around by running find twice - once for the directories and then again for other paths. This works for me:
find . -type d | egrep -v '[a-z][^/]*$' | sed -e 's#$#/#'
find . -not -type d | egrep -v '[a-z][^/]*$'

Resources