grep search for pipe term Argument list too long - linux

I have something like
grep ... | grep -f - *orders*
where the first grep ... gives a list of order numbers like
1393
3435
5656
4566
7887
6656
and I want to find those orders in multiple files (a_orders_1, b_orders_3 etc.), these files look something like
1001|strawberry|sam
1002|banana|john
...
However, when the first grep... returns too many order numbers I get the error "Argument list too long".
I also tried to give the grep command one order number at a time using a while loop but that's just way too slow. I did
grep ... | while read order; do grep $order *orders*; done
I'm very new to Unix clearly, explanations would be greatly appreciated, thanks!

The problem is the expansion of *orders* in grep ... | grep -f - *orders*. Your shell expands the pattern to the full list of files before passing that list to grep.
So we need to pass fewer "orders" files to each grep invocation. The find program is one way to do that, because it accepts wildcards and expands them internally:
find . -name '*orders*' # note this searches subdirectories too
Now that you know how to generate the list of filenames without running into the command line length limit, you can tell find to execute your second grep:
grep ... | find . -name '*orders*' -exec grep -f - {} +
The {} is where find places the filenames, and the + terminates the command and lets find know you're OK with passing multiple arguments to each invocation of grep -f, while still respecting the command line length limit by invoking grep -f more than once if the list of files exceeds the allowed length of a single command.

Related

How to use xargs with find?

I have a large number of files on disk and trying to xargs with find to get faster output.
find . -printf '%m %p\n'|sort -nr
If I write find . -printf '%m %p\n'|xargs -0 -P 0 sort -nr, it gives error argument line is too long. Removing -0 option gives other error.
The parallelism commands such as xargs or GNU parallel
are applicable only if the task can be divided into multiple independent
jobs e.g. processing multiple files at once with the same command.
It is not possible to use sort command with these parallelism commands.
Although sort has --parallel option, it may not work well for
piped input. (Not fully evaluated.)
As side notes:
The mechanism of xargs is it reads items (filenames in most cases) from
the standard input and generates individual commands by combining
the argument list (command to be executed) with each item. Then you'll
see the syntax .. | xargs .. sort is incorrect because each filename
is passed to sort as an argument then sort tries to sort the contents
of the file.
The -0 option to xargs tells xargs that input items are delimited
by a null character instead of a newline. It is useful when the input
filenames contain special characters including a newline character.
In order to use this feature, you need to coherently handle the piped
stream in that way: putting -print0 option to find and -z option
to sort. Otherwise the items are wrongly concatenated and will cause
argument line is too long error.
Suggesting to use locate command instead of find command.
You might want to update files database with updatedb command.
Read more about locate command here.

Calculate the total size of all files from a generated folders list with full PATH

I have a list containing multiple directories with the full PATH:
/mnt/directory_1/sub_directory_1/
/mnt/directory_2/
/mnt/directory_3/sub_directory_3/other_directories_3/
I need to calculated what the total size is of this list.
From Get total size of a list of files in UNIX
du -ch $file_list | tail -1 | cut -f 1
This was the closest of an answer I could find but gave me the following error message:
bash: /bin/du: Argument list too long
Do not use backticks `. Use $(..) instead.
Do not use:
command $(cat something)
this is a common anti-pattern. It works for simple cases, fails for many more, because the result of $(...) undergoes word splitting and filename expansion.
Check your scripts with http://shellcheck.net
If you want to "run a command with argument from a file" use xargs or write a loop. Read https://mywiki.wooledge.org/BashFAQ/001 . Also xargs will handle too many arguments by itself. And I would also add -s to du. Try:
xargs -d'\n' du -sch < file_list.txt | tail -1 | cut -f 1
test on repl bash

Unable to run cat command in CentOS (argument list too long)

I have a folder which has around 300k files of each file contains 2-3mb
Now I want to run a command to find the count of char { in shell
My command:
nohup cat *20200119*| grep "{" | wc -l > /mpt_sftp/mpt_cdr_ocs/file.txt
This works fine with small number of files
When i run in files location where I have all the files (300k files) it showing
Argument too long
Would you please try the following:
find . -maxdepth 1 -type f -name "*20200119*" -print0 | xargs -0 grep -F -o "{" | wc -l > /mpt_sftp/mpt_cdr_ocs/file.txt
I have actually tested with 300,000 files of 10-character-long filenames and it is working well.
xargs automatically adjusts the length of argument list fed to grep and we don't need to worry about it. (You can see how the grep command is executed by putting -t option to xargs.)
The -F option drastically speeds-up the execution of grep to search for a fixed string, not a regex.
The -o option will be needed if the character { appears multiple times in a line and you want to count them individually.
The maximum size of the argument list varies, but it is usually something like 128 KiB or 256 KiB. That means you have an awful lot of files if the *20200119* part is overflowing the maximum argument list. But you say "around 3 lakhs files", which is around 300,000 — each file has at least the 8-character date string in it, plus enough other characters to make the name unique, so the list of file names will be far too long for even the largest plausible 'maximum argument list size'.
Note that the nohup cat part of your command is not sensible (see UUoC: Useless Use of Cat); you should be using grep '{' *20200119* to save transferring all that data down a pipe unnecessarily. However, that too would run into problems with the argument list being too long.
You will probably have to use a variant of the following command to get the desired result without overflowing your command line:
find . -depth 1 -name '*20200119*' -exec grep '{' {} + | wc -l
This uses the feature of POSIX find that groups as many arguments as will fit on the command line without overflowing to run grep on large (but not too large) numbers of files, and then pass the output of the grep commands to wc. If you're worried about the file names appearing in the output, suppress them with the grep -h.
Or you might use:
find . -depth 1 -name '*20200119*' -exec grep -c -h '{' {} + |
awk '{sum += $1} END {print sum}'
The grep -c -h on macOS produces a simple number (the count of the number of lines containing at least one {) on its standard output for each file listed in its argument list; so too does GNU grep. The awk script adds up those numbers and prints the result.
Using -depth 1 is supported by find on macOS; so too is -maxdepth 1 — they are equivalent. GNU find does not appear to support -depth 1. It would be better to use -maxdepth 1. POSIX find only supports -depth with no number. You'd probably get a better error message from using -maxdepth 1 with a find that only supports POSIX's rather minimal set of options than you would when using -depth 1.

Using grep to identify a pattern

I have several documents hosted on a cloud instance. I want to extract all words conforming to a specific pattern into a .txt file. This is the pattern:
ABC123A
ABC123B
ABC765A
and so one. Essentially the words start with a specific character string 'ABC', have a fixed number of numerals, and end with a letter. This is my code:
grep -oh ABC[0-9].*[a-zA-Z]$ > /home/user/abcLetterMatches.txt
When I execute the query, it runs for several hours without generating any output. I have over 1100 documents. However, when I run this query:
grep -r ABC[0-9].*[a-zA-Z]$ > /home/user/abcLetterMatches.txt
the list of files with the strings is generated in a matter for seconds.
What do I need to correct in my query? Also, what is causing the delay?
UPDATE 1
Based on the answers, it's evident that the command is missing the file name on which it needs to be executed. I want to run the code on multiple document files (>1000)
The documents I want searched are in multiple sub-directories within a directory. What is a good way to search through them? Doing
grep -roh ABC[0-9].*[a-zA-Z]$ > /home/user/abcLetterMatches.txt
only returns the file names.
UPDATE 2
If I use the updated code from the answer below:
find . -exec grep -oh "ABC[0-9].*[a-zA-Z]$" >> ~/abcLetterMatches.txt {} \;
I get a no file or directory error
UPDATE 3
The pattern can be anywhere in the line.
You can use this regexp :
~/ grep -E "^ABC[0-9]{3}[A-Z]$" docs > filename
ABC123A
ABC123B
ABC765A
There is no delay, grep is just waiting for the input you didn't give it (and therefore it waits on standard input, by default). You can correct your command by supplying argument with filename:
grep -oh "ABC[0-9].*[a-zA-Z]$" file.txt > /home/user/abcLetterMatches.txt
Source (man grep):
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]
To perform the same grepping on several files recursively, combine it with find command:
find . -exec grep -oh "ABC[0-9].*[a-zA-Z]$" >> ~/abcLetterMatches.txt {} \;
This does what you ask for:
grep -hr '^ABC[0-9]\{3\}[A-Za-z]$'
-h to not get the filenames.
-r to search recursively. If no directory is given (as above) the current one is used. Otherwise just specify one as the last argument.
Quotes around the pattern to avoid accidental globbing, etc.
^ at the beginning of the pattern to — together with $ at the end — only match whole lines. (Not sure if this was a requirement, but the sample data suggests it.)
\{3\} to specify that there should be three digits.
No .* as that would match a whole lot of other things.

How to tell how many files match description with * in unix

Pretty simple question: say I have a set of files:
a1.txt
a2.txt
a3.txt
b1.txt
And I use the following command:
ls a*.txt
It will return:
a1.txt a2.txt a3.txt
Is there a way in a bash script to tell how many results will be returned when using the * pattern. In the above example if I were to use a*.txt the answer should be 3 and if I used *1.txt the answer should be 2.
Comment on using ls:
I see all the other answers attempt this by parsing the output of
ls. This is very unpredictable because this breaks when you have
file names with "unusual characters" (e.g. spaces).
Another pitfall would be, it is ls implementation dependent. A
particular implementation might format output differently.
There is a very nice discussion on the pitfalls of parsing ls output on the bash wiki maintained by Greg Wooledge.
Solution using bash arrays
For the above reasons, using bash syntax would be the more reliable option. You can use a glob to populate a bash array with all the matching file names. Then you can ask bash the length of the array to get the number of matches. The following snippet should work.
files=(a*.txt) && echo "${#files[#]}"
To save the number of matches in a variable, you can do:
files=(a*.txt)
count="${#files[#]}"
One more advantage of this method is you now also have the matching files in an array which you can iterate over.
Note: Although I keep repeating bash syntax above, I believe the above solution applies to all sh-family of shells.
You can't know ahead of time, but you can count how many results are returned. I.e.
ls -l *.txt | wc -l
ls -l will display the directory entries matching the specified wildcard, wc -l will give you the count.
You can save the value of this command in a shell variable with either
num=$(ls * | wc -l)
or
num=`ls -l *.txt | wc -l`
and then use $num to access it. The first form is preferred.
You can use ls in combination with wc:
ls a*.txt | wc -l
The ls command lists the matching files one per line, and wc -l counts the number of lines.
I like suvayu's answer, but there's no need to use an array:
count() { echo $#; }
count *
In order to count files that might have unpredictable names, e.g. containing new-lines, non-printable characters etc., I would use the -print0 option of find and awk with RS='\0':
num=$(find . -maxdepth 1 -print0 | awk -v RS='\0' 'END { print NR }')
Adjust the options to find to refine the count, e.g. if the criteria is files starting with a lower-case a with .txt extension in the current directory, use:
find . -type f -name 'a*.txt' -maxdepth 1 -print0

Resources