Grep multiple expressions in one pass, output matches to each expression in separate file - linux

I want to use grep on the top utility. Iterating through top 5 times, here are my criteria:
grep two different expressions in a single pass on the top output
expression 1: grep for the line on overall cpu usage, then output it to file: cpu_stats.txt
expression 2: grep for the line on overall memory usage, then output it to file: memory_stats.txt
Here is what I have now:
top -b -n 5 | egrep "\%Cpu\(s\):|KiB Mem :" > both_cpu_and_memory.txt
This successfully grabs the desired top output, but notice it is putting both expression matches in the exact same file.
Where I am stuck: I do not know how to, in a single pass, output the matches from one expression to one file, and how to output the matches from the other expression to another file.
Is this possible? Is it possible to, in one pass, grep multiple expressions, and matches for each expression is outputted to a separate file?

Unless you do something convoluted with saving the output in a temporary file and making a couple of passes over it, you can't do what you want with grep. It's really easy with awk, though:
top -b -n 5 | awk '/%Cpu\(s\):/ { print > "cpu_stats.txt" }
/KiB Mem :/ { print > "memory_stats.txt" }'

grep cannot do what you ask. This is a job for awk, or indeed any other scripting language capable of parsing or capable of using regular expressions.

Related

Prefix search names to output in bash

I have a simple egrep command searching for multiple strings in a text file which outputs either null or a value. Below is the command and the output.
cat Output.txt|egrep -i "abc|def|efg"|cut -d ':' -f 2
Output is:-
xxx
(null)
yyy
Now, i am trying to prefix my search texts to the output like below.
abc:xxx
def:
efg:yyy
Any help on the code to achieve this or where to start would be appreciated.
-Abhi
Since I do not know exactly your input file content (not specified properly in the question), I will put some hypothesis in order to answer your question.
Case 1: the patterns you are looking for are always located in the same column
If it is the case, the answer is quite straightforward:
$ cat grep_file.in
abc:xxx:uvw
def:::
efg:yyy:toto
xyz:lol:hey
$ egrep -i "abc|def|efg" grep_file.in | cut -d':' -f1,2
abc:xxx
def:
efg:yyy
After the grep just use the cut with the 2 columns that you are looking for (here it is 1 and 2)
REMARK:
Do not cat the file, pipe it and then grep it, since this is doing the work twice!!! Your grep command will already read the file so do not read it twice, it might not be that important on small files but you will feel the difference on 10GB files for example!
Case 2: the patterns you are looking for are NOT located in the same column
In this case it is a bit more tricky, but not impossible. There are many ways of doing, here I will detail the awk way:
$ cat grep_file2.in
abc:xxx:uvw
::def:
efg:yyy:toto
xyz:lol:hey
If your input file is in this format; with your pattern that could be located anywhere:
$ awk 'BEGIN{FS=":";ORS=FS}{tmp=0;for(i=1;i<=NF;i++){tmp=match($i,/abc|def|efg/);if(tmp){print $i;break}}if(tmp){printf "%s\n", $2}}' grep_file
2.in
abc:xxx
def:
efg:yyy
Explanations:
FS=":";ORS=FS define your input/output field separator at : Then on each line you define a test variable that will become true when you reach your pattern, you loop on all the fields of the line until you reach it if it is the case you print it, break the loop and print the second field + an EOL char.
If you do not meet your pattern you do nothing.
If you prefer the sed way, you can use the following command:
$ sed -n '/abc\|def\|efg/{h;s/.*\(abc\|def\|efg\).*/\1:/;x;s/^[^:]*:\([^:]*\):.*/\1/;H;x;s/\n//p}' grep_file2.in
abc:xxx
def:
efg:yyy
Explanations:
/abc\|def\|efg/{} is used to filter the lines that contain only one of the patterns provided, then you execute the instructions in the block. h;s/.*\(abc\|def\|efg\).*/\1:/; save the line in the hold space and replace the line with one of the 3 patterns, x;s/^[^:]*:\([^:]*\):.*/\1/; is used to exchange the pattern and hold space and extract the 2nd column element. Last but not least, H;x;s/\n//p is used to regroup both extracted elements on 1 line and print it.
try this
$ egrep -io "(abc|def|efg):[^:]*" file
will print the match and the next token after delimiter.
If we can assume that there are only two fields, that abc etc will always match in the first field, and that getting the last match on a line which contains multiple matches is acceptable, a very simple sed script could work.
sed -n 's/^[^:]*\(abc\|def\|efg\)[^:]*:\([^:]*\)/\1:\2/p' file
If other but similar conditions apply (e.g. there are three fields or more but we don't care about matches in the first two) the required modifications are trivial. If not, you really need to clarify your question.

Print previous line if condition is met

I would like to grep a word and then find the second column in the line and check if it is bigger than a value. Is yes, I want to print the previous line.
Ex:
Input file
AAAAAAAAAAAAA
BB 2
CCCCCCCCCCCCC
BB 0.1
Output
AAAAAAAAAAAAA
Now, I want to search for BB and if the second column (2 or 0.1) in that line is bigger than 1, I want to print the previous line.
Can somebody help me with grep and awk? Thanks. Any other suggestions are also welcome. Thanks.
This can be a way:
$ awk '$1=="BB" && $2>1 {print f} {f=$1}' file
AAAAAAAAAAAAA
Explanation
$1=="BB" && $2>1 {print f} if the 1st field is exactly BB and 2nd field is bigger than 1, then print f, a stored value.
{f=$1} store the current line in f, so that it is accessible when reading the next line.
Another option: reverse the file and print the next line if the condition matches:
tac file | awk '$1 == "BB" && $2 > 1 {getline; print}' | tac
Concerning generality
I think it needs to be mentioned that the most general solution to this class of problem involves two passes:
the first pass to add a decimal row number ($REC) to the front of each line, effectively grouping lines into records by $REC
the second pass to trigger on the first instance of each new value of $REC as a record boundary (resetting $CURREC), thereafter rolling along in the native AWK idiom concerning the records to follow matching $CURREC.
In the intermediate file, some sequence of decimal digits followed by a separator (for human reasons, typically an added tab or space) is parsed (aka conceptually snipped off) as out-of-band with respect to the baseline file.
Command line paste monster
Even confined to the command line, it's an easy matter to ensure that the intermediate file never hits disk. You just need to use an advanced shell such as ZSH (my own favourite) which supports process substitution:
paste <( <input.txt awk "BEGIN { R=0; N=0; } /Header pattern/ { N=1; } { R=R+N; N=0; print R; }" ) input.txt | awk -f yourscript.awk
Let's render that one-liner more suitable for exposition:
P="/Header pattern/"
X="BEGIN { R=0; N=0; } $P { N=1; } { R=R+N; N=0; print R; }"
paste <( <input.txt awk $X ) input.txt | awk -f yourscript.awk
This starts three processes: the trivial inline AWK script, paste, and the AWK script you really wanted to run in the first place.
Behind the scenes, the <() command line construct creates a named pipe and passes the pipe name to paste as the name of its first input file. For paste's second input file, we give it the name of our original input file (this file is thus read sequentially, in parallel, by two different processes, which will consume between them at most one read from disk, if the input file is cold).
The magic named pipe in the middle is an in-memory FIFO that ancient Unix probably managed at about 16 kB of average size (intermittently pausing the paste process if the yourscript.awk process is sluggish in draining this FIFO back down).
Perhaps modern Unix throws a bigger buffer in there because it can, but it's certainly not a scarce resource you should be concerned about, until you write your first truly advanced command line with process redirection involving these by the hundreds or thousands :-)
Additional performance considerations
On modern CPUs, all three of these processes could easily find themselves running on separate cores.
The first two of these processes border on the truly trivial: an AWK script with a single pattern match and some minor bookkeeping, paste called with two arguments. yourscript.awk will be hard pressed to run faster than these.
What, your development machine has no lightly loaded cores to render this master shell-master solution pattern almost free in the execution domain?
Ring, ring.
Hello?
Hey, it's for you. 2018 just called, and wants its problem back.
2020 is officially the reprieve of MTV: That's the way we like it, magic pipes for nothing and cores for free. Not to name out loud any particular TLA chip vendor who is rocking the space these days.
As a final performance consideration, if you don't want the overhead of parsing actual record numbers:
X="BEGIN { N=0; } $P { N=1; } { print N; N=0; }"
Now your in-FIFO intermediate file is annotated with just an additional two characters prepended to each line ('0' or '1' and the default separator character added by paste), with '1' demarking first line in record.
Named FIFOs
Under the hood, these are no different than the magic FIFOs instantiated by Unix when you write any normal pipe command:
cat file | proc1 | proc2 | proc2
Three unnamed pipes (and a whole process devoted to cat you didn't even need).
It's almost unfortunate that the truly exceptional convenience of the default stdin/stdout streams as premanaged by the shell obscures the reality that paste $magictemppipe1 $magictemppipe2 bears no additional performance considerations worth thinking about, in 99% of all cases.
"Use the <() Y-joint, Luke."
Your instinctive reflex toward natural semantic decomposition in the problem domain will herewith benefit immensely.
If anyone had had the wits to name the shell construct <() as the YODA operator in the first place, I suspect it would have been pressed into universal service at least a solid decade ago.
Combining sed & awk you get this:
sed 'N;s/\n/ /' < file |awk '$3>1{print $1}'
sed 'N;s/\n/ / : Combine 1st and 2nd line and replace next line char with space
awk '$3>1{print $1}': print $1(1st column) if $3(3rd column's value is > 1)

How to determine if the content of one file is included in the content of another file

First, my apologies for what is perhaps a rather stupid question that doesn't quite belong here.
Here's my problem: I have two large text files containing a lot of file names, let's call them A and B, and I want to determine if A is a subset of B, disregarding order, i.e. for each file name in A, find if file name is also in B, otherwise A is not a subset.
I know how to preprocess the files (to remove anything but the file name itself, removing different capitalization), but now I'm left to wonder if there is a simple way to perform the task with a shell command.
Diff probably doesn't work, right? Even if I 'sort' the two files first, so that at least the files that are present in both will be in the same order, since A is probably a proper subset of B, diff will just tell me that every line is different.
Again, my apologies if the question doesn't belong here, and in the end, if there is no easy way to do it I will just write a small program to do the job, but since I'm trying to get a better handle on the shell commands, I thought I'd ask here first.
Do this:
cat b | sort -u | wc
cat a b | sort -u | wc
If you get the same result, a is a subset of b.
Here's how to do it in awk
awk '
# read A, the supposed subset file
FNR == NR {a[$0]; next}
# process file B
$0 in a {delete a[$0]}
END {if (length(a) == 0) {print "A is a proper subset of B"}}
' A B
Test if an XSD file is a subset of a WSDL file:
xmllint --format file.wsdl | awk '{$1=$1};1' | sort -u | wc
xmllint --format file.wsdl file.xsd | awk '{$1=$1};1' | sort -u | wc
This adapts the elegant concept of RichieHindle's prior answer using:
xmllint --format instead of cat, to pretty print the XML so each XML element was on one line, as required by sort -u | wc. Other pretty printing commands might work here, e.g. jq . for json.
an awk command to normalise the whitespace: strip leading and trailing (because the indentation is different in both files), and collapse internal. Caveat: does not consider XML attribute order within the element.

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?
As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt
Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

What are the differences among grep, awk & sed? [duplicate]

This question already has answers here:
What are the differences between Perl, Python, AWK and sed? [closed]
(5 answers)
What is the difference between sed and awk? [closed]
(3 answers)
Closed last month.
I am confused about the differences between grep, awk and sed in terms of their role in Unix/Linux system administration and text processing.
Short definition:
grep: search for specific terms in a file
#usage
$ grep This file.txt
Every line containing "This"
Every line containing "This"
Every line containing "This"
Every line containing "This"
$ cat file.txt
Every line containing "This"
Every line containing "This"
Every line containing "That"
Every line containing "This"
Every line containing "This"
Now awk and sed are completly different than grep.
awk and sed are text processors. Not only do they have the ability to find what you are looking for in text, they have the ability to remove, add and modify the text as well (and much more).
awk is mostly used for data extraction and reporting. sed is a stream editor
Each one of them has its own functionality and specialties.
Example
Sed
$ sed -i 's/cat/dog/' file.txt
# this will replace any occurrence of the characters 'cat' by 'dog'
Awk
$ awk '{print $2}' file.txt
# this will print the second column of file.txt
Basic awk usage:
Compute sum/average/max/min/etc. what ever you may need.
$ cat file.txt
A 10
B 20
C 60
$ awk 'BEGIN {sum=0; count=0; OFS="\t"} {sum+=$2; count++} END {print "Average:", sum/count}' file.txt
Average: 30
I recommend that you read this book: Sed & Awk: 2nd Ed.
It will help you become a proficient sed/awk user on any unix-like environment.
Grep is useful if you want to quickly search for lines that match in a file. It can also return some other simple information like matching line numbers, match count, and file name lists.
Awk is an entire programming language built around reading CSV-style files, processing the records, and optionally printing out a result data set. It can do many things but it is not the easiest tool to use for simple tasks.
Sed is useful when you want to make changes to a file based on regular expressions. It allows you to easily match parts of lines, make modifications, and print out results. It's less expressive than awk but that lends it to somewhat easier use for simple tasks. It has many more complicated operators you can use (I think it's even turing complete), but in general you won't use those features.
I just want to mention a thing, there are many tools can do text processing, e.g.
sort, cut, split, join, paste, comm, uniq, column, rev, tac, tr, nl, pr, head, tail.....
they are very handy but you have to learn their options etc.
A lazy way (not the best way) to learn text processing might be: only learn grep , sed and awk. with this three tools, you can solve almost 99% of text processing problems and don't need to memorize above different cmds and options. :)
AND, if you 've learned and used the three, you knew the difference. Actually, the difference here means which tool is good at solving what kind of problem.
a more lazy way might be learning a script language (python, perl or ruby) and do every text processing with it.

Resources