How to execute multiple commands within the same input - linux

Cyber newbie.
This has me completely stumped. I need to search a file called ‘countries’ for all countries containing the letter ‘y’. Following this, sort the output of this command in reverse order and write the output to a file called ‘output’.
How do I sort by a particular character?
Thanks

grep y countries | sort -r > output
should do it.
The pipe character | sends the output of the command on the left, grep, as input to the right, sort.
The output redirection character > sends the output of the result to a file
'output'.

Related

Prefix search names to output in bash

I have a simple egrep command searching for multiple strings in a text file which outputs either null or a value. Below is the command and the output.
cat Output.txt|egrep -i "abc|def|efg"|cut -d ':' -f 2
Output is:-
xxx
(null)
yyy
Now, i am trying to prefix my search texts to the output like below.
abc:xxx
def:
efg:yyy
Any help on the code to achieve this or where to start would be appreciated.
-Abhi
Since I do not know exactly your input file content (not specified properly in the question), I will put some hypothesis in order to answer your question.
Case 1: the patterns you are looking for are always located in the same column
If it is the case, the answer is quite straightforward:
$ cat grep_file.in
abc:xxx:uvw
def:::
efg:yyy:toto
xyz:lol:hey
$ egrep -i "abc|def|efg" grep_file.in | cut -d':' -f1,2
abc:xxx
def:
efg:yyy
After the grep just use the cut with the 2 columns that you are looking for (here it is 1 and 2)
REMARK:
Do not cat the file, pipe it and then grep it, since this is doing the work twice!!! Your grep command will already read the file so do not read it twice, it might not be that important on small files but you will feel the difference on 10GB files for example!
Case 2: the patterns you are looking for are NOT located in the same column
In this case it is a bit more tricky, but not impossible. There are many ways of doing, here I will detail the awk way:
$ cat grep_file2.in
abc:xxx:uvw
::def:
efg:yyy:toto
xyz:lol:hey
If your input file is in this format; with your pattern that could be located anywhere:
$ awk 'BEGIN{FS=":";ORS=FS}{tmp=0;for(i=1;i<=NF;i++){tmp=match($i,/abc|def|efg/);if(tmp){print $i;break}}if(tmp){printf "%s\n", $2}}' grep_file
2.in
abc:xxx
def:
efg:yyy
Explanations:
FS=":";ORS=FS define your input/output field separator at : Then on each line you define a test variable that will become true when you reach your pattern, you loop on all the fields of the line until you reach it if it is the case you print it, break the loop and print the second field + an EOL char.
If you do not meet your pattern you do nothing.
If you prefer the sed way, you can use the following command:
$ sed -n '/abc\|def\|efg/{h;s/.*\(abc\|def\|efg\).*/\1:/;x;s/^[^:]*:\([^:]*\):.*/\1/;H;x;s/\n//p}' grep_file2.in
abc:xxx
def:
efg:yyy
Explanations:
/abc\|def\|efg/{} is used to filter the lines that contain only one of the patterns provided, then you execute the instructions in the block. h;s/.*\(abc\|def\|efg\).*/\1:/; save the line in the hold space and replace the line with one of the 3 patterns, x;s/^[^:]*:\([^:]*\):.*/\1/; is used to exchange the pattern and hold space and extract the 2nd column element. Last but not least, H;x;s/\n//p is used to regroup both extracted elements on 1 line and print it.
try this
$ egrep -io "(abc|def|efg):[^:]*" file
will print the match and the next token after delimiter.
If we can assume that there are only two fields, that abc etc will always match in the first field, and that getting the last match on a line which contains multiple matches is acceptable, a very simple sed script could work.
sed -n 's/^[^:]*\(abc\|def\|efg\)[^:]*:\([^:]*\)/\1:\2/p' file
If other but similar conditions apply (e.g. there are three fields or more but we don't care about matches in the first two) the required modifications are trivial. If not, you really need to clarify your question.

Linux / Unix — pipe the output to join

I am very new at Linux / Unix stuff, and time to time I am doing some sort of exercise.
I was doing my exercises till I came up to one part.
Plain sort quotes.t5 and pipe the output to join.
In join use field separator, read from stdin and from quotes.comms, output to quotes.t6
The problem is, I don't understand what this part is asking.
A few days ago I have ran this command on the server:
wget 'http://finance.yahoo.com/d/quotes.csv?s=BARC.L+0992.HK+RHT+AAPL+ADI+AEIS+AGNC+AMAT+AMGN+AMRN+ARCC+ARIA+ARNA+ATVI+BBRY+BIDU+BRCD+BRCM+BSFT+CENX+CERE+CMCSA+COCO+CSCO+CSIQ+CSOD+CTRP+CTSH+CYTX+DRYS+DTV+DXM+EA+EBAY+EGLE+ENDP+ESRX+EXPD+EXTR+FANG+FAST+FB+FCEL+FITB+FLEX+FOXA+FSLR+FTR+GALE+GERN+GILD+GMCR+GRPN+GTAT+HBAN+HDS+HIMX+HSOL+IMGN+INTC+JASO+JBLU+JDSU+KERX+LINE+LINTA+MDLZ+MNKD+MPEL+MSFT+MU+MXIM+MYL+NFLX+NIHD+NUAN+NVDA+ONNN+ORIG+OTEX+OXBT+PENN+PMCS+PSEC+QCOM+RBCN+REGN+RFMD+RSOL+SCTY+SINA+SIRI+SNDK+SPWR+SYMC+TSLA+TUES+TWGP+TXN+VOLC+WEN+YHOO+ZNGA&f=nab' -O quotes.csv
But the produced file quotes.csv was not good enough to get insight into finances and stuff so I need some help from you!
Checkpointing. When finished this lesson you must get this:
$ sha256sum -c quotesshasums
quotes.t1: OK
quotes.t2: OK
quotes.t3: OK
quotes.t4: OK
quotes.t5: OK
quotes.t6: OK
quotes.csv
We have a source file with stock prices data
Lines are terminated with CRLF, which is not Unix style. Make it LF terminated.
Means remove CR (\r) byte from each line. To do this use sed (man sed) substitute
command, output to quotes.t1
More info at http://en.wikipedia.org/wiki/Newline
Run checkpoint to test if quotes.t1 is OK.
Use head and tail commands to output all except first and last line of file
quotes.t1 to quotes.t2
Make fields separated with pipe (vertical bar |) instead of comma.
sed -re 's/,([0-9.]+),([0-9.]+)/|\1|\2/g' quotes.t2 > quotes.t3
Numeric sort by third field(key), don't forget the new separator, output to quotes.t4q
Output last five lines, cut it leaving first and third fields in result. quotes.t5
Plain sort quotes.t5 and pipe the output to join.
In join use field separator, read from stdin and from quotes.comms, output to quotes.t6
If needed, I can post all parts of this exercise, but I am thinking you may know what I need to do at this part.
Mainly what I need to know what that join means. I searched on Google about this, but still I don't get it.
Transferring an abbreviated version of the comments into an answer.
The original version of the question was asking about:
Plain sort quotes.t5 and pipe the output to join.
In join use field separator, read from stdin and from quotes.comms, output to quotes.t6
You need to know that join is a command. It can read from standard input if you specify - as one of its two input file names.
The steps are then, it seems to me, quite straight-forward:
sort quotes.t5 | join -t'|' - quotes.comm > quotes.t6
or perhaps:
sort quotes.t5 | join -t'|' quotes.comm - >quotes.t6
I'm not sure how you tell which is required, except by interpreting 'read from stdin and quotes.comms' as meaning standard input first and quotes.comms second.

Copy a section within two keywords into a target file

I have thousand of files in a directory and each file contains numbers of defined variables starting with keyword DEFINE and ending with a semicolon (;), I want to copy all the occurrences of the data between this keyword(Inclusive) into a target file.
Example: Below is the content of the text file:
/* This code is for lookup */
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
END.
Now from the above content i just want to copy the section starting with DEFINE and ending with ; into a target file i.e. the output should be:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
this needs to done for thousands of scripts and multiple occurences, Please help out.
Thanks a lot , the provided code works, but to a limited extent only when the whole sentence is in a single line but the data is not supposed to be in one single line it is spread in multiple line like below:
/* This code is for lookup */
DEFINE variable as a1 expr= if branchno > 55
then
extract (n123f1 using brach, code)
else
branchno = null
;
END.
The code is also in the above fashion i need to capture all the data between DEFINE and semicolon (;) after every define there will be an ending semicolon ;, this is the pattern.
It sounds like you want grep(1):
grep '^DEFINE.*;$' input > output
Try using grep. Let's say you have files with extension .txt in present directory,
grep -ho 'DEFINE.*;' *.txt > outfile
Output:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
Short Description
-o will give you only matching string rather than whole line, if line also contains something else and want to ommit it.
-h will suppress file names before matching result
Read man page of grep by typing man grep on your terminal
EDIT
If you want capability to search in multiple lines, you can use pcregrep with -M option
pcregrep -M 'DEFINE.*?(\n|.)*?;' *.txt > outfile
Works fine on my system. Check man pcregrep for more details
Reference : SO Question
One can make a simple solution using sed with version :
sed -n -e '/^DEFINE/{:a p;/;$/!{n;ba}}' your-file
Option -n prevents sed from printing every line; then each time a line begins with DEFINE, print the line (command p) then enter a loop: until you find a line ending with ;, grab the next line and loop to the print command. When exiting the loop, you do nothing.
It looks a bit dirty; it seems that the version sed15 has a shorter (and more straightforward) way to achieve this in one line:
sed -n -e '/^DEFINE/,/;$/p' your-file
Indeed, only for this version of sed, both patterns are treated; for other versions of sed like mine under cygwin, the range patterns must be on separate lines to work properly.
One last thing to remember: it does not treat inclusive patterned ranges, i.e. it stops printing after the first encountered end-pattern even if multiple start patterns have been matched. Prefer something with awk if this is a feature you are looking for.

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?
As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt
Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

Difference between "sort < output" and "sort output"

I just want to know the difference between:
sort < output
and
sort output
in Linux. How does it work exactly?
This has been discussed on unix.stackexchange here: Performance difference between stdin and command line argument
In sort < file the shell performs redirection. It opens the file and passes the stdin file descriptor to the sort command which reads it.
In sort file, the sort command opens the file and then reads it.
sort < output is telling the shell to use the contents of the file output and dump it to standard in for the command sort.
sort output is telling the command sort to use the file output on disk as it's source.
Many unix commands will accept either standard in or a file as input. The acceptance of standard in allows easier chaining of commands, often for things like ps aux | grep "my process" | sort. (List all processes, filter by "my process", sort lines).
With sort < input the shell will run the sort command, and attach its input to the file 'input'.
With sort input the shell will run the sort command, and give it as parameter the string input. The sort command will then open the file to read it- content.
Effectively there is no difference.
sort < output uses a feature of the shell called file redirection (see e.g. here)
The shell opens tile file output and attaches that open file as stdin to the sort program.
sort output gives the output filename as an command line argument to sort.
sort, as many utilities that takes a filename as an argument, will try to read input from stdin if you do not give it a filename as an argument, such as in the first case here. In both cases, sort will read the content of the output file, sort it, and write the result to stdout.

Resources