Copy a section within two keywords into a target file - linux

I have thousand of files in a directory and each file contains numbers of defined variables starting with keyword DEFINE and ending with a semicolon (;), I want to copy all the occurrences of the data between this keyword(Inclusive) into a target file.
Example: Below is the content of the text file:
/* This code is for lookup */
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
END.
Now from the above content i just want to copy the section starting with DEFINE and ending with ; into a target file i.e. the output should be:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
this needs to done for thousands of scripts and multiple occurences, Please help out.
Thanks a lot , the provided code works, but to a limited extent only when the whole sentence is in a single line but the data is not supposed to be in one single line it is spread in multiple line like below:
/* This code is for lookup */
DEFINE variable as a1 expr= if branchno > 55
then
extract (n123f1 using brach, code)
else
branchno = null
;
END.
The code is also in the above fashion i need to capture all the data between DEFINE and semicolon (;) after every define there will be an ending semicolon ;, this is the pattern.

It sounds like you want grep(1):
grep '^DEFINE.*;$' input > output

Try using grep. Let's say you have files with extension .txt in present directory,
grep -ho 'DEFINE.*;' *.txt > outfile
Output:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
Short Description
-o will give you only matching string rather than whole line, if line also contains something else and want to ommit it.
-h will suppress file names before matching result
Read man page of grep by typing man grep on your terminal
EDIT
If you want capability to search in multiple lines, you can use pcregrep with -M option
pcregrep -M 'DEFINE.*?(\n|.)*?;' *.txt > outfile
Works fine on my system. Check man pcregrep for more details
Reference : SO Question

One can make a simple solution using sed with version :
sed -n -e '/^DEFINE/{:a p;/;$/!{n;ba}}' your-file
Option -n prevents sed from printing every line; then each time a line begins with DEFINE, print the line (command p) then enter a loop: until you find a line ending with ;, grab the next line and loop to the print command. When exiting the loop, you do nothing.
It looks a bit dirty; it seems that the version sed15 has a shorter (and more straightforward) way to achieve this in one line:
sed -n -e '/^DEFINE/,/;$/p' your-file
Indeed, only for this version of sed, both patterns are treated; for other versions of sed like mine under cygwin, the range patterns must be on separate lines to work properly.
One last thing to remember: it does not treat inclusive patterned ranges, i.e. it stops printing after the first encountered end-pattern even if multiple start patterns have been matched. Prefer something with awk if this is a feature you are looking for.

Related

Simple way to remove multi-line string using sed

Using sed, is there a way to remove multiple lines from a text file based on some starting and ending expressions?
I have known markers in the file and want to remove everything between (markers inclusive). I have seen some really complicated solutions and I would like to do this without resorting to micro commands.
My file looks something like this:
cat /tmp/foobar.txt
this is line 1
this is line 3
tomcat.util.scan.StandardJarScanFilter.jarsToSkip=\
annotations-api.jar,\
ant-junit*.jar,\
ant-launcher.jar,\
ant.jar,\
asm-*.jar,\
aspectj*.jar,\
bootstrap.jar,\
catalina-ant.jar,\
catalina-ha.jar,\
catalina-ssi.jar,\
catalina-storeconfig.jar
the end leave me
and me
I want to remove everything starting at tomcat.util all the away to the last .jar
tldr;
I think this is the simplest way, ad no need for the assembly like micro commands
sed '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
which produces
this is line 1
this is line 3
the end leave me
and me
if you wanted to remove the lines in the file rather than spit out the output to stdout then use the inline flag, so
sed -i '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
So... how does this work?
sed commands, like vi commands operate on an address. Normally we don't specify an address and that simply applies the command to all lines of the file, eg when replacing the for that in a file we'd normally do
sed -i 's/the/that/g' /tmp/foobar.txt
ie applying the substitute or s command to all lines in the file.
In this case you want to delete some lines so we can use the delete or d command. But we need to tell it where to delete. So we need to give it an address.
The format of a sed command is
[addr][!]command[options]
(see the docs )
If no address is specified then the command is applied to all lines, if the ! is specified then it is applied to all lines that don't match the pattern. So far so good.
The trick here is that addr can be a single address or a range of addresses. The address can be a line number or a regex pattern. You use a , between two addresses to to specify a range.
so to delete line 5 to 8 inclusive you could do
sed -i '5,8d' /tmp/foobar.txt
in this case rather than knowing the line number we know some "markers" and we can use Regex instead, so the first marker, a line starting with tomcat.util is found by the regex
/^tomcat\.util.*$/
The second marker is a bit more tricky but if we look we can see that the final line to remove is the first one that does not end with a \, so we can match a line that consists of "anything but does not end with \"
/^.*[^\]$/
While the second marker could match a whole bunch of lines if we make a range out of these two regexes, the range means that the second "address" is the first line after the first address that matches the regex.
Putting that all together, we want to delete (d) all lines in the range from the address that is found by the regex matching a line starting with tomcat.util and ending with a line that does not end in \ ie
sed '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
hope that helps ;-)
Cheers
Karl
Awk is generally more useful than sed for anything spanning lines. Using any awk in any shell on every Unix box:
$ awk '!/\.jar/{f=0} /tomcat\.util/{f=1} !f' file
this is line 1
this is line 3
the end leave me
and me
This might work for you (GNU sed):
sed -n '/tomcat\.util/{:a;n;/\.jar/ba};p' file
Turn off implicit printing using the -n option.
Match on a line containing tomcat.util.
Continue fetching lines until such a line does not match one containing .jar.
Print all other lines.
Alternative:
sed -E '/tomcat\.util/{:a;$!N;/\.jar(,\\)?$/s/\n//;ta;D}' file
Gather up lines beginning tomcat.util and ending either .jar,\ or .jar, removing newlines until the end-of-file or a mis-match and then delete the collection.

Using sed to obtain pattern range through multiple files in a directory

I was wondering if it was possible to use the sed command to find a range between 2 patterns (in this case, dates) and output these lines in the range to a new file.
Right now, I am just looking at one file and getting lines within my time range of the file FileMoverTransfer.log. However, after a certain time period, these logs are moved to new log files with a suffix such as FileMoverTransfer.log-20180404-xxxxxx.gz. Here is my current code:
sed -n '/^'$start_date'/,/^'$end_date'/p;/^'$end_date'/q' FileMoverTransfer.log >> /public/FileMoverRoot/logs/intervalFMT.log
While this doesn't work, as sed isn't able to look through all of the files in the directory starting with FileMoverTransfer.log?
sed -n '/^'$start_date'/,/^'$end_date'/p;/^'$end_date'/q' FileMoverTransfer.log* >> /public/FileMoverRoot/logs/intervalFMT.log
Any help would be greatly appreciated. Thanks!
The range operator only operates within a single file, so you can't use it if the start is in one file and the end is in another file.
You can use cat to concatenate all the files, and pipe this to sed:
cat FileMoverTransfer.log* | sed -n "/^$start_date/,/^$end_date/p;/^$end_date/q" >> /public/FileMoverRoot/logs/intervalFMT.log
And instead of quoting and unquoting the sed command, you can use double quotes so that the variables will be expanded inside it. This will also prevent problems if the variables contain whitespace.
awk solution
As the OP confirmed that an awk solution would be acceptable, I post it.
(gunzip -c FileMoverTransfer.log-*.gz; cat FileMoverTransfer.log ) \
|awk -v st="$start_date" -v en="$end_date" '$1>=st&&$1<=en{print;next}$1>en{exit}'\
>/public/FileMoverRoot/logs/intervalFMT.log
This solution is functionally almost identical to Barmar’s sed solution, with the difference that his solution, like the OP’s, will print and quit at the first record matching the end date, while mine will print all lines matching the end date and quit at the first record past the end date, without printing it.
Some remarks:
The OP didn't specify the date format. I suppose it is a format compatible with ordinary string order, otherwise some conversion function should be used.
The files FileMoverTransfer.log-*.gz must be named in such a way that their alphabetical ordering corresponds to the chronological order (which is probably the case.)
I suppose that the dates are separated from the rest of the line by whitespace. If they aren’t, you have to supply the -F option to awk. E.g., if the dates are separated by -, you must write awk -F- ...
awk is much faster than sed in this case, because awk simply looks for the separator (whitespace or whatever was supplied with -F) while sed performs a regexp match.
There is no concept of range in my code, only date comparison. The only place where I suppose that the lines are ordered is when I say $1>en{exit}, that is exit when a line is newer than the end date. If you remove that final pattern and its action, the code will run through the whole input, but you could drop the requirement that the files be ordered.

Prefix search names to output in bash

I have a simple egrep command searching for multiple strings in a text file which outputs either null or a value. Below is the command and the output.
cat Output.txt|egrep -i "abc|def|efg"|cut -d ':' -f 2
Output is:-
xxx
(null)
yyy
Now, i am trying to prefix my search texts to the output like below.
abc:xxx
def:
efg:yyy
Any help on the code to achieve this or where to start would be appreciated.
-Abhi
Since I do not know exactly your input file content (not specified properly in the question), I will put some hypothesis in order to answer your question.
Case 1: the patterns you are looking for are always located in the same column
If it is the case, the answer is quite straightforward:
$ cat grep_file.in
abc:xxx:uvw
def:::
efg:yyy:toto
xyz:lol:hey
$ egrep -i "abc|def|efg" grep_file.in | cut -d':' -f1,2
abc:xxx
def:
efg:yyy
After the grep just use the cut with the 2 columns that you are looking for (here it is 1 and 2)
REMARK:
Do not cat the file, pipe it and then grep it, since this is doing the work twice!!! Your grep command will already read the file so do not read it twice, it might not be that important on small files but you will feel the difference on 10GB files for example!
Case 2: the patterns you are looking for are NOT located in the same column
In this case it is a bit more tricky, but not impossible. There are many ways of doing, here I will detail the awk way:
$ cat grep_file2.in
abc:xxx:uvw
::def:
efg:yyy:toto
xyz:lol:hey
If your input file is in this format; with your pattern that could be located anywhere:
$ awk 'BEGIN{FS=":";ORS=FS}{tmp=0;for(i=1;i<=NF;i++){tmp=match($i,/abc|def|efg/);if(tmp){print $i;break}}if(tmp){printf "%s\n", $2}}' grep_file
2.in
abc:xxx
def:
efg:yyy
Explanations:
FS=":";ORS=FS define your input/output field separator at : Then on each line you define a test variable that will become true when you reach your pattern, you loop on all the fields of the line until you reach it if it is the case you print it, break the loop and print the second field + an EOL char.
If you do not meet your pattern you do nothing.
If you prefer the sed way, you can use the following command:
$ sed -n '/abc\|def\|efg/{h;s/.*\(abc\|def\|efg\).*/\1:/;x;s/^[^:]*:\([^:]*\):.*/\1/;H;x;s/\n//p}' grep_file2.in
abc:xxx
def:
efg:yyy
Explanations:
/abc\|def\|efg/{} is used to filter the lines that contain only one of the patterns provided, then you execute the instructions in the block. h;s/.*\(abc\|def\|efg\).*/\1:/; save the line in the hold space and replace the line with one of the 3 patterns, x;s/^[^:]*:\([^:]*\):.*/\1/; is used to exchange the pattern and hold space and extract the 2nd column element. Last but not least, H;x;s/\n//p is used to regroup both extracted elements on 1 line and print it.
try this
$ egrep -io "(abc|def|efg):[^:]*" file
will print the match and the next token after delimiter.
If we can assume that there are only two fields, that abc etc will always match in the first field, and that getting the last match on a line which contains multiple matches is acceptable, a very simple sed script could work.
sed -n 's/^[^:]*\(abc\|def\|efg\)[^:]*:\([^:]*\)/\1:\2/p' file
If other but similar conditions apply (e.g. there are three fields or more but we don't care about matches in the first two) the required modifications are trivial. If not, you really need to clarify your question.

concatenate two strings and one variable using bash

I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards
By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv
You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).
Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...

find string and replace

Hi I have a file like this
L_00001_mRNA_interferase_MazF
ATGGATTATCCAAAACAAAAGGATATTGTCTGGATTGATTTTGACCCTTCTAAAGGCAAA
GAGATAAGAAAGCGGAGACCTGCGTTAGTAGTTAGTAAAGATGAATTTAATGAACGTACA
GGTTTCTGTTTAGTTTGCCCCATCACATCTACTAAAAGGAACTTTGCAACGTATATTGAA
ATAACAGACCCACAGAAAGTAGAAGGGGACGTAGTTACCCATCAATTGCGAGCGGTTGAT
TACACCACAAGAAATATCGAAAAAATTGAACAATGTGATATGTTGACGTGGATTGATGTA
GTAGAAGTAATCGGAATGTTTATTTAA
L_00002_hypothetical_protein
ATGGAAACGGTAGTTAGAAAGATAGGGAATTCAGTAGGAACTATTTTTCCGAAAAGTATT
TCACCACAAGTTGGAGAAAAGTTCACTATTCTTAAAGTTGGGGAAGCGTATATATTGAAA
CCTAAGAGAGAAGATATTTTTAAAAATGCTGAAGATTGGGTAGGGTTTAGAGAAGCTTTG
ACTAATGAAGATAAAGAATGGGACGAGATGAAACTTGAGGGAGGAGAACGCTAG
L_00003_hypothetical_protein
ATGACAACGTTTGGAGAAATTCATAGCAATGCAGAAGGTTATAAAAACGATTTTAATGAG
TTGAATAAATTAGTATTACGTGTAGCTGAAGAAAAAGCAAAAGGAGAGCCATTAGTAACG
TGGTTTCGGTTGCGGAATCGTAGGATTGCACAAGTATTAGACCCAATGAAAGAAGAAGTA
GAAAGTAAATCAAAGTACGAAAAAAGAAGAGTAGCAGCAATTAGTAAAAGCTTTTTTCTA
CTTAAAAAAGCTTTTAACTTTATTGAAGCAGAACAATTTGAAAAAGCAGAAAAATTAATT
I would like to substitute the header of each sequence with a string.
I have a conversion file like
L_00001_mRNA_interferase_MazF galM,GALM,aldose1-epimerase[EC:5.1.3.3]
L_00002_hypothetical_protein E3.2.1.85,lacG,6-phospho-beta-galactosidase[EC:3.2.1.85]
L_00003_hypothetical_protein PTS-Lac-EIIB,lacE,PTSsystem,lactose-specificIIBcomponent[EC:2.7.1.69]
Your question is unclear as to what platform you're on (Windows, Linux, Mac, ...), what languages you're constrained to, and the exact details of your input files.
On the assumption that you're on Linux, or otherwise have sed and awk available and a command shell, it could be as simple as (where $ indicates a Bourne-like shell prompt):
$ awk '{print "s/^" $1 "/" $2 "/"}' conversions.txt > conversions.sed
$ sed -f conversions.sed sequences.txt > relabeled.txt
This assumes that your first file (with the headings you want changed) is called sequences.txt and your second file (the “conversion file”) is called conversions.txt. It is further assumed that the “conversion file” contains one record per line with exactly two fields — the original and substitute headers — separated by whitespace (i.e. neither the original header nor the new header contain any spaces) and no blank lines.
In this solution, the first (awk) line converts the conversions.txt file into a sed script, conversions.sed; the second (sed) line then runs this script on the sequences.txt file, producing the relabeled.txt file, which may (or may not) be what you're looking for.
Depending on the exact nature of your input files, which isn't clear from your question, this may need a bit of tweaking.
Hope this helps.

Resources