Awk between two patterns with pattern in the middle - linux

Hi i am looking for an awk that can find two patterns and print the data between them to
a file only if in the middle there is a third patterns in the middle.
for example:
Start
1
2
middle
3
End
Start
1
2
End
And the output will be:
Start
1
2
middle
3
End
I found in the web awk '/patterns1/, /patterns2/' path > text.txt
but i need only output with the third patterns in the middle.

And here is a solution without flags:
$ awk 'BEGIN{RS="End"}/middle/{printf "%s", $0; print RT}' file
Start
1
2
middle
3
End
Explanation: The RS variable is the record separator, so we set it to "End", so that each Record is separated by "End".
Then we filter the Records that contain "middle", with the /middle/ filter, and for the matched records we print the current record with $0 and the separator with print RT

This awk should work:
awk '$1=="Start"{ok++} ok>0{a[b++]=$0} $1=="middle"{ok++} $1=="End"{if(ok>1) for(i=0; i<length(a); i++) print a[i]; ok=0;b=0;delete a}' file
Start
1
2
middle
3
End
Expanded:
awk '$1 == "Start" {
ok++
}
ok > 0 {
a[b++] = $0
}
$1 == "middle" {
ok++
}
$1 == "End" {
if (ok > 1)
for (i=0; i<length(a); i++)
print a[i];
ok=0;
b=0;
delete a
}' file

Just use some flags with awk:
/Start/ {
start_flag=1
}
/middle/ {
mid_flag=1
}
start_flag {
n=NR;
lines[NR]=$0
}
/End/ {
if (start_flag && mid_flag)
for(i=n;i<NR;i++)
print lines[i]
start_flag=mid_flag=0
delete lines
}

Modified the awk user000001
awk '/middle/{printf "%s%s\n",$0,RT}' RS="End" file
EDIT:
Added test for Start tag
awk '/Start/ && /middle/{printf "%s%s\n",$0,RT}' RS="End" file

This will work with any modern awk:
awk '/Start/{f=1;rec=""} f{rec=rec $0 ORS} /End/{if (rec~/middle/) printf "%s",rec}' file
The solutions that set RS to "End" are gawk-specific, which may be fine but it's definitely worth mentioning.

Related

Changing previous duplicate line in awk

I want to change all duplicate names in .csv to unique, but after finding duplicate I cannot reach previous line, because it's already printed. I've tried to save all lines in array and print them in End section, but it doesn't work and I don't understand how to access specific field in this array (two-dimensional array isn't supported in awk?).
sample input
...,9,phone,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone,...
desired output
...,9,phone9,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone43,...
My attempt ($2 - id field, $3 - name field)
BEGIN{
FS=","
OFS=","
marker=777
}
{
if (names[$3] == marker) {
$3 = $3 $2
#Attempt to change previous duplicate
results[nameLines[$3]]=$3 id[$3]
}
names[$3] = marker
id[$3] = $2
nameLines[$3] = NR
results[NR] = $0
}
END{
#it prints some numbers, not saved lines
for(result in results)
print result
}
Here is single pass awk that stores all records in buffer:
awk -F, '
{
rec[NR] = $0
++fq[$3]
}
END {
for (i=1; i<=NR; ++i) {
n = split(rec[i], a, /,/)
if (fq[a[3]] > 1)
a[3] = a[3] a[2]
for (k=1; k<=n; ++k)
printf "%s", a[k] (k < n ? FS : ORS)
}
}' file
...,9,phone9,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone43,...
This could be easily done in 2 pass Input_file in awk where we need not to create 2 dimensional arrays in it. With your shown samples written in GNU awk.
awk '
BEGIN{FS=OFS=","}
FNR==NR{
arr1[$3]++
next
}
{
$3=(arr1[$3]>1?$3 $2:$3)
}
1
' Input_file Input_file
Output will be as follows:
...,9,phone9,...
...,43,book,...
...,27,apple,...
...,85,hook,...
...,43,phone43,...

move lines into a file by number of columns using awk

I have a sample file with '||o||' as field separator.
www.google.org||o||srScSG2C5tg=||o||bngwq
farhansingla.it||o||4sQVj09gpls=||o||
ngascash||o||||o||
ms-bronze.com.br||o||||o||
I want to move the lines with only 1 field in 1.txt and those having more than 1 field in not_1.txt. I am using the following command:
sed 's/\(||o||\)\+$//g' sample.txt | awk -F '[|][|]o[|][|]' '{if (NF == 1) print > "1.txt"; else print > "not_1.txt" }'
The problem is that it is moving not the original lines but the replaced ones.
The output I am getting is (not_1.txt):
td#the-end.org||o||srScSG2C5tg=||o||bnm
erba01#tiscali.it||o||4sQVj09gpls=
1.txt:
ngas
ms-inside#bol.com.br
As you can see the original lines are modified. I don't want to modify the lines.
Any help would be highly appreciated.
Awk solution:
awk -F '[|][|]o[|][|]' \
'{
c = 0;
for (i=1; i<=NF; i++) if ($i != "") c++;
print > (c == 1? "1" : "not_1")".txt"
}' sample.txt
Results:
$ head 1.txt not_1.txt
==> 1.txt <==
ngascash||o||||o||
ms-bronze.com.br||o||||o||
==> not_1.txt <==
www.google.org||o||srScSG2C5tg=||o||bngwq
farhansingla.it||o||4sQVj09gpls=||o||
Following awk may help you on same.
awk -F'\\|\\|o\\|\\|' '{for(i=1;i<=NF;i++){count=$i?++count:count};if(count==1){print > "1_field_only"};if(count>1){print > "not_1_field"};count=""}' Input_file
Adding a non-one liner form of solution too now.
awk -F'\\|\\|o\\|\\|' '
{
for(i=1;i<=NF;i++){ count=$i?++count:count };
if(count==1) { print > "1_field_only" };
if(count>1) { print > "not_1_field" };
count=""
}
' Input_file
Explanation: Adding explanation for above code too now.
awk -F'\\|\\|o\\|\\|' ' ##Setting field separator as ||o|| here and escaping the | here to take it literal character here.
{
for(i=1;i<=NF;i++){ count=$i?++count:count }; ##Starting a for loop to traverse through all the fields here, increasing variable count value if a field is NOT null.
if(count==1) { print > "1_field_only" }; ##Checking if count value is 1 it means fields are only 1 in line so printing current line into 1_field_only file.
if(count>1) { print > "not_1_field" }; ##Checking if count is more than 1 so printing current line into output file named not_1_field file here.
count="" ##Nullifying the variable count here.
}
' Input_file ##Mentioning Input_file name here.

Using `awk` to print number of lines in file in the BEGIN section

I am trying to write an awk script and before anything is done tell the user how many lines are in the file. I know how to do this in the END section but unable to do so in the BEGIN section. I have searched SE and Google but have only found a half dozen ways to do this in the END section or as part of a bash script, not how to do it before any processing has taken place at all. I was hoping for something like the following:
#!/usr/bin/awk -f
BEGIN{
print "There are a total of " **TOTAL LINES** " lines in this file.\n"
}
{
if($0==4587){print "Found record on line number "NR; exit 0;}
}
But have been unable to determine how to do this, if it is even possible. Thanks.
You can read the file twice:
awk 'NR!=1 && FNR==1 {print NR-1} <some more code here>' file{,}
In your example:
awk 'NR!=1 && FNR==1 {print "There are a total of "NR-1" lines in this file.\n"} $0==4587 {print "Found record on line number "NR; exit 0;}' file{,}
You can use file file instead of file{,} (it just makes it show up twice)
NR!=1 && FNR==1 this will be true only at first line of second file.
To use an awk script containing:
#!/usr/bin/awk -f
NR!=1 && FNR==1 {
print "There are a total of "NR-1" lines in this file.\n"
}
$0==4587 {
print "Found record on line number "NR; exit 0
}
call:
awk -f myscript file{,}
To do this robustly and for multiple files you need something like:
$ cat tst.awk
BEGINFILE {
numLines = 0
while ( (getline line < FILENAME) > 0 ) {
numLines++
}
print "----\nThere are a total of", numLines, "lines in", FILENAME
}
$0==4587 { print "Found record on line number", FNR, "of", FILENAME; nextfile }
$
$ cat file1
a
4587
c
$
$ cat file2
$
$ cat file3
d
e
f
4587
$
$ awk -f tst.awk file1 file2 file3
----
There are a total of 3 lines in file1
Found record on line number 2 of file1
----
There are a total of 0 lines in file2
----
There are a total of 4 lines in file3
Found record on line number 4 of file3
The above uses GNU awk for BEGINFILE. Any other solution is difficult to implement such that it will handle empty files (you need an array to track files being parsed and print info the the FNR==1 and END sections after the empty file has been skipped).
Using getline has caveats and should not be used lightly, see http://awk.info/?tip/getline, but this is one of the appropriate and robust uses of it. You can also test for non-readable files in BEGINFILE by testing ERRNO and skipping the file (see the gawk manual) - that situation will cause other scripts to abort.
BEGIN {
s="cat your_file.txt|wc -l";
s | getline file_size;
close(s);
print file_size
}
This will put the size of the file named your_file.txt into the awk variable file_size and print it out.
If your file name is dynamic you can pass the filename on the commandline and change the script to use the variable.
E.g. my.awk
BEGIN {
s="cat "VAR"|wc -l";
s | getline file_size;
close(s);
print file_size
}
Then you can call it like this:
awk -v VAR="your_file.txt" -f my.awk
If you use GNU awk and need a robust, generic solution that accommodates multiple, possibly empty input files, use Ed Morton's solution.
This answer uses portable (POSIX-compliant) code. Within the constraints noted, it is robust, but Ed's GNU awk solution is both simpler and more robust.
Tip of the hat to Ed Morton for his help.
With a single input file, it is simpler to handle line counting with a shell command in the BEGIN block, which has the following advantages:
on invocation, the filename doesn't have to be specified twice, unlike in the accepted answer
Also note that the accepted answer doesn't work as intended (as of this writing); the correct form is (see the comments on the answer for an explanation):
awk 'NR==FNR {next} FNR==1 {print NR-1} $0==4587 {print "Found record on line number "NR; exit 0}' file{,}
the solution also works with an empty input file.
In terms of performance, this approach is either only slightly slower than reading the file twice in awk, or even a little faster, depending on the awk implementation used:
awk '
BEGIN {
# Execute a shell command to count the lines and read
# result into an awk variable via <cmd> | getline <varname>.
# If the file cannot be read, abort. (The shell has already printed an error msg.)
cmd="wc -l < \"" ARGV[1] "\""; if ((cmd | getline count) < 1) exit 1; close(cmd)
printf "There are a total of %s lines in this file.\n\n", count
}
$0==4587 { print "Found record on line number " NR; exit 0 }
' file
Assumptions:
The filename is passed as the 1st operand (non-option argument) on the command line, accessed as ARGV[1].
The filename doesn't contain embedded " chars.
The following solutions deal with multiple files and make analogous assumptions:
All operands passed are filenames. That is, all arguments after the program must be filenames, and not variable assignments such as var=value.
No filename contains embedded " chars.
No processing is to take place if any of the input files do not exist or cannot be read.
It's not hard to generalize this to handling multiple files, but the following solution doesn't print the line count for empty files:
awk '
BEGIN {
# Loop over all input files and store their line counts in an array.
for (i=1; i<ARGC; ++i) {
cmd="wc -l < \"" ARGV[i] "\""; if ((cmd | getline count) < 1) exit 1; close(cmd)
counts[ARGV[i]] = count
}
}
# At the beginning of every (non-empty) file, print the line count.
FNR==1 { printf "There are a total of %s lines in file %s.\n\n", counts[FILENAME], FILENAME }
# $0==4587 { print "%s: Found record on line number %d\n", FILENAME, NR; exit 0 }
' file1 file2 # ...
Things get a little trickier if you want the line count to be printed for empty files also:
awk '
BEGIN {
# Loop over all input files and store their line counts in an array.
for (i=1; i<ARGC; ++i) {
cmd="wc -l < \"" ARGV[i] "\""; if ((cmd | getline count) < 1) exit 1; close(cmd)
counts[ARGV[i]] = count
}
fileCount = ARGC - 1
fmtStringCount = "There are a total of %s lines in file %s.\n\n"
}
# At the beginning of every (non-empty) file, print the line count.
FNR==1 {
++fileIndex
# If there were intervening empty files, print their counts too.
while (ARGV[fileIndex] != FILENAME) {
printf fmtStringCount, 0, ARGV[fileIndex++]
}
printf fmtStringCount, counts[FILENAME], FILENAME
}
# Process input lines
$0==4587 { print "%s: Found record on line number %d\n", FILENAME, NR; exit 0 }
# If there are any remaining empty files a the end, print their counts too.
END {
while (fileIndex < fileCount) { printf fmtStringCount, 0, ARGV[++fileIndex] }
}
' file1 file2 # ...
You can get the number of lines by wc and cut, and set to awk variable with -v option, then you can use the variable in awk script.
cat awk.txt \
| awk -v FNC=`wc -l awk.txt | cut -wf 2` \
'BEGIN { print "FNC: " FNC } { print $0 }'

Check variables from different lines with awk

I want to combine values from multiple lines with different lengths using awk into one line if they match. In the following sample match values for first field,
aggregating values from second field into a list.
Input, sample csv:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z
How can I write an awk expression (maybe some other shell expression) to check if the first field value match with the next/previous line, and then print a list of second fields values aggregated and separated by a pipe?
awk '
BEGIN {FS=";"}
{ if ($1==prev) {sec=sec "|" $2; }
else { if (prev) { print prev ";" sec; };
prev=$1; sec=$2; }}
END { if (prev) { print prev ";" sec; }}'
This, as you requested, checks the consecutive lines.
does this oneliner work?
awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' file
tested here:
kent$ cat a
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
kent$ awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' a
555;f
4444;a|d|z
222;a|b
if you want to keep it sorted, add a |sort at the end.
Slightly convoluted, but does the job:
awk -F';' \
'{
if (a[$1]) {
a[$1]=a[$1] "|" $2
} else {
a[$1]=$2
}
}
END {
for (k in a) {
print k ";" a[k]
}
}' file
Assuming that you have set the field separator ( -F ) to ; :
{
if ( $1 != last ) { print s; s = ""; }
last = $1;
s = s "|" $2;
} END {
print s;
}
The first line and the first character are slightly wrong, but that's an exercise for the reader :-). Two simple if's suffice to fix that.
(Edit: Missed out last line.)
this should work:
Command:
awk -F';' '{if(a[$1]){a[$1]=a[$1]"|"$2}else{a[$1]=$2}}END{for (i in a){print i";" a[i] }}' fil
Input:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z

unfolding a file on linux

I have a huge textfile, approx 400.000 lines 80 charachters wide on liux.
Need to "unfold" the file, merging four lines into one
ending up having 1/4 of the lines, each line 80*4 charachters long.
any suggestions?
perl -pe 'chomp if (++$i % 4);'
An easier way to do it with awk would be:
awk '{ printf $0 } (NR % 4 == 0) { print }' filename
Although if you wanted to protect against ending up without a trailing newline it gets a little more complicated:
awk '{ printf $0 } (NR % 4 == 0) { print } END { if (NR % 4 != 0) print }' filename
I hope I understood your question correctly. You have an input line like this (except your lines are longer):
abcdef
ghijkl
mnopqr
stuvwx
yz0123
456789
ABCDEF
You want output like this:
abcdefghijklmnopqrstuvwx
yz0123456789ABCDEF
The following awk program should do it:
{ line = line $0 }
(NR % 4) == 0 { print line; line = "" }
END { if (line != "") print line }
Run it like this:
awk -f merge.awk data.txt

Resources