How to use AWK to continuously output lines from a file - linux

I have a file with multiple lines, and I want to continuously output some lines of the file, such as first time, print from line 1 to line 5, next time, print line 2 to line 6, and so on.
I find AWK as a very useful function and I tried to write a code on my own, but it just outputs nothing.
Following is my code
#!/bin/bash
for n in `seq 1 3`
do
N1=$n
N2=$((n+4))
awk -v n1="$N1" -v n2="$N2" 'NR == n1, NR == n2 {print $0}' my_file >> new_file
done
For example, I have an input file called my_file
1 99 tut
2 24 bcc
3 32 los
4 33 rts
5 642 pac
6 23 caas
7 231 cdos
8 1 caee
9 78 cdsa
Then I expect an output file as
1 99 tut
2 24 bcc
3 32 los
4 33 rts
5 642 pac
2 24 bcc
3 32 los
4 33 rts
5 642 pac
6 23 caas
3 32 los
4 33 rts
5 642 pac
6 23 caas
7 231 cdos

Could you please try following, written and tested with shown samples in GNU awk. Where one needs to mention all lines which needs to be printed in lines_from variable, then there is a variable named till_lines which tells us how many lines we need to print from a specific line(eg--> from 1st line print next 4 lines too). On another note, I have tested OP's code and it worked fine for me, its generating the output file with new_file since calling awk in bash loop is NOT good practice hence adding this as an improvement too here.
awk -v lines_from="1,2,3" -v till_lines="4" '
BEGIN{
num=split(lines_from,arr,",")
for(i=1;i<=num;i++){ line[arr[i]] }
}
FNR==NR{
value[FNR]=$0
next
}
(FNR in line){
print value[FNR] > "output_file"
j=""
while(++j<=till_lines){ print value[FNR+j] > "output_file" }
}
' Input_file Input_file
When I see contents of output_file I could see following:
cat output_file
1 99 tut
2 24 bcc
3 32 los
4 33 rts
5 642 pac
2 24 bcc
3 32 los
4 33 rts
5 642 pac
6 23 caas
3 32 los
4 33 rts
5 642 pac
6 23 caas
7 231 cdos
Explanation: Adding detailed explanation for above.
awk -v lines_from="1,2,3" -v till_lines="4" ' ##Starting awk program from here and creating 2 variables lines_from and till_lines here, where lines_from will have all line numbers which one wants to print from. till_lines is the value till lines one has to print.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(lines_from,arr,",") ##Splitting lines_from into arr with delimiter of , here.
for(i=1;i<=num;i++){ ##Running a for loop from i=1 to till value of num here.
line[arr[i]] ##Creating array line with index of value of array arr with index of i here.
}
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1st time Input_file is being read.
value[FNR]=$0 ##Creating value with index as FNR and its value is current line.
next ##next will skip all further statements from here.
}
(FNR in line){ ##Checking condition if current line number is coming in array then do following.
print value[FNR] > "output_file" ##Printing value with index of FNR into output_file
j="" ##Nullifying value of j here.
while(++j<=till_lines){ ##Running while loop from j=1 to till value of till_lines here.
print value[FNR+j] > "output_file" ##Printing value of array value with index of FNR+j and print output into output_file
}
}
' Input_file Input_file ##Mentioning Input_file names here.

Another awk variant
awk '
BEGIN {N1=1; N2=5}
arr[NR]=$0 {}
END {
while (arr[N2]) {
for (i=N1; i<=N2; i++)
print arr[i]
N1++
N2++
}
}
' file

Related

How to remove some words in specific field using awk?

I have several lines of text. I want to extract the number after specific word using awk.
I tried the following code but it does not work.
At first, create the test file by: vi test.text. There are 3 columns (the 3 fields are generated by some other pipeline commands using awk).
Index AllocTres CPUTotal
1 cpu=1,mem=256G 18
2 cpu=2,mem=1024M 16
3 4
4 cpu=12,gres/gpu=3 12
5 8
6 9
7 cpu=13,gres/gpu=4,gres/gpu:ret6000=2 20
8 mem=12G,gres/gpu=3,gres/gpu:1080ti=1 21
Please note there are several empty fields in this file.
what I want to achieve only keep the number folloing the first gres/gpu part and remove all cpu= and mem= parts using a pipeline like: cat test.text | awk '{some_commands}' to output 3 columns:
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
1st solution: With your shown samples, please try following GNU awk code. This takes care of spaces in between fields.
awk '
FNR==1{ print; next }
match($0,/[[:space:]]+/){
space=substr($0,RSTART,RLENGTH-1)
}
{
match($2,/gres\/gpu=([0-9]+)/,arr)
match($0,/^[^[:space:]]+[[:space:]]+[^[:space:]]+([[:space:]]+)/,arr1)
space1=sprintf("%"length($2)-length(arr[1])"s",OFS)
if(NF>2){ sub(OFS,"",arr1[1]);$2=space arr[1] space1 arr1[1] }
}
1
' Input_file
Output will be as follows for above code with shown samples:
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
2nd solution: If you don't care of spaces then try following awk code.
awk 'FNR==1{print;next} match($2,/gres\/gpu=([0-9]+)/,arr){$2=arr[1]} 1' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line then do following.
print ##Printing current line.
next ##next will skip all further statements from here.
}
match($2,/gres\/gpu=([0-9]+)/,arr){ ##using match function to match regex gres/gpu= digits and keeping digits in capturing group.
$2=arr[1] ##Assigning 1st value of array arr to 2nd field itself.
}
1 ##printing current edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
Using sed
$ sed 's~\( \+\)[^,]*,\(gres/gpu=\([0-9]\)\|[^ ]*\)[^ ]* \+~\1\3 \t\t\t\t ~' input_file
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
awk '
FNR>1 && NF==3 {
n = split($2, a, ",")
for (i=1; a[i] !~ /gres\/gpu=[0-9]+,?/ && i<=n; ++i);
sub(/.*=/, "", a[i])
$2 = a[i]
}
NF==2 {$3=$2; $2=""}
{printf "%-7s%-11s%s\n",$1,$2,$3}' test.txt
Output:
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
You can adjust column widths as desired.
This assumes the first and last columns always have a value, so that NF (number of fields) can be used to identify field 2. Then if field 2 is not empty, split that field on commas, scan the resulting array for the first match of gres/gpu, remove this suffix, and print the three fields. If field 2 is empty, the second last line inserts an empty awk field so printf always works.
If assumption above is wrong, it's also possible to identify field 2 by its character index.
A awk-based solution without needing
- array splitting,
- regex back-referencing,
- prior state tracking, or
- input multi-passing
—- since m.p. for /dev/stdin would require state tracking
|
{mng}awk '!_~NF || sub("[^ ]+$", sprintf("%*s&", length-length($!(NF=NF)),_))' \
FS='[ ][^ \\/]*gres[/]gpu[=]|[,: ][^= ]+[=][^,: ]+' OFS=
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
If you don't care for nawk, then it's even simpler single-pass approach with only 1 all-encompassing call to sub() per line :
awk ' sub("[^ ]*$", sprintf("%*s&", length($_) - length($(\
gsub(" [^ /]*gres[/]gpu=|[,: ][^= ]+=[^,: ]+", _)*_)),_))'
or even more condensed but worse syntax styling :
awk 'sub("[^ ]*$",sprintf("%*s&",length^gsub(" [^ /]*gres\/gpu=|"\
"[,: ][^= ]+=[^,: ]+",_)^_ - length,_) )'
This might work for you (GNU sed):
sed -E '/=/!b
s/\S+/\n&\n/2;h
s/.*\n(.*)\n.*/\1/
/gpu=/!{s/./ /g;G;s/(^.*)\n(.*)\n.*\n/\2\1/p;d}
s/gpu=([^,]*)/\n\1 \n/;s/(.*)\n(.*\n)/\2\1/;H
s/.*\n//;s/./ /g;H;g
s/\n.*\n(.*)\n(.*)\n.*\n(.*)/\2\3\1/' file
In essence the solution above involves using the hold space (see here and eventually here) as a scratchpad to hold intermediate results. Those results are gathered by isolating the second field and then again the gpu info. The step by step story follows:
If the line does not contain a second field, leave alone.
Surround the second field by newlines and make a copy.
Isolate the second field
If the second field contains no gpu info, replace the entire field by spaces and using the copy, format the line accordingly.
Otherwise, isolate the gpu info, move it to the front of the line and append that to the copy of the line in the hold space.
Meanwhile, remove the gpu info from the pattern space and replace each character in the pattern space by a space.
Apend these spaces to the copy and then overwrite the pattern space by the copy.
Lastly, knowing each part of the line has been split by newlines, reassemble the parts into the desired format.
N.B. The solution depends on the spacing of columns being real spaces. If there are tabs in the file, then prepend the sed command s/\t/ /g (where in the example tabs are replaced by 8 spaces).
Alternative:
sed -E '/=/!b
s/\S+/\n&\n/2;h
s/.*(\n.*)\n.*/\1/;s/(.)(.*gpu=)([^,]+)/\3\1\2/;H
s/.*\n//;s/./ /g;G
s/(.*)\n(.*)\n.*\n(.*)\n(.*)\n.*$/\2\4\1\3/' file
In this solution, rather than treat lines with a second field but no gpu info, as a separate case, I introduce a place holder for this missing info and follow the same solution as if gpu info was present.

Awk script to concatenate two column and look for concatenated values in another file

Need your help in solving this puzzle. Any kind of help will be appreciated and link for any documents to read and learn and to deal with such scenarios would be helpful
Concatenate column1 and column2 of file 1. Then check for the concatenated value in Column1 of File2. If found extract the corresponding value of column2 and column3 of File2, Again concatenate column1 and column2 of File2. Now look for this concatenated value in File1 and if found
For example - concatenate column1(262881626) and column2(10) of File1. Then look for this concatenated(26288162610) value in column1 of File2 and extract corresponding column2 and column3 value of File2.
Now again concatenate column1 and column2 of File2 and look for this concatenated(2628816261050) value in File1 and multiply exchange rate(2) fetched by concatenated value(26288162610) with taxable value(65) which corresponding to 2628816261050 of File1. Store the result of multiplcation value in column4(AD) of File1 only
File1
Bill Doc LineNo Taxablevalue AD
262881626 10 245
262881627 10 32
262881628 20 456
262881629 30 0
262881630 40 45
2628816261050 11 65
2628816271060 12 34
2628816282070 13 45
2628816293080 14 0
2628816304090 15
File2
Bill.Doc Item Exch.Rate
26288162610 50 2
26288162710 60 1
26288162820 70 45
26288162930 80 1
26288163040 90 5
Output File
Bill Doc LineNo Taxablevalue AD
262881626 10 245
262881627 10 32
262881628 20 456
262881629 30 0
262881630 40
2628816261050 11 65 130
2628816271060 12 34 34
2628816282070 13 45 180
2628816293080 14 0 0
2628816304090 15
Though your output is not clear, could you please try following and let me know of this helps you.
awk -F"|" 'FNR==NR{a[$1$2]=$NF;next} {print $0,$1 in a?"|" a[$1]*$NF:""}' OFS="" File2 File1
Explanation:
awk -F"|" ' ##Setting field separator as |(pipe) here.
FNR==NR{ ##Checking condition here FNR==NR which will be TRUE when first file named File2 is being read.
a[$1$2]=$NF; ##Creating an array named a whose index is $1$2(first and second field of current line) and value if last field.
next} ##next will skip all further statements from here.
{ ##Statements from here will be executed when only 2nd Input_file named File1 is being read.
print $0,$1 in a?"|" a[$1]*$NF:"" ##Printing $0(current line) and then checking if $1 of current line is present in array a is yes then print a value * $NF else print NULL.
}
' OFS="" File2 File1 ##Setting OFS to NULL here and mentioning both the Input_file(s) name here.

Find lines with a common value in a particular column

Suppose I have a file like this
5 kata 45 buu
34 tuy 3 rre
21 ppo 90 ty
21 ret 60 buu
09 ret 89 ty
21 plk 1 uio
23 kata 90 ty
I want to have in output only the lines that contains repetead values on the 4th column. Therefore, my desired output would be this one:
5 kata 45 buu
21 ppo 90 ty
21 ret 60 buu
09 ret 89 ty
23 kata 90 ty
How can I perform this task?
I can identify and isolate the column of my interest with:
awk -F"," '{print $4}' file1 > file1_temp
and then check if there are repeated values and how many with:
awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' file1_temp
but that's not definitely what I would like to do..
A simple way to preserve the ordering would be to run through the file twice. The first time, keep a record of the counts, then print the ones with a count greater than 1 on the second pass:
awk 'NR == FNR { ++count[$4]; next } count[$4] > 1' file file
If you prefer not to loop through the file twice, you can keep track of things in a few arrays and do the printing in the END block:
awk '{ line[NR] = $0; col[NR] = $4; ++count[$4] }
END { for (i = 1; i <= NR; ++i) if (count[col[i]] > 1) print line[i] }' file
Here line stores the contents of the whole line, col stores the fourth column and count does the same as before.

How to extract every N columns and write into new files?

I've been struggling to write a code for extracting every N columns from an input file and write them into output files according to their extracting order.
(My real world case is to extract every 800 columns from a total 24005 columns file starting at column 6, so I need a loop)
In a simpler case below, extracting every 3 columns(fields) from an input file with a start point of the 2nd column.
for example, if the input file looks like:
aa 1 2 3 4 5 6 7 8 9
bb 1 2 3 4 5 6 7 8 9
cc 1 2 3 4 5 6 7 8 9
dd 1 2 3 4 5 6 7 8 9
and I want the output to look like this:
output_file_1:
1 2 3
1 2 3
1 2 3
1 2 3
output_file_2:
4 5 6
4 5 6
4 5 6
4 5 6
output_file_3:
7 8 9
7 8 9
7 8 9
7 8 9
I tried this, but it doesn't work:
awk 'for(i=2;i<=10;i+a) {{printf "%s ",$i};a=3}' <inputfile>
It gave me syntax error and the more I fix the more problems coming out.
I also tried the linux command cut but while I was dealing with large files this seems effortless. And I wonder if cut would do a loop cut of every 3 fields just like the awk.
Can someone please help me with this and give a quick explanation? Thanks in advance.
Actions to be performed by awk on the input data must be included in curled braces, so the reason the awk one-liner you tried results in a syntax error is that the for cycle does not respect this rule. A syntactically correct version will be:
awk '{for(i=2;i<=10;i+a) {printf "%s ",$i};a=3}' <inputfile>
This is syntactically correct (almost, see end of this post.), but does not do what you think.
To separate the output by columns on different files, the best thing is to use awk redirection operator >. This will give you the desired output, given that your input files always has 10 columns:
awk '{ print $2,$3,$4 > "file_1"; print $5,$6,$7 > "file_2"; print $8,$9,$10 > "file_3"}' <inputfile>
mind the " " to specify the filenames.
EDITED: REAL WORLD CASE
If you have to loop along the columns because you have too many of them, you can still use awk (gawk), with two loops: one on the output files and one on the columns per file. This is a possible way:
#!/usr/bin/gawk -f
BEGIN{
CTOT = 24005 # total number of columns, you can use NF as well
DELTA = 800 # columns per file
START = 6 # first useful column
d = CTOT/DELTA # number of output files.
}
{
for ( i = 0 ; i < d ; i++)
{
for ( j = 0 ; j < DELTA ; j++)
{
printf("%f\t",$(START+j+i*DELTA)) > "file_out_"i
}
printf("\n") > "file_out_"i
}
}
I have tried this on the simple input files in your example. It works if CTOT can be divided by DELTA. I assumed you had floats (%f) just change that with what you need.
Let me know.
P.s. going back to your original one-liner, note that the loop is an infinite one, as i is not incremented: i+a must be substituted by i+=a, and a=3 must be inside the inner braces:
awk '{for(i=2;i<=10;i+=a) {printf "%s ",$i;a=3}}' <inputfile>
this evaluates a=3 at every cycle, which is a bit pointless. A better version would thus be:
awk '{for(i=2;i<=10;i+=3) {printf "%s ",$i}}' <inputfile>
Still, this will just print the 2nd, 5th and 8th column of your file, which is not what you wanted.
awk '{ print $2, $3, $4 >"output_file_1";
print $5, $6, $7 >"output_file_2";
print $8, $9, $10 >"output_file_3";
}' input_file
This makes one pass through the input file, which is preferable to multiple passes. Clearly, the code shown only deals with the fixed number of columns (and therefore a fixed number of output files). It can be modified, if necessary, to deal with variable numbers of columns and generating variable file names, etc.
(My real world case is to extract every 800 columns from a total 24005 columns file starting at column 6, so I need a loop)
In that case, you're correct; you need a loop. In fact, you need two loops:
awk 'BEGIN { gap = 800; start = 6; filebase = "output_file_"; }
{
for (i = start; i < start + gap; i++)
{
file = sprintf("%s%d", filebase, i);
for (j = i; j <= NF; j += gap)
printf("%s ", $j) > file;
printf "\n" > file;
}
}' input_file
I demonstrated this to my satisfaction with an input file with 25 columns (numbers 1-25 in the corresponding columns) and gap set to 8 and start set to 2. The output below is the resulting 8 files pasted horizontally.
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24 9 17 25
With GNU awk:
$ awk -v d=3 '{for(i=2;i<NF;i+=d) print gensub("(([^ ]+ +){" i-1 "})(([^ ]+( +|$)){" d "}).*","\\3",""); print "----"}' file
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
Just redirect the output to files if desired:
$ awk -v d=3 '{sfx=0; for(i=2;i<NF;i+=d) print gensub("(([^ ]+ +){" i-1 "})(([^ ]+( +|$)){" d "}).*","\\3","") > ("output_file_" ++sfx)}' file
The idea is just to tell gensub() to skip the first few (i-1) fields then print the number of fields you want (d = 3) and ignore the rest (.*). If you're not printing exact multiples of the number of fields you'll need to massage how many fields get printed on the last loop iteration. Do the math...
Here's a version that'd work in any awk. It requires 2 loops and modifies the spaces between fields but it's probably easier to understand:
$ awk -v d=3 '{sfx=0; for(i=2;i<=NF;i+=d) {str=fs=""; for(j=i;j<i+d;j++) {str = str fs $j; fs=" "}; print str > ("output_file_" ++sfx)} }' file
I was successful using the following command line. :) It uses a for loop and pipes the awk program into it's stdin using -f -. The awk program itself is created using bash variable math.
for i in 0 1 2; do
echo "{print \$$((i*3+2)) \" \" \$$((i*3+3)) \" \" \$$((i*3+4))}" \
| awk -f - t.file > "file$((i+1))"
done
Update: After the question has updated I tried to hack a script that creates the requested 800-cols-awk script dynamically ( a version according to Jonathan Lefflers answer) and pipe that to awk. Although the scripts looks good (for me ) it produces an awk syntax error. The question is, is this too much for awk or am I missing something? Would really appreciate feedback!
Update: Investigated this and found documentation that says awk has a lot af restrictions. They told to use gawk in this situations. (GNU's awk implementation). I've done that. But still I'll get an syntax error. Still feedback appreciated!
#!/bin/bash
# Note! Although the script's output looks ok (for me)
# it produces an awk syntax error. is this just too much for awk?
# open pipe to stdin of awk
exec 3> >(gawk -f - test.file)
# verify output using cat
#exec 3> >(cat)
echo '{' >&3
# write dynamic script to awk
for i in {0..24005..800} ; do
echo -n " print " >&3
for (( j=$i; j <= $((i+800)); j++ )) ; do
echo -n "\$$j " >&3
if [ $j = 24005 ] ; then
break
fi
done
echo "> \"file$((i/800+1))\";" >&3
done
echo "}"

Slice 3TB log file with sed, awk & xargs?

I need to slice several TB of log data, and would prefer the speed of the command line.
I'll split the file up into chunks before processing, but need to remove some sections.
Here's an example of the format:
uuJ oPz eeOO 109 66 8
uuJ oPz eeOO 48 0 221
uuJ oPz eeOO 9 674 3
kf iiiTti oP 88 909 19
mxmx lo uUui 2 9 771
mxmx lo uUui 577 765 27878456
The gaps between the first 3 alphanumeric strings are spaces. Everything after that is tabs. Lines are separated with \n.
I want to keep only the last line in each group.
If there's only 1 line in a group, it should be kept.
Here's the expected output:
uuJ oPz eeOO 9 674 3
kf iiiTti oP 88 909 19
mxmx lo uUui 577 765 27878456
How can I do this with sed, awk, xargs and friends, or should I just use something higher level like Python?
awk -F '\t' '
NR==1 {key=$1}
$1!=key {print line; key=$1}
{line=$0}
END {print line}
' file_in > file_out
Try this:
awk 'BEGIN{FS="\t"}
{if($1!=prevKey) {if (NR > 1) {print lastLine}; prevKey=$1} lastLine=$0}
END{print lastLine}'
It saves the last line and prints it only when it notcies that the key has changed.
This might work for you:
sed ':a;$!N;/^\(\S*\s\S*\s\S*\)[^\n]*\n\1/s//\1/;ta;P;D' file

Resources