unfolding a file on linux - linux

I have a huge textfile, approx 400.000 lines 80 charachters wide on liux.
Need to "unfold" the file, merging four lines into one
ending up having 1/4 of the lines, each line 80*4 charachters long.
any suggestions?

perl -pe 'chomp if (++$i % 4);'

An easier way to do it with awk would be:
awk '{ printf $0 } (NR % 4 == 0) { print }' filename
Although if you wanted to protect against ending up without a trailing newline it gets a little more complicated:
awk '{ printf $0 } (NR % 4 == 0) { print } END { if (NR % 4 != 0) print }' filename

I hope I understood your question correctly. You have an input line like this (except your lines are longer):
abcdef
ghijkl
mnopqr
stuvwx
yz0123
456789
ABCDEF
You want output like this:
abcdefghijklmnopqrstuvwx
yz0123456789ABCDEF
The following awk program should do it:
{ line = line $0 }
(NR % 4) == 0 { print line; line = "" }
END { if (line != "") print line }
Run it like this:
awk -f merge.awk data.txt

Related

awk print output on same line

I am counting nucleotides in the contigs of a fasta file. My file looks like
>1
ATACCTACTA
ATTTACGTCA
GTA
>2
ATATTCGTAT
GTCTCGATCT
A
>3
etc.
My command is
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0; } { seqlen += length($0)}END{print seqlen}'
The output is now like
>1
23
>2
21
How to get the output on the same line, like
>1 23
>2 21
and more few changes and voila (thanks to #Ed Morton):
awk '/^>/ {if(seqlen)print k,seqlen; seqlen=0; k=$0; next;} { seqlen += length($0);}END{print k,seqlen;}' filename
This one works for me:
awk '/^>/ && NR>1 {printf " %d \n", x; }/^>/{ printf "%s", $0 }!/^>/{ x += length($0) } file
I hope it works now as expected.
try:
awk '/^>/{printf("%s ",$0);getline;printf("%s\n",length($0))}' Input_file
Checking if a line is starting from > then printing that line now using getline to jump to next line. printing the length of current line with new line, mentionint the Input_file then.
EDIT:
awk '/^>/{if(VAL){print Q OFS VAL;Q=VAL="";Q=$0;next};Q=$0;next} {VAL=VAL?VAL+length($0):length($0)} END{print Q,VAL}' Input_file
Checking if any line starting from > then checking if VAL variable is NOT NULL if not then print variable Q's and VAL's value and then nullify then Q,VAL variables and next will skip all further statements else make Q as $0 and use next to skep further statements. So creating a variable named VAL which will calculate the length of each line and add to it's own value. in END section print values of Q, VAL.

parse text vertical to horizontal

I'm looking to parse the following data:
T
E
S
T
_
7
TTTTTTT
EEEEEEE
SSSSSSS
TTTTTTT
_______
5679111
012
into something like:
TEST_7
TEST_5, TEST_6, TEST_7, TEST_9, TEST_10, TEST_11, TEST_12
Any suggestions could help. Ty
awk to the rescue!
This is basically a transpose operation
awk 'BEGIN {FS=""}
{for(i=1;i<=NF;i++) a[NR,i]=$i;
if(max<NF)max=NF}
END {for(i=1;i<=max;i++)
{for(j=1;j<=NR;j++) printf "%s",a[j,i];
print ""}}' file
TEST_7TEST_5
TEST_6
TEST_7
TEST_9
TEST_10
TEST_11
TEST_12
you need to explain the rules on how to transform this to your desired layout.
Python:
#!/usr/bin/python
txt='''\
T
E
S
T
_
7
TTTTTTT
EEEEEEE
SSSSSSS
TTTTTTT
_______
5679111
012 '''
row_len=max(len(line.rstrip()) for line in txt.splitlines())
arr=[list('{:{w}}'.format(line.rstrip(), w=row_len)) for line in txt.splitlines()]
print '\n'.join([''.join(t) for t in zip(*arr)])
Or, awk:
awk 'BEGIN{RS="[ ]*\n"}
{lines[NR]=$0
max=length($0)>max ? length($0) : max }
END{ for (i=1; i in lines; i++)
lines[i]=sprintf("%-*s", max, lines[i])
for (i=1;i<=max; i++){
for (j=1; j in lines; j++)
printf "%s", substr(lines[j], i, 1)
print ""
}
}' file
Prints:
TEST_7TEST_5
TEST_6
TEST_7
TEST_9
TEST_10
TEST_11
TEST_12
In awk (well GNU awk for -F ''):
$ awk -F '' '
NR!=1 && NF!=p {
for(i=1;i<=p;i++)
printf "%s%s",a[i],(i==p?ORS:"")
delete a
p=NF }
NR==1 || NF==p {
for(i=1;i<=NF;i++)
a[i]=a[i] $i
p=NF
j++ }
END {
for(i=1;i<=p;i++)
printf "%s%s",a[i],(i==p?ORS:", ") }
' file
TEST_7
TEST_5 , TEST_6 , TEST_7 , TEST_9 , TEST_10, TEST_11, TEST_12
It detects change (and prints buffered) when record length (NF actually) changes.

Using `awk` to print number of lines in file in the BEGIN section

I am trying to write an awk script and before anything is done tell the user how many lines are in the file. I know how to do this in the END section but unable to do so in the BEGIN section. I have searched SE and Google but have only found a half dozen ways to do this in the END section or as part of a bash script, not how to do it before any processing has taken place at all. I was hoping for something like the following:
#!/usr/bin/awk -f
BEGIN{
print "There are a total of " **TOTAL LINES** " lines in this file.\n"
}
{
if($0==4587){print "Found record on line number "NR; exit 0;}
}
But have been unable to determine how to do this, if it is even possible. Thanks.
You can read the file twice:
awk 'NR!=1 && FNR==1 {print NR-1} <some more code here>' file{,}
In your example:
awk 'NR!=1 && FNR==1 {print "There are a total of "NR-1" lines in this file.\n"} $0==4587 {print "Found record on line number "NR; exit 0;}' file{,}
You can use file file instead of file{,} (it just makes it show up twice)
NR!=1 && FNR==1 this will be true only at first line of second file.
To use an awk script containing:
#!/usr/bin/awk -f
NR!=1 && FNR==1 {
print "There are a total of "NR-1" lines in this file.\n"
}
$0==4587 {
print "Found record on line number "NR; exit 0
}
call:
awk -f myscript file{,}
To do this robustly and for multiple files you need something like:
$ cat tst.awk
BEGINFILE {
numLines = 0
while ( (getline line < FILENAME) > 0 ) {
numLines++
}
print "----\nThere are a total of", numLines, "lines in", FILENAME
}
$0==4587 { print "Found record on line number", FNR, "of", FILENAME; nextfile }
$
$ cat file1
a
4587
c
$
$ cat file2
$
$ cat file3
d
e
f
4587
$
$ awk -f tst.awk file1 file2 file3
----
There are a total of 3 lines in file1
Found record on line number 2 of file1
----
There are a total of 0 lines in file2
----
There are a total of 4 lines in file3
Found record on line number 4 of file3
The above uses GNU awk for BEGINFILE. Any other solution is difficult to implement such that it will handle empty files (you need an array to track files being parsed and print info the the FNR==1 and END sections after the empty file has been skipped).
Using getline has caveats and should not be used lightly, see http://awk.info/?tip/getline, but this is one of the appropriate and robust uses of it. You can also test for non-readable files in BEGINFILE by testing ERRNO and skipping the file (see the gawk manual) - that situation will cause other scripts to abort.
BEGIN {
s="cat your_file.txt|wc -l";
s | getline file_size;
close(s);
print file_size
}
This will put the size of the file named your_file.txt into the awk variable file_size and print it out.
If your file name is dynamic you can pass the filename on the commandline and change the script to use the variable.
E.g. my.awk
BEGIN {
s="cat "VAR"|wc -l";
s | getline file_size;
close(s);
print file_size
}
Then you can call it like this:
awk -v VAR="your_file.txt" -f my.awk
If you use GNU awk and need a robust, generic solution that accommodates multiple, possibly empty input files, use Ed Morton's solution.
This answer uses portable (POSIX-compliant) code. Within the constraints noted, it is robust, but Ed's GNU awk solution is both simpler and more robust.
Tip of the hat to Ed Morton for his help.
With a single input file, it is simpler to handle line counting with a shell command in the BEGIN block, which has the following advantages:
on invocation, the filename doesn't have to be specified twice, unlike in the accepted answer
Also note that the accepted answer doesn't work as intended (as of this writing); the correct form is (see the comments on the answer for an explanation):
awk 'NR==FNR {next} FNR==1 {print NR-1} $0==4587 {print "Found record on line number "NR; exit 0}' file{,}
the solution also works with an empty input file.
In terms of performance, this approach is either only slightly slower than reading the file twice in awk, or even a little faster, depending on the awk implementation used:
awk '
BEGIN {
# Execute a shell command to count the lines and read
# result into an awk variable via <cmd> | getline <varname>.
# If the file cannot be read, abort. (The shell has already printed an error msg.)
cmd="wc -l < \"" ARGV[1] "\""; if ((cmd | getline count) < 1) exit 1; close(cmd)
printf "There are a total of %s lines in this file.\n\n", count
}
$0==4587 { print "Found record on line number " NR; exit 0 }
' file
Assumptions:
The filename is passed as the 1st operand (non-option argument) on the command line, accessed as ARGV[1].
The filename doesn't contain embedded " chars.
The following solutions deal with multiple files and make analogous assumptions:
All operands passed are filenames. That is, all arguments after the program must be filenames, and not variable assignments such as var=value.
No filename contains embedded " chars.
No processing is to take place if any of the input files do not exist or cannot be read.
It's not hard to generalize this to handling multiple files, but the following solution doesn't print the line count for empty files:
awk '
BEGIN {
# Loop over all input files and store their line counts in an array.
for (i=1; i<ARGC; ++i) {
cmd="wc -l < \"" ARGV[i] "\""; if ((cmd | getline count) < 1) exit 1; close(cmd)
counts[ARGV[i]] = count
}
}
# At the beginning of every (non-empty) file, print the line count.
FNR==1 { printf "There are a total of %s lines in file %s.\n\n", counts[FILENAME], FILENAME }
# $0==4587 { print "%s: Found record on line number %d\n", FILENAME, NR; exit 0 }
' file1 file2 # ...
Things get a little trickier if you want the line count to be printed for empty files also:
awk '
BEGIN {
# Loop over all input files and store their line counts in an array.
for (i=1; i<ARGC; ++i) {
cmd="wc -l < \"" ARGV[i] "\""; if ((cmd | getline count) < 1) exit 1; close(cmd)
counts[ARGV[i]] = count
}
fileCount = ARGC - 1
fmtStringCount = "There are a total of %s lines in file %s.\n\n"
}
# At the beginning of every (non-empty) file, print the line count.
FNR==1 {
++fileIndex
# If there were intervening empty files, print their counts too.
while (ARGV[fileIndex] != FILENAME) {
printf fmtStringCount, 0, ARGV[fileIndex++]
}
printf fmtStringCount, counts[FILENAME], FILENAME
}
# Process input lines
$0==4587 { print "%s: Found record on line number %d\n", FILENAME, NR; exit 0 }
# If there are any remaining empty files a the end, print their counts too.
END {
while (fileIndex < fileCount) { printf fmtStringCount, 0, ARGV[++fileIndex] }
}
' file1 file2 # ...
You can get the number of lines by wc and cut, and set to awk variable with -v option, then you can use the variable in awk script.
cat awk.txt \
| awk -v FNC=`wc -l awk.txt | cut -wf 2` \
'BEGIN { print "FNC: " FNC } { print $0 }'

Awk between two patterns with pattern in the middle

Hi i am looking for an awk that can find two patterns and print the data between them to
a file only if in the middle there is a third patterns in the middle.
for example:
Start
1
2
middle
3
End
Start
1
2
End
And the output will be:
Start
1
2
middle
3
End
I found in the web awk '/patterns1/, /patterns2/' path > text.txt
but i need only output with the third patterns in the middle.
And here is a solution without flags:
$ awk 'BEGIN{RS="End"}/middle/{printf "%s", $0; print RT}' file
Start
1
2
middle
3
End
Explanation: The RS variable is the record separator, so we set it to "End", so that each Record is separated by "End".
Then we filter the Records that contain "middle", with the /middle/ filter, and for the matched records we print the current record with $0 and the separator with print RT
This awk should work:
awk '$1=="Start"{ok++} ok>0{a[b++]=$0} $1=="middle"{ok++} $1=="End"{if(ok>1) for(i=0; i<length(a); i++) print a[i]; ok=0;b=0;delete a}' file
Start
1
2
middle
3
End
Expanded:
awk '$1 == "Start" {
ok++
}
ok > 0 {
a[b++] = $0
}
$1 == "middle" {
ok++
}
$1 == "End" {
if (ok > 1)
for (i=0; i<length(a); i++)
print a[i];
ok=0;
b=0;
delete a
}' file
Just use some flags with awk:
/Start/ {
start_flag=1
}
/middle/ {
mid_flag=1
}
start_flag {
n=NR;
lines[NR]=$0
}
/End/ {
if (start_flag && mid_flag)
for(i=n;i<NR;i++)
print lines[i]
start_flag=mid_flag=0
delete lines
}
Modified the awk user000001
awk '/middle/{printf "%s%s\n",$0,RT}' RS="End" file
EDIT:
Added test for Start tag
awk '/Start/ && /middle/{printf "%s%s\n",$0,RT}' RS="End" file
This will work with any modern awk:
awk '/Start/{f=1;rec=""} f{rec=rec $0 ORS} /End/{if (rec~/middle/) printf "%s",rec}' file
The solutions that set RS to "End" are gawk-specific, which may be fine but it's definitely worth mentioning.

Check variables from different lines with awk

I want to combine values from multiple lines with different lengths using awk into one line if they match. In the following sample match values for first field,
aggregating values from second field into a list.
Input, sample csv:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z
How can I write an awk expression (maybe some other shell expression) to check if the first field value match with the next/previous line, and then print a list of second fields values aggregated and separated by a pipe?
awk '
BEGIN {FS=";"}
{ if ($1==prev) {sec=sec "|" $2; }
else { if (prev) { print prev ";" sec; };
prev=$1; sec=$2; }}
END { if (prev) { print prev ";" sec; }}'
This, as you requested, checks the consecutive lines.
does this oneliner work?
awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' file
tested here:
kent$ cat a
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
kent$ awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' a
555;f
4444;a|d|z
222;a|b
if you want to keep it sorted, add a |sort at the end.
Slightly convoluted, but does the job:
awk -F';' \
'{
if (a[$1]) {
a[$1]=a[$1] "|" $2
} else {
a[$1]=$2
}
}
END {
for (k in a) {
print k ";" a[k]
}
}' file
Assuming that you have set the field separator ( -F ) to ; :
{
if ( $1 != last ) { print s; s = ""; }
last = $1;
s = s "|" $2;
} END {
print s;
}
The first line and the first character are slightly wrong, but that's an exercise for the reader :-). Two simple if's suffice to fix that.
(Edit: Missed out last line.)
this should work:
Command:
awk -F';' '{if(a[$1]){a[$1]=a[$1]"|"$2}else{a[$1]=$2}}END{for (i in a){print i";" a[i] }}' fil
Input:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z

Resources