Awk to read file as a whole - linux

Let a file with content as under -
abcdefghijklmn
pqrstuvwxyzabc
defghijklmnopq
In general if any operation using awk is performed, it iterates line by line and performs that action on each line.
For e.g:
awk '{print substr($0,8,10)}' file
O/P:
hijklmn
wxyzabc
klmnopq
I would like to know an approach in which all the contents inside the file is treated as a single variable and awk prints just one output.
Example Desired O/P:
hijklmnpqr
It's not that I wish for the desired output for the given question but in general would appreciate if anyone could suggest an approach to provide the content of a file as a whole to the awk.

This is a gawk solution
From the docs:
There are times when you might want to treat an entire data file as a single record.
The only way to make this happen is to give RS a value that you know doesn’t occur in the input file.
This is hard to do in a general way, such that a program always works for arbitrary input files.
$ cat file
abcdefghijklmn
pqrstuvwxyzabc
defghijklmnopq
The RS must be set to a pattern not present in archive, following Denis Shirokov suggestion on the docs (Thanks #EdMorton):
$ gawk '{print ">>>"$0"<<<<"}' RS='^$' file
>>>abcdefghijklmn
pqrstuvwxyzabc
defghijklmnopq
abcdefghijklmn
pqrstuvwxyzabc
defghijklmnopq
<<<<
The trick is in bold font:
It works by setting RS to ^$, a regular expression that will never
match if the file has contents. gawk reads data from the file into
tmp, attempting to match RS. The match fails after each read, but fails quickly, such that gawk fills tmp with the entire contents of the file
So:
$ gawk '{gsub(/\n/,"");print substr($0,8,10)}' RS='^$' file
Returns:
hijklmnpqr

With GNU awk for multi-char RS (best approach):
$ awk -v RS='^$' '{print substr($0,8,10)}' file
hijklmn
pq
With other awks if your input can't contain NUL characters:
$ awk -v RS='\0' '{print substr($0,8,10)}' file
hijklmn
pq
With other awks otherwise:
$ awk '{rec = rec $0 ORS} END{print substr(rec,8,10)}' file
hijklmn
pq
Note that none of those produce the output you say you wanted:
hijklmnpqr
because they do what you say you wanted (a newline is just another character in your input file, nothing special):
"read file as a whole"
To get the output you say you want would require removing all newlines from the file first. You can do that with gsub(/\n/,"") or various other methods such as:
$ awk '{rec = rec $0} END{print substr(rec,8,10)}' file
hijklmnpqr
if that's really what you want.

Related

How to truncate rest of the text in a file after finding a specific text pattern, in unix?

I have a HTML PAGE which I have extracted in unix using wget command, in that after the word "Check list" I need to remove all of the text and with the remaining I am trying to grep some data. I am unable to think on a way which can be helpful for removing the text after a keyword. if I do
s/Check list.*//g
It just removes the line , I want everything below that to be gone. How do I perform this?
The other solutions you have so far require non-POSIX-mandatory tools (GNU sed, GNU awk, or perl) so YMMV with their availability and will read the whole file into memory at once.
These will work in any awk in any shell on every Unix box and only read 1 line at a time into memory:
awk -F 'Check list' '{print $1} NF>1{exit}' file
or:
awk 'sub(/Check list.*/,""){f=1} {print} f{exit}' file
With GNU awk for multi-char RS you could do:
awk -v RS='Check list' '{print; exit}' file
but that would still read all of the text before Check list into memory at once.
Depending on which sed version you have, maybe
sed -z 's/Check list.*//'
The /g flag is useless as you only want to replace everything once.
If your sed does not have the -z option (which says to use the ASCII null character as line terminator instead of newline; this hinges on your file not containing any actual nulls, but that should trivially be true for any text file), try Perl:
perl -0777 -pe 's/Check list.*//s'
Unlike sed -z, this explicitly says to slurp the entire file into memory (the argument to -0 is the octal character code of a terminator character, but 777 is not a valid terminator character at all, so it always reads the entire file as a single "line") so this works even if there are spurious nulls in your file. The final s flag says to include newline in what . matches (otherwise s/.*// would still only substitute on the matching physical line).
I assume you are aware that removing everything will violate the integrity of the HTML file; it needs there to be a closing tag for every start tag near the beginning of the document (so if it starts with <html><body> you should keep </body></html> just before the end of the file, for example).
With awk you could make use of RS variable and then set field separator to regex with word boundaries and then print the very first field as per need.
awk -v RS="^$" -v FS='\\<check_list\\>' '{print $1}' Input_file
You might use q to instruct GNU sed to quit, thus ending processing, consider following simple example, let file.txt content be
123
456
789
and say you want to jettison everything beyond 5, then you could do
sed '/5/{s/5.*//;q}' file.txt
which gives output
123
4
Explanation: for line having 5, substitute 5 and everything beyond it with empty string (i.e. delete it), then q. Observe that lowercase q is used to provide printing of altered line before quiting.
(tested in GNU sed 4.7)

Rename file as third word on it (bash)

I have several autogenerated files (see the picture below for example) and I want to rename them according to 3rd word in the first line (in this case, that would be 42.txt).
First line:
ligand CC##HOc3ccccc3 42 P10000001
Is there a way to do it?
Say you have file.txt containing:
ligand CC##HOc3ccccc3 42 P10000001
and you want to rename file.txt to 42.txt based on the 3rd field in the file.
*Using awk
The easiest way is simply to use mv with awk in a command substitution, e.g.:
mv file.txt $(awk 'NR==1 {print $3; exit}' file.txt).txt
Where the command-substitution $(...) is just the awk expression awk 'NR==1 {print $3; exit}' that simply outputs the 3rd-field (e.g. 42). Specifying NR==1 ensures only the first line is considered and exit at the end of that rule ensures no more lines are processed wasting time if file.txt is a 100000 line file.
Confirmation
file.txt is now renamed 42.txt, e.g.
$ cat 42.txt
ligand CC##HOc3ccccc3 42 P10000001
Using read
You can also use read to simply read the first line and take the 3rd word as the name there and then mv the file, e.g.
$ read -r a a name a <file.txt; mv file.txt "$name".txt
The temporary variable a above is just used to read and discard the other words in the first line of the file.

split and write the files with AWK -Bash

INPUT_FILE.txt in c:\Pro\usr\folder1
ABCDEFGH123456
ABCDEFGH123456
ABCDEFGH123456
BBCDEFGH123456
BBCDEFGH123456
used the below AWK command in .SH script which runs from c:\Pro\usr\folder2 to split the file to multiple txt files with an extension of _kg based on first 8 characters.
awk '{ F=substr($0,1,8) "_kg" ".txt"; print $0 >> F; close(F) }' ' "c:\Pro\usr\folder1\input_file.txt"
this is working good , but the files are writing in the main location where the bash is pointing. How can I route the created files to another location like c:\Pro\usr\folder3.
Thanks
Following awk code may help you in same, written and tested with shown samples in GNU awk.
awk -v outPath='c:\\Pro\\usr\\folder3' -v FPAT='^.{8}' '{outFile=($1"_kg.txt");outFile=outPath"\\"outFile;print > (outFile);close(outFile)}' Input_file
Explanation: Creating an awk variable named outPath which has path mentioned by OP in samples. Then setting FPAT(field separator settings as a regex), where I am creating field of 8 characters starting from first character. In main program of awk, creating outFile variable which has output file names in it(1st field following by _kg.txt), then printing whole line to output file and closing the output file in backend to avoid "too many opened files" error.
Pass the destination folder as a variable to awk:
awk -v dest='c:\\Pro\\usr\\folder3\\' '{F=dest substr($0,1,8) "_kg" ".txt"; print $0 >> F; close(F) }' "c:\Pro\usr\folder1\input_file.txt"
I think the doubled backslashes are required.

How to do something like grep -B to select only one line?

Everything is in the title. Basicaly let's say I have this pattern
some text lalala
another line
much funny wow grep
I grep funny and I want my output to be "lalala"
Thank you
One possible answer is to use either ed or ex to do this (it is trivial in them):
ed - yourfile <<< 'g/funny/.-2p'
(Or replace ed with ex. You might have red, the restricted editor, too; it can't modify files.) This looks for the pattern /funny/ globally, and whenever it is found, prints the line 2 before the matching line (that's the .-2p part). Or, if you want the most recent line containing 'lalala' before the line matching 'funny':
ed - yourfile <<< 'g/funny/?lalala?p'
The only problem is if you're trying to process standard input rather than a file; then you have to save the standard input to a file and process that file, which spoils the concurrency.
You can't do negative offsets in sed (though GNU sed allows you to do positive offsets, so you could use sed -n '/lalala/,+2p' file to get the 'lalala' to 'funny' lines (which isn't quite what you want) based on finding 'lalala', but you cannot find the 'lalala' lines based on finding 'funny'). Standard sed does not allow offsets at all.
If you need to print just the IP address found on a line 8 lines before the pattern-matching line, you need a slightly more involved ed script, but it is still doable:
ed - yourfile <<< 'g/funny/.-8s/.* //p'
This uses the same basic mechanism to find the right line, then runs a substitute command to remove everything up to the last space on the line and print the modified version. Since there isn't a w command, it doesn't actually modify the file.
Since grep -B only prints each full number of lines before the match, you'll have to pipe the output into something like grep or Awk.
grep -B 2 "funny" file|awk 'NR==1{print $NF; exit}'
You could also just use Awk.
awk -v s="funny" '/[[:space:]]lalala$/{n=NR+2; o=$NF}NR==n && $0~s{print o}' file
For the specific example of an IP address 8 lines before the match as mentioned in your comment:
awk -v s="funny" '
/[[:space:]][0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$/ {
n=NR+8
ip=$NF
}
NR==n && $0~s {
print ip
}' file
These Awk solutions first find the output field you might want, then print the output only if the word you want exists in the nth following line.
Here's an attempt at a slightly generalized Awk solution. It maintains a circular queue of the last q lines and prints the line at the head of the queue when it sees a match.
#!/bin/sh
: ${q=8}
e=$1
shift
awk -v q="$q" -v e="$e" '{ m[(NR%q)+1] = $0 }
$0 ~ e { print m[((NR+1)%q)+1] }' "${#--}"
Adapting to a different default (I set it to 8) or proper option handling (currently, you'd run it like q=3 ./qgrep regex file) as well as remembering (and hence printing) the entire line should be easy enough.
(I also didn't bother to make it work correctly if you see a match in the first q-1 lines. It will just print an empty line then.)

Issues with the AWK function

Does Awk have a limit to the amount of data it can process?
for i in "052" "064" "060" "070" "074" "076" "178"
do
awk -v f="${i}" -F, 'match ($1,f) { print $2","$3 }' uls.csv > ul$i.csv
awk -v f="${i}" -F, 'match ($1,f) { print $2","$3 }' dls.csv > dl$i.csv
awk -v n="${i}" -F, 'match ($1,n) { print $2","$3 }' dlsur.csv >> dlu$i.csv
awk -v k="${i}" -F, 'match ($1,k) { print $2","$3 }' dailyd.csv >> dla$i.csv
awk -v m="${i}" -F, 'match ($1,m) { print $2","$3 }' dailyu.csv >> ula$i.csv
done
When I run that piece of code, it basically pulls data from csv files and creates new files.
that piece of code works perfectly.
but when i add an extra file (in the for loop), for example "180" it will create that file, but will also include a few lines of data from other files. I went over the code many times. I even checked the raw data before it goes into this loop, and it is all correct. This seems like a glitch in awk.
Do I need to apply a wait function so that it can catch up?
Also something like
for file in uls dls dlsur dailyd dailyu; do
awk -F, -vOFS=, -vfile=$i '$1 ~ /052|064|060|070|074|076|178/ {print $2,$3 >> file$1.csv}' $file.csv
done
is probably better if it does what you want. Many fewer invocations of awk and loops through your files. (Slightly different output file names. That would be fixable but complicate the script a bit more than I thought was necessary for the purpose.)
No. What you say you think is happening cannot be happening - awk WILL NOT randomly pull data from un-specified files and put it in it's output stream.
Note that in your 3rd and subsequent lines you are using '>>' instead of '>' for your output redirection - have you accounted for that?
If you update your question (i.e. do NOT try to do it in a comment!) to tell us what you're trying to do with some representative sample input and expected output (just 2 input files, not 5, should be enough to explain your problem), we can help you write a correct script to do that.

Resources