Parse columns with awk - linux

I am new at AWK programming and I was wondering how to filter the following text:
Goedel - Declarative language for AI, based on many-sorted logic. Strongly
typed, polymorphic, declarative, with a module system. Supports bignums
and sets. "The Goedel Programming Language", P. M. Hill et al, MIT Press
1994, ISBN 0-262-08229-2. Goedel 1.4 - partial implementation in SICStus
Prolog 2.1.
ftp://ftp.cs.bris.ac.uk/goedel
info: goedel#compsci.bristol.ac.uk
Just to print this:
Goedel
I have used the following sentence but it just does not work as I wished:
awk -F " - " "/ - /{ print $1 }"
It shows the following:
Goedel
1994, ISBN 0-262-08229-2. Goedel 1.4
Could somebody tell me what I have to modify so I can get what I want?
Thanks in advance

awk 'BEGIN { RS = "" } { print $1 }' your_file.txt
which means: splits string into paragraphs by empty line, and then splits words by the default separator (space), and finally print the first word ($1) of every paragraph

this one-liner could work for your requirement:
awk -F ' - ' 'NF>1{print $1;exit}'

awk -F ' - ' ' { if (FNR % 4 == 1) next; print $1; }'
If the format is exactly the same as below, then the code above should work:
1 Author - ...
2 Year ...
3 URL
4 Extra info ...
5 Author - ...
6..N etc.
If there is a blank line between entries, you can set RS to a null string and $1 will be the author as long as the value for -F (the FS variable in an awk script) is the same. This has the advantage that if you don't have "info: ..." or a URL, you can still distinguish between entries, assuming it is not "Author - ...{newline}Year ...{newline}{newline}info: ...{newline}{newline}Author - ..." (you can't have an empty line between parts of an entry if an empty line is what separates entries.) For example:
# A blank line is what separates each entry.
BEGIN { RS = ""; }
{ print $1; }
If you have an awk that supports it, you can make RS a multiple character string if necessary (e.g. RS = "\n--\n" for entries separated by "--" on a line by itself). If you need a regex or simply don't have an awk that supports multiple character record separators, you're forced to use something like the following:
BEGIN { found_sep = 1; }
{ if (found_sep) { print $1; found_sep = 0; } }
# Entry separator is "--\n"
/^--$/ { found_sep = 1; }
More sample input will be required for something more complicated.

Related

Find and replace words using sed command not working

I have a a text file which is tab separated, the first column holds the word to be found and the second column holds the word to replace the found word. This text file contains English and Arabic pairs. Once the word is found and replaced it should not be changed again.
For example:
adam a +dam
a b
ال ال+
So for a given text file:
adam played with a ball ال
I expect:
a +dam played with b ball ال+
However, I get:
b +dbm plbyed with b bbll ال+
I am using the following sed command to find and replace:
sed -e 's/^/s%/' -e 's/\t/%/' -e 's/$/%g/' tab_sep_file.txt | sed -f - original_file.txt >replaced.txt
How can I fix this issue
The basic problem to your approach is that you don't want to replace matched text in a prior substitution with a later one - you don't want to change the a's in a +dam to b's. This makes sed a pretty poor choice - you can make a regular expression that matches all of the things you want to replace fairly easily, but picking which replacement to use is an issue.
A way using GNU awk:
gawk -F'\t' '
FNR == NR { subs[$1] = $2; next } # populate the array of substitutions
ENDFILE {
if (FILENAME == ARGV[1]) {
# Build a regular expression of things to substitute
subre = "\\<("
first=0
for (s in subs)
subre = sprintf("%s%s%s", subre, first++ ? "|" : "", s)
subre = sprintf("%s)\\>", subre)
}
}
{
# Do the substitution
nwords = patsplit($0, words, subre, between)
printf "%s", between[0]
for (n = 1; n <= nwords; n++)
printf "%s%s", subs[words[n]], between[n]
printf "\n"
}
' tab_sep_file.txt original_file.txt
which outputs
a +dam played with b ball
First it reads the TSV file and builds an array of words to be replaced and text to replace it with (subs). Then after reading that file, it builds a regular expression to match all possible words to be found - \<(a|adam)\> in this case. The \< and \> match only at the beginning and end, respectively, of words, so the a in ball won't match.
Then for the second file with the text you want to process, it uses patsplit() to split each line into an array of matched parts (words) and the bits between matches (between), and iterates over the length of the array, printing out the replacement text for each match. That way it avoids re-matching text that's already been replaced.
And a perl version that uses a similar approach (Taking advantage of perl's ability to evaluate the replacement text in a s/// substitution):
perl -e '
use strict;
use warnings;
# Set file/standard stream char encodings from locale
use open ":locale";
# Or for explicit UTF-8 text
# use open ":encoding(UTF-8)", ":std";
my %subs;
open my $words, "<", shift or die $!;
while (<$words>) {
chomp;
my ($word, $rep) = split "\t" ,$_, 2;
$subs{$word} = $rep;
}
my $subre = "\\b(?:" . join("|", map { quotemeta } keys %subs) . ")\\b";
while (<<>>) {
print s/$subre/$subs{$&}/egr;
}
' tab_sep_file.txt original_file.txt
(This one will escape regular expression metacharacters in the words to replace, making it more robust)

Replace each nth occurrence of 'foo' and 'bar' on two distincts columns by numerically respective nth line of a supplied file in respective columns

I have a source.txt file like below containing two columns of data. The format of the columns of source.txt include [ ] (square bracket) as shown in my source.txt:
[hot] [water]
[16] [boots and, juice]
and I have another target.txt file and contain empty lines plus full stops at the end of each line:
the weather is today (foo) but we still have (bar).
= (
the next bus leaves at (foo) pm, we can't forget to take the (bar).
I want to do replace foo of each nth line of target.txt with the "respective contents" of the first column of source.txt, and also replace bar of each nth line of target.txt with the "respective contents" of the second column of source. txt.
i tried to search other sources and understand how i would do it, at first i already have a command that i use to replace "replace each nth occurrence of 'foo' by numerically respective nth line of a supplied file" but i couldn't adapt it:
awk 'NR==FNR {a[NR]=$0; next} /foo/{gsub("foo", a[++i])} 1' source.txt target.txt > output.txt;
I remember seeing a way to use gsub with containing two columns of data but I don't remember what exactly the difference was.
EDIT POST: sometimes read with some symbols between them = and ( and ) within the target.txt text. I added this symbol as some answers will not work if these symbols are in the target.txt file
Note: the number of target.txt lines and therefore the number of occurrences of bar and foo in this file can vary, I just showed a sample. But the number of occurrences of both foo and bar in each row is 1 respectively.
With your shown samples, please try following answer. Written and tested in GNU awk.
awk -F'\\[|\\] \\[|\\]' '
FNR==NR{
foo[FNR]=$2
bar[FNR]=$3
next
}
NF{
gsub(/\<foo\>/,foo[++count])
gsub(/\<bar\>/,bar[count])
}
1
' source.txt FS=" " target.txt
Explanation: Adding detailed explanation for above.
awk -F'\\[|\\] \\[|\\]' ' ##Setting field separator as [ OR ] [ OR ] here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when source.txt will be read.
foo[FNR]=$2 ##Creating foo array with index of FNR and value of 2nd field here.
bar[FNR]=$3 ##Creating bar array with index of FNR and value of 3rd field here.
next ##next will skip all further statements from here.
}
NF{ ##If line is NOT empty then do following.
gsub(/\<foo\>/,foo[++count]) ##Globally substituting foo with array foo value, whose index is count.
gsub(/\<bar\>/,bar[count]) ##Globally substituting bar with array of bar with index of count.
}
1 ##printing line here.
' source.txt FS=" " target.txt ##Mentioning Input_files names here.
EDIT: Adding following solution also which will handle n number of occurrences of [...] in source and matching them at target file also. Since this is a working solution for OP(confirmed in comments) adding this in here. Also fair warning this will fail when source.txt contains a &.
awk '
FNR==NR{
while(match($0,/\[[^]]*\]/)){
arr[++count]=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
}
next
}
{
line=$0
while(match(line,/\(?[[:space:]]*(\<foo\>|\<bar\>)[[:space:]]*\)?/)){
val=substr(line,RSTART,RLENGTH)
sub(val,arr[++count1])
line=substr(line,RSTART+RLENGTH)
}
}
1
' source.txt target.txt
Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN {
FS="[][]"
tags["foo"]
tags["bar"]
}
NR==FNR {
map["foo",NR] = $2
map["bar",NR] = $4
next
}
{
found = 0
head = ""
while ( match($0,/\([^)]+)/) ) {
tag = substr($0,RSTART+1,RLENGTH-2)
if ( tag in tags ) {
if ( !found++ ) {
lineNr++
}
val = map[tag,lineNr]
}
else {
val = substr($0,RSTART,RLENGTH)
}
head = head substr($0,1,RSTART-1) val
$0 = substr($0,RSTART+RLENGTH)
}
print head $0
}
$ awk -f tst.awk source.txt target.txt
the weather is today hot but we still have water.
= (
the next bus leaves at 16 pm, we can't forget to take the boots and, juice.
awk '
NR==FNR { # build lookup
# delete gumph
gsub(/(^[[:space:]]*\[)|(\][[:space:]]*$)/, "")
# split
split($0, a, /\][[:space:]]+\[/)
# store
foo[FNR] = a[1]
bar[FNR] = a[2]
next
}
!/[^[:space:]]/ { next } # ignore blank lines
{ # do replacements
VFNR++ # FNR - (ignored lines)
# can use sub if foo/bar only appear once
gsub(/\<foo\>/, foo[VFNR])
gsub(/\<bar\>/, bar[VFNR])
print
}
' source.txt target.txt
Note: \< and \> are not in POSIX but are accepted by some versions of awk (eg. gawk). I'm not sure if POSIX awk regex has "word boundary".

How to copy a certain amout of lines to a new txt file in bash using awk

I have a txt file which contains chapters, I want to copy each chapter to a new txt file using bash.
for example:
"CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very .......
CHAPTER II. The Pool of Tears
‘Curiouser and curiouser!’ cried Alice (she was so much surprised, that
for the moment she quite forgot how to speak good English); .....
"
I want to create 2 files 1 for each chapter.
awk 'BEGIN{start="0"; end="0"; chapters="0"}
{if($1 -eq chapter){
chapter++
sed -n "$start,$end" Alice_book_aux > Alice_book_chapter_$chapter
start = end
}
end++;}' Alice_book
This is what I thought I should do but is won't work :(
I'll make assumptions based on the given example.
AWK has an input parser that can process input through regexp filters
SED is an adequate tools to take excerpts from input, but AWK will suffice here.
Thus your revised code :
awk 'BEGIN {chapter=0;chapfile="";}
/^"?CHAPTER / {
chapter++;
chapfile="Alice_book_chapter_"chapter;
printf "" > chapfile;
}
{ if (chapter -gt 0) {
print >> chapfile;
}
}' Alice_book
As suggested by #karakfa, the awk script can be reduced to this :
awk '/^"?CHAPTER / {
chapter++;
chapfile="Alice_book_chapter_"chapter;
}
chapter{
print > chapfile;
}
' Alice_book

AWK to to find first occurrence of string and assign to variable for compare

I have written following line of code which explodes the string by the first occurrence of the string after a delimiter.
echo "$line" | awk -F':' '{ st = index($0,":");print "field1: "$1 "
=> " substr($0,st+1)}';
But I don't want to display it. Want to take both occurrences in variable so I tried the following code
explodetext="$line" | awk -F':' '{ st = index($0,":")}';
Sample data:
id:1
url:http://test.com
Expected OutPUt will be:
key=id
val=1
key=url
val=http://test.com
but not working as expected.Any solution?
Thanks
Your code, expanded:
echo "$line" \
| awk -F':' '
{
st = index($0,":")
print "field1: " $1 " => " substr($0,st+1)
}'
The output of this appears merely to split the line according to the first colon. From the sample data you've provided, it seems that your lines contain two fields, which are separated by the first colon found. This means you can't safely use awk's field separator to find your data (though you can use it for field names), making index() a reasonable approach.
One strategy might be to place your input into an array, for assessment:
#!/usr/bin/awk -f
BEGIN {
FS=":"
}
{
record[$1]=substr($0,index($0,":")+1);
}
END {
if (record["id"] > 0) {
printf("Record ID %d had a value of %s.\n", record["id"], record["url"])
} else {
print "No valid records found."
}
}
I suppose that your text file input.txt is stored in the format as given below:
id:1
url:http://test1.com
You could use the below piece of code, say awkscript, to achieve what you wish to do :
#!/bin/bash
awk '
BEGIN{FS=":"}
{
if ($2 > 0) {
if ( getline > 0){
st = index($0,":")
url = substr($0,st+1);
system("echo Do something with " url);
}
}
}' $1
Run the code as ./awkscript input.txt
Note: I assume that that the input file contains only one id/url pair as you confirmed in your comment.

How to parse words in awk?

I was wondering how to parse a parragraph that looks like the following:
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
And many other lines with text that I do not need
* * * * * * *
Autolisp - Dialect of LISP used by the Autocad CAD package, Autodesk,
Sausalito, CA.
CPL -
1. Combined Programming Language. U Cambridge and U London. A very
complex language, syntactically based on ALGOL-60, with a pure functional
subset.
Modula-3* - Incoprporation of Modula-2* ideas into Modula-3. "Modula-3*:
So I can get the following exit from the awk sentence:
Autolisp
CPL
Modula-3*
I have tried the following sentences because the file I want to filter is huge. It is a list of all the existing programming languages so far, but basically all the lines follow the same pattern as the above
Sentences I have used so far:
BEGIN{$0 !~ /^ / && NF == 2 && $2 == "-"} { print $1 }
BEGIN{RS=""; ORS="\n\n"; FS=OFS="\n"} /^FLIP -/{print $1,$3}
BEGIN{RS=""; FS=OFS="\n"} {print $1 NF-1}
BEGIN{NF == 2 && $2 == "-" } { print $1 }
BEGIN { RS = "" } { print $1 }
The sentences that have worked for me so far are:
BEGIN { RS = "\n\n"; FS = " - " }
{ print $1 }
awk -F " - " "/ - /{ print $1 }" file.txt
But it still prints or skips lines that I need/ don't need.
Thanks for your help & response!
I have broken my head for some days because I am a rookie with AWK programming
The default FS should be fine, to avoid any duplicate lines you can pipe the output to sort -u
$ gawk '$2 == "-" { print $1 }' file | sort -u
Autolisp
CPL
Modula-3*
It might not filter out everything you want but you can keep adding rules until the bad data is filtered.
Alternately you can avoid using sort by using an associative array:
$ gawk '$2=="-" { arr[$1] } END { for (key in arr) print key}' file
Autolisp
CPL
Modula-3*
If it doesn't have to be with awk, it would probably work to first use grep to select lines of the right form, and then use sed to trim off the end, as follows:
grep -e '^.* -' | sed -e 's/\(^.*\) -.*$/\1\n/; p;'
Edit: After some playing around with awk, it looks like part of your issue is that you don't always have '[languagename] - [stuff]', but rather '[languagename] -\n[stuff]', as is the case with CPL in the sample text, and therefore, FS=" - " doesn't separate on things like that.
Also, one possible thing to try is as follows:
BEGIN { r = "^.* -"; }
{
if (match($0, r)) {
printf("%s\n", substr($0, 1, RSTART + RLENGTH - 3));
}
}
I don't actually know much about awk, but this is my best guess at replicating what the grep and sed do above. It does appear to work on the sample text you gave, at least.

Resources