How to parse words in awk? - linux

I was wondering how to parse a parragraph that looks like the following:
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
And many other lines with text that I do not need
* * * * * * *
Autolisp - Dialect of LISP used by the Autocad CAD package, Autodesk,
Sausalito, CA.
CPL -
1. Combined Programming Language. U Cambridge and U London. A very
complex language, syntactically based on ALGOL-60, with a pure functional
subset.
Modula-3* - Incoprporation of Modula-2* ideas into Modula-3. "Modula-3*:
So I can get the following exit from the awk sentence:
Autolisp
CPL
Modula-3*
I have tried the following sentences because the file I want to filter is huge. It is a list of all the existing programming languages so far, but basically all the lines follow the same pattern as the above
Sentences I have used so far:
BEGIN{$0 !~ /^ / && NF == 2 && $2 == "-"} { print $1 }
BEGIN{RS=""; ORS="\n\n"; FS=OFS="\n"} /^FLIP -/{print $1,$3}
BEGIN{RS=""; FS=OFS="\n"} {print $1 NF-1}
BEGIN{NF == 2 && $2 == "-" } { print $1 }
BEGIN { RS = "" } { print $1 }
The sentences that have worked for me so far are:
BEGIN { RS = "\n\n"; FS = " - " }
{ print $1 }
awk -F " - " "/ - /{ print $1 }" file.txt
But it still prints or skips lines that I need/ don't need.
Thanks for your help & response!
I have broken my head for some days because I am a rookie with AWK programming

The default FS should be fine, to avoid any duplicate lines you can pipe the output to sort -u
$ gawk '$2 == "-" { print $1 }' file | sort -u
Autolisp
CPL
Modula-3*
It might not filter out everything you want but you can keep adding rules until the bad data is filtered.
Alternately you can avoid using sort by using an associative array:
$ gawk '$2=="-" { arr[$1] } END { for (key in arr) print key}' file
Autolisp
CPL
Modula-3*

If it doesn't have to be with awk, it would probably work to first use grep to select lines of the right form, and then use sed to trim off the end, as follows:
grep -e '^.* -' | sed -e 's/\(^.*\) -.*$/\1\n/; p;'
Edit: After some playing around with awk, it looks like part of your issue is that you don't always have '[languagename] - [stuff]', but rather '[languagename] -\n[stuff]', as is the case with CPL in the sample text, and therefore, FS=" - " doesn't separate on things like that.
Also, one possible thing to try is as follows:
BEGIN { r = "^.* -"; }
{
if (match($0, r)) {
printf("%s\n", substr($0, 1, RSTART + RLENGTH - 3));
}
}
I don't actually know much about awk, but this is my best guess at replicating what the grep and sed do above. It does appear to work on the sample text you gave, at least.

Related

I need to make an awk script to parse text in a file. I am not sure if I am doing it correctly

Hi I need to make a an awk script in order to parse a csv file and sort it in bash.
I need to get a list of presidents from Wikipedia and sort their years in office by year.
When it is all sorted out, each ear needs to be in a text file.
Im not sure I am doing it correctly
Here is a portion of my csv file:
28,Woodrow Wilson,http:..en.wikipedia.org.wiki.Woodrow_Wilson,4.03.1913,4.03.1921,Democratic ,WoodrowWilson.gif,thmb_WoodrowWilson.gif,New Jersey
29,Warren G. Harding,http:..en.wikipedia.org.wiki.Warren_G._Harding,4.03.1921,2.8.1923,Republican ,WarrenGHarding.gif,thmb_WarrenGHarding.gif,Ohio
I want to include $2 which is i think the name, and sort by $4 which is think the date the president took office
Here is my actual awk file:
#!/usr/bin/awk -f
-F, '{
if (substr($4,length($4)-3,2) == "17")
{ print $2 > Presidents1700 }
else if (substr($4,length($4)-3,2) == "18")
{ print $2 > Presidents1800 }
else if (substr($4,length($4)-3,2) == "19")
{ print $2 > Presidents1900 }
else if (substr($4,length($4)-3,2) == "20")
{ print $2 > Presidents2000 }
}'
Here is my function running it:
SplitFile() {
printf "Task 4: Spliting file based on century\n"
awk -f $AFILE ${custFolder}/${month}/$DFILE
}
Where $AFILE is my awk file, and the directories listed on the right lead to my actual file.
Here is a portion of my output, it's actually several hundred lines long but in the
end this is what a portion of it looks like:
awk: presidentData/10/presidents.csv:47: 46,Joseph Biden,http:..en.wikipedia.org.wiki.Joe_Biden,20.01.2021,Incumbent , Democratic , Joe_Biden.jpg,thmb_Joe_Biden.jpg,Pennsilvania awk: presidentData/10/presidents.csv:47: ^ syntax error awk: presidentData/10/presidents.csv:47: 46,Joseph Biden,http:..en.wikipedia.org.wiki.Joe_Biden,20.01.2021,Incumbent , Democratic , Joe_Biden.jpg,thmb_Joe_Biden.jpg,Pennsilvania awk: presidentData/10/presidents.csv:47: ^ syntax error
awk: presidentData/10/presidents.csv:47: 46,Joseph Biden,http:..en.wikipedia.org.wiki.Joe_Biden,20.01.2021,Incumbent , Democratic , Joe_Biden.jpg,thmb_Joe_Biden.jpg,Pennsilvania awk: presidentData/10/presidents.csv:47: ^ syntax error
awk: presidentData/10/presidents.csv:47: 46,Joseph Biden,http:..en.wikipedia.org.wiki.Joe_Biden,20.01.2021,Incumbent , Democratic , Joe_Biden.jpg,thmb_Joe_Biden.jpg,Pennsilvania awk: presidentData/10/presidents.csv:47:
I know the output is not very helpful; I would rather just screenshot but I can't. I tried getting help but these online classes can be really hard and getting help at a distance is tough, the syntax errors above seem to be pointing to commas in the csv file.
After the edits, it's clear you are trying to classify the presidents by century outputting the century in which the president served.
As stated in my comments above, you don't include single quotes or command-line arguments in an awk script file. You use the BEGIN {...} rule to set the field-separator FS = ",". Then there are several ways to you split things in the fourth field. split() is just as easy as anything else.
That will leave you with the ending year in which the president served in the fourth element of arr (arr[0] is always the complete expression matching any REGEX used). Then it just a matter of comparing with the largest year first and decreasing from there redirecting the output to the output file for the century.
Continuing with what you started, your awk script will look similar to:
#!/usr/bin/awk -f
BEGIN { FS = "," }
{
split ($4, arr, ".")
if (arr[3] >= 2000)
print $2 > "Presidents2000"
else if (arr[3] >= 1900)
print $2 > "Presidents1900"
else if (arr[3] >= 1800)
print $2 > "Presidents1800"
else if (arr[3] >= 1700)
print $2 > "Presidents1700"
}
Now make it executable (for convenience). Presuming the script is in the file pres.awk:
$ chmod +x pres.awk
Now simply call the awk script passing the .csv filename as the argument, e.g.
$ ./pres.awk my.csv
Now list the files named Presid* and see what is created:
$ ls -al Presid*
-rw-r--r-- 1 david david 33 Oct 8 22:28 Presidents1900
And verify the contents is what you needed:
$ cat Presidents1900
Woodrow Wilson
Warren G. Harding
Presuming that is the output you are looking for based on your attempt.
(note: you need to quote the output file name to ensure, e.g. Presidents1900 isn't taken as a variable that hasn't been set yet)
Let me know if you have further questions.

Find and replace words using sed command not working

I have a a text file which is tab separated, the first column holds the word to be found and the second column holds the word to replace the found word. This text file contains English and Arabic pairs. Once the word is found and replaced it should not be changed again.
For example:
adam a +dam
a b
ال ال+
So for a given text file:
adam played with a ball ال
I expect:
a +dam played with b ball ال+
However, I get:
b +dbm plbyed with b bbll ال+
I am using the following sed command to find and replace:
sed -e 's/^/s%/' -e 's/\t/%/' -e 's/$/%g/' tab_sep_file.txt | sed -f - original_file.txt >replaced.txt
How can I fix this issue
The basic problem to your approach is that you don't want to replace matched text in a prior substitution with a later one - you don't want to change the a's in a +dam to b's. This makes sed a pretty poor choice - you can make a regular expression that matches all of the things you want to replace fairly easily, but picking which replacement to use is an issue.
A way using GNU awk:
gawk -F'\t' '
FNR == NR { subs[$1] = $2; next } # populate the array of substitutions
ENDFILE {
if (FILENAME == ARGV[1]) {
# Build a regular expression of things to substitute
subre = "\\<("
first=0
for (s in subs)
subre = sprintf("%s%s%s", subre, first++ ? "|" : "", s)
subre = sprintf("%s)\\>", subre)
}
}
{
# Do the substitution
nwords = patsplit($0, words, subre, between)
printf "%s", between[0]
for (n = 1; n <= nwords; n++)
printf "%s%s", subs[words[n]], between[n]
printf "\n"
}
' tab_sep_file.txt original_file.txt
which outputs
a +dam played with b ball
First it reads the TSV file and builds an array of words to be replaced and text to replace it with (subs). Then after reading that file, it builds a regular expression to match all possible words to be found - \<(a|adam)\> in this case. The \< and \> match only at the beginning and end, respectively, of words, so the a in ball won't match.
Then for the second file with the text you want to process, it uses patsplit() to split each line into an array of matched parts (words) and the bits between matches (between), and iterates over the length of the array, printing out the replacement text for each match. That way it avoids re-matching text that's already been replaced.
And a perl version that uses a similar approach (Taking advantage of perl's ability to evaluate the replacement text in a s/// substitution):
perl -e '
use strict;
use warnings;
# Set file/standard stream char encodings from locale
use open ":locale";
# Or for explicit UTF-8 text
# use open ":encoding(UTF-8)", ":std";
my %subs;
open my $words, "<", shift or die $!;
while (<$words>) {
chomp;
my ($word, $rep) = split "\t" ,$_, 2;
$subs{$word} = $rep;
}
my $subre = "\\b(?:" . join("|", map { quotemeta } keys %subs) . ")\\b";
while (<<>>) {
print s/$subre/$subs{$&}/egr;
}
' tab_sep_file.txt original_file.txt
(This one will escape regular expression metacharacters in the words to replace, making it more robust)

How to copy a certain amout of lines to a new txt file in bash using awk

I have a txt file which contains chapters, I want to copy each chapter to a new txt file using bash.
for example:
"CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very .......
CHAPTER II. The Pool of Tears
‘Curiouser and curiouser!’ cried Alice (she was so much surprised, that
for the moment she quite forgot how to speak good English); .....
"
I want to create 2 files 1 for each chapter.
awk 'BEGIN{start="0"; end="0"; chapters="0"}
{if($1 -eq chapter){
chapter++
sed -n "$start,$end" Alice_book_aux > Alice_book_chapter_$chapter
start = end
}
end++;}' Alice_book
This is what I thought I should do but is won't work :(
I'll make assumptions based on the given example.
AWK has an input parser that can process input through regexp filters
SED is an adequate tools to take excerpts from input, but AWK will suffice here.
Thus your revised code :
awk 'BEGIN {chapter=0;chapfile="";}
/^"?CHAPTER / {
chapter++;
chapfile="Alice_book_chapter_"chapter;
printf "" > chapfile;
}
{ if (chapter -gt 0) {
print >> chapfile;
}
}' Alice_book
As suggested by #karakfa, the awk script can be reduced to this :
awk '/^"?CHAPTER / {
chapter++;
chapfile="Alice_book_chapter_"chapter;
}
chapter{
print > chapfile;
}
' Alice_book

Vim function to sort by lines by tags

I'm trying to create a custom function in gVim to sort lines based on a tag and I'm a bit out of my depth. The goal is for me to create a custom function in my _gvimrc so I can resort like this quickly and as necessary. I need some pointers as to where to study.
My goal is to group lines by a tag at the start of the line
file contents 1
:a some text
:b some other text
:a more text
:b more other text
:c final text
file contents 2
will become
:a
some text
more text
:b
some other text
more other text
:c
final text
file contents 1
file contents 2
The challenge is largely that I don't want to mess with the other lines in the file - those which have no tags. the :sort function will reorder all of those.
I have a feeling I will need to:
1. yank all the lines with tags into a register, and delete them (some kind of :g/pattern/yank register ?)
2. put them all at the beginning of the file (some kind of :g/pattern/put register ?)
3. sort the block by the tags (some kind of :sort r /pattern/ ?)
4. iterate over each tag to reformat from
:a text
:a text
:a text
to
:a
text
text
text
I'm not proficient enough in gVim to know where to really start - if anyone here is expert enough to assist with one of these sub-problems, so to speak, or has a better idea for the methodology, I'd really appreciate it. I'm particularly stymied on number 4.
If this isn't the kind of thing gVim is capable of, please let me know, but I have a feeling this is quite possible and just out of my pay grade.
I think awk or some higher language would fit this better. Example using awk and Vim's filter, %!:
:%!awk '$1 ~ /^:/ { a[$1] = a[$1] substr($0, 3) "\n"; next} {r = r $0 "\n"} END { asorti(a, keys); for (k in keys) { print keys[k]; print a[keys[k]] } print ""; print r}'
Prettier version:
$1 ~ /^:/ {
a[$1] = a[$1] substr($0, 3) "\n";
next
}
{
r = r $0 "\n"
}
END {
asorti(a, keys);
for (k in keys) {
print keys[k];
print a[keys[k]]
}
print "";
print r
}
Note: I use gawk

Parse columns with awk

I am new at AWK programming and I was wondering how to filter the following text:
Goedel - Declarative language for AI, based on many-sorted logic. Strongly
typed, polymorphic, declarative, with a module system. Supports bignums
and sets. "The Goedel Programming Language", P. M. Hill et al, MIT Press
1994, ISBN 0-262-08229-2. Goedel 1.4 - partial implementation in SICStus
Prolog 2.1.
ftp://ftp.cs.bris.ac.uk/goedel
info: goedel#compsci.bristol.ac.uk
Just to print this:
Goedel
I have used the following sentence but it just does not work as I wished:
awk -F " - " "/ - /{ print $1 }"
It shows the following:
Goedel
1994, ISBN 0-262-08229-2. Goedel 1.4
Could somebody tell me what I have to modify so I can get what I want?
Thanks in advance
awk 'BEGIN { RS = "" } { print $1 }' your_file.txt
which means: splits string into paragraphs by empty line, and then splits words by the default separator (space), and finally print the first word ($1) of every paragraph
this one-liner could work for your requirement:
awk -F ' - ' 'NF>1{print $1;exit}'
awk -F ' - ' ' { if (FNR % 4 == 1) next; print $1; }'
If the format is exactly the same as below, then the code above should work:
1 Author - ...
2 Year ...
3 URL
4 Extra info ...
5 Author - ...
6..N etc.
If there is a blank line between entries, you can set RS to a null string and $1 will be the author as long as the value for -F (the FS variable in an awk script) is the same. This has the advantage that if you don't have "info: ..." or a URL, you can still distinguish between entries, assuming it is not "Author - ...{newline}Year ...{newline}{newline}info: ...{newline}{newline}Author - ..." (you can't have an empty line between parts of an entry if an empty line is what separates entries.) For example:
# A blank line is what separates each entry.
BEGIN { RS = ""; }
{ print $1; }
If you have an awk that supports it, you can make RS a multiple character string if necessary (e.g. RS = "\n--\n" for entries separated by "--" on a line by itself). If you need a regex or simply don't have an awk that supports multiple character record separators, you're forced to use something like the following:
BEGIN { found_sep = 1; }
{ if (found_sep) { print $1; found_sep = 0; } }
# Entry separator is "--\n"
/^--$/ { found_sep = 1; }
More sample input will be required for something more complicated.

Resources