I have a question concerning the manipulation of a text file.
I have something like this
any text keyword 21 any text 32 any text
any text keyword 12 any text keyword 12 any text 23 any text
any text keyword 34 any text (keyword 45) any text (34) any text
now I wonder if I can grep/awk/sed/vi/.. somehow to add constants after the keyword?
For example I want to add e. g. a value of 10 to every integer after keyword but leaving the other numbers and the file format the same?
any text keyword 31 any text 32 any text
any text keyword 22 any text keyword 22 any text 23 any text
any text keyword 44 any text (keyword 55) any text (34) any text
Sorry, I did not find anything so far...
If Perl solution is ok for you:
perl -pe 's/(?<=keyword )(\d+)/$1+10/ge;' file
you mentioned vim, here it goes:
:%s/\v(keyword )#<=[0-9]+/\=submatch(0)+10/g
I tried hard for a sed version:
sed 's/keyword[ \t]*\([0-9]*\)/keyword $(( \1 + 10))/g;
s/"/\\"/g;
s/^/echo \"/;
s/$/\"/' input |
sh
look at this perl solution
perl -pe 's/keyword (\d)+/"keyword ".($1 + 10)/eg' your_file
if you wanna exclude some number from the sum (34 and 35 in this example)
perl -pe 's/keyword (\d)+/if ($1 != 34 && $1 != 35) { "keyword ".($1 + 10) } else { "keyword ".$1 }/eg' your_file
This works. Now I will admit I'm no awk expert so there may be shorter ways to do it but this is what I hacked together:
#!/bin/sh
cat $1 | awk \
'function incr(str) {
if (match(str, "[0-9]+")) {
number = substr(str, RSTART, RLENGTH)
number = number+10
printf("keyword %d",number)
str = substr(str, RSTART+RLENGTH)
}
}
function findall(str, re) {
where=match(str, re)
if (where==0)
{
print(str)
}
else
{
printf("%s", substr(str, 0, RSTART-1))
offset=RSTART+RLENGTH
incr(substr(str, RSTART, RLENGTH))
str = substr(str, offset)
findall(str, re)
}
}
{
findall($0, "keyword [0-9]+");
}'
Related
For text with color codes, how to wrap it to a fixed length in the terminal?
Text without color codes wraps nicely with fold:
echo -e "12345678901234567890" | fold -w 10
1234567890
1234567890
But this red text wraps wrong:
echo -e "\u001b[31m12345678901234567890" | fold -w 10
12345
6789012345
67890
Note: While the red text is wrapped wrong, it still is printed in red, which is the desired behavior.
(My use case is line wrapping the output of git log --color=always --oneline --graph.)
When determining the (printable) width of a prompt (eg, PS1) the special characters - \[ and \] - are used to designate a series of non-printing characters (see this, this, this and this).
So far I've been unable to find a way to use \[ and \] outside the scope of a prompt hence this awk hack ...
Assumptions:
we don't know the color codes in advance
for this exercise it is sufficient to deal with color codes of the format \e[...m (\e[m turns off color)
may have to deal with multiple color codes in the input
We'll wrap one awk idea in a bash function (for easier use):
myfold() {
awk -v n="${1:-10}" ' # default wrap is 10 (printable) characters
BEGIN { regex="[[:cntrl:]][[][^m]*m" # regex == "\e[*m"
#regex="\x1b[[][^m]*m" # alternatives
#regex="\033[[][^m]*m"
}
{ input=$0
while (input != "" ) { # repeatedly strip off "n" characters until we have processed the entire line
count=n
output=""
while ( count > 0 ) { # repeatedly strip off color control codes and characters until we have stripped of "n" characters
match(input,regex)
if (RSTART && RSTART <= count) {
output=output substr(input,1,RSTART+RLENGTH-1)
input=substr(input,RSTART+RLENGTH)
count=count - (RSTART > 1 ? RSTART-1 : 0)
}
else {
output=output substr(input,1,count)
input=substr(input,count+1)
count=0
}
}
print output
}
}
'
}
NOTES:
other non-color, non-printing characters will throw off the count
the regex could be expanded to address other non-printing color and/or character codes
Test run:
$ echo -e "\e[31m123456789012345\e[m67890\e[32mABCD\e[m"
12345678901234567890ABCD
$ echo -e "\e[31m123456789012345\e[m67890\e[32mABCD\e[m" | myfold 10
1234567890
1234567890
ABCD
$ echo -e "\e[31m123456789012345\e[m67890\e[32mABCD\e[m" | myfold 7
1234567
8901234
567890A
BCD
Displaying colors:
This post is related to my previous question about string splitting: Awk split string into words and numbers. Let's say we have a following string:
1A5T4
This string encodes the following information:
A at positon 2 (1 item before A)
T at position 8 (7 items before T , i.e. 1 + A + 5)
no more letters past the rightmost one mean no more relevant information to extract.
So the desired output here is A T 2 8
I'd like to write the Awk script to get this information, preferably in two arrays: one containing positions, the other containing letters. I thought this would be a convenient way to store it, as I need to use the values in other parts of the script that I am writing (or rather struggling to write).
I thought the first step would be to delimit the string by splitting it (credits go to helpful commenters Awk split string into words and numbers).
echo 1A5T4 | awk '{gsub(/[^0-9]+/," & ")}1'
1 A 5 T 4
But maybe the delimiter is not necessary. I tried to do the task using a for loop, by iterating through consecutive letter-number pairs, and adding them to the arrays. However, I was not able to make it to work (there is no arrary, as I could not get the loop to work properly):
echo 1A5T4 | awk '{gsub(/[0-9]+$/,"", $0); a = $0}{for (i = 1; i <= length(a); i++2) {b = substr(a, i, 1) + 1 + b; print b}}'
2
3
9
10
*idea here was to get only numbers and then the letters in the separate for loop
I also had the idea of expanding the string like this: .A.....T.... and then getting the positions of the letters by counting string lengths from the beginning until the letter.
The strings that I need to process will contain one more complication - another type of block: caret followed by a set of letters. In this block, the number of letters following a caret will be added to the final indices. Example below:
1A2^CCG3T4
A is 2 (as in the example above)
T is 11 (2 + 2 + 3 (sum of letter in CCG following the caret) + 3, so 10 positons that preceed T)
So the desired output here is A T 2 11
The letters following the caret are not relevant for anything else, except shifting the indices of the letter to the rate of the caret block.
Would be great to get some helpful hints on how to tackle this.
Clarification: the script should output all letters, as long as they are not preceded by caret. The letters after the caret only shift the indices. For example:
27T19T^A16G8G29
should give
T T G G 28 48 66 75
and
27T19T16G8G29
should give
T T G G 28 48 65 74
Update:
Thanks to #vgersh99, I managed to improve the code. It first converts the text blocks that follow each cater to the same format as the other blocks. Then all the blocks are dealt with in the same way (for loop), and in the end, caret values are just not displayed (the if statement). However, there is still the problem, in case there are multiple caret blocks of variable lengths.
1A5T4
1A1^AAAAA2T2
1A2^CCG3T4
27T19T^A16G8G29
27T19T16G8G29
1A^AA5^TT4T4
10A3A1G9A10A25^TT1^G1^G42T12^G1G29
{
match($0, /\^[A-Z]+/);
a = "^"length(substr($0, RSTART, RLENGTH))-2"^";
gsub(/\^[A-Z]+/, a)
}
# if a letter is directly followed by a caret, such carets are removed, as they would have count==0
{
a = match($0, /[A-Z]+\^/);
a = substr($0, RSTART, RLENGTH-1);
gsub(/[A-Z]+\^/, a)
}
# intermediate string with transformed caret blocks is then used further
{
sum=0; delete(out); str=""
n=patsplit($0,b, /[[:alpha:]^]/, seps);
for(i=1; i<=n;i++) {
sum+=seps[i-1]+1
# print b[i], sum
if (b[i]!="^")
{out[sum]=b[i]}
}
PROCINFO["sorted_in"] = "#ind_num_asc"
for(i in out) {
printf("%s ", out[i])
str=(str? str OFS:"") i
}
print str
} tst.txt
A T 2 8
A T 2 12
A T 2 12
T T G G 28 48 66 75
T T G G 28 48 65 74
A T 2 17
A A G A A T G 11 15 17 27 38 117 134
the last two values in the last row are incorrect, it should be 112 and 127.
This is because gsub always uses the first match to get the replacement for the string, and therefore all the replacements are identical in the intermediate string:
10A3A1G9A10A25^1^1^1^1^1^42T12^1^1G29
it's a rough approximation as I'm a bit confused about your explanation...
Will probably need to be tweaked a bit...
implemntation is gawk specific using gawk's support for patsplit and PROCINFO["sorted_in"].
Given myFile.txt:
1A5T4
1A1^AAAAA2T2
1A2^CCG3T4
27T19T^A16G8G29
27T19T16G8G29
1A^AA5^TT4T4
10A3A1G9A10A25^TT1^G1^G42T12^G1G29
$ cat tst.awk
# prep block for the following "core" mod block
{
# if a caret is followed by letters, subsitute it by caret followed by the length of
# the letter string (-1) followed by a caret
# eg: 1A1^AAAAA2T2 -> 1A1^4^2T2
#$0=gensub(/\^([[:alpha:]]+)/,"^" length("\\2")-2 "^","G")
if(match($0,/\^([[:alpha:]]+)/,sub1))
for (i=1;i in sub1;i++)
sub(sub1[i],int(sub1[i,"length"])-1 "^")
# if a letter is directly followed by a caret, such carets are removed, as they would have count==0
$0=gensub(/([[:alpha:]])\^/,"\\1","G")
#print "[" $0 "]"
#next
}
# "core" mod block
# intermediate string with transformed caret blocks is then used further
{
sum=0; delete(out); str=""
n=patsplit($0,b, /[[:alpha:]^]/, seps);
for(i=1; i<=n;i++) {
sum+=seps[i-1]+1
# print b[i], sum
if (b[i]!="^")
{out[sum]=b[i]}
}
PROCINFO["sorted_in"] = "#ind_num_asc"
for(i in out) {
printf("%s ", out[i])
str=(str? str OFS:"") i
}
print str
}
$ gawk -f tst.awk myFile.txt
A T 2 8
A T 2 12
A T 2 12
T T G G 28 48 66 75
T T G G 28 48 65 74
A T T T 2 11 12 17
A A G A A G G T G G 11 15 17 27 38 69 72 115 129 131
% echo 1A5T4 | gawk 'BEGIN{ FS=""; }{ for (i=1;i<=NF;i++) { if($i>="A"){ s=s $i } else { for(j=1;j<=$i;j++)s=s "." }} print s }'
.A.....T....
% echo 1A2^CCG3T4 | gawk 'BEGIN{ FS=""; }{ for (i=1;i<=NF;i++) { if($i>="A"){ s=s $i } else { for(j=1;j<=$i;j++)s=s "." }} print s }'
.A..^CCG...T....
%
maybe the caret handling is wrong, but that should not be too hard to fix...
maybe try this
{mawk/mawk2/gawk} 'BEGIN { FS = "[=]+";
OFS = "=";
} {
outC = outP = pos = "";
gsub(/\^/, "=&" ); # first split carets next to letters
gsub(/[0-9]+/, "=&="); # insert delims around numbers
} { $1 = $1 } {
while (match($0, /[\^][A-Z]+/)) { sub(/[\^][A-Z]+/, RLENGTH -1) }
} {
x = 1; do {
if ($(x) ~ /[0-9]+|^$/) { pos += int($(x)) } else {
outC = outC "" $(x) " ";
outP = outP "" ++pos " ";
} } while (++x <= NF); print outC outP; } '
this version of the solution works in mawk and mawk2 as well. It doesn't require any sort of patsplit / FPAT logic, nor any gawk specific feature. It also doesn't even require a single call to substr( ).
It also avoids any of the hash-index overhead associated with dealing with arrays. Doesn't require any sorting either, since it's read sequentially left-to-right anyway.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a non delimited text file consisting of around 1 million rows.
Sample rows
1YBL LOYALTY EXT 1000101172019001
2000100101000011512753184907301010614199100919699034659 VIDYA.SAGAR1#bank.IN VIDYA SAGAR CROSS BANDRA WM DELHI 456471
3000000027
On each row starting with digit "2","1","3"(rowtype) I have to insert delimiter based on the count of characters i.e on the end 0-1, 1-20,21-25... so on
How to do this using Linux script ?
Desired Output
1|YBL LOYALTY EXT |10001|01172019|001
2|00010010100001151|2753|184907301010614199100919699034659 |VIDYA.SAGAR1#bank.IN |VIDYA SAGAR |CROSS |BANDRA |WM |DELHI |456471
3|000000027
I tried this command
perl -ne ' if(/^2/) { #x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_"} if(/^1/) { #x=(1,16,5,8); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" } if(/^3/) { #x=(1); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" }' filename`
INPUT ROWS
1YBL LOYALTY EXT 1000112102018001
2000100101000002631653184911501010111199100919323739251 VIJAYPANDEY1191#GMAIL.COM VIJAY PANDEY PART OF GROUND FLOOR & BASEMENT SHOPPER STOP SV ROAD ANDHERI WEST LANDMARK-ERSTWHILE CRASSWORD BOOK STORE MUMBAI 400058
2000100101000019920453184964321010513199000919878857482 MAKSUDMASTER7775#GMAIL.COM MOHAMAD MAQSHUD MASTER H COLLECTION NEW SHIVPURI GALI NO 1 NEAR MAKHAN SINGH CHOWK LUDHIANA 141008
2000100101000023500853184923441010913197300919375580888 JAYNTITALA#GMAIL.COM JAYANTIBHAI TADA 44 KHODIYAR NAGAR B S ABHISHEK SUDAMA CHOWK KHODIYARNAGAR MOTA VARACHHA SURAT 395006
3000000066
EXPECTED OUTPUT
1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251 |VIJAYPANDEY1191#GMAIL.COM |VIJAY PANDEY |PART OF GROUND FLOOR & BASEMENT |SHOPPER STOP SV ROAD ANDHERI WEST |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482 |MAKSUDMASTER7775#GMAIL.COM |MOHAMAD MAQSHUD MASTER |H COLLECTION NEW SHIVPURI |GALI NO 1 |NEAR MAKHAN SINGH CHOWK |LUDHIANA |141008
2|0001001010000235008|531849|2344|101|09131973|00919375580888 |JAYNTITALA#GMAIL.COM |JAYANTIBHAI TADA |44 KHODIYAR NAGAR B S ABHISHEK |SUDAMA CHOWK |KHODIYARNAGAR MOTA VARACHHA |SURAT |395006
3|000000066
GETTING THIS BUT
1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251 |VIJAYPANDEY1191#GMAIL.COM |VIJAY PANDEY |PART OF GROUND FLOOR & BASEMENT |SHOPPER STOP SV ROAD ANDHERI WEST |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482 |MAKSUDMASTER7775#GMAIL.COM |MOHAMAD MAQSHUD MASTER |H COLLECTION NEW SHIVPURI |GALI NO 1 |NEAR MAKHAN SINGH CHOWK |LUDHIANA |141008
1|41008|
2|0001001010000235008|531849|2344|101|09131973|00919375580888 |JAYNTITALA#GMAIL.COM |JAYANTIBHAI TADA |44 KHODIYAR NAGAR B S ABHISHEK |SUDAMA CHOWK |KHODIYARNAGAR MOTA VARACHHA |SURAT |395006
3|95006
3|000000066
With GNU awk for FIELDWIDTHS:
$ awk -v FIELDWIDTHS='1 17 4 *' -v OFS='|' '/^2/{$1=$1; gsub(/\s+/,"&"OFS)} 1' file
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659 |VIDYA.SAGAR1#bank.IN |VIDYA |SAGAR |CROSS |BANDRA |WM |DELHI |456471
3000000027
The above use of FIELDWIDTHS says the input should be treated as separated into 4 fields of width 1 char, 17 chars, 4 chars and then the rest.
When you assign a value to a field awk recompiles the record replacing the input field separators with the value of OFS so $1=$1 is causing |s to be inserted between each of the fields described by FIELDWIDTHS.
Once that's done there's still all the remaining space-separated text to get a field separator added so the gsub() appends an OFS after every series of spaces.
Older versions of gawk don't support * as meaning the rest of the line - if you have that situation then just replace * with a large value like 99999.
You can try Perl as well
perl -lpe ' if(/^2/) { #x=(1,17,4);
for $i (#x) { s/(.{$i})//; printf("%s|",$1) } }' input_file
with the given inputs
$ cat rahman.txt
1YBL LOYALTY EXT 1000101172019001
2000100101000011512753184907301010614199100919699034659 VIDYA.SAGAR1#bank.IN VIDYA SAGAR CROSS BANDRA WM DELHI 456471
3000000027
$ perl -lpe ' if(/^2/) { #x=(1,17,4);
for $i (#x) { s/(.{$i})//; printf("%s|",$1) } }' rahman.txt
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659 VIDYA.SAGAR1#bank.IN VIDYA SAGAR CROSS BANDRA WM DELHI 456471
3000000027
$
just add entries to #x=(1,17,4) .. #x=(1,17,4,10,20)
EDIT1:
To add delimiters for those fields which can be split by space, use the below
$ perl -lpe ' if(/^2/) { #x=(1,17,4);
for $i (#x) { s/(.{$i})//; printf("%s|",$1) } s/\S+\s+\K/|/g }' rahman.txt
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659 |VIDYA.SAGAR1#bank.IN |VIDYA |SAGAR |CROSS |BANDRA |WM |DELHI |456471
3000000027
$
Explanation to the code
Explanation
perl -lpe # use -p for printing by default at the end of perl one-liner
# this makes sure when you dont have a line starting with 2 the line is printed after the if statement.
' if(/^2/) # if - select line that starts with 2. $_ will have the current line
{
#x=(1,17,4); # x is an array to hold the widths of fields. - 1, 17, 4
for $i (#x) # open for loop to loop through the array x
{
s/(.{$i})//; # no variable is specified, so the substitution acts on the $_ i.e current line
# first instance is s/(.{1})// => match one character and store it in $1 capturing variable
# replace the captured part with nothing and update $_
# e.g if the line is "200010010100001151" .. loop one will capture "2" and $_ becomes "00010010100001151"
# loop 2 => s/(.{17})// matches 17 character and $1 stores "00010010100001151"
printf("%s|",$1) # print $1 along with delimiter pipe
} # end of for loop
} # end of if
# here is default print statement in perl that will print the $_ after all modification
' input_file
EDIT2
I get below results based on your inputs. It works correctly.. what issues you see?
$ perl -ne ' if(/^2/) { #x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
> while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
> print "$_"} if(/^1/) { #x=(1,16,5,8); $i=0;
> while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
> print "$_" } if(/^3/) { #x=(1); $i=0;
> while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
> print "$_" }' rahman.txt
1|YBL LOYALTY EXT |10001|01172019|001
2|0001001010000115127|531849|0730|101|06141991|00919699034659 |VIDYA.SAGAR1#bank.IN VID|YA SAGAR CRO|SS BAN|DRA WM | DEL|HI 456|471
3|000000027
$
EDIT3:
Got the issue... $_ is modified and so at the end of /^2/ if loop, the $_ holds the value of "141008", which is then satisfying the next if (/^1/) condition and that if also executes.. To avoid it, just copy the $_ to a $line variable in the beginning and just check $line against /^2/, /^3/, /^1/ in the separate if loops.
$ perl -lne '$line=$_; if($line=~/^2/) { #x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" }
if($line=~/^1/) { #x=(1,16,5,8); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" }
if($line=~/^3/) { #x=(1); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" }' rahman2.txt
1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251 |VIJAYPANDEY1191#GMAIL.COM |VIJAY PANDEY |PART OF GROUND FLOOR & BASEMENT |SHOPPER STOP SV ROAD ANDHERI WEST |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482 |MAKSUDMASTER7775#GMAIL.COM |MOHAMAD MAQSHUD MASTER |H COLLECTION NEW SHIVPURI |GALI NO 1 |NEAR MAKHAN SINGH CHOWK |LUDHIANA |141008
2|0001001010000235008|531849|2344|101|09131973|00919375580888 |JAYNTITALA#GMAIL.COM |JAYANTIBHAI TADA |44 KHODIYAR NAGAR B S ABHISHEK |SUDAMA CHOWK |KHODIYARNAGAR MOTA VARACHHA |SURAT |395006
3|000000066
$
You do have delimiters in your file, you just don't see them: it's the space/tab characters. So you just need to replace those, using the sed/xxx/|/g command (by xxx I mean the space or TAB characters). In case you doubt whether your characters are spaces or tabs, you might open your file in a hex editor (space is ASCII code 32 (Hex : 20) and TAB has 9 (Hex : 09)).
You can try with gnu sed :
sed -E '/^2/{s//&|/;s/(.{19})(....)(\S+\s+)/\1|\2|\3|/}' infile
In case you don't have FIELDSWIDTHS then try following.
awk -v var="1,18,4" -v OFS="|" '
BEGIN{
num=split(var,array,",")
}
{
for(i=1;i<=num;i++){
val=val?(i==num?val substr($0,array[i-1]+1,array[i]):val substr($0,array[i-1]+1,array[i]) OFS):substr($0,1,array[i]) OFS
sum+=array[i]
}
if(sum==length($0)){
print val
}
else{
rest=substr($0,sum)
gsub(/[[:space:]]+/,"&"OFS,rest)
print val,rest
}
sum=rest=val=""
}
' Input_file
I was wondering how to parse a parragraph that looks like the following:
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
And many other lines with text that I do not need
* * * * * * *
Autolisp - Dialect of LISP used by the Autocad CAD package, Autodesk,
Sausalito, CA.
CPL -
1. Combined Programming Language. U Cambridge and U London. A very
complex language, syntactically based on ALGOL-60, with a pure functional
subset.
Modula-3* - Incoprporation of Modula-2* ideas into Modula-3. "Modula-3*:
So I can get the following exit from the awk sentence:
Autolisp
CPL
Modula-3*
I have tried the following sentences because the file I want to filter is huge. It is a list of all the existing programming languages so far, but basically all the lines follow the same pattern as the above
Sentences I have used so far:
BEGIN{$0 !~ /^ / && NF == 2 && $2 == "-"} { print $1 }
BEGIN{RS=""; ORS="\n\n"; FS=OFS="\n"} /^FLIP -/{print $1,$3}
BEGIN{RS=""; FS=OFS="\n"} {print $1 NF-1}
BEGIN{NF == 2 && $2 == "-" } { print $1 }
BEGIN { RS = "" } { print $1 }
The sentences that have worked for me so far are:
BEGIN { RS = "\n\n"; FS = " - " }
{ print $1 }
awk -F " - " "/ - /{ print $1 }" file.txt
But it still prints or skips lines that I need/ don't need.
Thanks for your help & response!
I have broken my head for some days because I am a rookie with AWK programming
The default FS should be fine, to avoid any duplicate lines you can pipe the output to sort -u
$ gawk '$2 == "-" { print $1 }' file | sort -u
Autolisp
CPL
Modula-3*
It might not filter out everything you want but you can keep adding rules until the bad data is filtered.
Alternately you can avoid using sort by using an associative array:
$ gawk '$2=="-" { arr[$1] } END { for (key in arr) print key}' file
Autolisp
CPL
Modula-3*
If it doesn't have to be with awk, it would probably work to first use grep to select lines of the right form, and then use sed to trim off the end, as follows:
grep -e '^.* -' | sed -e 's/\(^.*\) -.*$/\1\n/; p;'
Edit: After some playing around with awk, it looks like part of your issue is that you don't always have '[languagename] - [stuff]', but rather '[languagename] -\n[stuff]', as is the case with CPL in the sample text, and therefore, FS=" - " doesn't separate on things like that.
Also, one possible thing to try is as follows:
BEGIN { r = "^.* -"; }
{
if (match($0, r)) {
printf("%s\n", substr($0, 1, RSTART + RLENGTH - 3));
}
}
I don't actually know much about awk, but this is my best guess at replicating what the grep and sed do above. It does appear to work on the sample text you gave, at least.
I am new at AWK programming and I was wondering how to filter the following text:
Goedel - Declarative language for AI, based on many-sorted logic. Strongly
typed, polymorphic, declarative, with a module system. Supports bignums
and sets. "The Goedel Programming Language", P. M. Hill et al, MIT Press
1994, ISBN 0-262-08229-2. Goedel 1.4 - partial implementation in SICStus
Prolog 2.1.
ftp://ftp.cs.bris.ac.uk/goedel
info: goedel#compsci.bristol.ac.uk
Just to print this:
Goedel
I have used the following sentence but it just does not work as I wished:
awk -F " - " "/ - /{ print $1 }"
It shows the following:
Goedel
1994, ISBN 0-262-08229-2. Goedel 1.4
Could somebody tell me what I have to modify so I can get what I want?
Thanks in advance
awk 'BEGIN { RS = "" } { print $1 }' your_file.txt
which means: splits string into paragraphs by empty line, and then splits words by the default separator (space), and finally print the first word ($1) of every paragraph
this one-liner could work for your requirement:
awk -F ' - ' 'NF>1{print $1;exit}'
awk -F ' - ' ' { if (FNR % 4 == 1) next; print $1; }'
If the format is exactly the same as below, then the code above should work:
1 Author - ...
2 Year ...
3 URL
4 Extra info ...
5 Author - ...
6..N etc.
If there is a blank line between entries, you can set RS to a null string and $1 will be the author as long as the value for -F (the FS variable in an awk script) is the same. This has the advantage that if you don't have "info: ..." or a URL, you can still distinguish between entries, assuming it is not "Author - ...{newline}Year ...{newline}{newline}info: ...{newline}{newline}Author - ..." (you can't have an empty line between parts of an entry if an empty line is what separates entries.) For example:
# A blank line is what separates each entry.
BEGIN { RS = ""; }
{ print $1; }
If you have an awk that supports it, you can make RS a multiple character string if necessary (e.g. RS = "\n--\n" for entries separated by "--" on a line by itself). If you need a regex or simply don't have an awk that supports multiple character record separators, you're forced to use something like the following:
BEGIN { found_sep = 1; }
{ if (found_sep) { print $1; found_sep = 0; } }
# Entry separator is "--\n"
/^--$/ { found_sep = 1; }
More sample input will be required for something more complicated.