Extracting word after fixed word with awk - linux

I have a file file.txt containing a very long line:
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962797807950|mar0101|0|00000106829DAE7F3FAB187550B920530C00|0|0|4000018001000002||962797807950|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|1|||||||||||||0|0|||472|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|252|tid{111211344662580792}pfid{10}gob{1}rid{globitel} afid{}uid1{962797807950}aid1{1}ar1{100}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC RESERVE AMOUNT 10000}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{100}ctr{StaffLine}ftksn{JMT}ftksr{0001}ftktp{PayCall Ticket}||
I want to print only the word after "ctr" in this file, which is "StaffLine",
and I don't how many characters there are in this word.
I've tried:
awk '{comp[substr("ctr",0)]{print}}'
but it didn't work. How can I get hold of that word?

Here's one way using awk:
awk -F "[{}]" '{ for(i=1;i<=NF;i++) if ($i == "ctr") print $(i+1) }' file
Or if your version of grep supports Perl-like regex:
grep -oP "(?<=ctr{)[^}]+" file
Results:
StaffLine

Using sed:
sed 's/.*}ctr{\([^}]*\).*/\1/' input

One way of dealing with it is with sed:
sed -e 's/.*}ctr{//; s/}.*//' file.txt
This deletes everything up to and including the { after the word ctr (avoiding issues with any words which have ctr as a suffix, such as a hypothetical pxctr{Bogus} entry); it then deletes anything from the first remaining } onwards, leaving just StaffLine on the sample data.

perl -lne '$_=m/.*ctr{([^}]*)}.*/;print $1' your_file
tested below:
> cat temp
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962797807950|mar0101|0|00000106829DAE7F3FAB187550B920530C00|0|0|4000018001000002||962797807950|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|1|||||||||||||0|0|||472|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|252|tid{111211344662580792}pfid{10}gob{1}rid{globitel} afid{}uid1{962797807950}aid1{1}ar1{100}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC RESERVE AMOUNT 10000}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{100}ctr{StaffLine}ftksn{JMT}ftksr{0001}ftktp{PayCall Ticket}||
> perl -lne '$_=m/.*ctr{([^}]*)}.*/;print $1' temp
StaffLine
>

Related

Select subdomains using print command

cat a.txt
a.b.c.d.e.google.com
x.y.z.google.com
rev a.txt | awk -F. '{print $2,$3}' | rev
This is showing:
e google
x google
But I want this output
a.b.c.d.e.google
b.c.d.e.google
c.d.e.google
e.google
x.y.z.google
y.z.google
z.google
With your shown samples, please try following awk code. Written and tested in GNU awk should work in any awk.
awk '
BEGIN{
FS=OFS="."
}
{
nf=NF
for(i=1;i<(nf-1);i++){
print
$1=""
sub(/^[[:space:]]*\./,"")
}
}
' Input_file
Here is one more awk solution:
awk -F. '{while (!/^[^.]+\.[^.]+$/) {print; sub(/^[^.]+\./, "")}}' file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
Using sed
$ sed -En 'p;:a;s/[^.]+\.(.*([^.]+\.){2}[[:alpha:]]+$)/\1/p;ta' input_file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
Using bash:
IFS=.
while read -ra a; do
for ((i=${#a[#]}; i>2; i--)); do
echo "${a[*]: -i}"
done
done < a.txt
Gives:
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
(I assume the lack of d.e.google.com in your expected output is typo?)
For a shorter and arguably simpler solution, you could use Perl.
To auto-split the line on the dot character into the #F array, and then print the range you want:
perl -F'\.' -le 'print join(".", #F[0..$#F-1])' a.txt
-F'\.' will auto-split each input line into the #F array. It will split on the given regular expression, so the dot needs to be escaped to be taken literally.
$#F is the number of elements in the array. So #F[0..$#F-1] is the range of elements from the first one ($F[0]) to the penultimate one. If you wanted to leave out both "google" and "com", you would use #F[0..$#F-2] etc.

Replaceing multiple command calls

Able to trim and transpose the below data with sed, but it takes considerable time. Hope it would be better with AWK. Welcome any suggestions on this
Input Sample Data:
[INX_8_60L ] :9:Y
[INX_8_60L ] :9:N
[INX_8_60L ] :9:Y
[INX_8_60Z ] :9:Y
[INX_8_60Z ] :9:Y
Required Output:
INX?_8_60L¦INX?_8_60L¦INX?_8_60L¦INX?_8_60Z¦INX?_8_60Z
Just use awk, e.g.
awk -v n=0 '{printf (n?"!%s":"%s", substr ($0,2,match($0,/[ \t]+/)-2)); n=1} END {print ""}' file
Which will be orders of magnitude faster. It just picks out the (e.g. "INX_8_60L") substring using substring and match. n is simply used as a false/true (0/1) flag to prevent outputting a "!" before the first string.
Example Use/Output
With your data in file you would get:
$ awk -v n=0 '{printf (n?"!%s":"%s", substr ($0,2,match($0,/[ \t]+/)-2)); n=1} END {print ""}' file
INX_8_60L!INX_8_60L!INX_8_60L!INX_8_60Z!INX_8_60Z
Which appears to be what you are after. (Note: I'm not sure what your separator character is, so just change above as needed) If not, let me know and I'm happy to help further.
Edit Per-Changes
Including the '?' isn't difficult, and I just copied the character, so you would now have:
awk -v n=0 '{s=substr($0,2,match($0,/[ \t]+/)-2); sub(/_/,"?_",s); printf n?"¦%s":"%s", s; n=1}
END {print ""}' file
Example Output
INX?_8_60L¦INX?_8_60L¦INX?_8_60L¦INX?_8_60Z¦INX?_8_60Z
And to simplify, just operating on the first field as in #JamesBrown's answer, that would reduce to:
awk -v n=0 '{s=substr($1,2); sub(/_/,"?_",s); printf n?"¦%s":"%s", s; n=1} END {print ""}' file
Let me know if that needs more changes.
Don't start so many sed commands, separate the sed operations with semicolon instead.
Try to process the data in a single job and avoid regex. Below reading with substr() static sized first block and insterting ? while outputing.
$ awk '{
b=b (b==""?"":";") substr($1,2,3) "?" substr($1,5)
}
END {
print b
}' file
Output:
INX?_8_60L;INX?_8_60L;INX?_8_60L;INX?_8_60Z;INX?_8_60Z
If the fields are not that static in size:
$ awk '
BEGIN {
FS="[[_ ]" # split field with regex
}
{
printf "%s%s?_%s_%s",(i++?";":""), $2,$3,$4 # output semicolons and fields
}
END {
print ""
}' file
Performance of solutions for 20 M records:
Former:
real 0m8.017s
user 0m7.856s
sys 0m0.160s
Latter:
real 0m24.731s
user 0m24.620s
sys 0m0.112s
sed can be very fast when used gingerly, so for simplicity and speed you might wish to consider:
sed -e 's/ .*//' -e 's/\[INX/INX?/' | tr '\n' '|' | sed -e '$s/|$//'
The second call to sed is there to satisfy the requirement that there is no trailing |.
Another solution using GNU awk:
awk -F'[[ ]+' '
{printf "%s%s",(o?"¦":""),gensub(/INX/,"INX?",1,$2);o=1}
END{print ""}
' file
The field separator is set (with -F option) such that it matches the wanted parameter.
The main statement is to print the modified parameter with the ? character.
The variable o allows to keep track of the delimeter ¦.

awk or sed to change column value in a file

I have a csv file with data as follows
16:47:07,3,r-4-VM,230000000.,0.466028518635,131072,0,0,0,60,0
16:47:11,3,r-4-VM,250000000.,0.50822578824,131072,0,0,0,0,0
16:47:14,3,r-4-VM,240000000.,0.488406067907,131072,0,0,32768,0,0
16:47:17,3,r-4-VM,230000000.,0.467893525702,131072,0,0,0,0,0
I would like to shorten the value in the 5th column.
Desired output
16:47:07,3,r-4-VM,230000000.,0.46,131072,0,0,0,60,0
16:47:11,3,r-4-VM,250000000.,0.50,131072,0,0,0,0,0
16:47:14,3,r-4-VM,240000000.,0.48,131072,0,0,32768,0,0
16:47:17,3,r-4-VM,230000000.,0.46,131072,0,0,0,0,0
Your help is highly appreciated
awk '{$5=sprintf( "%.2g", $5)} 1' OFS=, FS=, input
This will round and print .47 instead of .46 on the first line, but perhaps that is desirable.
Try with this:
cat filename | sed 's/\(^.*\)\(0\.[0-9][0-9]\)[0-9]*\(,.*\)/\1\2\3/g'
So far, the output is at GNU/Linux standard output, so
cat filename | sed 's/\(^.*\)\(0\.[0-9][0-9]\)[0-9]*\(,.*\)/\1\2\3/g' > out_filename
will send the desired result to out_filename
If rounding is not desired, i.e. 0.466028518635 needs to be printed as 0.46, use:
cat <input> | awk -F, '{$5=sprintf( "%.4s", $5)} 1' OFS=,
(This can another example of Useless use of cat)
You want it in perl, This is it:
perl -F, -lane '$F[4]=~s/^(\d+\...).*/$1/g;print join ",",#F' your_file
tested below:
> cat temp
16:47:07,3,r-4-VM,230000000.,0.466028518635,131072,0,0,0,60,0
16:47:11,3,r-4-VM,250000000.,10.50822578824,131072,0,0,0,0,0
16:47:14,3,r-4-VM,240000000.,0.488406067907,131072,0,0,32768,0,0
16:47:17,3,r-4-VM,230000000.,0.467893525702,131072,0,0,0,0,0
> perl -F, -lane '$F[4]=~s/^(\d+\...).*/$1/g;print join ",",#F' temp
16:47:07,3,r-4-VM,230000000.,0.46,131072,0,0,0,60,0
16:47:11,3,r-4-VM,250000000.,10.50,131072,0,0,0,0,0
16:47:14,3,r-4-VM,240000000.,0.48,131072,0,0,32768,0,0
16:47:17,3,r-4-VM,230000000.,0.46,131072,0,0,0,0,0
sed -r 's/^(([^,]+,){4}[^,]{4})[^,]*/\1/' file.csv
This might work for you (GNU sed):
sed -r 's/([^,]{,4})[^,]*/\1/5' file
This replaces the 5th occurence of non-commas to no more than 4 characters length.

Can I use awk to convert all the lower-case letters into upper-case?

I have a file mixed with lower-case letters and upper-case letters, can I use awk to convert all the letters in that file into upper-case?
Try this:
awk '{ print toupper($0) }' <<< "your string"
Using a file:
awk '{ print toupper($0) }' yourfile.txt
You can use awk, but tr is the better tool:
tr a-z A-Z < input
or
tr [:lower:] [:upper:] < input
Try this:
$ echo mix23xsS | awk '{ print toupper($0) }'
MIX23XSS
Something like
< yourMIXEDCASEfile.txt awk '{print toupper($0)}' > yourUPPERCASEfile.txt
You mean like this thread explains:
http://www.unix.com/shell-programming-scripting/24320-converting-file-names-upper-case.html
(Ok, it's about filenames, but the same principle applies to files)
If Perl is an option:
perl -ne 'print uc()' file
-n loop around input file, do not automatically print line
-e execute the perl code in quotes
uc() = uppercase
To print all lowercase:
perl -ne 'print lc()' file

How to replace a multi line string in a bunch files

#!/bin/sh
old="hello"
new="world"
sed -i s/"${old}"/"${new}"/g $(grep "${old}" -rl *)
The preceding script just work for single line text, how can I write a script can replace
a multi line text.
old='line1
line2
line3'
new='newtext1
newtext2'
What command can I use.
You could use perl or awk, and change the record separator to something else than newline (so you can match against bigger chunks. For example with awk:
echo -e "one\ntwo\nthree" | awk 'BEGIN{RS="\n\n"} sub(/two\nthree\n, "foo")'
or with perl (-00 == paragraph buffered mode)
echo -e "one\ntwo\nthree" | perl -00 -pne 's/two\nthree/foo/'
I don't know if there's a possibility to have no record separator at all (with perl, you could read the whole file first, but then again that's not nice with regards to memory usage)
awk can do that for you.
awk 'BEGIN { RS="" }
FILENAME==ARGV[1] { s=$0 }
FILENAME==ARGV[2] { r=$0 }
FILENAME==ARGV[3] { sub(s,r) ; print }
' FILE_WITH_CONTENTS_OF_OLD FILE_WITH_CONTENTS_OF_NEW ORIGINALFILE > NEWFILE
But you can do it with vim like described here (scriptable solution).
Also see this and this in the sed faq.

Resources