Select subdomains using print command - linux

cat a.txt
a.b.c.d.e.google.com
x.y.z.google.com
rev a.txt | awk -F. '{print $2,$3}' | rev
This is showing:
e google
x google
But I want this output
a.b.c.d.e.google
b.c.d.e.google
c.d.e.google
e.google
x.y.z.google
y.z.google
z.google

With your shown samples, please try following awk code. Written and tested in GNU awk should work in any awk.
awk '
BEGIN{
FS=OFS="."
}
{
nf=NF
for(i=1;i<(nf-1);i++){
print
$1=""
sub(/^[[:space:]]*\./,"")
}
}
' Input_file

Here is one more awk solution:
awk -F. '{while (!/^[^.]+\.[^.]+$/) {print; sub(/^[^.]+\./, "")}}' file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com

Using sed
$ sed -En 'p;:a;s/[^.]+\.(.*([^.]+\.){2}[[:alpha:]]+$)/\1/p;ta' input_file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com

Using bash:
IFS=.
while read -ra a; do
for ((i=${#a[#]}; i>2; i--)); do
echo "${a[*]: -i}"
done
done < a.txt
Gives:
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
(I assume the lack of d.e.google.com in your expected output is typo?)

For a shorter and arguably simpler solution, you could use Perl.
To auto-split the line on the dot character into the #F array, and then print the range you want:
perl -F'\.' -le 'print join(".", #F[0..$#F-1])' a.txt
-F'\.' will auto-split each input line into the #F array. It will split on the given regular expression, so the dot needs to be escaped to be taken literally.
$#F is the number of elements in the array. So #F[0..$#F-1] is the range of elements from the first one ($F[0]) to the penultimate one. If you wanted to leave out both "google" and "com", you would use #F[0..$#F-2] etc.

Related

Replaceing multiple command calls

Able to trim and transpose the below data with sed, but it takes considerable time. Hope it would be better with AWK. Welcome any suggestions on this
Input Sample Data:
[INX_8_60L ] :9:Y
[INX_8_60L ] :9:N
[INX_8_60L ] :9:Y
[INX_8_60Z ] :9:Y
[INX_8_60Z ] :9:Y
Required Output:
INX?_8_60L¦INX?_8_60L¦INX?_8_60L¦INX?_8_60Z¦INX?_8_60Z
Just use awk, e.g.
awk -v n=0 '{printf (n?"!%s":"%s", substr ($0,2,match($0,/[ \t]+/)-2)); n=1} END {print ""}' file
Which will be orders of magnitude faster. It just picks out the (e.g. "INX_8_60L") substring using substring and match. n is simply used as a false/true (0/1) flag to prevent outputting a "!" before the first string.
Example Use/Output
With your data in file you would get:
$ awk -v n=0 '{printf (n?"!%s":"%s", substr ($0,2,match($0,/[ \t]+/)-2)); n=1} END {print ""}' file
INX_8_60L!INX_8_60L!INX_8_60L!INX_8_60Z!INX_8_60Z
Which appears to be what you are after. (Note: I'm not sure what your separator character is, so just change above as needed) If not, let me know and I'm happy to help further.
Edit Per-Changes
Including the '?' isn't difficult, and I just copied the character, so you would now have:
awk -v n=0 '{s=substr($0,2,match($0,/[ \t]+/)-2); sub(/_/,"?_",s); printf n?"¦%s":"%s", s; n=1}
END {print ""}' file
Example Output
INX?_8_60L¦INX?_8_60L¦INX?_8_60L¦INX?_8_60Z¦INX?_8_60Z
And to simplify, just operating on the first field as in #JamesBrown's answer, that would reduce to:
awk -v n=0 '{s=substr($1,2); sub(/_/,"?_",s); printf n?"¦%s":"%s", s; n=1} END {print ""}' file
Let me know if that needs more changes.
Don't start so many sed commands, separate the sed operations with semicolon instead.
Try to process the data in a single job and avoid regex. Below reading with substr() static sized first block and insterting ? while outputing.
$ awk '{
b=b (b==""?"":";") substr($1,2,3) "?" substr($1,5)
}
END {
print b
}' file
Output:
INX?_8_60L;INX?_8_60L;INX?_8_60L;INX?_8_60Z;INX?_8_60Z
If the fields are not that static in size:
$ awk '
BEGIN {
FS="[[_ ]" # split field with regex
}
{
printf "%s%s?_%s_%s",(i++?";":""), $2,$3,$4 # output semicolons and fields
}
END {
print ""
}' file
Performance of solutions for 20 M records:
Former:
real 0m8.017s
user 0m7.856s
sys 0m0.160s
Latter:
real 0m24.731s
user 0m24.620s
sys 0m0.112s
sed can be very fast when used gingerly, so for simplicity and speed you might wish to consider:
sed -e 's/ .*//' -e 's/\[INX/INX?/' | tr '\n' '|' | sed -e '$s/|$//'
The second call to sed is there to satisfy the requirement that there is no trailing |.
Another solution using GNU awk:
awk -F'[[ ]+' '
{printf "%s%s",(o?"¦":""),gensub(/INX/,"INX?",1,$2);o=1}
END{print ""}
' file
The field separator is set (with -F option) such that it matches the wanted parameter.
The main statement is to print the modified parameter with the ? character.
The variable o allows to keep track of the delimeter ¦.

Find words containing 20 vowels grep

I found many similar questions but most of them ask for vowels in a row which is easy. I want to find words that contain 20 vowels not in a row using grep.
I originally thought grep -Ei [aeiou]{20} would do it but that seems to search only for 20 vowels in a row
Use a regular expression that searches for 20 vowels separated by any quantity of consonants.
grep -Ei "[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*"
The backslash is just informing the shell that the expression continues on the next line. It is not part of the regex itself.
If you understand that part, you can shorten it considerably using groups. This regexp is the same as above, but using groups in parenthesis with repetition.
grep -Ei "([aeiou][b-df-hj-np-tv-z]*){20}"
I don't believe that's a problem that calls for just a regex. Here's a programmatic approach. We redefine the field separator to the empty string; each character is a field. We iterate over the line; if a character is a vowel we increment a counter. If, at the end of the string, the count is 20, we print it:
cat nicks.awk
BEGIN{
FS=""
}
{
c=0;
for( i=1;i<=NF;i=i+1 ){
if ($i ~ /[aeiou]/ ){
c=c+1;
}
};
if(c==20){
print $0
}
}
And this is what it does ... it only prints back the one string that has 20 vowels.
echo "contributorNickSequestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk
echo "contributorNickSeoquestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk
contributorNickSeoquestionsfoundcontainingvowelsgrcep
echo "contributorNickSaeoquestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk
If all you really need is to find 20 vowels in a line then that's just:
awk '{x=tolower($0)} gsub(/[aeiou]/,"&",x)==20' file
or with grep:
grep -Ei '^[^aeiou]*([aeiou][^aeiou]*){20}$' file
To find words (assuming each is space separated) there's many options including this with GNU awk:
awk -v RS='\\s+' -v IGNORECASE=1 'gsub(/[aeiou]/,"&")==20' file
or this with any awk:
awk '{for (i=1;i<=NF;i++) {x=tolower($i); if (gsub(/[aeiou]/,"&",x)==20) print $i} }' file

filter out unrecognised fields using awk

I have a CVS file where I expect some values such as Y or N. Folks are adding comments or arbitrary entries such as NA? that I want to remove:
Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,,
I can use gsub to remove things that I am anticipating such as:
$ cat test.csv | awk '{gsub("NA\\?", ""); gsub("NA \\?",""); gsub("TBD", ""); print}'
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
Yet that will break if someone adds a new comment. I am looking for a regex to generalise the match as "not Y".
I tried some negative look arounds but couldn't get it to work on the awk that I have which is GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.1, GNU MP 6.1.2). Thanks in advance!
awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if ($i !~ /^(y|Y|n|N)$/) $i="";print}' test.CSV
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
Accepting only Y/N (case-insensitive).
awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}'
This seems to do the trick. Loops through the 3rd through the last field, and if the field isn't Y, it's replaced with nothing. Since we're modifying fields we need to set OFS as well.
$ cat file.txt
Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,,
$ awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}'
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
If you wanted to accept "N" too, /^[YN]$/ would work.
cat test.CSV | awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if($i != "Y") $i=""; print}'
Output:
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
Update: So there's no need to use regex if you simply want to determine it's "Y" or not.
However, if you want to use regex, as zzevannn's answer and tink's answer already gave great ideas of regex condition, so I'll give a batch replace by regex instead:
To be exact, and to increase the challenge, I created some boundary conditions:
$ cat test.CSV
Create,20055776,Y,,Y,Y,,Y,,YNA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,YN.Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,NANN,,,,,Y,,,NA ?Y,,,Y,,,,,,TYBD,,,,,,,,,
And the batch replace is:
$ awk 'BEGIN{FS=OFS=","}{fst=$1;sub($1 FS,"");print fst,gensub("(,)[^,]*[^Y,]+[^,]*","\\1","g",$0);}' test.CSV
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
"(,)[^,]*[^Y,]+[^,]*" is to match anything between two commas that other than single Y.
Note I saved $1 and deleted $1 and the comma after it first, and later print it back.
sed solution
# POSIX
sed -e ':a' -e 's/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;t a' test.csv
# GNU
sed ':a;s/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;ta' test.csv
awk on same concept (avoid some problem of sed that miss the OR regex)
awk -F ',' '{ Idx=$2;gsub(/,[[:blank:]]*[^YN,][^,]*/, "");sub( /,/, "," Idx);print}'

Extract field after colon for lines where field before colon matches pattern

I have a file file1 which looks as below:
tool1v1:1.4.4
tool1v2:1.5.3
tool2v1:1.5.2.c8.5.2.r1981122221118
tool2v2:32.5.0.abc.r20123433554
I want to extract value of tool2v1 and tool2v2
My output should be 1.5.2.c8.5.2.r1981122221118 and 32.5.0.abc.r20123433554.
I have written the following awk but it is not giving correct result:
awk -F: '/^tool2v1/ {print $2}' file1
awk -F: '/^tool2v2/ {print $2}' file1
grep -E can also do the job:
grep -E "tool2v[12]" file1 |sed 's/^.*://'
If you have a grep that supports Perl compatible regular expressions such as GNU grep, you can use a variable-sized look-behind:
$ grep -Po '^tool2v[12]:\K.*' infile
1.5.2.c8.5.2.r1981122221118
32.5.0.abc.r20123433554
The -o option is to retain just the match instead of the whole matching line; \K is the same as "the line must match the things to the left, but don't include them in the match".
You could also use a normal look-behind:
$ grep -Po '(?<=^tool2v[12]:).*' infile
1.5.2.c8.5.2.r1981122221118
32.5.0.abc.r20123433554
And finally, to fix your awk which was almost correct (and as pointed out in a comment):
$ awk -F: '/^tool2v[12]/ { print $2 }' infile
1.5.2.c8.5.2.r1981122221118
32.5.0.abc.r20123433554
You can filter with grep:
grep '\(tool2v1\|tool2v2\)'
And then remove the part before the : with sed:
sed 's/^.*://'
This sed operation means:
^ - match from beginning of string
.* - all characters
up to and including the :
... and replace this matched content with nothing.
The format is sed 's/<MATCH>/<REPLACE>/'
Whole command:
grep '\(tool2v1\|tool2v2\)' file1|sed 's/^.*://'
Result:
1.5.2.c8.5.2.r1981122221118
32.5.0.abc.r20123433554
the question has already been answered though, but you can also use pure bash to achieve the desired result
#!/usr/bin/env bash
while read line;do
if [[ "$line" =~ ^tool2v* ]];then
echo "${line#*:}"
fi
done < ./file1.txt
the while loop reads every line of the file.txt, =~ does a regexp match to check if the value of $line variable if it starts with toolv2, then it trims : backward

Extracting word after fixed word with awk

I have a file file.txt containing a very long line:
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962797807950|mar0101|0|00000106829DAE7F3FAB187550B920530C00|0|0|4000018001000002||962797807950|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|1|||||||||||||0|0|||472|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|252|tid{111211344662580792}pfid{10}gob{1}rid{globitel} afid{}uid1{962797807950}aid1{1}ar1{100}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC RESERVE AMOUNT 10000}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{100}ctr{StaffLine}ftksn{JMT}ftksr{0001}ftktp{PayCall Ticket}||
I want to print only the word after "ctr" in this file, which is "StaffLine",
and I don't how many characters there are in this word.
I've tried:
awk '{comp[substr("ctr",0)]{print}}'
but it didn't work. How can I get hold of that word?
Here's one way using awk:
awk -F "[{}]" '{ for(i=1;i<=NF;i++) if ($i == "ctr") print $(i+1) }' file
Or if your version of grep supports Perl-like regex:
grep -oP "(?<=ctr{)[^}]+" file
Results:
StaffLine
Using sed:
sed 's/.*}ctr{\([^}]*\).*/\1/' input
One way of dealing with it is with sed:
sed -e 's/.*}ctr{//; s/}.*//' file.txt
This deletes everything up to and including the { after the word ctr (avoiding issues with any words which have ctr as a suffix, such as a hypothetical pxctr{Bogus} entry); it then deletes anything from the first remaining } onwards, leaving just StaffLine on the sample data.
perl -lne '$_=m/.*ctr{([^}]*)}.*/;print $1' your_file
tested below:
> cat temp
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962797807950|mar0101|0|00000106829DAE7F3FAB187550B920530C00|0|0|4000018001000002||962797807950|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|1|||||||||||||0|0|||472|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|252|tid{111211344662580792}pfid{10}gob{1}rid{globitel} afid{}uid1{962797807950}aid1{1}ar1{100}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC RESERVE AMOUNT 10000}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{100}ctr{StaffLine}ftksn{JMT}ftksr{0001}ftktp{PayCall Ticket}||
> perl -lne '$_=m/.*ctr{([^}]*)}.*/;print $1' temp
StaffLine
>

Resources