Select subdomains using print command - linux
cat a.txt
a.b.c.d.e.google.com
x.y.z.google.com
rev a.txt | awk -F. '{print $2,$3}' | rev
This is showing:
e google
x google
But I want this output
a.b.c.d.e.google
b.c.d.e.google
c.d.e.google
e.google
x.y.z.google
y.z.google
z.google
With your shown samples, please try following awk code. Written and tested in GNU awk should work in any awk.
awk '
BEGIN{
FS=OFS="."
}
{
nf=NF
for(i=1;i<(nf-1);i++){
print
$1=""
sub(/^[[:space:]]*\./,"")
}
}
' Input_file
Here is one more awk solution:
awk -F. '{while (!/^[^.]+\.[^.]+$/) {print; sub(/^[^.]+\./, "")}}' file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
Using sed
$ sed -En 'p;:a;s/[^.]+\.(.*([^.]+\.){2}[[:alpha:]]+$)/\1/p;ta' input_file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
Using bash:
IFS=.
while read -ra a; do
for ((i=${#a[#]}; i>2; i--)); do
echo "${a[*]: -i}"
done
done < a.txt
Gives:
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
(I assume the lack of d.e.google.com in your expected output is typo?)
For a shorter and arguably simpler solution, you could use Perl.
To auto-split the line on the dot character into the #F array, and then print the range you want:
perl -F'\.' -le 'print join(".", #F[0..$#F-1])' a.txt
-F'\.' will auto-split each input line into the #F array. It will split on the given regular expression, so the dot needs to be escaped to be taken literally.
$#F is the number of elements in the array. So #F[0..$#F-1] is the range of elements from the first one ($F[0]) to the penultimate one. If you wanted to leave out both "google" and "com", you would use #F[0..$#F-2] etc.
Related
Replaceing multiple command calls
Able to trim and transpose the below data with sed, but it takes considerable time. Hope it would be better with AWK. Welcome any suggestions on this Input Sample Data: [INX_8_60L ] :9:Y [INX_8_60L ] :9:N [INX_8_60L ] :9:Y [INX_8_60Z ] :9:Y [INX_8_60Z ] :9:Y Required Output: INX?_8_60L¦INX?_8_60L¦INX?_8_60L¦INX?_8_60Z¦INX?_8_60Z
Just use awk, e.g. awk -v n=0 '{printf (n?"!%s":"%s", substr ($0,2,match($0,/[ \t]+/)-2)); n=1} END {print ""}' file Which will be orders of magnitude faster. It just picks out the (e.g. "INX_8_60L") substring using substring and match. n is simply used as a false/true (0/1) flag to prevent outputting a "!" before the first string. Example Use/Output With your data in file you would get: $ awk -v n=0 '{printf (n?"!%s":"%s", substr ($0,2,match($0,/[ \t]+/)-2)); n=1} END {print ""}' file INX_8_60L!INX_8_60L!INX_8_60L!INX_8_60Z!INX_8_60Z Which appears to be what you are after. (Note: I'm not sure what your separator character is, so just change above as needed) If not, let me know and I'm happy to help further. Edit Per-Changes Including the '?' isn't difficult, and I just copied the character, so you would now have: awk -v n=0 '{s=substr($0,2,match($0,/[ \t]+/)-2); sub(/_/,"?_",s); printf n?"¦%s":"%s", s; n=1} END {print ""}' file Example Output INX?_8_60L¦INX?_8_60L¦INX?_8_60L¦INX?_8_60Z¦INX?_8_60Z And to simplify, just operating on the first field as in #JamesBrown's answer, that would reduce to: awk -v n=0 '{s=substr($1,2); sub(/_/,"?_",s); printf n?"¦%s":"%s", s; n=1} END {print ""}' file Let me know if that needs more changes.
Don't start so many sed commands, separate the sed operations with semicolon instead.
Try to process the data in a single job and avoid regex. Below reading with substr() static sized first block and insterting ? while outputing. $ awk '{ b=b (b==""?"":";") substr($1,2,3) "?" substr($1,5) } END { print b }' file Output: INX?_8_60L;INX?_8_60L;INX?_8_60L;INX?_8_60Z;INX?_8_60Z If the fields are not that static in size: $ awk ' BEGIN { FS="[[_ ]" # split field with regex } { printf "%s%s?_%s_%s",(i++?";":""), $2,$3,$4 # output semicolons and fields } END { print "" }' file Performance of solutions for 20 M records: Former: real 0m8.017s user 0m7.856s sys 0m0.160s Latter: real 0m24.731s user 0m24.620s sys 0m0.112s
sed can be very fast when used gingerly, so for simplicity and speed you might wish to consider: sed -e 's/ .*//' -e 's/\[INX/INX?/' | tr '\n' '|' | sed -e '$s/|$//' The second call to sed is there to satisfy the requirement that there is no trailing |.
Another solution using GNU awk: awk -F'[[ ]+' ' {printf "%s%s",(o?"¦":""),gensub(/INX/,"INX?",1,$2);o=1} END{print ""} ' file The field separator is set (with -F option) such that it matches the wanted parameter. The main statement is to print the modified parameter with the ? character. The variable o allows to keep track of the delimeter ¦.
Find words containing 20 vowels grep
I found many similar questions but most of them ask for vowels in a row which is easy. I want to find words that contain 20 vowels not in a row using grep. I originally thought grep -Ei [aeiou]{20} would do it but that seems to search only for 20 vowels in a row
Use a regular expression that searches for 20 vowels separated by any quantity of consonants. grep -Ei "[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\ [aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\ [aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\ [aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\ [aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\ [aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\ [aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*" The backslash is just informing the shell that the expression continues on the next line. It is not part of the regex itself. If you understand that part, you can shorten it considerably using groups. This regexp is the same as above, but using groups in parenthesis with repetition. grep -Ei "([aeiou][b-df-hj-np-tv-z]*){20}"
I don't believe that's a problem that calls for just a regex. Here's a programmatic approach. We redefine the field separator to the empty string; each character is a field. We iterate over the line; if a character is a vowel we increment a counter. If, at the end of the string, the count is 20, we print it: cat nicks.awk BEGIN{ FS="" } { c=0; for( i=1;i<=NF;i=i+1 ){ if ($i ~ /[aeiou]/ ){ c=c+1; } }; if(c==20){ print $0 } } And this is what it does ... it only prints back the one string that has 20 vowels. echo "contributorNickSequestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk echo "contributorNickSeoquestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk contributorNickSeoquestionsfoundcontainingvowelsgrcep echo "contributorNickSaeoquestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk
If all you really need is to find 20 vowels in a line then that's just: awk '{x=tolower($0)} gsub(/[aeiou]/,"&",x)==20' file or with grep: grep -Ei '^[^aeiou]*([aeiou][^aeiou]*){20}$' file To find words (assuming each is space separated) there's many options including this with GNU awk: awk -v RS='\\s+' -v IGNORECASE=1 'gsub(/[aeiou]/,"&")==20' file or this with any awk: awk '{for (i=1;i<=NF;i++) {x=tolower($i); if (gsub(/[aeiou]/,"&",x)==20) print $i} }' file
filter out unrecognised fields using awk
I have a CVS file where I expect some values such as Y or N. Folks are adding comments or arbitrary entries such as NA? that I want to remove: Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,, I can use gsub to remove things that I am anticipating such as: $ cat test.csv | awk '{gsub("NA\\?", ""); gsub("NA \\?",""); gsub("TBD", ""); print}' Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,, Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,, Yet that will break if someone adds a new comment. I am looking for a regex to generalise the match as "not Y". I tried some negative look arounds but couldn't get it to work on the awk that I have which is GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.1, GNU MP 6.1.2). Thanks in advance!
awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if ($i !~ /^(y|Y|n|N)$/) $i="";print}' test.CSV Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,, Accepting only Y/N (case-insensitive).
awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}' This seems to do the trick. Loops through the 3rd through the last field, and if the field isn't Y, it's replaced with nothing. Since we're modifying fields we need to set OFS as well. $ cat file.txt Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,, $ awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}' Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,, If you wanted to accept "N" too, /^[YN]$/ would work.
cat test.CSV | awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if($i != "Y") $i=""; print}' Output: Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,, Update: So there's no need to use regex if you simply want to determine it's "Y" or not. However, if you want to use regex, as zzevannn's answer and tink's answer already gave great ideas of regex condition, so I'll give a batch replace by regex instead: To be exact, and to increase the challenge, I created some boundary conditions: $ cat test.CSV Create,20055776,Y,,Y,Y,,Y,,YNA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,YN.Y,,,Y,,Y,,,Y,,,,,,,, Create,20055779,,Y,,,NANN,,,,,Y,,,NA ?Y,,,Y,,,,,,TYBD,,,,,,,,, And the batch replace is: $ awk 'BEGIN{FS=OFS=","}{fst=$1;sub($1 FS,"");print fst,gensub("(,)[^,]*[^Y,]+[^,]*","\\1","g",$0);}' test.CSV Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,, Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,,,,Y,,Y,,,Y,,,,,,,, Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,, "(,)[^,]*[^Y,]+[^,]*" is to match anything between two commas that other than single Y. Note I saved $1 and deleted $1 and the comma after it first, and later print it back.
sed solution # POSIX sed -e ':a' -e 's/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;t a' test.csv # GNU sed ':a;s/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;ta' test.csv awk on same concept (avoid some problem of sed that miss the OR regex) awk -F ',' '{ Idx=$2;gsub(/,[[:blank:]]*[^YN,][^,]*/, "");sub( /,/, "," Idx);print}'
Extract field after colon for lines where field before colon matches pattern
I have a file file1 which looks as below: tool1v1:1.4.4 tool1v2:1.5.3 tool2v1:1.5.2.c8.5.2.r1981122221118 tool2v2:32.5.0.abc.r20123433554 I want to extract value of tool2v1 and tool2v2 My output should be 1.5.2.c8.5.2.r1981122221118 and 32.5.0.abc.r20123433554. I have written the following awk but it is not giving correct result: awk -F: '/^tool2v1/ {print $2}' file1 awk -F: '/^tool2v2/ {print $2}' file1
grep -E can also do the job: grep -E "tool2v[12]" file1 |sed 's/^.*://'
If you have a grep that supports Perl compatible regular expressions such as GNU grep, you can use a variable-sized look-behind: $ grep -Po '^tool2v[12]:\K.*' infile 1.5.2.c8.5.2.r1981122221118 32.5.0.abc.r20123433554 The -o option is to retain just the match instead of the whole matching line; \K is the same as "the line must match the things to the left, but don't include them in the match". You could also use a normal look-behind: $ grep -Po '(?<=^tool2v[12]:).*' infile 1.5.2.c8.5.2.r1981122221118 32.5.0.abc.r20123433554 And finally, to fix your awk which was almost correct (and as pointed out in a comment): $ awk -F: '/^tool2v[12]/ { print $2 }' infile 1.5.2.c8.5.2.r1981122221118 32.5.0.abc.r20123433554
You can filter with grep: grep '\(tool2v1\|tool2v2\)' And then remove the part before the : with sed: sed 's/^.*://' This sed operation means: ^ - match from beginning of string .* - all characters up to and including the : ... and replace this matched content with nothing. The format is sed 's/<MATCH>/<REPLACE>/' Whole command: grep '\(tool2v1\|tool2v2\)' file1|sed 's/^.*://' Result: 1.5.2.c8.5.2.r1981122221118 32.5.0.abc.r20123433554
the question has already been answered though, but you can also use pure bash to achieve the desired result #!/usr/bin/env bash while read line;do if [[ "$line" =~ ^tool2v* ]];then echo "${line#*:}" fi done < ./file1.txt the while loop reads every line of the file.txt, =~ does a regexp match to check if the value of $line variable if it starts with toolv2, then it trims : backward
Extracting word after fixed word with awk
I have a file file.txt containing a very long line: 1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962797807950|mar0101|0|00000106829DAE7F3FAB187550B920530C00|0|0|4000018001000002||962797807950|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|1|||||||||||||0|0|||472|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|252|tid{111211344662580792}pfid{10}gob{1}rid{globitel} afid{}uid1{962797807950}aid1{1}ar1{100}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC RESERVE AMOUNT 10000}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{100}ctr{StaffLine}ftksn{JMT}ftksr{0001}ftktp{PayCall Ticket}|| I want to print only the word after "ctr" in this file, which is "StaffLine", and I don't how many characters there are in this word. I've tried: awk '{comp[substr("ctr",0)]{print}}' but it didn't work. How can I get hold of that word?
Here's one way using awk: awk -F "[{}]" '{ for(i=1;i<=NF;i++) if ($i == "ctr") print $(i+1) }' file Or if your version of grep supports Perl-like regex: grep -oP "(?<=ctr{)[^}]+" file Results: StaffLine
Using sed: sed 's/.*}ctr{\([^}]*\).*/\1/' input
One way of dealing with it is with sed: sed -e 's/.*}ctr{//; s/}.*//' file.txt This deletes everything up to and including the { after the word ctr (avoiding issues with any words which have ctr as a suffix, such as a hypothetical pxctr{Bogus} entry); it then deletes anything from the first remaining } onwards, leaving just StaffLine on the sample data.
perl -lne '$_=m/.*ctr{([^}]*)}.*/;print $1' your_file tested below: > cat temp 1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962797807950|mar0101|0|00000106829DAE7F3FAB187550B920530C00|0|0|4000018001000002||962797807950|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|1|||||||||||||0|0|||472|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|252|tid{111211344662580792}pfid{10}gob{1}rid{globitel} afid{}uid1{962797807950}aid1{1}ar1{100}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC RESERVE AMOUNT 10000}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{100}ctr{StaffLine}ftksn{JMT}ftksr{0001}ftktp{PayCall Ticket}|| > perl -lne '$_=m/.*ctr{([^}]*)}.*/;print $1' temp StaffLine >