Need to remove domains domains from sub-domans - linux

I am trying to get last 2 values from right to left from cut command
I have a large database for about 110 Million domains and subdomains.
Like
yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk
In simple words I am trying to remove subdomains from domains
echo a.yahoo.aa | cut -d '.' -f 2,3
yahoo.aa
but when I try
echo yahoo.aa | cut -d '.' -f 2,3
aa
it give me only aa
Required output is
yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk
edit thanks anubhava for suggestion.
a TLD property is like
xxxx.xx
xxx.xx
xx.xx
i.e. a ccTLD always has 2 characters in last.

Long solution but a think that makes what you want to do:
Executable file domain.awk:
#! /usr/bin/awk -f
BEGIN {
FS="."
}
{
ret = $NF
if (NF >= 2 && (length($(NF - 1)) == 2 || length($(NF - 1)) == 3)) {
ret = $(NF - 1) "." ret
if (NF >= 3) {
ret = $(NF - 2) "." ret
}
} else if (NF >= 2) {
ret = $(NF - 1) "." ret
}
print ret
}
with domains.lst file:
yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk
aus.co.au
Used like that:
./domain.awk domains.lst
Output:
yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk
aus.co.au

Using the sample input you provided and accepting your statement that a ccTLD always has 2 characters in last. as being your criteria for printing the last 3 instead of last 2 segments of the input:
Using GNU grep for -o:
$ grep -Eo '[^.]+\.[^.]+(\.[^.]{2})?$' file
yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk
or using any awk:
$ awk 'match($0,/[^.]+\.[^.]+(\.[^.]{2})?$/){print substr($0,RSTART)}' file
yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk

Try
echo a.yahoo.aa | awk -F'.' '{print $NF"."$(NF-1)}'

large database for about 110 Million domains and subdomains.
Due to this I suggest using sed here, let file.txt content be
yahoo.com
mail.yahoo.com
a.yahoo.com
then
sed 's/^.*\.\([^.]*\.[^.]*\)$/\1/' file.txt
output
yahoo.com
yahoo.com
yahoo.com
Explanation: In regular expression spanning whole line (^-start, $-end) I use single capturing group which contain zero-or-more (*) non-dots followed by literal dot (\.) followed by zero-or-more non-dots which is adjacent to end of line, I replace whole line with content of that group. Disclaimer: this solution assumes there is always at least one dot in each line
(tested in GNU sed 4.2.2)

You are selecting only fields 2 and 3. You need to select from field 2 up to the end:
... | cut -d '.' -f 2-

Related

Extracting a set of characters for a column in a txt file

I have a bed file (what is a txt file formed by columns separated by tabs). The fourth column has a name followed by numbers. Using the command line (Linux), I would like to get these names without repetition. A provided an example below.
This is my file:
$head example.bed
1 2160195 2161184 SKI_1.2160205.2161174
1 2234406 2234552 SKI_1.2234416.2234542
1 2234713 2234849 SKI_1.2234723.2234839
1 2235268 2235551 SKI_1.2235278.2235541
1 2235721 2236034 SKI_1.2235731.2236024
1 2237448 2237699 SKI_1.2237458.2237689
1 2238005 2238214 SKI_1.2238015.2238204
1 9770503 9770664 PIK3CD_1.9770513.9770654
1 9775588 9775837 PIK3CD_1.9775598.9775827
1 9775896 9776146 PIK3CD_1.9775906.9776136
...
My list should look like this:
SKI_1
PIK3CD_1
...
Could please help me with the code I need to use?
I found the solution years ago with grep but I have lost the document in which I used to save all useful codes.
Given so.txt:
1 2160195 2161184 SKI_1.2160205.2161174
1 2234406 2234552 SKI_1.2234416.2234542
1 2234713 2234849 SKI_1.2234723.2234839
1 2235268 2235551 SKI_1.2235278.2235541
1 2235721 2236034 SKI_1.2235731.2236024
1 2237448 2237699 SKI_1.2237458.2237689
1 2238005 2238214 SKI_1.2238015.2238204
1 9770503 9770664 PIK3CD_1.9770513.9770654
1 9775588 9775837 PIK3CD_1.9775598.9775827
1 9775896 9776146 PIK3CD_1.9775906.9776136
Then the following command should do the trick:
cat so.txt | awk '{split($4,f,".");print f[1];}' | sort -u
$4 is the 4th column
We split the 4th column on the . character. The result is put into the f array
Finally we filter out the duplicates with sort -u
With data in the file bed and using awk:
awk 'NR>1 { split($4,arr,".");bed[arr[1]]="" } END { for (i in bed) { print i } }' bed
Ignore the first line and then split the 4th space delimited field into the array arr based on "." put the first index of this array into another bed array. In the end block, loop through the bed array and print the indexes.
Using grep:
cat example.bed|awk '{print $4}'|grep -oP '\w+_\d+'|sort -u
produces:
PIK3CD_1
SKI_1
and without the cat command (as there are always UUOC advocates here.. ):
awk '{print $4}' < example.bed|grep -oP '\w+_\d+'|sort -u
I have found this:
head WRGL2_hg19_v1.bed | cut -f4 | cut -d "." -f1
Output:
PIK3CD_1
SKI_1

Isolate product names from strings by matching string after (including) first letter in a variable

I have a bunch of strings of following pattern in a text file:
201194_2012110634 Appliance 130 AB i Some optional (Notes )
300723_2017050006(2016111550) Device 16 AB i Note
The first part is serial, the second is date. Device/Appliance name and model (about 10 possible different names) is the string after date number and before (including AB i).
I was able to isolate dates and serials using
SERIAL=${line:0:6}
YEAR=${line:7:4}
I'm trying to isolate Device name and note after that:
#!/bin/bash
while IFS= read line || [[ -n $line ]]; do
NAME=${line#*[a-zA-Z]}
STRINGAP='Appliance '"${line/#*Appliance/}"
The first approach is to take everything after the first letter appearing in line, which gives me
NAME = ppliance 130 AB i Some optional (Notes )
The second approach is to write tests for each of the ~10 possible appliance/device names and then append appliance name after the subtracted test. Then test variable which actually matched Appliance / Device (or other name) and use that to input into the database.
Is it possible to write a line that would select everything, including first letter in a line, in text file? Then I would subtract everything after AB i to get notes and everything before AB i would become appliance name.
Remove the ${line#*[az-A-Z]} line (which will, as you see, remove the first character of the name), and instead use
STRINGAP=$(echo "$line" | sed 's/^[0-9_]* \(.*\) AB i.*/\1/')
This drops the leading digits and underscore, and everything from " AB i" to the end.
Edit: The details are unclear - do you want to keep the "AB i", and will it always be "AB i"? If you want it, change the line to
STRINGAP=$(echo "$line" | sed 's/^[0-9_]* \(.* AB i\).*/\1/')
I also forgot the double quotes round the text line.
You can use sed and read to give you more control of parsing.
tmp> line2="300723_2017050006(2016111550) Device 16 AB i Note"
tmp> read serial date type val <<<$(echo $line2 | \
sed 's/\([0-9]*\)_\([0-9]*\)[^A-Z]*\(Device\|Appliance\) \
\([0-9]*\).*/\1 \2 \3 \4/')
tmp> echo "$serial|$date|$type|$val"
300723|2017050006|Device|16
Basically, read allows you to assign multiple variables in one line. The sed statment parses the line, and gives you space delimitted output of its results. You can also read each variable seperately if you don't mind running sed a few extra times:
device="$(echo $line2 | sed -e 's/^.*Device \([0-9]*\).*/\1/;t;d')"
appliance="$(echo $line2 | sed -e 's/^.*Appliance \([0-9]*\).*/\1/;t;d')"
This way $device is populated with device if present, and is blank otherwise (note the -e and ;t;d at the end of the regex to prevent it from dumping the line if it doesn't match.)
Your question isn't clear but it seems like you might be trying to parse strings into substrings. Try this with GNU awk for the 3rd arg to match() and let us know if there's something else you were looking for:
$ awk 'match($0,/^([0-9]+)_([0-9]+)(\([0-9]+\))?\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(.*)/,a) {
for (i=1; i<=8; i++) {
print i, a[i]
}
print "---"
}' file
1 201194
2 2012110634
3
4 Appliance
5 130
6 AB
7 i
8 Some optional (Notes )
---
1 300723
2 2017050006
3 (2016111550)
4 Device
5 16
6 AB
7 i
8 Note
---
If you wanted a CSV output, for example, then it'd just be:
$ awk -v OFS=',' 'match($0,/^([0-9]+)_([0-9]+)(\([0-9]+\))?\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(.*)/,a) {
for (i=1; i<=8; i++) {
printf "%s%s", a[i], (i<8?OFS:ORS)
}
}' file
201194,2012110634,,Appliance,130,AB,i,Some optional (Notes )
300723,2017050006,(2016111550),Device,16,AB,i,Note
Massage to suit...

Use sed or awk to replace line after match

I'm trying to create a little script that basically uses dig +short to find the IP of a website, and then pipe that to sed/awk/grep to replace a line. This is what the current file looks like:
#Server
123.455.1.456
246.523.56.235
So, basically, I want to search for the '#Server' line in a text file, and then replace the two lines underneath it with an IP address acquired from dig.
I understand some of the syntax of sed, but I'm really having trouble figuring out how to replace two lines underneath a match. Any help is much appreciated.
Based on the OP, it's not 100% clear exactly what needs to replaced where, but here's a a one-liner for the general case, using GNU sed and bash. Replace the two lines after "3" with standard input:
echo Hoot Gibson | sed -e '/3/{r /dev/stdin' -e ';p;N;N;d;}' <(seq 7)
Outputs:
1
2
3
Hoot Gibson
6
7
Note: sed's r command is opaquely documented (in Linux anyway). For more about r, see:
"5.9. The 'r' command isn't inserting the file into the text" in this sed FAQ.
here's how in awk:
newip=12.34.56.78
awk -v newip=$newip '{
if($1 == "#Server"){
l = NR;
print $0
}
else if(l>0 && NR == l+1){
print newip
}
else if(l==0 || NR != l+2){
print $0
}
}' file > file.tmp
mv -f file.tmp file
explanation:
pass $newip to awk
if the first field of the current line is #Server, let l = current line number.
else if the current line is one past #Server, print the new ip.
else if the current row is not two past #Server, print the line.
overwrite original file with modified version.

find a pattern and print line based on finding the first pattern sed, awk grep

I have a rather large file. What is common to all is the hostname to break each section example :
HOSTNAME:host1
data 1
data here
data 2
text here
section 1
text here
part 4
data here
comm = 2
HOSTNAME:host-2
data 1
data here
data 2
text here
section 1
text here
part 4
data here
comm = 1
The above prints
As you see above, in between each section there are other sections broken down by key words or lines that have specific values
I like to use a oneliner to print host name for each section and then print which ever lines I want to extract under each hostname section
Can you please help. I am using now grep -C 10 HOSTNAME | gerp -C pattern
but this assumes that there are 10 lines in each section. This is not an optimal way to do this; can someone show a better way. I also need to be able to print more than one line under each pattern that I find . So if I find data1 and there are additional lines under it I like to grab and print them
So output of command would be like
grep -C 10 HOSTNAME | grep data 1
grep -C 10 HOSTNAME | grep -A 2 data 1
HOSTNAME:Host1
data 1
HOSTNAME:Hoss2
data 1
Beside Grep I use this sed command to print my output
sed -r '/HOSTNAME|shared/!d' filename
The only problem with this sed command is that it only prints the lines that have patterns shared & HOSTNAME in them. I also need to specify the number of lines I like to print in my case under the line that matched patterns shared. So I like to print HOSTNAME and give the number of lines I like to print under second search pattern shared.
Thanks
awk to the rescue!
$ awk -v lines=2 '/HOSTNAME/{c=lines} NF&&c&&c--' file
HOSTNAME:host1
data 1
HOSTNAME:host-2
data 1
print lines number of lines including pattern match, skips empty lines.
If you want to specify secondary keyword instead number of lines
$ awk -v key='data 1' '/HOSTNAME/{h=1; print} h&&$0~key{print; h=0}' file
HOSTNAME:host1
data 1
HOSTNAME:host-2
data 1
Here is a sed twoliner:
sed -n -r '/HOSTNAME/ { p }
/^\s+data 1/ {p }' hostnames.txt
It prints (p)
when the line contains a HOSTNAME
when the line starts with some whitespace (\s+) followed by your search criterion (data 1)
non-mathing lines are not printed (due to the sed -n option)
Edit: Some remarks:
this was tested with GNU sed 4.2.2 under linux
you dont need the -r if your sed version does not support it, replace the second pattern to /^.*data 1/
we can squash everything in one line with ;
Putting it all together, here is a revised version in one line, without the need for the extended regex ( i.e without -r):
sed -n '/HOSTNAME/ { p } ; /^.*data 1/ {p }' hostnames.txt
The OP requirements seem to be very unclear, but the following is consistent with one interpretation of what has been requested, and more importantly, the program has no special requirements, and the code can easily be modified to meet a variety of requirements. In particular, both search patterns (the HOSTNAME pattern and the "data 1" pattern) can easily be parameterized.
The main idea is to print all lines in a specified subsection, or at least a certain number up to some limit.
If there is a limit on how many lines in a subsection should be printed, specify a value for limit, otherwise set it to 0.
awk -v limit=0 '
/^HOSTNAME:/ { subheader=0; hostname=1; print; next}
/^ *data 1/ { subheader=1; print; next }
/^ *data / { subheader=0; next }
subheader && (limit==0 || (subheader++ < limit)) { print }'
Given the lines provided in the question, the output would be:
HOSTNAME:host1
data 1
HOSTNAME:host-2
data 1
(Yes, I know the variable 'hostname' in the awk program is currently unused, but I included it to make it easy to add a test to satisfy certain obvious requirements regarding the preconditions for identifying a subheader.)
sed -n -e '/hostname/,+p' -e '/Duplex/,+p'
The simplest way to do it is to combine two sed commands ..

Using awk to print all columns from the nth to the last

This line worked until I had whitespace in the second field.
svn status | grep '\!' | gawk '{print $2;}' > removedProjs
is there a way to have awk print everything in $2 or greater? ($3, $4.. until we don't have anymore columns?)
I suppose I should add that I'm doing this in a Windows environment with Cygwin.
Print all columns:
awk '{print $0}' somefile
Print all but the first column:
awk '{$1=""; print $0}' somefile
Print all but the first two columns:
awk '{$1=$2=""; print $0}' somefile
There's a duplicate question with a simpler answer using cut:
svn status | grep '\!' | cut -d\ -f2-
-d specifies the delimeter (space), -f specifies the list of columns (all starting with the 2nd)
You could use a for-loop to loop through printing fields $2 through $NF (built-in variable that represents the number of fields on the line).
Edit:
Since "print" appends a newline, you'll want to buffer the results:
awk '{out = ""; for (i = 2; i <= NF; i++) {out = out " " $i}; print out}'
Alternatively, use printf:
awk '{for (i = 2; i <= NF; i++) {printf "%s ", $i}; printf "\n"}'
awk '{out=$2; for(i=3;i<=NF;i++){out=out" "$i}; print out}'
My answer is based on the one of VeeArr, but I noticed it started with a white space before it would print the second column (and the rest). As I only have 1 reputation point, I can't comment on it, so here it goes as a new answer:
start with "out" as the second column and then add all the other columns (if they exist). This goes well as long as there is a second column.
Most solutions with awk leave an space. The options here avoid that problem.
Option 1
A simple cut solution (works only with single delimiters):
command | cut -d' ' -f3-
Option 2
Forcing an awk re-calc sometimes remove the added leading space (OFS) left by removing the first fields (works with some versions of awk):
command | awk '{ $1=$2="";$0=$0;} NF=NF'
Option 3
Printing each field formatted with printf will give more control:
$ in=' 1 2 3 4 5 6 7 8 '
$ echo "$in"|awk -v n=2 '{ for(i=n+1;i<=NF;i++) printf("%s%s",$i,i==NF?RS:OFS);}'
3 4 5 6 7 8
However, all previous answers change all repeated FS between fields to OFS. Let's build a couple of option that do not do that.
Option 4 (recommended)
A loop with sub to remove fields and delimiters at the front.
And using the value of FS instead of space (which could be changed).
Is more portable, and doesn't trigger a change of FS to OFS:
NOTE: The ^[FS]* is to accept an input with leading spaces.
$ in=' 1 2 3 4 5 6 7 8 '
$ echo "$in" | awk '{ n=2; a="^["FS"]*[^"FS"]+["FS"]+";
for(i=1;i<=n;i++) sub( a , "" , $0 ) } 1 '
3 4 5 6 7 8
Option 5
It is quite possible to build a solution that does not add extra (leading or trailing) whitespace, and preserve existing whitespace(s) using the function gensub from GNU awk, as this:
$ echo ' 1 2 3 4 5 6 7 8 ' |
awk -v n=2 'BEGIN{ a="^["FS"]*"; b="([^"FS"]+["FS"]+)"; c="{"n"}"; }
{ print(gensub(a""b""c,"",1)); }'
3 4 5 6 7 8
It also may be used to swap a group of fields given a count n:
$ echo ' 1 2 3 4 5 6 7 8 ' |
awk -v n=2 'BEGIN{ a="^["FS"]*"; b="([^"FS"]+["FS"]+)"; c="{"n"}"; }
{
d=gensub(a""b""c,"",1);
e=gensub("^(.*)"d,"\\1",1,$0);
print("|"d"|","!"e"!");
}'
|3 4 5 6 7 8 | ! 1 2 !
Of course, in such case, the OFS is used to separate both parts of the line, and the trailing white space of the fields is still printed.
NOTE: [FS]* is used to allow leading spaces in the input line.
I personally tried all the answers mentioned above, but most of them were a bit complex or just not right. The easiest way to do it from my point of view is:
awk -F" " '{ for (i=4; i<=NF; i++) print $i }'
Where -F" " defines the delimiter for awk to use. In my case is the whitespace, which is also the default delimiter for awk. This means that -F" " can be ignored.
Where NF defines the total number of fields/columns. Therefore the loop will begin from the 4th field up to the last field/column.
Where $N retrieves the value of the Nth field. Therefore print $i will print the current field/column based based on the loop count.
awk '{ for(i=3; i<=NF; ++i) printf $i""FS; print "" }'
lauhub proposed this correct, simple and fast solution here
This was irritating me so much, I sat down and wrote a cut-like field specification parser, tested with GNU Awk 3.1.7.
First, create a new Awk library script called pfcut, with e.g.
sudo nano /usr/share/awk/pfcut
Then, paste in the script below, and save. After that, this is how the usage looks like:
$ echo "t1 t2 t3 t4 t5 t6 t7" | awk -f pfcut --source '/^/ { pfcut("-4"); }'
t1 t2 t3 t4
$ echo "t1 t2 t3 t4 t5 t6 t7" | awk -f pfcut --source '/^/ { pfcut("2-"); }'
t2 t3 t4 t5 t6 t7
$ echo "t1 t2 t3 t4 t5 t6 t7" | awk -f pfcut --source '/^/ { pfcut("-2,4,6-"); }'
t1 t2 t4 t6 t7
To avoid typing all that, I guess the best one can do (see otherwise Automatically load a user function at startup with awk? - Unix & Linux Stack Exchange) is add an alias to ~/.bashrc; e.g. with:
$ echo "alias awk-pfcut='awk -f pfcut --source'" >> ~/.bashrc
$ source ~/.bashrc # refresh bash aliases
... then you can just call:
$ echo "t1 t2 t3 t4 t5 t6 t7" | awk-pfcut '/^/ { pfcut("-2,4,6-"); }'
t1 t2 t4 t6 t7
Here is the source of the pfcut script:
# pfcut - print fields like cut
#
# sdaau, GNU GPL
# Nov, 2013
function spfcut(formatstring)
{
# parse format string
numsplitscomma = split(formatstring, fsa, ",");
numspecparts = 0;
split("", parts); # clear/initialize array (for e.g. `tail` piping into `awk`)
for(i=1;i<=numsplitscomma;i++) {
commapart=fsa[i];
numsplitsminus = split(fsa[i], cpa, "-");
# assume here a range is always just two parts: "a-b"
# also assume user has already sorted the ranges
#print numsplitsminus, cpa[1], cpa[2]; # debug
if(numsplitsminus==2) {
if ((cpa[1]) == "") cpa[1] = 1;
if ((cpa[2]) == "") cpa[2] = NF;
for(j=cpa[1];j<=cpa[2];j++) {
parts[numspecparts++] = j;
}
} else parts[numspecparts++] = commapart;
}
n=asort(parts); outs="";
for(i=1;i<=n;i++) {
outs = outs sprintf("%s%s", $parts[i], (i==n)?"":OFS);
#print(i, parts[i]); # debug
}
return outs;
}
function pfcut(formatstring) {
print spfcut(formatstring);
}
Would this work?
awk '{print substr($0,length($1)+1);}' < file
It leaves some whitespace in front though.
Printing out columns starting from #2 (the output will have no trailing space in the beginning):
ls -l | awk '{sub(/[^ ]+ /, ""); print $0}'
echo "1 2 3 4 5 6" | awk '{ $NF = ""; print $0}'
this one uses awk to print all except the last field
This is what I preferred from all the recommendations:
Printing from the 6th to last column.
ls -lthr | awk '{out=$6; for(i=7;i<=NF;i++){out=out" "$i}; print out}'
or
ls -lthr | awk '{ORS=" "; for(i=6;i<=NF;i++) print $i;print "\n"}'
If you need specific columns printed with arbitrary delimeter:
awk '{print $3 " " $4}'
col#3 col#4
awk '{print $3 "anything" $4}'
col#3anythingcol#4
So if you have whitespace in a column it will be two columns, but you can connect it with any delimiter or without it.
Perl solution:
perl -lane 'splice #F,0,1; print join " ",#F' file
These command-line options are used:
-n loop around every line of the input file, do not automatically print every line
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
splice #F,0,1 cleanly removes column 0 from the #F array
join " ",#F joins the elements of the #F array, using a space in-between each element
Python solution:
python -c "import sys;[sys.stdout.write(' '.join(line.split()[1:]) + '\n') for line in sys.stdin]" < file
I want to extend the proposed answers to the situation where fields are delimited by possibly several whitespaces –the reason why the OP is not using cut I suppose.
I know the OP asked about awk, but a sed approach would work here (example with printing columns from the 5th to the last):
pure sed approach
sed -r 's/^\s*(\S+\s+){4}//' somefile
Explanation:
s/// is the standard command to perform substitution
^\s* matches any consecutive whitespace at the beginning of the line
\S+\s+ means a column of data (non-whitespace chars followed by whitespace chars)
(){4} means the pattern is repeated 4 times.
sed and cut
sed -r 's/^\s+//; s/\s+/\t/g' somefile | cut -f5-
by just replacing consecutive whitespaces by a single tab;
tr and cut:
tr can also be used to squeeze consecutive characters with the -s option.
tr -s [:blank:] <somefile | cut -d' ' -f5-
If you don't want to reformat the part of the line that you don't chop off, the best solution I can think of is written in my answer in:
How to print all the columns after a particular number using awk?
It chops what is before the given field number N, and prints all the rest of the line, including field number N and maintaining the original spacing (it does not reformat). It doesn't mater if the string of the field appears also somewhere else in the line.
Define a function:
fromField () {
awk -v m="\x01" -v N="$1" '{$N=m$N; print substr($0,index($0,m)+1)}'
}
And use it like this:
$ echo " bat bi iru lau bost " | fromField 3
iru lau bost
$ echo " bat bi iru lau bost " | fromField 2
bi iru lau bost
Output maintains everything, including trailing spaces
In you particular case:
svn status | grep '\!' | fromField 2 > removedProjs
If your file/stream does not contain new-line characters in the middle of the lines (you could be using a different Record Separator), you can use:
awk -v m="\x0a" -v N="3" '{$N=m$N ;print substr($0, index($0,m)+1)}'
The first case will fail only in files/streams that contain the rare hexadecimal char number 1
This awk function returns substring of $0 that includes fields from begin to end:
function fields(begin, end, b, e, p, i) {
b = 0; e = 0; p = 0;
for (i = 1; i <= NF; ++i) {
if (begin == i) { b = p; }
p += length($i);
e = p;
if (end == i) { break; }
p += length(FS);
}
return substr($0, b + 1, e - b);
}
To get everything starting from field 3:
tail = fields(3);
To get section of $0 that covers fields 3 to 5:
middle = fields(3, 5);
b, e, p, i nonsense in function parameter list is just an awk way of declaring local variables.
All of the other answers given here and in linked questions fail in various ways given various possible FS values. Some leave leading and/or trailing white space, some convert every FS to the OFS, some rely on semantics that only apply when FS is the default value, some rely on negating FS in a bracket expression which will fail given a multi-char FS, etc.
To do this robustly for any FS, use GNU awk for the 4th arg to split():
$ cat tst.awk
{
split($0,flds,FS,seps)
for ( i=n; i<=NF; i++ ) {
printf "%s%s", flds[i], seps[i]
}
print ""
}
$ printf 'a b c d\n' | awk -v n=3 -f tst.awk
c d
$ printf ' a b c d\n' | awk -v n=3 -f tst.awk
c d
$ printf ' a b c d\n' | awk -v n=3 -F'[ ]' -f tst.awk
b c d
$ printf ' a b c d\n' | awk -v n=3 -F'[ ]+' -f tst.awk
b c d
$ printf 'a###b###c###d\n' | awk -v n=3 -F'###' -f tst.awk
c###d
$ printf '###a###b###c###d\n' | awk -v n=3 -F'###' -f tst.awk
b###c###d
Note that I'm using split() above because it's 3rg arg is a field separator, not just a regexp like the 2nd arg to match(). The difference is that field separators have additional semantics to regexps such as skipping leading and/or trailing blanks when the separator is a single blank char - if you wanted to use a while(match()) loop or any form of *sub() to emulate the above then you'd need to write code to implement those semantics whereas split() already implements them for you.
Awk examples looks complex here, here is simple Bash shell syntax:
command | while read -a cols; do echo ${cols[#]:1}; done
Where 1 is your nth column counting from 0.
Example
Given this content of file (in.txt):
c1
c1 c2
c1 c2 c3
c1 c2 c3 c4
c1 c2 c3 c4 c5
here is the output:
$ while read -a cols; do echo ${cols[#]:1}; done < in.txt
c2
c2 c3
c2 c3 c4
c2 c3 c4 c5
This would work if you are using Bash and you could use as many 'x ' as elements you wish to discard and it ignores multiple spaces if they are not escaped.
while read x b; do echo "$b"; done < filename
Perl:
#m=`ls -ltr dir | grep ^d | awk '{print \$6,\$7,\$8,\$9}'`;
foreach $i (#m)
{
print "$i\n";
}
UPDATE :
if you wanna use no function calls at all while preserving the spaces and tabs in between the remaining fields, then do :
echo " 1 2 33 4444 555555 \t6666666 " |
{m,g}awk ++NF FS='^[ \t]*[^ \t]*[ \t]+|[ \t]+$' OFS=
=
2 33 4444 555555 6666666
===================
You can make it a lot more straight forward :
svn status | [m/g]awk '/!/*sub("^[^ \t]*[ \t]+",_)'
svn status | [n]awk '(/!/)*sub("^[^ \t]*[ \t]+",_)'
Automatically takes care of the grep earlier in the pipe, as well as trimming out extra FS after blanking out $1, with the added bonus of leaving rest of the original input untouched instead of having tabs overwritten with spaces (unless that's the desired effect)
If you're very certain $1 does not contain special characters that need regex escaping, then it's even easier :
mawk '/!/*sub($!_"[ \t]+",_)'
gawk -c/P/e '/!/*sub($!_"""[ \t]+",_)'
Or if you prefer customizing FS+OFS to handle it all :
mawk 'NF*=/!/' FS='^[^ \t]*[ \t]+' OFS='' # this version uses OFS
This should be a reasonably comprehensive awk-field-sub-string-extraction function that
returns substring of $0 based on input ranges, inclusive
clamp in out of range values,
handle variable length field SEPs
has speedup treatments for ::
completely no inputs, returning $0 directly
input values resulting in guaranteed empty string ("")
FROM-field == 1
FS = "" that has split $0 out by individual chars
(so the FROM <(_)> and TO <(__)> fields behave like cut -c rather than cut -f)
original $0 restored, w/o overwriting FS seps with OFS
|
{m,g}awk '{
2 print "\n|---BEFORE-------------------------\n"
3 ($0) "\n|----------------------------\n\n ["
4 fld2(2, 5) "]\n [" fld2(3) "]\n [" fld2(4, 2)
5 "]<----------------------------------------------should be
6 empty\n [" fld2(3, 11) "]<------------------------should be
7 capped by NF\n [" fld2() "]\n [" fld2((OFS=FS="")*($0=$0)+11,
8 23) "]<-------------------FS=\"\", split by chars
9 \n\n|---AFTER-------------------------\n" ($0)
10 "\n|----------------------------"
11 }
12 function fld2(_,__,___,____,_____)
13 {
if (+__==(_=-_<+_ ?+_:_<_) || (___=____="")==__ || !NF) {
return $_
16 } else if (NF<_ || (__=NF<+__?NF:+__)<(_=+_?_:!_)) {
return ___
18 } else if (___==FS || _==!___) {
19 return ___<FS \
? substr("",$!_=$!_ substr("",__=$!(NF=__)))__
20 : substr($(_<_),_,__)
21 }
22 _____=$+(____=___="\37\36\35\32\31\30\27\26\25"\
"\24\23\21\20\17\16\6\5\4\3\2\1")
23 NF=__
24 if ($(!_)~("["(___)"]")) {
25 gsub("..","\\&&",___) + gsub(".",___,____)
27 ___=____
28 }
29 __=(_) substr("",_+=_^=_<_)
30 while(___!="") {
31 if ($(!_)!~(____=substr(___,--_,++_))) {
32 ___=____
33 break }
35 ___=substr(___,_+_^(!_))
36 }
37 return \
substr("",($__=___ $__)==(__=substr($!_,
_+index($!_,___))),_*($!_=_____))(__)
}'
those <TAB> are actual \t \011 but relabeled for display clarity
|---BEFORE-------------------------
1 2 33 4444 555555 <TAB>6666666
|----------------------------
[2 33 4444 555555]
[33]
[]<---------------------------------------------- should be empty
[33 4444 555555 6666666]<------------------------ should be capped by NF
[ 1 2 33 4444 555555 <TAB>6666666 ]
[ 2 33 4444 555555 <TAB>66]<------------------- FS="", split by chars
|---AFTER-------------------------
1 2 33 4444 555555 <TAB>6666666
|----------------------------
I wasn't happy with any of the awk solutions presented here because I wanted to extract the first few columns and then print the rest, so I turned to perl instead. The following code extracts the first two columns, and displays the rest as is:
echo -e "a b c d\te\t\tf g" | \
perl -ne 'my #f = split /\s+/, $_, 3; printf "first: %s second: %s rest: %s", #f;'
The advantage compared to the perl solution from Chris Koknat is that really only the first n elements are split off from the input string; the rest of the string isn't split at all and therefor stays completely intact. My example demonstrates this with a mix of spaces and tabs.
To change the amount of columns that should be extracted, replace the 3 in the example with n+1.
ls -la | awk '{o=$1" "$3; for (i=5; i<=NF; i++) o=o" "$i; print o }'
from this answer is not bad but the natural spacing is gone.
Please then compare it to this one:
ls -la | cut -d\ -f4-
Then you'd see the difference.
Even ls -la | awk '{$1=$2=""; print}' which is based on the answer voted best thus far is not preserve the formatting.
Thus I would use the following, and it also allows explicit selective columns in the beginning:
ls -la | cut -d\ -f1,4-
Note that every space counts for columns too, so for instance in the below, columns 1 and 3 are empty, 2 is INFO and 4 is:
$ echo " INFO 2014-10-11 10:16:19 main " | cut -d\ -f1,3
$ echo " INFO 2014-10-11 10:16:19 main " | cut -d\ -f2,4
INFO 2014-10-11
$
If you want formatted text, chain your commands with echo and use $0 to print the last field.
Example:
for i in {8..11}; do
s1="$i"
s2="str$i"
s3="str with spaces $i"
echo -n "$s1 $s2" | awk '{printf "|%3d|%6s",$1,$2}'
echo -en "$s3" | awk '{printf "|%-19s|\n", $0}'
done
Prints:
| 8| str8|str with spaces 8 |
| 9| str9|str with spaces 9 |
| 10| str10|str with spaces 10 |
| 11| str11|str with spaces 11 |
The top-voted answer by zed_0xff did not work for me.
I have a log where after $5 with an IP address can be more text or no text. I need everything from the IP address to the end of the line should there be anything after $5. In my case, this is actually within an awk program, not an awk one-liner so awk must solve the problem. When I try to remove the first 4 fields using the solution proposed by zed_0xff:
echo " 7 27.10.16. Thu 11:57:18 37.244.182.218" | awk '{$1=$2=$3=$4=""; printf "[%s]\n", $0}'
it spits out wrong and useless response (I added [..] to demonstrate):
[ 37.244.182.218 one two three]
There are even some suggestions to combine substr with this wrong answer, but that only complicates things. It offers no improvement.
Instead, if columns are fixed width until the cut point and awk is needed, the correct answer is:
echo " 7 27.10.16. Thu 11:57:18 37.244.182.218" | awk '{printf "[%s]\n", substr($0,28)}'
which produces the desired output:
[37.244.182.218 one two three]

Resources