This line worked until I had whitespace in the second field.
svn status | grep '\!' | gawk '{print $2;}' > removedProjs
is there a way to have awk print everything in $2 or greater? ($3, $4.. until we don't have anymore columns?)
I suppose I should add that I'm doing this in a Windows environment with Cygwin.
Print all columns:
awk '{print $0}' somefile
Print all but the first column:
awk '{$1=""; print $0}' somefile
Print all but the first two columns:
awk '{$1=$2=""; print $0}' somefile
There's a duplicate question with a simpler answer using cut:
svn status | grep '\!' | cut -d\ -f2-
-d specifies the delimeter (space), -f specifies the list of columns (all starting with the 2nd)
You could use a for-loop to loop through printing fields $2 through $NF (built-in variable that represents the number of fields on the line).
Edit:
Since "print" appends a newline, you'll want to buffer the results:
awk '{out = ""; for (i = 2; i <= NF; i++) {out = out " " $i}; print out}'
Alternatively, use printf:
awk '{for (i = 2; i <= NF; i++) {printf "%s ", $i}; printf "\n"}'
awk '{out=$2; for(i=3;i<=NF;i++){out=out" "$i}; print out}'
My answer is based on the one of VeeArr, but I noticed it started with a white space before it would print the second column (and the rest). As I only have 1 reputation point, I can't comment on it, so here it goes as a new answer:
start with "out" as the second column and then add all the other columns (if they exist). This goes well as long as there is a second column.
Most solutions with awk leave an space. The options here avoid that problem.
Option 1
A simple cut solution (works only with single delimiters):
command | cut -d' ' -f3-
Option 2
Forcing an awk re-calc sometimes remove the added leading space (OFS) left by removing the first fields (works with some versions of awk):
command | awk '{ $1=$2="";$0=$0;} NF=NF'
Option 3
Printing each field formatted with printf will give more control:
$ in=' 1 2 3 4 5 6 7 8 '
$ echo "$in"|awk -v n=2 '{ for(i=n+1;i<=NF;i++) printf("%s%s",$i,i==NF?RS:OFS);}'
3 4 5 6 7 8
However, all previous answers change all repeated FS between fields to OFS. Let's build a couple of option that do not do that.
Option 4 (recommended)
A loop with sub to remove fields and delimiters at the front.
And using the value of FS instead of space (which could be changed).
Is more portable, and doesn't trigger a change of FS to OFS:
NOTE: The ^[FS]* is to accept an input with leading spaces.
$ in=' 1 2 3 4 5 6 7 8 '
$ echo "$in" | awk '{ n=2; a="^["FS"]*[^"FS"]+["FS"]+";
for(i=1;i<=n;i++) sub( a , "" , $0 ) } 1 '
3 4 5 6 7 8
Option 5
It is quite possible to build a solution that does not add extra (leading or trailing) whitespace, and preserve existing whitespace(s) using the function gensub from GNU awk, as this:
$ echo ' 1 2 3 4 5 6 7 8 ' |
awk -v n=2 'BEGIN{ a="^["FS"]*"; b="([^"FS"]+["FS"]+)"; c="{"n"}"; }
{ print(gensub(a""b""c,"",1)); }'
3 4 5 6 7 8
It also may be used to swap a group of fields given a count n:
$ echo ' 1 2 3 4 5 6 7 8 ' |
awk -v n=2 'BEGIN{ a="^["FS"]*"; b="([^"FS"]+["FS"]+)"; c="{"n"}"; }
{
d=gensub(a""b""c,"",1);
e=gensub("^(.*)"d,"\\1",1,$0);
print("|"d"|","!"e"!");
}'
|3 4 5 6 7 8 | ! 1 2 !
Of course, in such case, the OFS is used to separate both parts of the line, and the trailing white space of the fields is still printed.
NOTE: [FS]* is used to allow leading spaces in the input line.
I personally tried all the answers mentioned above, but most of them were a bit complex or just not right. The easiest way to do it from my point of view is:
awk -F" " '{ for (i=4; i<=NF; i++) print $i }'
Where -F" " defines the delimiter for awk to use. In my case is the whitespace, which is also the default delimiter for awk. This means that -F" " can be ignored.
Where NF defines the total number of fields/columns. Therefore the loop will begin from the 4th field up to the last field/column.
Where $N retrieves the value of the Nth field. Therefore print $i will print the current field/column based based on the loop count.
awk '{ for(i=3; i<=NF; ++i) printf $i""FS; print "" }'
lauhub proposed this correct, simple and fast solution here
This was irritating me so much, I sat down and wrote a cut-like field specification parser, tested with GNU Awk 3.1.7.
First, create a new Awk library script called pfcut, with e.g.
sudo nano /usr/share/awk/pfcut
Then, paste in the script below, and save. After that, this is how the usage looks like:
$ echo "t1 t2 t3 t4 t5 t6 t7" | awk -f pfcut --source '/^/ { pfcut("-4"); }'
t1 t2 t3 t4
$ echo "t1 t2 t3 t4 t5 t6 t7" | awk -f pfcut --source '/^/ { pfcut("2-"); }'
t2 t3 t4 t5 t6 t7
$ echo "t1 t2 t3 t4 t5 t6 t7" | awk -f pfcut --source '/^/ { pfcut("-2,4,6-"); }'
t1 t2 t4 t6 t7
To avoid typing all that, I guess the best one can do (see otherwise Automatically load a user function at startup with awk? - Unix & Linux Stack Exchange) is add an alias to ~/.bashrc; e.g. with:
$ echo "alias awk-pfcut='awk -f pfcut --source'" >> ~/.bashrc
$ source ~/.bashrc # refresh bash aliases
... then you can just call:
$ echo "t1 t2 t3 t4 t5 t6 t7" | awk-pfcut '/^/ { pfcut("-2,4,6-"); }'
t1 t2 t4 t6 t7
Here is the source of the pfcut script:
# pfcut - print fields like cut
#
# sdaau, GNU GPL
# Nov, 2013
function spfcut(formatstring)
{
# parse format string
numsplitscomma = split(formatstring, fsa, ",");
numspecparts = 0;
split("", parts); # clear/initialize array (for e.g. `tail` piping into `awk`)
for(i=1;i<=numsplitscomma;i++) {
commapart=fsa[i];
numsplitsminus = split(fsa[i], cpa, "-");
# assume here a range is always just two parts: "a-b"
# also assume user has already sorted the ranges
#print numsplitsminus, cpa[1], cpa[2]; # debug
if(numsplitsminus==2) {
if ((cpa[1]) == "") cpa[1] = 1;
if ((cpa[2]) == "") cpa[2] = NF;
for(j=cpa[1];j<=cpa[2];j++) {
parts[numspecparts++] = j;
}
} else parts[numspecparts++] = commapart;
}
n=asort(parts); outs="";
for(i=1;i<=n;i++) {
outs = outs sprintf("%s%s", $parts[i], (i==n)?"":OFS);
#print(i, parts[i]); # debug
}
return outs;
}
function pfcut(formatstring) {
print spfcut(formatstring);
}
Would this work?
awk '{print substr($0,length($1)+1);}' < file
It leaves some whitespace in front though.
Printing out columns starting from #2 (the output will have no trailing space in the beginning):
ls -l | awk '{sub(/[^ ]+ /, ""); print $0}'
echo "1 2 3 4 5 6" | awk '{ $NF = ""; print $0}'
this one uses awk to print all except the last field
This is what I preferred from all the recommendations:
Printing from the 6th to last column.
ls -lthr | awk '{out=$6; for(i=7;i<=NF;i++){out=out" "$i}; print out}'
or
ls -lthr | awk '{ORS=" "; for(i=6;i<=NF;i++) print $i;print "\n"}'
If you need specific columns printed with arbitrary delimeter:
awk '{print $3 " " $4}'
col#3 col#4
awk '{print $3 "anything" $4}'
col#3anythingcol#4
So if you have whitespace in a column it will be two columns, but you can connect it with any delimiter or without it.
Perl solution:
perl -lane 'splice #F,0,1; print join " ",#F' file
These command-line options are used:
-n loop around every line of the input file, do not automatically print every line
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
splice #F,0,1 cleanly removes column 0 from the #F array
join " ",#F joins the elements of the #F array, using a space in-between each element
Python solution:
python -c "import sys;[sys.stdout.write(' '.join(line.split()[1:]) + '\n') for line in sys.stdin]" < file
I want to extend the proposed answers to the situation where fields are delimited by possibly several whitespaces –the reason why the OP is not using cut I suppose.
I know the OP asked about awk, but a sed approach would work here (example with printing columns from the 5th to the last):
pure sed approach
sed -r 's/^\s*(\S+\s+){4}//' somefile
Explanation:
s/// is the standard command to perform substitution
^\s* matches any consecutive whitespace at the beginning of the line
\S+\s+ means a column of data (non-whitespace chars followed by whitespace chars)
(){4} means the pattern is repeated 4 times.
sed and cut
sed -r 's/^\s+//; s/\s+/\t/g' somefile | cut -f5-
by just replacing consecutive whitespaces by a single tab;
tr and cut:
tr can also be used to squeeze consecutive characters with the -s option.
tr -s [:blank:] <somefile | cut -d' ' -f5-
If you don't want to reformat the part of the line that you don't chop off, the best solution I can think of is written in my answer in:
How to print all the columns after a particular number using awk?
It chops what is before the given field number N, and prints all the rest of the line, including field number N and maintaining the original spacing (it does not reformat). It doesn't mater if the string of the field appears also somewhere else in the line.
Define a function:
fromField () {
awk -v m="\x01" -v N="$1" '{$N=m$N; print substr($0,index($0,m)+1)}'
}
And use it like this:
$ echo " bat bi iru lau bost " | fromField 3
iru lau bost
$ echo " bat bi iru lau bost " | fromField 2
bi iru lau bost
Output maintains everything, including trailing spaces
In you particular case:
svn status | grep '\!' | fromField 2 > removedProjs
If your file/stream does not contain new-line characters in the middle of the lines (you could be using a different Record Separator), you can use:
awk -v m="\x0a" -v N="3" '{$N=m$N ;print substr($0, index($0,m)+1)}'
The first case will fail only in files/streams that contain the rare hexadecimal char number 1
This awk function returns substring of $0 that includes fields from begin to end:
function fields(begin, end, b, e, p, i) {
b = 0; e = 0; p = 0;
for (i = 1; i <= NF; ++i) {
if (begin == i) { b = p; }
p += length($i);
e = p;
if (end == i) { break; }
p += length(FS);
}
return substr($0, b + 1, e - b);
}
To get everything starting from field 3:
tail = fields(3);
To get section of $0 that covers fields 3 to 5:
middle = fields(3, 5);
b, e, p, i nonsense in function parameter list is just an awk way of declaring local variables.
All of the other answers given here and in linked questions fail in various ways given various possible FS values. Some leave leading and/or trailing white space, some convert every FS to the OFS, some rely on semantics that only apply when FS is the default value, some rely on negating FS in a bracket expression which will fail given a multi-char FS, etc.
To do this robustly for any FS, use GNU awk for the 4th arg to split():
$ cat tst.awk
{
split($0,flds,FS,seps)
for ( i=n; i<=NF; i++ ) {
printf "%s%s", flds[i], seps[i]
}
print ""
}
$ printf 'a b c d\n' | awk -v n=3 -f tst.awk
c d
$ printf ' a b c d\n' | awk -v n=3 -f tst.awk
c d
$ printf ' a b c d\n' | awk -v n=3 -F'[ ]' -f tst.awk
b c d
$ printf ' a b c d\n' | awk -v n=3 -F'[ ]+' -f tst.awk
b c d
$ printf 'a###b###c###d\n' | awk -v n=3 -F'###' -f tst.awk
c###d
$ printf '###a###b###c###d\n' | awk -v n=3 -F'###' -f tst.awk
b###c###d
Note that I'm using split() above because it's 3rg arg is a field separator, not just a regexp like the 2nd arg to match(). The difference is that field separators have additional semantics to regexps such as skipping leading and/or trailing blanks when the separator is a single blank char - if you wanted to use a while(match()) loop or any form of *sub() to emulate the above then you'd need to write code to implement those semantics whereas split() already implements them for you.
Awk examples looks complex here, here is simple Bash shell syntax:
command | while read -a cols; do echo ${cols[#]:1}; done
Where 1 is your nth column counting from 0.
Example
Given this content of file (in.txt):
c1
c1 c2
c1 c2 c3
c1 c2 c3 c4
c1 c2 c3 c4 c5
here is the output:
$ while read -a cols; do echo ${cols[#]:1}; done < in.txt
c2
c2 c3
c2 c3 c4
c2 c3 c4 c5
This would work if you are using Bash and you could use as many 'x ' as elements you wish to discard and it ignores multiple spaces if they are not escaped.
while read x b; do echo "$b"; done < filename
Perl:
#m=`ls -ltr dir | grep ^d | awk '{print \$6,\$7,\$8,\$9}'`;
foreach $i (#m)
{
print "$i\n";
}
UPDATE :
if you wanna use no function calls at all while preserving the spaces and tabs in between the remaining fields, then do :
echo " 1 2 33 4444 555555 \t6666666 " |
{m,g}awk ++NF FS='^[ \t]*[^ \t]*[ \t]+|[ \t]+$' OFS=
=
2 33 4444 555555 6666666
===================
You can make it a lot more straight forward :
svn status | [m/g]awk '/!/*sub("^[^ \t]*[ \t]+",_)'
svn status | [n]awk '(/!/)*sub("^[^ \t]*[ \t]+",_)'
Automatically takes care of the grep earlier in the pipe, as well as trimming out extra FS after blanking out $1, with the added bonus of leaving rest of the original input untouched instead of having tabs overwritten with spaces (unless that's the desired effect)
If you're very certain $1 does not contain special characters that need regex escaping, then it's even easier :
mawk '/!/*sub($!_"[ \t]+",_)'
gawk -c/P/e '/!/*sub($!_"""[ \t]+",_)'
Or if you prefer customizing FS+OFS to handle it all :
mawk 'NF*=/!/' FS='^[^ \t]*[ \t]+' OFS='' # this version uses OFS
This should be a reasonably comprehensive awk-field-sub-string-extraction function that
returns substring of $0 based on input ranges, inclusive
clamp in out of range values,
handle variable length field SEPs
has speedup treatments for ::
completely no inputs, returning $0 directly
input values resulting in guaranteed empty string ("")
FROM-field == 1
FS = "" that has split $0 out by individual chars
(so the FROM <(_)> and TO <(__)> fields behave like cut -c rather than cut -f)
original $0 restored, w/o overwriting FS seps with OFS
|
{m,g}awk '{
2 print "\n|---BEFORE-------------------------\n"
3 ($0) "\n|----------------------------\n\n ["
4 fld2(2, 5) "]\n [" fld2(3) "]\n [" fld2(4, 2)
5 "]<----------------------------------------------should be
6 empty\n [" fld2(3, 11) "]<------------------------should be
7 capped by NF\n [" fld2() "]\n [" fld2((OFS=FS="")*($0=$0)+11,
8 23) "]<-------------------FS=\"\", split by chars
9 \n\n|---AFTER-------------------------\n" ($0)
10 "\n|----------------------------"
11 }
12 function fld2(_,__,___,____,_____)
13 {
if (+__==(_=-_<+_ ?+_:_<_) || (___=____="")==__ || !NF) {
return $_
16 } else if (NF<_ || (__=NF<+__?NF:+__)<(_=+_?_:!_)) {
return ___
18 } else if (___==FS || _==!___) {
19 return ___<FS \
? substr("",$!_=$!_ substr("",__=$!(NF=__)))__
20 : substr($(_<_),_,__)
21 }
22 _____=$+(____=___="\37\36\35\32\31\30\27\26\25"\
"\24\23\21\20\17\16\6\5\4\3\2\1")
23 NF=__
24 if ($(!_)~("["(___)"]")) {
25 gsub("..","\\&&",___) + gsub(".",___,____)
27 ___=____
28 }
29 __=(_) substr("",_+=_^=_<_)
30 while(___!="") {
31 if ($(!_)!~(____=substr(___,--_,++_))) {
32 ___=____
33 break }
35 ___=substr(___,_+_^(!_))
36 }
37 return \
substr("",($__=___ $__)==(__=substr($!_,
_+index($!_,___))),_*($!_=_____))(__)
}'
those <TAB> are actual \t \011 but relabeled for display clarity
|---BEFORE-------------------------
1 2 33 4444 555555 <TAB>6666666
|----------------------------
[2 33 4444 555555]
[33]
[]<---------------------------------------------- should be empty
[33 4444 555555 6666666]<------------------------ should be capped by NF
[ 1 2 33 4444 555555 <TAB>6666666 ]
[ 2 33 4444 555555 <TAB>66]<------------------- FS="", split by chars
|---AFTER-------------------------
1 2 33 4444 555555 <TAB>6666666
|----------------------------
I wasn't happy with any of the awk solutions presented here because I wanted to extract the first few columns and then print the rest, so I turned to perl instead. The following code extracts the first two columns, and displays the rest as is:
echo -e "a b c d\te\t\tf g" | \
perl -ne 'my #f = split /\s+/, $_, 3; printf "first: %s second: %s rest: %s", #f;'
The advantage compared to the perl solution from Chris Koknat is that really only the first n elements are split off from the input string; the rest of the string isn't split at all and therefor stays completely intact. My example demonstrates this with a mix of spaces and tabs.
To change the amount of columns that should be extracted, replace the 3 in the example with n+1.
ls -la | awk '{o=$1" "$3; for (i=5; i<=NF; i++) o=o" "$i; print o }'
from this answer is not bad but the natural spacing is gone.
Please then compare it to this one:
ls -la | cut -d\ -f4-
Then you'd see the difference.
Even ls -la | awk '{$1=$2=""; print}' which is based on the answer voted best thus far is not preserve the formatting.
Thus I would use the following, and it also allows explicit selective columns in the beginning:
ls -la | cut -d\ -f1,4-
Note that every space counts for columns too, so for instance in the below, columns 1 and 3 are empty, 2 is INFO and 4 is:
$ echo " INFO 2014-10-11 10:16:19 main " | cut -d\ -f1,3
$ echo " INFO 2014-10-11 10:16:19 main " | cut -d\ -f2,4
INFO 2014-10-11
$
If you want formatted text, chain your commands with echo and use $0 to print the last field.
Example:
for i in {8..11}; do
s1="$i"
s2="str$i"
s3="str with spaces $i"
echo -n "$s1 $s2" | awk '{printf "|%3d|%6s",$1,$2}'
echo -en "$s3" | awk '{printf "|%-19s|\n", $0}'
done
Prints:
| 8| str8|str with spaces 8 |
| 9| str9|str with spaces 9 |
| 10| str10|str with spaces 10 |
| 11| str11|str with spaces 11 |
The top-voted answer by zed_0xff did not work for me.
I have a log where after $5 with an IP address can be more text or no text. I need everything from the IP address to the end of the line should there be anything after $5. In my case, this is actually within an awk program, not an awk one-liner so awk must solve the problem. When I try to remove the first 4 fields using the solution proposed by zed_0xff:
echo " 7 27.10.16. Thu 11:57:18 37.244.182.218" | awk '{$1=$2=$3=$4=""; printf "[%s]\n", $0}'
it spits out wrong and useless response (I added [..] to demonstrate):
[ 37.244.182.218 one two three]
There are even some suggestions to combine substr with this wrong answer, but that only complicates things. It offers no improvement.
Instead, if columns are fixed width until the cut point and awk is needed, the correct answer is:
echo " 7 27.10.16. Thu 11:57:18 37.244.182.218" | awk '{printf "[%s]\n", substr($0,28)}'
which produces the desired output:
[37.244.182.218 one two three]
Related
What I want to achieve:
grep: extract lines with the contig number and length
awk: remove "length:" from column 2
sort: sort by length (in descending order)
Current code
grep "length:" test_reads.fa.contigs.vcake_output | awk -F:'{print $2}' |sort -g -r > contig.txt
Example content of test_reads.fa.contigs.vcake_output:
>Contig_11 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_0 length:99995
ATTTATGCCGTTGGCCACGAATTCAGAATCATATTA
Expected output
>Contig_0 99995
>Contig_11 42
With your shown samples, please try following awk + sort solution here.
awk -F'[: ]' '/^>/{print $1,$3}' Input_file | sort -nrk2
Explanation: Simple explanation would be, running awk program to read Input_file first, where setting field separator as : OR space and checking condition if line starts from > then printing its 1st and 2nd fields then sending its output(as a standard input) to sort command where sorting it from 2nd field to get required output.
Here is a gnu-awk solution that does it all in a single command without invoking sort:
awk -F '[:[:blank:]]' '
$2 == "length" {arr[$1] = $3}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (i in arr)
print i, arr[i]
}' file
>Contig_0 99995
>Contig_11 42
Perhaps this, combining grep and awk:
awk -F '[ :]' '$2 == "length" {print $1, $3}' file | sort ...
Assumptions:
if more than one row has the same length then additionally sort the 1st column using 'version' sort
Adding some additional lines to the sample input:
$ cat test_reads.fa.contigs.vcake_output
>Contig_0 length:99995
ATTTATGCCGTTGGCCACGAATTCAGAATCATATTA
>Contig_11 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_17 length:93
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_837 ignore-this-length:1000000
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
>Contig_8 length:42
ACTCTGAGTGATCTTGGCGTAATAGGCCTGCTTAATGATCGT
One sed/sort idea:
$ sed -rn 's/(>[^ ]+) length:(.*)$/\1 \2/p' test_reads.fa.contigs.vcake_output | sort -k2,2nr -k1,1V
Where:
-En - enable extended regex support and suppress normal printing of input data
(>[^ ])+) - (1st capture group) - > followed by 1 or more non-space characters
length: - space followed by length:
(.*) - (2nd capture group) - 0 or more characters (following the colon)
$ - end of line
\1 \2/p - print 1st capture group + <space> + 2nd capture group
-k2,2nr - sort by 2nd (spaced-delimited) field in reverse numeric order
-k1,1V - sort by 1st (space-delimited) field in Version order
This generates:
>Contig_0 99995
>Contig_17 93
>Contig_8 42
>Contig_11 42
Hii experts i have a big text file that contain many columns.Now i want to extract each column in separate text file serially with adding two strings on the top.
suppose i have a input file like this
2 3 4 5 6
3 4 5 6 7
2 3 4 5 6
1 2 2 2 2
then i need to extract each column in separate text file with two strings on the top
file1.txt file2.txt .... filen.txt
s=5 s=5
r=9 r=9
2 3
3 4
2 3
1 2
i tried script as below:but it doesnot work properly.need help from experts.Thanks in advance.
#!/bin/sh
for i in $(seq 1 1 5)
do
echo $i
awk '{print $i}' inp_file > file_$i
done
Could you please try following, written and tested with shown samples in GNU awk. Following doesn't have close file function used because your sample shows you have only 5 columns in Input_file. Also created 2 awk variables which will be printed before actual column values are getting printed to output file(named var1 and var2).
awk -v var1="s=5" -v var2="r=9" '
{
count++
for(i=1;i<=NF;i++){
outputFile="file"i".txt"
if(count==1){
print (var1 ORS var2) > (outputFile)
}
print $i > (outputFile)
}
}
' Input_file
In case you can have more than 5 or more columns then better close output files kin backend using close option, use this then(to avoid error too many files opened).
awk -v var1="s=5" -v var2="r=9" '
{
count++
for(i=1;i<=NF;i++){
outputFile="file"i".txt"
if(count==1){
print (var1 ORS var2) > (outputFile)
}
print $i >> (outputFile)
}
close(outputFile)
}
' Input_file
Pretty simple to do in one pass through the file with awk using its output redirection:
awk 'NR==1 { for (n = 1; n <= NF; n++) print "s=5\nr=9" > ("file_" n) }
{ for (n = 1; n <= NF; n++) print $n > ("file_" n) }' inp_file
With GNU awk to internally handle more than a dozen or so simultaneously open files:
NR == 1 {
for (i=1; i<=NF; i++) {
out[i] = "file" i ".txt"
print "s=5" ORS "r=9" > out[i]
}
}
{
for (i=1; i<=NF; i++) {
print $i > out[i]
}
}
or with any awk just close them as you go:
NR == 1 {
for (i=1; i<=NF; i++) {
out[i] = "file" i ".txt"
print "s=5" ORS "r=9" > out[i]
close(out[i])
}
}
{
for (i=1; i<=NF; i++) {
print $i >> out[i]
close(out[i])
}
}
split -nr/$(wc -w <(head -1 input) | cut -d' ' -f1) -t' ' --additional-suffix=".txt" -a4 --numeric-suffix=1 --filter "cat <(echo -e 's=5 r=9') - | tr ' ' '\n' >\$FILE" <(tr -s '\n' ' ' <input) file
This uses the nifty split command in a unique way to rearrange the columns. Hopefully it's faster than awk, although after spending a considerable amount of time coding it, testing it, and writing it up, I find that it may not be scalable enough for you since it requires a process per column, and many systems are limited in user processes (check ulimit -u). I submit it though because it may have some limited learning usefulness, to you or to a reader down the line.
Decoding:
split -- Divide a file up into subfiles. Normally this is by lines or by size but we're tweaking it to use columns.
-nr/$(...) -- Use round-robin output: Sort records (in our case, matrix cells) into the appropriate number of bins in a round-robin fashion. This is the key to making this work. The part in parens means, count (wc) the number of words (-w) in the first line (<(head -1 input)) of the input and discard the filename (cut -d' ' -f1), and insert the output into the command line.
-t' ' -- Use a single space as a record delimiter. This breaks the matrix cells into records for split to split on.
--additional-suffix=".txt" -- Append .txt to output files.
-a4 -- Use four-digit numbers; you probably won't get 1,000 files out of it but just in case ...
--numeric-suffix=1 -- Add a numeric suffix (normally it's a letter combination) and start at 1. This is pretty pedantic but it matches the example. If you have more than 100 columns, you will need to add a -a4 option or whatever length you need.
--filter ... -- Pipe each file through a shell command.
Shell command:
cat -- Concatenate the next two arguments.
<(echo -e 's=5 r=9') -- This means execute the echo command and use its output as the input to cat. We use a space instead of a newline to separate because we're converting spaces to newlines eventually and it is shorter and clearer to read.
- -- Read standard input as an argument to cat -- this is the binned data.
| tr ' ' '\n' -- Convert spaces between records to newlines, per the desired output example.
>\$FILE -- Write to the output file, which is stored in $FILE (but we have to quote it so the shell doesn't interpret it in the initial command).
Shell command over -- rest of split arguments:
<(tr -s '\n' ' ' < input) -- Use, as input to split, the example input file but convert newlines to spaces because we don't need them and we need a consistent record separator. The -s means only output one space between each record (just in case we got multiple ones on input).
file -- This is the prefix to the output filenames. The output in my example would be file0001.txt, file0002.txt, ..., file0005.txt.
I have a sample file like
XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant
Here I need to grep most occurring sequence of 3 letters within a word
Output should be
acc = 5
aco = 3
Is that possible in Bash?
I got absolutely no idea how I can accomplish it with either awk, sed, grep.
Any clue how it's possible...
PS: no output because I got no idea how to do that, I dont wanna wrote unnecessary awk -F, xyz abc... that not gonna help anywhere...
Here's how to get started with what I THINK you're trying to do:
$ cat tst.awk
BEGIN { stringLgth = 3 }
{
for (fldNr=1; fldNr<=NF; fldNr++) {
field = $fldNr
fieldLgth = length(field)
if ( fieldLgth >= stringLgth ) {
maxBegPos = fieldLgth - (stringLgth - 1)
for (begPos=1; begPos<=maxBegPos; begPos++) {
string = tolower(substr(field,begPos,stringLgth))
cnt[string]++
}
}
}
}
END {
for (string in cnt) {
print string, cnt[string]
}
}
.
$ awk -f tst.awk file | sort -k2,2nr
acc 5
cou 5
cco 4
ing 4
nti 4
oun 4
tin 4
unt 4
aco 3
abc 1
ant 1
any 1
bca 1
cac 1
cal 1
com 1
con 1
fir 1
ica 1
irm 1
lta 1
mpa 1
nsu 1
omp 1
ons 1
ous 1
pan 1
sti 1
sul 1
tan 1
tic 1
ult 1
ust 1
xyz 1
yza 1
zac 1
This is an alternative method to the solution of Ed Morton. It is less looping, but needs a bit more memory. The idea is not to care about spaces or any non-alphabetic character. We filter them out in the end.
awk -v n=3 '{ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
END {for(s in a) if (s !~ /[^a-z]/) print s,a[s] }' file
When you use GNU awk, you can do this a bit differently and optimized by setting each record to be a word. This way the end selection does not need to happen:
awk -v n=3 -v RS='[[:space:]]' '
(length>=n){ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
END {for(s in a) print s,a[s] }' file
This might work for you (GNU sed, sort and uniq):
sed -E 's/.(..)/\L&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -c |
sort -s -k1,1rn |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p'
Use the first sed invocation to output 3 letter lower case words.
Sort the words.
Count the duplicates.
Sort the counts in reverse numerical order maintaining the alphabetical order.
Use the second sed invocation to manipulate the results into the desired format.
If you only want lines with duplicates and in alphabetical order and case wise, use:
sed -E 's/.(..)/&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -cd |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p
I have a text file with random words in it. i want to find out which words have maximum occurrence as a pair('hi,hello' OR 'Good,Bye').
Simple.txt
hi there. hello this a dummy file. hello world. you did good job. bye for now.
I have written this command to get the count for each word(hi,hello,good,bye).
cat simple.txt| tr -cs '[:alnum:]' '[\n*]' | sort | uniq -c|grep -E -i "\<hi\>|\<hello\>|\<good\>|\<bye\>"
this gives me the the occurrence of each word with a count(number of times it occurs) in the file but now how to refine this and get a direct output as "Hi/hello is the pair with maximum occurrence"
To make it more interesting, let's consider this test file:
$ cat >file.txt
You say hello. I say good bye. good bye. good bye.
To get a count of all pairs of words:
$ awk -v RS='[[:space:][:punct:]]+' 'NR>1{a[last","$0]++} {last=$0} END{for (pair in a) print a[pair], pair}' file.txt
3 good,bye
1 say,good
2 bye,good
1 I,say
1 You,say
1 hello,I
1 say,hello
To get the single pair with the highest count, we need to sort:
$ awk -v RS='[[:space:][:punct:]]+' 'NR>1{a[last","$0]++} {last=$0} END{for (pair in a) print a[pair], pair}' file.txt | sort -nr | head -1
3 good,bye
How it works
-v RS='[[:space:][:punct:]]+'
This tells awk to use any combination of white space or punctuation as a record separator. This means that each word becomes a record.
NR>1{a[last","$0]++}
For every word after the first, increment the count in associative array a for the combination of the previous and current work.
last=$0
Save the current word in the variable last.
END{for (pair in a) print a[pair], pair}
After we have finished reading the input, print out the results for each pair.
sort -nr
Sort the output numerically in reverse (highest number first) order.
head -1
Select the first line (giving us the pair with the highest count).
Multiline version
For those who prefer their code spread out over multiple lines:
awk -v RS='[[:space:][:punct:]]+' '
NR>1 {
a[last","$0]++
}
{
last=$0
}
END {
for (pair in a)
print a[pair], pair
}' file.txt | sort -nr | head -1
some terse perl:
perl -MList::Util=max,sum0 -slne '
for $word (m/(\w+)/g) {$count{$word}++}
} END {
$pair{$_} = sum0 #count{+split} for ($a, $b);
$max = max values %pair;
print "$max => ", {reverse %pair}->{$max};
' -- -a="hi hello" -b="good bye" simple.txt
3 => hi hello
I have this file:
$ cat file
1515523 A45678BF141 A11269151
2234545 A45678BE145 A87979746
5432568 A45678B2123 A40629187
7234573 A45678B4154 A98879129
8889568 A45678B5123 A13409137
9234511 A45678B9176 A23589941
3904568 A45678B7123 A52329165
3234555 A45678B1169 A23589497
9643568 A45678B6123 A39969112
1234547 A45678B2132 A40579243
and this script:
cat file | awk '{FS = " "} {print $1" "$3" "$5}'| awk '{
n = split($3, a, "");
s = "";
for (i = 1; i <= n; i += 2) s = s a[i+1] a[i];
print $1, substr($2, length($2)-3, 4), s
}'| cut -d" " -f3,1 > output
And when I open the output with vi, I have:
1515523 F141 11621915^M
2234545 E145 78797964^M
5432568 2123 04261978^M
7234573 4154 89781992^M
8889568 5123 31041973^M
9234511 9176 32859914^M
3904568 7123 25231956^M
3234555 1169 32854979^M
9643568 6123 93691921^M
1234547 2132 04752934^M
I don't know why I obtain ^M, because when I intend to run the awk snippet:
cat imei | awk '{FS=" "} {print $2","$1}'
the output is mistaken, i.e., it does not exchange the columns, as it does not print the second column. Any ideas on what may be happening?
There are carriage returns (^M or Control-M) in the data file. It probably came from a Windows machine at some point.
When you print $2","$1 (which concatenates $2 with a string containing a comma and then $1 — it took me a couple of looks to see what it was really doing), the carriage return makes the second column overwrite the first.
Look at the data file with od -c or similar tools to see the carriage returns in it.
You can use dos2unix or tr or various other techniques to convert the file from DOS/Windows format to Unix format.
Also, given the data format shown, I'd expect not to use -F " " (or the FS = " ", which is equivalent), so that you have columns $1, $2, and $3, which is more obvious than working with columns 1, 3, 5 as shown. You could set OFS to double-blank if you wanted the output with two blanks between columns.
$ dos2unix file
$ awk '{split($3,a,""); print $1, substr($2,8), a[3]a[2]a[5]a[4]a[7]a[6]a[9]a[8]}' file
1515523 F141 11621915
2234545 E145 78797964
5432568 2123 04261978
7234573 4154 89781992
8889568 5123 31041973
9234511 9176 32859914
3904568 7123 25231956
3234555 1169 32854979
9643568 6123 93691921
1234547 2132 04752934
Since you are using awk you do not need a dos2unix.
simply insert
gsub(/\r/,"");
as a first statement in your awk Script
It cleans up each line read in. Subsequent matching or processing does not get any 'carriage return' characters.
How about a perl 'one liner' (with a continuation line)
$ dos2unix file
$ perl -lane \
'$xxxx = substr($F[1],-4);
#c = split(//,$F[2]);
print "$F[0] $xxxx $c[2]$c[1]$c[4]$c[3]$c[6]$c[5]$c[8]$c[7]"' file