I am counting nucleotides in the contigs of a fasta file. My file looks like
>1
ATACCTACTA
ATTTACGTCA
GTA
>2
ATATTCGTAT
GTCTCGATCT
A
>3
etc.
My command is
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0; } { seqlen += length($0)}END{print seqlen}'
The output is now like
>1
23
>2
21
How to get the output on the same line, like
>1 23
>2 21
and more few changes and voila (thanks to #Ed Morton):
awk '/^>/ {if(seqlen)print k,seqlen; seqlen=0; k=$0; next;} { seqlen += length($0);}END{print k,seqlen;}' filename
This one works for me:
awk '/^>/ && NR>1 {printf " %d \n", x; }/^>/{ printf "%s", $0 }!/^>/{ x += length($0) } file
I hope it works now as expected.
try:
awk '/^>/{printf("%s ",$0);getline;printf("%s\n",length($0))}' Input_file
Checking if a line is starting from > then printing that line now using getline to jump to next line. printing the length of current line with new line, mentionint the Input_file then.
EDIT:
awk '/^>/{if(VAL){print Q OFS VAL;Q=VAL="";Q=$0;next};Q=$0;next} {VAL=VAL?VAL+length($0):length($0)} END{print Q,VAL}' Input_file
Checking if any line starting from > then checking if VAL variable is NOT NULL if not then print variable Q's and VAL's value and then nullify then Q,VAL variables and next will skip all further statements else make Q as $0 and use next to skep further statements. So creating a variable named VAL which will calculate the length of each line and add to it's own value. in END section print values of Q, VAL.
Related
I am looking preferably for a bash/Linux method for the problem below.
I have a text file (input.txt) that looks like so (and many many more lines):
TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW
index_07_barcode_04_PA-17-ACW-04 17-ACW
index_09_barcode_05_PA-17-ACW-05 17-ACW
index_08_barcode_37_PA-21-YC-15 21-YC
index_09_barcode_04_PA-22-GB-10 22-GB
index_10_barcode_37_PA-28-CC-17 28-CC
index_11_barcode_29_PA-32-MW-07 32-MW
index_11_barcode_20_PA-32-MW-08 32-MW
I want to produce a file that looks like
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
I thought that I could do something along the lines of this.
cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??
But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?
Thank you!
(Happy Easter/ long weekend to all!)
With your shown samples please try following.
awk '
FNR==NR{
arr[$2]=(arr[$2]?arr[$2]",":"")$1
next
}
($2 in arr){
print $2"("arr[$2]")"
delete arr[$2]
}
' Input_file Input_file
2nd solution: Within a single read of Input_file try following.
awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file
Explanation(1st solution): Adding detailed explanation for 1st solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if 2nd field is present in arr then do following.
print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here.
delete arr[$2] ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input:
$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
printf "%s%s(", ORS, $2
ORS = ")\n"
sep = ""
prev = $2
}
{
printf "%s%s", sep, $1
sep = ","
}
END { print "" }
$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
This might work for you (GNU sed):
sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file
Append each manipulated line to the hold space.
Before moving on to the next line, accumlate like keys into a single line.
Delete every line except the last.
Replace the last line by the contents of the hold space.
Remove the first character (newline artefact introduced by H comand) and print the result.
N.B. The final solution is unsorted and in the original order.
I have a sample file with '||o||' as field separator.
www.google.org||o||srScSG2C5tg=||o||bngwq
farhansingla.it||o||4sQVj09gpls=||o||
ngascash||o||||o||
ms-bronze.com.br||o||||o||
I want to move the lines with only 1 field in 1.txt and those having more than 1 field in not_1.txt. I am using the following command:
sed 's/\(||o||\)\+$//g' sample.txt | awk -F '[|][|]o[|][|]' '{if (NF == 1) print > "1.txt"; else print > "not_1.txt" }'
The problem is that it is moving not the original lines but the replaced ones.
The output I am getting is (not_1.txt):
td#the-end.org||o||srScSG2C5tg=||o||bnm
erba01#tiscali.it||o||4sQVj09gpls=
1.txt:
ngas
ms-inside#bol.com.br
As you can see the original lines are modified. I don't want to modify the lines.
Any help would be highly appreciated.
Awk solution:
awk -F '[|][|]o[|][|]' \
'{
c = 0;
for (i=1; i<=NF; i++) if ($i != "") c++;
print > (c == 1? "1" : "not_1")".txt"
}' sample.txt
Results:
$ head 1.txt not_1.txt
==> 1.txt <==
ngascash||o||||o||
ms-bronze.com.br||o||||o||
==> not_1.txt <==
www.google.org||o||srScSG2C5tg=||o||bngwq
farhansingla.it||o||4sQVj09gpls=||o||
Following awk may help you on same.
awk -F'\\|\\|o\\|\\|' '{for(i=1;i<=NF;i++){count=$i?++count:count};if(count==1){print > "1_field_only"};if(count>1){print > "not_1_field"};count=""}' Input_file
Adding a non-one liner form of solution too now.
awk -F'\\|\\|o\\|\\|' '
{
for(i=1;i<=NF;i++){ count=$i?++count:count };
if(count==1) { print > "1_field_only" };
if(count>1) { print > "not_1_field" };
count=""
}
' Input_file
Explanation: Adding explanation for above code too now.
awk -F'\\|\\|o\\|\\|' ' ##Setting field separator as ||o|| here and escaping the | here to take it literal character here.
{
for(i=1;i<=NF;i++){ count=$i?++count:count }; ##Starting a for loop to traverse through all the fields here, increasing variable count value if a field is NOT null.
if(count==1) { print > "1_field_only" }; ##Checking if count value is 1 it means fields are only 1 in line so printing current line into 1_field_only file.
if(count>1) { print > "not_1_field" }; ##Checking if count is more than 1 so printing current line into output file named not_1_field file here.
count="" ##Nullifying the variable count here.
}
' Input_file ##Mentioning Input_file name here.
Hello and thank you for taking the time to read this question. For the last day I have been trying to solve a problem and haven’t come any closer to a solution. I have a sample file of data that contains the following:
Fighter#Trainer
Bobby#SamBonen
Billy#BobBrown
Sammy#DJacobson
James#DJacobson
Donny#SonnyG
Ben#JasonS
Dave#JuanO
Derrek#KMcLaughlin
Dillon#LGarmati
Orson#LGarmati
Jeff#RodgerU
Brad#VCastillo
The goal is to identify “Trainers” that have have more then one fighter. My gut feeling is the “getline” and variable declaration directives in AWK are going to be needed. I have tried different combinations of
awk -F# 'NR>1{a=$2; getline; if($2 = a) {print $0,"Yes"} else {print $0,"NO"}}' sample.txt
Yet, the output is nowhere near the desired results. In fact, it doesn’t even output all the rows in the sample file!
My desired results are:
Fighter#Trainer
Bobby#SamBonen#NO
Billy#BobBrown#NO
Sammy#DJacobson#YES
James#DJacobson#YES
Donny#SonnyG#NO
Ben#JasonS#NO
Dave#JuanO#NO
Derrek#KMcLaughlin#NO
Dillon#LGarmati#YES
Orson#LGarmati#YES
Jeff#RodgerU#NO
Brad#VCastillo#NO
I am completely lost as to where to go from here. I have been searching and trying to find a solution to no avail, and I'm looking for some input. Thank you!
You don't need getline.
You could just process the input normally,
building up counts per trainer,
and print the result in an END block:
awk -F# '{
lines[NR] = $0;
trainers[NR] = $2;
counts[$2]++;
}
END {
print lines[1];
for (i = 2; i <= length(lines); i++) {
print lines[i] "#" (counts[trainers[i]] > 1 ? "YES" : "NO");
}
}' sample.txt
Another option is to make two passes:
$ cat p.awk
BEGIN {FS=OFS="#"}
NR==1 {print;next};
NR==FNR {++trainers[$2]; next}
FNR>1 {$3=(trainers[$2]>1)?"YES":"NO"; print}
$ awk -f p.awk p.txt p.txt
Fighter#Trainer
Bobby#SamBonen#NO
Billy#BobBrown#NO
Sammy#DJacobson#YES
James#DJacobson#YES
Donny#SonnyG#NO
Ben#JasonS#NO
Dave#JuanO#NO
Derrek#KMcLaughlin#NO
Dillon#LGarmati#YES
Orson#LGarmati#YES
Jeff#RodgerU#NO
Brad#VCastillo#NO
Explained:
Set the input and output file separators:
BEGIN {FS=OFS="#"}
Print the header:
NR==1 {print;next};
First pass, count occurrences of each trainer:
NR==FNR {++trainers[$2]; next}
Second pass, set YES or NO according to trainer count, and print result:
FNR>1 {$3=(trainers[$2]>1)?"YES":"NO"; print}
I am trying to write an awk script and before anything is done tell the user how many lines are in the file. I know how to do this in the END section but unable to do so in the BEGIN section. I have searched SE and Google but have only found a half dozen ways to do this in the END section or as part of a bash script, not how to do it before any processing has taken place at all. I was hoping for something like the following:
#!/usr/bin/awk -f
BEGIN{
print "There are a total of " **TOTAL LINES** " lines in this file.\n"
}
{
if($0==4587){print "Found record on line number "NR; exit 0;}
}
But have been unable to determine how to do this, if it is even possible. Thanks.
You can read the file twice:
awk 'NR!=1 && FNR==1 {print NR-1} <some more code here>' file{,}
In your example:
awk 'NR!=1 && FNR==1 {print "There are a total of "NR-1" lines in this file.\n"} $0==4587 {print "Found record on line number "NR; exit 0;}' file{,}
You can use file file instead of file{,} (it just makes it show up twice)
NR!=1 && FNR==1 this will be true only at first line of second file.
To use an awk script containing:
#!/usr/bin/awk -f
NR!=1 && FNR==1 {
print "There are a total of "NR-1" lines in this file.\n"
}
$0==4587 {
print "Found record on line number "NR; exit 0
}
call:
awk -f myscript file{,}
To do this robustly and for multiple files you need something like:
$ cat tst.awk
BEGINFILE {
numLines = 0
while ( (getline line < FILENAME) > 0 ) {
numLines++
}
print "----\nThere are a total of", numLines, "lines in", FILENAME
}
$0==4587 { print "Found record on line number", FNR, "of", FILENAME; nextfile }
$
$ cat file1
a
4587
c
$
$ cat file2
$
$ cat file3
d
e
f
4587
$
$ awk -f tst.awk file1 file2 file3
----
There are a total of 3 lines in file1
Found record on line number 2 of file1
----
There are a total of 0 lines in file2
----
There are a total of 4 lines in file3
Found record on line number 4 of file3
The above uses GNU awk for BEGINFILE. Any other solution is difficult to implement such that it will handle empty files (you need an array to track files being parsed and print info the the FNR==1 and END sections after the empty file has been skipped).
Using getline has caveats and should not be used lightly, see http://awk.info/?tip/getline, but this is one of the appropriate and robust uses of it. You can also test for non-readable files in BEGINFILE by testing ERRNO and skipping the file (see the gawk manual) - that situation will cause other scripts to abort.
BEGIN {
s="cat your_file.txt|wc -l";
s | getline file_size;
close(s);
print file_size
}
This will put the size of the file named your_file.txt into the awk variable file_size and print it out.
If your file name is dynamic you can pass the filename on the commandline and change the script to use the variable.
E.g. my.awk
BEGIN {
s="cat "VAR"|wc -l";
s | getline file_size;
close(s);
print file_size
}
Then you can call it like this:
awk -v VAR="your_file.txt" -f my.awk
If you use GNU awk and need a robust, generic solution that accommodates multiple, possibly empty input files, use Ed Morton's solution.
This answer uses portable (POSIX-compliant) code. Within the constraints noted, it is robust, but Ed's GNU awk solution is both simpler and more robust.
Tip of the hat to Ed Morton for his help.
With a single input file, it is simpler to handle line counting with a shell command in the BEGIN block, which has the following advantages:
on invocation, the filename doesn't have to be specified twice, unlike in the accepted answer
Also note that the accepted answer doesn't work as intended (as of this writing); the correct form is (see the comments on the answer for an explanation):
awk 'NR==FNR {next} FNR==1 {print NR-1} $0==4587 {print "Found record on line number "NR; exit 0}' file{,}
the solution also works with an empty input file.
In terms of performance, this approach is either only slightly slower than reading the file twice in awk, or even a little faster, depending on the awk implementation used:
awk '
BEGIN {
# Execute a shell command to count the lines and read
# result into an awk variable via <cmd> | getline <varname>.
# If the file cannot be read, abort. (The shell has already printed an error msg.)
cmd="wc -l < \"" ARGV[1] "\""; if ((cmd | getline count) < 1) exit 1; close(cmd)
printf "There are a total of %s lines in this file.\n\n", count
}
$0==4587 { print "Found record on line number " NR; exit 0 }
' file
Assumptions:
The filename is passed as the 1st operand (non-option argument) on the command line, accessed as ARGV[1].
The filename doesn't contain embedded " chars.
The following solutions deal with multiple files and make analogous assumptions:
All operands passed are filenames. That is, all arguments after the program must be filenames, and not variable assignments such as var=value.
No filename contains embedded " chars.
No processing is to take place if any of the input files do not exist or cannot be read.
It's not hard to generalize this to handling multiple files, but the following solution doesn't print the line count for empty files:
awk '
BEGIN {
# Loop over all input files and store their line counts in an array.
for (i=1; i<ARGC; ++i) {
cmd="wc -l < \"" ARGV[i] "\""; if ((cmd | getline count) < 1) exit 1; close(cmd)
counts[ARGV[i]] = count
}
}
# At the beginning of every (non-empty) file, print the line count.
FNR==1 { printf "There are a total of %s lines in file %s.\n\n", counts[FILENAME], FILENAME }
# $0==4587 { print "%s: Found record on line number %d\n", FILENAME, NR; exit 0 }
' file1 file2 # ...
Things get a little trickier if you want the line count to be printed for empty files also:
awk '
BEGIN {
# Loop over all input files and store their line counts in an array.
for (i=1; i<ARGC; ++i) {
cmd="wc -l < \"" ARGV[i] "\""; if ((cmd | getline count) < 1) exit 1; close(cmd)
counts[ARGV[i]] = count
}
fileCount = ARGC - 1
fmtStringCount = "There are a total of %s lines in file %s.\n\n"
}
# At the beginning of every (non-empty) file, print the line count.
FNR==1 {
++fileIndex
# If there were intervening empty files, print their counts too.
while (ARGV[fileIndex] != FILENAME) {
printf fmtStringCount, 0, ARGV[fileIndex++]
}
printf fmtStringCount, counts[FILENAME], FILENAME
}
# Process input lines
$0==4587 { print "%s: Found record on line number %d\n", FILENAME, NR; exit 0 }
# If there are any remaining empty files a the end, print their counts too.
END {
while (fileIndex < fileCount) { printf fmtStringCount, 0, ARGV[++fileIndex] }
}
' file1 file2 # ...
You can get the number of lines by wc and cut, and set to awk variable with -v option, then you can use the variable in awk script.
cat awk.txt \
| awk -v FNC=`wc -l awk.txt | cut -wf 2` \
'BEGIN { print "FNC: " FNC } { print $0 }'
Here is my problem
I have a File 1 where I have some data
Var1.1 Var1.2 Var1.3
Var2.1 Var2.2 Var2.3
Var3.1 Var3.2 Var3.3
And I have a File 2 that I would like edit thanks to the above data
File2 (1)
***pattern with Var2.1***
some text...
File2(2)
***pattern with Var2.1***
Here I want to add Var2.2 and Var2.3
some text
My first solution is to use AWK, but I don't know to include a bash command in. The AWK should make something like:
Search the pattern in the File2
When awk get it, awk calls a script which returns the wanted values from the File1.
Then awk can edit the File2
don't hesitate to explain me other possibilities if there are which are more simple !
Thank you !
This is how I run an external command from within awk to base64-decode a string:
cmd = "/usr/bin/base64 -i -d <<< " $2 " 2>/dev/null"
while ( ( cmd | getline result ) > 0 ) { }
close(cmd)
split(result, a, "[:=,]")
name=a[2]
Perhaps you can get some inspiration from it...
There's no need to run an external script to accomplish what you want. It can be done completely within a short AWK script.
awk 'FNR == NR {arr[$1] = $2 " " $3; next} {print; for (lookup in arr) {if ($0 ~ lookup) {split(arr[lookup], a); print "Here I want to add " a[1] " and " a[2]}}}' File1 File2
Explanation:
FNR == NR {arr[$1] = $2 " " $3; next} - Loop through the first file and save all the values in an array indexed by the first column. The record number equals the file record number for the first file.
print - Print every input line.
for (lookup in arr) {if ($0 ~ lookup) { - Loop through each of the array indices and see if the input line matches.
split(arr[lookup], a) - Split the value stored at the matched index into a temporary array.
print "Here I want to add " a[1] " and " a[2] - Print some text using the two values resulting from the split.