How to use awk with a for loop? - linux

I have a 5 column file
2 649 2 82 1
3 651 1 83 1
2 652 3 84 2
... ... ... ... ...
The first column is n number of points in segment, the second is the x coordinate, the third is delta x, the delta between the current x coordinate and the next one, similarly the fourth column is the y coordinate and fifth is delta y. I need to generate all points in the segments so the output should be, from the data in the first line
649 82
650 82.5
From the data in the second line
651 83
651.33 83.33
651.67 83.67
From the data in the third line
652 84
653.5 85
Any idea How to do it?

This will do:
awk '{n=$1; x=$2; dx=$3; y=$4; dy=$5; \
for(i=0;i<n;i++) printf "%.2f %.2f\n", x+i*dx/n, y+i*dy/n; }' file
You can adjust %.2f as you desired. For e.g. %.4f to print 4 digits of fraction.
I only used variables for clarity. Otherwise, you could simply do:
awk '{for(i=0;i<$1;i++) printf "%.2f %.2f\n", $2+i*$3/$1, $4+i*$5/$1; }' file

Related

Extract columns from multiple text files with bash or awk or sed?

I am trying to extract column1 and column4 from multiple text files.
file1.txt:
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
CFLAU10s46802|kraken:taxid|33189 1 125 2 105 84 1.68 36.8 24
CFLAU10s46898|kraken:taxid|33189 1 116 32 116 100 23.5862 35.7 19.4
CFLAU10s46988|kraken:taxid|33189 1 105 2 53 50.4762 1.00952 36.9 11
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 1 1102 2 88 7.98548 0.15971 36.4 10
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 1 2133 6 113 5.2977 0.186592 36.6 13
file2.txt:
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
CFLAU10s46802|kraken:taxid|33189 1 125 5 105 84 1.68 36.8 24
CFLAU10s46898|kraken:taxid|33189 1 116 40 116 100 23.5862 35.7 19.4
CFLAU10s46988|kraken:taxid|33189 1 105 6 53 50.4762 1.00952 36.9 11
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 1 1102 2 88 7.98548 0.15971 36.4 10
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 1 2133 6 113 5.2977 0.186592 36.6 13
output format (save the output as merged.txt in another directory). In the output file: Column1(#nname) will be once because this is same for every file, but there will be multiple column4 (numreads) as many as files and the rename the column4 should be according to each file name.
Output file looks like:
#rname file1_numreads file2_numreads
CFLAU10s46802|kraken:taxid|33189 2 5
CFLAU10s46898|kraken:taxid|33189 32 40
CFLAU10s46988|kraken:taxid|33189 2 6
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 2 88
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 6 113
Your suggestions would be appreciated.
Here is something I put together. awk gurus might have a simpler - shorter version but I am still learning awk.
Create a file script.awk and make it executable. Put in it:
#!/usr/bin/awk -f
BEGIN { FS="\t" }
# process files, ignoring comments
!/^#/ {
# keep the first column values.
# Only add a new value if it is not already in the array.
if (!($1 in firstcolumns)) {
firstcolumns[$1] = $1
}
# extract the 4th column of file1, put it in the array (column 1).1
if (FILENAME == ARGV[1]) {
results[$1 ".1"] = $4
}
# extract the 4th column of file2, put it in the array (column 1).2
if (FILENAME == ARGV[2]) {
results[$1 ".2"] = $4
}
}
# print the results
END {
# for each first column value...
for (key in firstcolumns) {
# Print the first column, then (column 1).1, then (column 1).2
print key "\t" results[key ".1"] "\t" results[key ".2"]
}
}
Call it like this: ./script.awk file1.txt file2.txt.
Since awk parses the files line per line, I keep the possible values of the first column in an array (firstcolumns).
For each line, if the 4th column comes from file1.txt (ARGV[1]) I store it in the results array under (firstcolumn).1.
For each line, if the 4th column comes from file2.txt (ARGV[2]) I store it in the results array under (firstcolumn).2.
In the END block, loop through the possible firstcolumn values and print the values (firstcolumn).1 and (firstcolumn).2, separated by "\t" for tabs.
Results:
$ ./so.awk file1.txt file2.txt
AUZW01004514.1 C4 C4
CFLAU10s46988|kraken:taxid|33189 2 6
CFLAU10s46802|kraken:taxid|33189 2 5
AUZW01004739.1 C4 C4
CFLAU10s46898|kraken:taxid|33189 32 40

Datamash: Transposing the column into rows based on group in bash

I have a tab delim file with a 2 columns like following
A 123
A 23
A 45
A 67
B 88
B 72
B 50
B 23
C 12
C 14
I want to transpose with the above data based on the first column like following
A 123 23 45 67
B 88 72 50 23
C 12 14
I tried the datamash transpose < input-file.txt but it didnt yield the output as expected.
One awk version:
awk '{printf ($1!=f?"\n%s":" "$2),$0;f=$1}' file
A 123 23 45 67
B 88 72 50 23
C 12 14
With this version, you get on blank line, but should be fast and handle large data since no loop or array variable are used.
$1!=f?"\n%s":" "$2),$0 If first field is not equal f, print new line and all fields
if $1 = f, only print field 2.
f=$1 set f to first field
datamash --group=1 --field-separator=' ' collapse 2 <file | tr ',' ' '
Output:
A 123 23 45 67
B 88 72 50 23
C 12 14
Input must be sorted, as in the question.
This might work for you (GNU sed):
sed -E ':a;N;s/^((\S+)\s+.*)\n\2/\1/;ta;P;D' file
Append the next line and if the first field of the first line is the same as the first field of the second line, remove the newline and the first field of the second line. Print the first line in the pattern space and then delete it and the following newline and repeat.

How to test two entries per line to given intervals?

I have a reference file ref with certain values (v1 and v2), and for every value there is an interval with upper (ub) and lower (lb bonds) and a group number (gn)defined:
v1 v2 ub1 lb1 ub2 lb2 gn
50 25 51 49 26 24 1
86 13 86.5 85.5 14 12 2
...
Now I have a file test with many lines and two of the entries of every line have values that lie within the intervals defined in ref. The goal is to assign every line the group number which corresponds to the entries in the reference file.
Input file:
50.2 24.6
85.7 13.9
86.3 12.6
Desired output:
50.2 24.6 1
85.7 13.9 2
86.3 12.6 2
My approach so far is this code with bash and awk:
while read line
do
lin=( ${line} )
rot=${lin[0]}
tilt=${lin[1]}
awk -v line="${line}" -v rot="$rot" -v tilt="$tilt" ' {if ((rot>$4) && (rot<$3) && (tilt>$6) && (tilt<$5)) {print line,$7} } ' reference >> output
done < test
But it won't work, the test file has 130000 lines, but the output file has only 11000. So obviously I am doing something wrong. I'm grateful for any suggestions.
with the . used as decimal separator
$ awk 'NR==FNR && NR>1{ub1[$NF]=$3;lb1[$NF]=$4;ub2[$NF]=$5;lb2[$NF]=$6; next}
{for(k in lb1)
if(lb1[k]<$1 && $1<ub1[k] &&
lb2[k]<$2 && $2<ub2[k]) print $0, k}' file input
50.2 24.6 1
85.7 13.9 2
86.3 12.6 2
you may need to change the locale settings to use , as the decimal separator. Also the code assumes there is one pair of ranges per group number (so indexed ranges with group number), if not you need to index by row number and keep a mapping to row number to group number as well.

Use part of a column in one file as search term in other file

I have two files. The output file I am searching has earthquake locations and has the following format:
19090212 1323 30.12 36 19.41 103 28.24 7.29 0.00 4 149 25.8 0.02 5.7 9.8 D - 0
19090216 1828 49.61 36 13.27 101 35.38 10.94 0.00 13 54 38.5 0.07 0.3 0.7 B 0
19090711 2114 54.11 35 1.07 99 56.42 7.00 0.00 7 177 18.7 4.00 63.3 53.2 D # 0
I want to use the last 6 digits of the first column (i.e. '090418' out of '19090418') with the first 3 digits of the second column (i.e. '072' out of '0728') as my search term. The file I am searching has the following format:
SC17 P 090212132329.89
X25A P 090212132330.50
AMTX P 090216182814.12
X29A P 090216182813.70
Y28A P 090216182822.36
MSTX P 090216182826.80
Y27A P 090216182831.43
After I search the second file for the term, I need to figure out how many lines are in that section. So for this example, if I were searching the terms shown for the second file above, I want to know there are 2 lines for 090212132 and 5 lines for 090216182.
This is my first post, so please let me know how I can improve clarity or conciseness in my posts. Thanks for the help!
awk to the rescue!
$ awk 'NR==FNR{a[substr($1,3) substr($2,1,3)]; next}
{k=substr($3,1,9)}
k in a{a[k]++}
END{for(k in a) if(a[k]>0) print k,a[k]}' file1 file2
with your input files, there is no output as expected.
The answer karakfa suggested worked! My output looks like this:
100224194 7
100117172 18
091004005 11
090520220 10
090526143 21
090122033 20
Thanks for the help!
Karafka answer with explanation
awk 'NR==FNR { # For first file
$1 = substr($1, 3); # Get last 6 characters from first col
$2 = substr($2, 1, 3); # Get first 3 characters from second col
a[$1 $2]; # Add to an array
next } # Move to next record in first file
# Start processing second file
{k = substr($3, 1, 9)} # Get first 9 character for third col
k in a {a[k]++} # If key in a, then increment the key
END {
for (k in a) # Iterate array
if (a[k] > 0) # If pattern was matched
print k, a[k] # print the pattern and num occurrence
}'

How can I swap numbers inside data block of repeating format using linux commands?

I have a huge data file, and I hope to swap some numbers of 2nd column only, in the following format file. The file have 25,000,000 dataset, and 8768 lines each.
%% Edited: shorter 10 line example. Sorry for the inconvenience. This is typical one data block.
# Dataset 1
#
# Number of lines 10
#
# header lines
5 11 3 10 120 90 0 0.952 0.881 0.898 2.744 0.034 0.030
10 12 3 5 125 112 0 0.952 0.897 0.905 2.775 0.026 0.030
50 10 3 48 129 120 0 1.061 0.977 0.965 3.063 0.001 0.026
120 2 4 5 50 186 193 0 0.881 0.965 0.899 0.917 3.669 0.000 -0.005
125 3 4 10 43 186 183 0 0.897 0.945 0.910 0.883 3.641 0.000 0.003
186 5 4 120 125 249 280 0 0.899 0.910 0.931 0.961 3.727 0.000 -0.001
193 6 4 120 275 118 268 0 0.917 0.895 0.897 0.937 3.799 0.000 0.023
201 8 4 278 129 131 280 0 0.921 0.837 0.870 0.934 3.572 0.000 0.008
249 9 4 186 355 179 317 0 0.931 0.844 0.907 0.928 3.615 0.000 0.008
280 10 4 186 201 340 359 0 0.961 0.934 0.904 0.898 3.700 0.000 0.033
#
# Dataset 1
#
# Number of lines 10
...
As you can see, there are 7 repeating header lines in the head, and 1 trailing line at the end of the dataset. Those header and trailing lines are all beginning from #. As a result, the data will have 7 header lines, 8768 data lines, and 1 trailing line, total 8776 lines per a data block. That one trailing line only contains sinlge '#'.
I want to swap some numbers in 2nd columns only. First, I want to replace
1, 9, 10, 11 => 666
2, 6, 7, 8 => 333
3, 4, 5 => 222
of the 2nd column, and then,
666 => 6
333 => 3
222 => 2
of the 2nd column. I hope to conduct this replacing for all repeating dataset.
I tried this with python, but the data is too big, so it makes memory error. How can I perform this swapping with linux commands like sed or awk or cat commands?
Thanks
Best,
This might work for you, but you'd have to use GNU awk, as it's using the gensub command and $0 reassignment.
Put the following into an executable awk file ( like script.awk ):
#!/usr/bin/awk -f
BEGIN {
a[1] = a[9] = a[10] = a[11] = 6
a[2] = a[6] = a[7] = a[8] = 3
a[3] = a[4] = a[5] = 2
}
function swap( c2, val ) {
val = a[c2]
return( val=="" ? c2 : val )
}
/^( [0-9]+ )/ { $0 = gensub( /^( [0-9]+)( [0-9]+)/, "\\1 " swap($2), 1 ) }
47 # print the line
Here's the breakdown:
BEGIN - set up an array a with mappings of the new values.
create a user defined function swap to provide values for the 2nd column from the a array or the value itself. The c2 element is passed in, while the val element is a local variable ( becuase no 2nd argument is passed in ).
when a line starts with a space followed by a number and a space (the pattern), then use gensub to replace the first occurrance of the first number pattern with itself concatenated with a space and the return from swap(the action). In this case, I'm using gensub's replacement text to preserve the first column data. The second column is passed to swap using the field data identifier of $2. Using gensub should preserve the formatting of the data lines.
47 - an expression that evaluates to true provides the default action of printing $0, which for data lines might have been modified. Any line that wasn't "data" will be printed out here w/o modifications.
The provided data doesn't show all the cases, so I made up my own test file:
# 2 skip me
9 2 not going to process me
1 1 don't change the for matting
2 2 4 23242.223 data
3 3 data that's formatted
4 4 7 that's formatted
5 5 data that's formatted
6 6 data that's formatted
7 7 data that's formatted
8 8 data that's formatted
9 9 data that's formatted
10 10 data that's formatted
11 11 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
Running the executable awk (like ./script.awk data) gives the following output:
# 2 skip me
9 2 not going to process me
1 6 don't change the for matting
2 3 4 23242.223 data
3 2 data that's formatted
4 2 7 that's formatted
5 2 data that's formatted
6 3 data that's formatted
7 3 data that's formatted
8 3 data that's formatted
9 6 data that's formatted
10 6 data that's formatted
11 6 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
which looks alright to me, but I'm not the one with 25 million datasets.
You'd also most definitely want to try this on a smaller sample of your data first (the first few datasets?) and redirect stdout a temp file perhaps like:
head -n 26328 data | ./script.awk - > tempfile
You can learn more about the elements used in this script here:
awk basics (the man page)
Arrays
User defined functions
String functions - gensub()
And of course, you should spend some quality time reviewing awk related questions and answers on Stack Overflow ;)

Resources