Use part of a column in one file as search term in other file - linux

I have two files. The output file I am searching has earthquake locations and has the following format:
19090212 1323 30.12 36 19.41 103 28.24 7.29 0.00 4 149 25.8 0.02 5.7 9.8 D - 0
19090216 1828 49.61 36 13.27 101 35.38 10.94 0.00 13 54 38.5 0.07 0.3 0.7 B 0
19090711 2114 54.11 35 1.07 99 56.42 7.00 0.00 7 177 18.7 4.00 63.3 53.2 D # 0
I want to use the last 6 digits of the first column (i.e. '090418' out of '19090418') with the first 3 digits of the second column (i.e. '072' out of '0728') as my search term. The file I am searching has the following format:
SC17 P 090212132329.89
X25A P 090212132330.50
AMTX P 090216182814.12
X29A P 090216182813.70
Y28A P 090216182822.36
MSTX P 090216182826.80
Y27A P 090216182831.43
After I search the second file for the term, I need to figure out how many lines are in that section. So for this example, if I were searching the terms shown for the second file above, I want to know there are 2 lines for 090212132 and 5 lines for 090216182.
This is my first post, so please let me know how I can improve clarity or conciseness in my posts. Thanks for the help!

awk to the rescue!
$ awk 'NR==FNR{a[substr($1,3) substr($2,1,3)]; next}
{k=substr($3,1,9)}
k in a{a[k]++}
END{for(k in a) if(a[k]>0) print k,a[k]}' file1 file2
with your input files, there is no output as expected.

The answer karakfa suggested worked! My output looks like this:
100224194 7
100117172 18
091004005 11
090520220 10
090526143 21
090122033 20
Thanks for the help!

Karafka answer with explanation
awk 'NR==FNR { # For first file
$1 = substr($1, 3); # Get last 6 characters from first col
$2 = substr($2, 1, 3); # Get first 3 characters from second col
a[$1 $2]; # Add to an array
next } # Move to next record in first file
# Start processing second file
{k = substr($3, 1, 9)} # Get first 9 character for third col
k in a {a[k]++} # If key in a, then increment the key
END {
for (k in a) # Iterate array
if (a[k] > 0) # If pattern was matched
print k, a[k] # print the pattern and num occurrence
}'

Related

Extract columns from multiple text files with bash or awk or sed?

I am trying to extract column1 and column4 from multiple text files.
file1.txt:
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
CFLAU10s46802|kraken:taxid|33189 1 125 2 105 84 1.68 36.8 24
CFLAU10s46898|kraken:taxid|33189 1 116 32 116 100 23.5862 35.7 19.4
CFLAU10s46988|kraken:taxid|33189 1 105 2 53 50.4762 1.00952 36.9 11
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 1 1102 2 88 7.98548 0.15971 36.4 10
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 1 2133 6 113 5.2977 0.186592 36.6 13
file2.txt:
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
CFLAU10s46802|kraken:taxid|33189 1 125 5 105 84 1.68 36.8 24
CFLAU10s46898|kraken:taxid|33189 1 116 40 116 100 23.5862 35.7 19.4
CFLAU10s46988|kraken:taxid|33189 1 105 6 53 50.4762 1.00952 36.9 11
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 1 1102 2 88 7.98548 0.15971 36.4 10
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 1 2133 6 113 5.2977 0.186592 36.6 13
output format (save the output as merged.txt in another directory). In the output file: Column1(#nname) will be once because this is same for every file, but there will be multiple column4 (numreads) as many as files and the rename the column4 should be according to each file name.
Output file looks like:
#rname file1_numreads file2_numreads
CFLAU10s46802|kraken:taxid|33189 2 5
CFLAU10s46898|kraken:taxid|33189 32 40
CFLAU10s46988|kraken:taxid|33189 2 6
AUZW01004514.1 Cronartium comandrae C4 contig1015102_0, whole genome shotgun sequence 2 88
AUZW01004739.1 Cronartium comandrae C4 contig1070682_0, whole genome shotgun sequence 6 113
Your suggestions would be appreciated.
Here is something I put together. awk gurus might have a simpler - shorter version but I am still learning awk.
Create a file script.awk and make it executable. Put in it:
#!/usr/bin/awk -f
BEGIN { FS="\t" }
# process files, ignoring comments
!/^#/ {
# keep the first column values.
# Only add a new value if it is not already in the array.
if (!($1 in firstcolumns)) {
firstcolumns[$1] = $1
}
# extract the 4th column of file1, put it in the array (column 1).1
if (FILENAME == ARGV[1]) {
results[$1 ".1"] = $4
}
# extract the 4th column of file2, put it in the array (column 1).2
if (FILENAME == ARGV[2]) {
results[$1 ".2"] = $4
}
}
# print the results
END {
# for each first column value...
for (key in firstcolumns) {
# Print the first column, then (column 1).1, then (column 1).2
print key "\t" results[key ".1"] "\t" results[key ".2"]
}
}
Call it like this: ./script.awk file1.txt file2.txt.
Since awk parses the files line per line, I keep the possible values of the first column in an array (firstcolumns).
For each line, if the 4th column comes from file1.txt (ARGV[1]) I store it in the results array under (firstcolumn).1.
For each line, if the 4th column comes from file2.txt (ARGV[2]) I store it in the results array under (firstcolumn).2.
In the END block, loop through the possible firstcolumn values and print the values (firstcolumn).1 and (firstcolumn).2, separated by "\t" for tabs.
Results:
$ ./so.awk file1.txt file2.txt
AUZW01004514.1 C4 C4
CFLAU10s46988|kraken:taxid|33189 2 6
CFLAU10s46802|kraken:taxid|33189 2 5
AUZW01004739.1 C4 C4
CFLAU10s46898|kraken:taxid|33189 32 40

How can I replace a specific character in a file where it's position changes in bash command line or script?

I have the following file:
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 3 1 1 1 1 1
The character "3" that I need to change is bolded and italicized. The value of this character is dynamic, but always a single digit. I have tried a few things using sed but I can't come up with a way to account for the character changing position due to additional characters being added before that position.
This character is always at the same position from the END of the line, but not from the beginning. Meaning, the content to the left of this character may change and it may be longer, but this is always the 11th character and 6th digit from the end. It is easy to devise a way to cut it, or find it using tail, but I can't devise a way to replace it.
To be clear, the single digit character in question will always be replaced with another single digit character.
With GNU awk
$ cat file
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 3 1 1 1 1 1
$ gawk -i inplace -v new=9 'NF {$(NF-5) = new} 1' file
$ cat file
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 9 1 1 1 1 1
Where:
NF {$(NF-5) = new} means, when the line is not empty, replace the 6th-last field with the new value (9).
1 means print every record.
awk '{ $(NF-5) = ($(NF - 5) + 8) % 10; print }'
Given your input data, it produces;
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 1 1 1 1 1 1
The 3 has been mapped via 11 to 1 — pick your poison on how you assign the new value, but the magic is $(NF - 5) to pick up the fifth column before the last one (or sixth from end).
Would you try the following:
replace="x" # or whatever you want to replace
sed 's/\(.\)\(.\{10\}\)$/'"$replace"'\2/' file
The left portion of the sed command \(.\)\(.\{10\}\)$ matches a character, followed by ten characters, then anchored by the end of line.
Then the 1st character is replaced with the specified character and the following ten characters are reused.
I'm gonna assume that the number that you're looking for is the same distance from the end, regardless of what comes before it:
rev ~/test.txt | awk '$6=<value to replace>' | rev
Using the bash shell which should be the last option.
rep=10
read -ra var <<< '2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 3 1 1 1 1 1'
for i in "${!var[#]}"; do printf '%s ' "${var[$i]/${var[-6]}/$rep}"; done
If it is in a file.
rep=10
read -ra var < file.txt
for i in "${!var[#]}"; do printf '%s ' "${var[$i]/${var[-6]}/$rep}"; done
Not the shortest and fastest way but it can be done...

How to test two entries per line to given intervals?

I have a reference file ref with certain values (v1 and v2), and for every value there is an interval with upper (ub) and lower (lb bonds) and a group number (gn)defined:
v1 v2 ub1 lb1 ub2 lb2 gn
50 25 51 49 26 24 1
86 13 86.5 85.5 14 12 2
...
Now I have a file test with many lines and two of the entries of every line have values that lie within the intervals defined in ref. The goal is to assign every line the group number which corresponds to the entries in the reference file.
Input file:
50.2 24.6
85.7 13.9
86.3 12.6
Desired output:
50.2 24.6 1
85.7 13.9 2
86.3 12.6 2
My approach so far is this code with bash and awk:
while read line
do
lin=( ${line} )
rot=${lin[0]}
tilt=${lin[1]}
awk -v line="${line}" -v rot="$rot" -v tilt="$tilt" ' {if ((rot>$4) && (rot<$3) && (tilt>$6) && (tilt<$5)) {print line,$7} } ' reference >> output
done < test
But it won't work, the test file has 130000 lines, but the output file has only 11000. So obviously I am doing something wrong. I'm grateful for any suggestions.
with the . used as decimal separator
$ awk 'NR==FNR && NR>1{ub1[$NF]=$3;lb1[$NF]=$4;ub2[$NF]=$5;lb2[$NF]=$6; next}
{for(k in lb1)
if(lb1[k]<$1 && $1<ub1[k] &&
lb2[k]<$2 && $2<ub2[k]) print $0, k}' file input
50.2 24.6 1
85.7 13.9 2
86.3 12.6 2
you may need to change the locale settings to use , as the decimal separator. Also the code assumes there is one pair of ranges per group number (so indexed ranges with group number), if not you need to index by row number and keep a mapping to row number to group number as well.

Rearrange column with empty values using awk or sed

i want to rearrange the columns of a txt file, but there are empty values, which cause a problem. For example:
testfile:
Name ID Count Date Other
A 1 10 513 x
6 15 312 x
3 18 314 x
B 19 31 942 x
8 29 722 x
when i tried $ more testfile |awk '{print $2"\t"$1"\t"$3"\t"$4"\t"$5}'
it becomes:
ID Name Count Date Other
1 A 10 513 x
15 6 312 x
18 3 314 x
19 B 31 942 x
29 8 722 x
which is not i want, please help,i want it to be
ID Name Count Date Other
1 A 10 513 x
15 6 312 x
18 3 314 x
19 B 31 942 x
29 8 722 x
moreover i am not sure which columns might contain empty values, and the column length is not fixed, thank you
Assuming your input file is not tab-separated and you have (or can get) GNU awk then I recommend:
$ awk -v FIELDWIDTHS="8 8 8 8 8" -v OFS='\t' '{
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
}
t=$1; $1=$2; $2=t'
}1' file
ID Name Count Date Other
1 A 10 513 x
6 15 312 x
3 18 314 x
19 B 31 942 x
8 29 722 x
If your file is tab-separated then all you need is:
awk 'BEGIN{FS=OFS="\t"} {t=$1; $1=$2; $2=t}1' file
Another awk alternative is using the number of fields. If you know your data and it's only deficit in the first column you can try this.
awk -v OFS="\t" 'NF==4{$5=$4;$4=$3;$3=$2;$2=$1;$1=""} {print $2,$1,$3,$4,$5}'
However, the output will be tab separated instead of fixed length format. You can achieve the same using printf and changing OFS, but perhaps tab separated is what you really need for tabular representation.
The most natural model for awk to use is columns as defined by the transitions from white-space to non-white-space and back. Since you have columns that may themselves be white-space, the natural model won't work.
However, you can revert to using a model based on column positions rather than transitions, meaning that a file containing only spaces (the presence of tabs will complicate things):
Name ID Count Date Other
A 1 10 513 x
6 15 312 x
3 18 314 x
B 19 31 942 x
8 29 722 x
can still be rearranged, though not as succinctly as transition-based columns.
The following awk script will do the trick, swapping name and id:
{
name = substr($0, 1,7);
id = substr($0, 9,7);
count = substr($0,17,7);
date = substr($0,25,7);
other = substr($0,33 );
print id" "name" "count" "date" "other;
}
If the original file is called pax.in and the awk script is stored in pax.awk, the command awk -f pax.awk pax.in will give you, as desired:
ID Name Count Date Other
1 A 10 513 x
6 15 312 x
3 18 314 x
19 B 31 942 x
8 29 722 x
Keep in mind I've written that script to be relatively flexible, allowing you to change the order of the columns quite easily. If all you want is to swap the first two columns, you could use:
awk '{print substr($0,9,8)substr($0,1,8)substr($0,17)}' qq.in
or the slightly shorter (if you're allowed to use other tools):
sed -E 's/^(.{8})(.{8})/\2\1/' qq.in

How can I swap numbers inside data block of repeating format using linux commands?

I have a huge data file, and I hope to swap some numbers of 2nd column only, in the following format file. The file have 25,000,000 dataset, and 8768 lines each.
%% Edited: shorter 10 line example. Sorry for the inconvenience. This is typical one data block.
# Dataset 1
#
# Number of lines 10
#
# header lines
5 11 3 10 120 90 0 0.952 0.881 0.898 2.744 0.034 0.030
10 12 3 5 125 112 0 0.952 0.897 0.905 2.775 0.026 0.030
50 10 3 48 129 120 0 1.061 0.977 0.965 3.063 0.001 0.026
120 2 4 5 50 186 193 0 0.881 0.965 0.899 0.917 3.669 0.000 -0.005
125 3 4 10 43 186 183 0 0.897 0.945 0.910 0.883 3.641 0.000 0.003
186 5 4 120 125 249 280 0 0.899 0.910 0.931 0.961 3.727 0.000 -0.001
193 6 4 120 275 118 268 0 0.917 0.895 0.897 0.937 3.799 0.000 0.023
201 8 4 278 129 131 280 0 0.921 0.837 0.870 0.934 3.572 0.000 0.008
249 9 4 186 355 179 317 0 0.931 0.844 0.907 0.928 3.615 0.000 0.008
280 10 4 186 201 340 359 0 0.961 0.934 0.904 0.898 3.700 0.000 0.033
#
# Dataset 1
#
# Number of lines 10
...
As you can see, there are 7 repeating header lines in the head, and 1 trailing line at the end of the dataset. Those header and trailing lines are all beginning from #. As a result, the data will have 7 header lines, 8768 data lines, and 1 trailing line, total 8776 lines per a data block. That one trailing line only contains sinlge '#'.
I want to swap some numbers in 2nd columns only. First, I want to replace
1, 9, 10, 11 => 666
2, 6, 7, 8 => 333
3, 4, 5 => 222
of the 2nd column, and then,
666 => 6
333 => 3
222 => 2
of the 2nd column. I hope to conduct this replacing for all repeating dataset.
I tried this with python, but the data is too big, so it makes memory error. How can I perform this swapping with linux commands like sed or awk or cat commands?
Thanks
Best,
This might work for you, but you'd have to use GNU awk, as it's using the gensub command and $0 reassignment.
Put the following into an executable awk file ( like script.awk ):
#!/usr/bin/awk -f
BEGIN {
a[1] = a[9] = a[10] = a[11] = 6
a[2] = a[6] = a[7] = a[8] = 3
a[3] = a[4] = a[5] = 2
}
function swap( c2, val ) {
val = a[c2]
return( val=="" ? c2 : val )
}
/^( [0-9]+ )/ { $0 = gensub( /^( [0-9]+)( [0-9]+)/, "\\1 " swap($2), 1 ) }
47 # print the line
Here's the breakdown:
BEGIN - set up an array a with mappings of the new values.
create a user defined function swap to provide values for the 2nd column from the a array or the value itself. The c2 element is passed in, while the val element is a local variable ( becuase no 2nd argument is passed in ).
when a line starts with a space followed by a number and a space (the pattern), then use gensub to replace the first occurrance of the first number pattern with itself concatenated with a space and the return from swap(the action). In this case, I'm using gensub's replacement text to preserve the first column data. The second column is passed to swap using the field data identifier of $2. Using gensub should preserve the formatting of the data lines.
47 - an expression that evaluates to true provides the default action of printing $0, which for data lines might have been modified. Any line that wasn't "data" will be printed out here w/o modifications.
The provided data doesn't show all the cases, so I made up my own test file:
# 2 skip me
9 2 not going to process me
1 1 don't change the for matting
2 2 4 23242.223 data
3 3 data that's formatted
4 4 7 that's formatted
5 5 data that's formatted
6 6 data that's formatted
7 7 data that's formatted
8 8 data that's formatted
9 9 data that's formatted
10 10 data that's formatted
11 11 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
Running the executable awk (like ./script.awk data) gives the following output:
# 2 skip me
9 2 not going to process me
1 6 don't change the for matting
2 3 4 23242.223 data
3 2 data that's formatted
4 2 7 that's formatted
5 2 data that's formatted
6 3 data that's formatted
7 3 data that's formatted
8 3 data that's formatted
9 6 data that's formatted
10 6 data that's formatted
11 6 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
which looks alright to me, but I'm not the one with 25 million datasets.
You'd also most definitely want to try this on a smaller sample of your data first (the first few datasets?) and redirect stdout a temp file perhaps like:
head -n 26328 data | ./script.awk - > tempfile
You can learn more about the elements used in this script here:
awk basics (the man page)
Arrays
User defined functions
String functions - gensub()
And of course, you should spend some quality time reviewing awk related questions and answers on Stack Overflow ;)

Resources