Finding if a column is in a range - linux

I have two files that I want to find out if a column of file1 is in a range of columns.
file1.txt
1 19
1 21
1 24
2 22
4 45
file2.txt
1 19 23 A
1 20 28 A
4 42 45 A
I am trying to see if the 1st column of file1.txt is the same with 1st column of file2.txt, whether the second column of file1.txt is in between 2nd and 3rd columns of file2.txt, and append if it is in the range.
So the output should be :
output.txt
1 19 23 A 1 19
1 19 23 A 1 21
1 20 28 A 1 24
4 42 45 A 4 45
What I am trying is to find out if first columns are the same:
awk 'NR==FNR{c[$1]++;next};c[$1] > 0' file1.txt file2.txt
1 19 23 A
1 20 28 A
4 42 45 A
But I am not able to put the larger/ smaller conditions.
How do I add it?

Following may also help you here.
while read first second
do
awk -v fir="$first" -v sec="$second" '$1==fir && ($2<=sec && $3>=sec){print $0,fir,sec}' file2
done < "file1"

Using join + awk:
join file2.txt file1.txt | awk '{if ($2 <= $5 && $5 <= $3) { print $1,$2,$3,$4,$1,$5 } }'
First two files are joined on the first column, then the columns are compared and output printed (with the first column printed twice, as join hides it).

Using awk:
$ awk 'NR==FNR{a[$1]=a[$1]" "$2;next} {split(a[$1],b);for(i in b) if(b[i]>=$2 && b[i]<=$3) print $0,$1,b[i]}' file1 file2
1 19 23 A 1 19
1 19 23 A 1 21
1 20 28 A 1 21
1 20 28 A 1 24
4 42 45 A 4 45
The first block statement stores the elements of file1 into the array a. The array index is the first column of the file and the array element is the concatenation of all numbers of the second column with the same number in the first column.
The second block statement loops over the the array a element with the same index as the first column and checks for the number in the array is in between the range.
Another approach is to use join:
$ join -o 1.1 1.2 1.3 1.4 1.1 2.2 file2 file1 | awk '$6 >= $2 && $6 <= $3'
1 19 23 A 1 19
1 19 23 A 1 21
1 20 28 A 1 21
1 20 28 A 1 24
4 42 45 A 4 45
join -o generated the expected output format. The awk statement is filtering
the lines that are in range.

Related

Translate specific columns to rows

I have this file (space delimited) :
bc1 no 12
bc1 no 15
bc1 yes 4
bc2 no 8
bc3 yes 14
bc3 yes 12
bc4 no 2
I would like to get this output :
bc1 3 no;no;yes 31
bc2 1 no 8
bc3 2 yes;yes 26
bc4 1 no 2
1st column : one occurence of the first column in the input file
2nd : number of this occurence in the input file
3rd : 3rd column translated in row with ";" delimiter
4th : sum of the last column
I can do what I want with the "no/yes" column :
awk -F' ' 'NF>2{a[$1] = a[$1]";"$2}END{for(i in a){print i" "a[i]}}' test.txt | sort -k1,1n
With your shown samples, please try following awk code. Since $1(first column) is always sorted we need not to sort it so coming up with this solution here.
awk '
prev!=$1 && prev{
print prev OFS count,value,sum
count=sum=0
prev=value=""
}
{
prev=$1
value=(value?value ";":"") $2
count++
sum+=$NF
}
END{
if(prev){
print prev OFS count,value,sum
}
}
' Input_file
Here's a solution with newer versions of datamash (thanks glenn jackman for the tips about count and --collapse-delimiter):
datamash -t' ' -g1 --collapse-delimiter=';' count 2 collapse 2 sum 3 <ip.txt
With older versions (mine is 1.4) and awk:
$ datamash -t' ' -g1 count 2 collapse 2 sum 3 <ip.txt
bc1 3 no,no,yes 31
bc2 1 no 8
bc3 2 yes,yes 26
bc4 1 no 2
$ <ip.txt datamash -t' ' -g1 count 2 collapse 2 sum 3 | awk '{gsub(/,/, ";", $3)} 1'
bc1 3 no;no;yes 31
bc2 1 no 8
bc3 2 yes;yes 26
bc4 1 no 2
-t helps to set space as field separator. Column 2 is collapsed by using column 1 as the key. sum 3 helps to find the total of the numbers. count 2 helps to count the number of collapsed rows.
Then, awk is used to change the , in the third column to ;
One alternative awk idea:
awk '
function print_row() {
if (series)
print key,c,series,sum
c=sum=0
series=sep=""
}
{ if ($1 != key) # if 1st column has changed then print previous data set
print_row()
key=$1
c++
series=series sep $2
sep=";"
sum+=$3
}
END { print_row() } # flush last data set to stdout
' input
This generates:
bc1 3 no;no;yes 31
bc2 1 no 8
bc3 2 yes;yes 26
bc4 1 no 2

Shifting column titles to right

I have a file which I want to process it in bash or python.
The structure is with 4 columns but only with 3 column titles:
input.txt
1STCOLUMN 2NDCOLUMN THIRDCOLUMN
input1 12 33 45
input22 10 13 9
input4 2 23 11
input4534 3 1 1
I am trying to shift the title columns to right and add a title of "INPUTS" to the first column (input column).
Desired output: Adding the column title
Desired-output-step1.csv
INPUTS 1STCOLUMN 2NDCOLUMN THIRDCOLUMN
input1 12 33 45
input22 10 13 9
input4 2 23 11
input4534 3 1 1
I tried with sed:
sed -i '1iINPUTS, 1STCOLUMN, 2NDCOLUMN, THIRDCOLUMN' input.txt
But I do not prefer to type the names of the columns for this reason.
How do I just insert the new title to first column and the other column titles shift to right?
you can specify which line to be replaced using line numbers
$ sed '1s/^/INPUTS /' ip.txt
INPUTS 1STCOLUMN 2NDCOLUMN THIRDCOLUMN
input1 12 33 45
input22 10 13 9
input4 2 23 11
input4534 3 1 1
here, 1 indicates that you want to apply s command only for 1st line
s/^/INPUTS / insert something to start of line, you'll have to adjust the spacing as needed
instead of counting and testing the spaces, you can let column -t do the padding and formatting job:
sed '1s/^/INPUTS /' ip.txt|column -t
This will give you:
INPUTS 1STCOLUMN 2NDCOLUMN THIRDCOLUMN
input1 12 33 45
input22 10 13 9
input4 2 23 11
input4534 3 1 1

How can I use awk for modify a column based in the first column?

I have a data like this, and I need automatize a simple task. I need to make the second value of a row, become the same as the first cell in the next row in the sequence like this:
First Second
1 2
4 6
10 12
25 28
30 35
Become
First Second
1 4
4 10
10 25
25 30
30 35
$ awk 'NR==1; NR>2{print p[1], $1} {split($0,p)} END{print p[1], p[2]}' file
First Second
1 4
4 10
10 25
25 30
30 35
It should be noted your output is wrong, you cannot know the 35 because that row has not been read yet:
$ awk 'NR > 1 {print $1} {printf $1 "\t"}' file
1 4
4 10
10 25
25 30
30

How to extract one column from multiple files, and paste those columns into one file?

I want to extract the 5th column from multiple files, named in a numerical order, and paste those columns in sequence, side by side, into one output file.
The file names look like:
sample_problem1_part1.txt
sample_problem1_part2.txt
sample_problem2_part1.txt
sample_problem2_part2.txt
sample_problem3_part1.txt
sample_problem3_part2.txt
......
Each problem file (1,2,3...) has two parts (part1, part2). Each file has the same number of lines.
The content looks like:
sample_problem1_part1.txt
1 1 20 20 1
1 7 21 21 2
3 1 22 22 3
1 5 23 23 4
6 1 24 24 5
2 9 25 25 6
1 0 26 26 7
sample_problem1_part2.txt
1 1 88 88 8
1 1 89 89 9
2 1 90 90 10
1 3 91 91 11
1 1 92 92 12
7 1 93 93 13
1 5 94 94 14
sample_problem2_part1.txt
1 4 330 30 a
3 4 331 31 b
1 4 332 32 c
2 4 333 33 d
1 4 334 34 e
1 4 335 35 f
9 4 336 36 g
The output should look like: (in a sequence of problem1_part1, problem1_part2, problem2_part1, problem2_part2, problem3_part1, problem3_part2,etc.,)
1 8 a ...
2 9 b ...
3 10 c ...
4 11 d ...
5 12 e ...
6 13 f ...
7 14 g ...
I was using:
paste sample_problem1_part1.txt sample_problem1_part2.txt > \
sample_problem1_partall.txt
paste sample_problem2_part1.txt sample_problem2_part2.txt > \
sample_problem2_partall.txt
paste sample_problem3_part1.txt sample_problem3_part2.txt > \
sample_problem3_partall.txt
And then:
for i in `find . -name "sample_problem*_partall.txt"`
do
l=`echo $i | sed 's/sample/extracted_col_/'`
`awk '{print $5, $10}' $i > $l`
done
And:
paste extracted_col_problem1_partall.txt \
extracted_col_problem2_partall.txt \
extracted_col_problem3_partall.txt > \
extracted_col_problemall_partall.txt
It works fine with a few files, but it's a crazy method when the number of files is large (over 4000).
Could anyone help me with simpler solutions that are capable of dealing with multiple files, please?
Thanks!
Here's one way using awk and a sorted glob of files:
awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $5 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *)
Results:
1 8 a
2 9 b
3 10 c
4 11 d
5 12 e
6 13 f
7 14 g
Explanation:
For each line of input of each input file:
Add the files line number to an array with a value of column 5.
(a[FNR] ? a[FNR] FS : "") is a ternary operation, which is set up to build up the arrays value as a record. It simply asks if the files line number is already in the array. If so, add the arrays value followed by the default file separator before adding the fifth column. Else, if the line number is not in the array, don't prepend anything, just let it equal the fifth column.
At the end of the script:
Use a C-style loop to iterate through the array, printing each of the arrays values.
For only ~4000 files, you should be able to do:
find . -name sample_problem*_part*.txt | xargs paste
If find is giving names in the wrong order, pipe it to sort:
find . -name sample_problem*_part*.txt | sort ... | xargs paste
# print filenames in sorted order
find -name sample\*.txt | sort |
# extract 5-th column from each file and print it on a single line
xargs -n1 -I{} sh -c '{ cut -s -d " " -f 5 $0 | tr "\n" " "; echo; }' {} |
# transpose
python transpose.py ?
where transpose.py:
#!/usr/bin/env python
"""Write lines from stdin as columns to stdout."""
import sys
from itertools import izip_longest
missing_value = sys.argv[1] if len(sys.argv) > 1 else '-'
for row in izip_longest(*[column.split() for column in sys.stdin],
fillvalue=missing_value):
print " ".join(row)
Output
1 8 a
2 9 b
3 10 c
4 11 d
5 ? e
6 ? f
? ? g
Assuming the first and second files have less lines than the third one (missing values are replaced by '?').
Try this one. My script assumes that every file has the same number of lines.
# get number of lines
lines=$(wc -l sample_problem1_part1.txt | cut -d' ' -f1)
for ((i=1; i<=$lines; i++)); do
for file in sample_problem*; do
# get line number $i and delete everything except the last column
# and then print it
# echo -n means that no newline is appended
echo -n $(sed -n ${i}'s%.*\ %%p' $file)" "
done
echo
done
This works. For 4800 files, each 7 lines long it took 2 minutes 57.865 seconds on a AMD Athlon(tm) X2 Dual Core Processor BE-2400.
PS: The time for my script increases linearly with the number of lines. It would take very long time to merge files with 1000 lines. You should consider learning awk and use the script from steve. I tested it: For 4800 files, each with 1000 lines it took only 65 seconds!
You can pass awk output to paste and redirect it to a new file as follows:
paste <(awk '{print $3}' file1) <(awk '{print $3}' file2) <(awk '{print $3}' file3) > file.txt

how to subset a file - select a numbers of rows or columns

I would like to have your advice/help on how to subset a big file (millions of rows or lines).
For example,
(1)
I have big file (millions of rows, tab-delimited). I want to a subset of this file with only rows from 10000 to 100000.
(2)
I have big file (millions of columns, tab-delimited). I want to a subset of this file with only columns from 10000 to 100000.
I know there are tools like head, tail, cut, split, and awk or sed. I can use them to do simple subsetting. But, I do not know how to do this job.
Could you please give any advice? Thanks in advance.
Filtering rows is easy, for example with AWK:
cat largefile | awk 'NR >= 10000 && NR <= 100000 { print }'
Filtering columns is easier with CUT:
cat largefile | cut -d '\t' -f 10000-100000
As Rahul Dravid mentioned, cat is not a must here, and as Zsolt Botykai added you can improve performance using:
awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile
cut -d '\t' -f 10000-100000 largefile
Some different solutions:
For row ranges:
In sed :
sed -n 10000,100000p somefile.txt
For column ranges in awk:
awk -v f=10000 -v t=100000 '{ for (i=f; i<=t;i++) printf("%s%s", $i,(i==t) ? "\n" : OFS) }' details.txt
For the first problem, selecting a set of rows from a large file, piping tail to head is very simple. You want 90000 rows from largefile starting at row 10000. tail grabs the back end of largefile starting at row 10000 and then head chops off all but the first 90000 rows.
tail -n +10000 largefile | head -n 90000 -
Was beaten to it for the sed solution, so I'll post a perl dito instead.
To print selected lines.
$ seq 100 | perl -ne 'print if $. >= 10 && $. <= 20'
10
11
12
13
14
15
16
17
18
19
20
To print selective columns, use
perl -lane 'print $F[1] .. $F[3] '
-F is used in conjunction with -a, to choose the delimiter on which to split lines.
To test, use seq and paste to get generate some columns
$ seq 50 | paste - - - - -
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
36 37 38 39 40
41 42 43 44 45
46 47 48 49 50
Lets's print everything except the first and the last column
$ seq 50 | paste - - - - - | perl -lane 'print join " ", $F[1] .. $F[3]'
2 3 4
7 8 9
12 13 14
17 18 19
22 23 24
27 28 29
32 33 34
37 38 39
42 43 44
47 48 49
In the join statement above, there is a tab, you get it by doing a ctrl-v tab.

Resources