Fast extraction of lines based on line numbers - linux

I am looking for a fast way to extract lines of a file based on a list of line numbers read from a different file in bash.
Define three files:
position_file: Containing a single column of integers
full_data_file: Containing a single column of data
extracted_data_file: Containing those lines in full_data_file whose line numbers match the integers in position_file
My current way of doing this is
while read position; do
awk -v pos="$position" 'NR==pos {print; exit}' < full_data_file >> extracted_data_file
done < position_file
The problem is that this is painfully slow and I'm trying to do this for a large number of rather large files. I was hoping someone might be able to suggest a faster way.
Thank you for your help.

The right way with awk command:
Input files:
$ head pos.txt data.txt
==> pos.txt <==
2
4
6
8
10
==> data.txt <==
a
b
c
d
e
f
g
h
i
j
awk 'NR==FNR{ a[$1]; next }FNR in a' pos.txt data.txt > result.txt
$ cat result.txt
b
d
f
h
j

Related

Bash script: filter columns based on a character

My text file should be of two columns separated by a tab-space (represented by \t) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s).
A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5
My objective is to create a table as follows:
A\t1
B\t2
C\t3
D\t4
E\t5
i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3 I can discard the x that is present after space and store the columns as C\t3.
I have tried a couple of things but with no luck.
I tried to cut the cols based on \t into independent columns and then cut the first column based on \s and join them again. However, it did not work.
Here is the snippet:
col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[#]};idx++));do
echo "#{col1[$idx]} #{col2[$idx]}"
# I will append to myArr here
done
The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.
Any advice is very much appreciated.
Thank you. :)
And another sed solution:
Search and replace any literal space followed by any number of non-TAB-characters with nothing.
sed -E 's/ [^\t]+//' file
A 1
B 2
C 3
D 4
E 5
If there could be more than one actual space in there just make it 's/ +[^\t]+//' ...
Assuming that when you say a space you mean a blank character then using any awk:
awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file
Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)
$ cat ls
A 1
B 2
C x 3
D 4
E y 5
$ cat ls |perl -pe 's/^(\S+).*\t(\S+)/$1 $2/g'
A 1
B 2
C 3
D 4
E 5
This code gets all non-empty characters from the front and all non-empty characters from after \t
Try
sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file
The ANSI-C Quoting ($'...') feature of Bash is used to make tab characters visible as \t.
take advantage of FS and OFS and let them do all the hard work for you
{m,g}awk NF=NF FS='[ \t].*[ \t]' OFS='\t'
A 1
B 2
C 3
D 4
E 5
if there's a chance of leading edge or trailing edge spaces and tabs, then perhaps
mawk 'NF=gsub("^[ \t]+|[ \t]+$",_)^_+!_' OFS='\t' RS='[\r]?\n'

Concatenate two columns of a text file

I have a tsv file like
1 2 3 4 5 ...
a b c d e ...
x y z j k ...
How can I merge two contiguous columns, say the 2nd and the 3rd, to get
1 2-3 4 5 ...
a b-c d e ...
x y-z j k ...
I need the code to work with text files with different numbers of columns, so I can't use something like awk 'BEGIN{FS="\t"} {print $1"\t"$2"-"$3"\t"$4"\t"$5}' file
awk is the first tool I thought about for the task and one I'm trying to learn, so I'm very interested in answers using it, but any solution with any other tool would be greatly appreciated.
With simple sed command for tsv file:
sed 's/\t/-/2' file
The output:
1 2-3 4 5 ...
a b-c d e ...
x y-z j k ...
Following awk may help you in same, in case you are not worried about little space which will be created when 3rd field will be nullified.
awk '{$2=$2"-"$3;$3=""} 1' Input_file
With awk:
awk -v OFS='\t' -v col=2 '{
$(col)=$(col)"-"$(col+1); # merge col and col+1
for (i=col+1;i<NF;i++) $(i)=$(i+1); # shift columns right of col+1 by one to the left
NF--; # remove the last field
}1' file # print the record
Output:
1 2-3 4 5 ...
a b-c d e ...
x y-z j k ...

AWK compare two columns in two seperate files

I would like to compare two files and do something like this: if the 5th column in the first file is equal to the 5th column in the second file, I would like to print the whole line from the first file. Is that possible? I searched for the issue but was unable to find a solution :(
The files are separated by tabulators and I tried something like this:
zcat file1.txt.gz file2.txt.gz | awk -F'\t' 'NR==FNR{a[$5];next}$5 in a {print $0}'
Did anybody tried to do a similar thing? :)
Thanks in advance for help!
Your script is fine, but you need to provide each file individually to awk and in reverse order.
$ cat file1.txt
a b c d 100
x y z w 200
p q r s 300
1 2 3 4 400
$ cat file2.txt
. . . . 200
. . . . 400
$ awk 'NR==FNR{a[$5];next} $5 in a {print $0}' file2.txt file1.txt
x y z w 200
1 2 3 4 400
EDIT:
As pointed out in the comments, the generic solution above can be improved and tailored to OP's situation of starting with compressed tab-separated files:
$ awk -F'\t' 'NR==FNR{a[$5];next} $5 in a' <(zcat file2.txt) <(zcat file1.txt)
x y z w 200
1 2 3 4 400
Explanation:
NR is the number of the current record being processed and FNR is the number
of the current record within its file . Thus NR == FNR is only
true when awk is processing the first file given to it (which in our case is file2.txt).
a[$5] adds the value of the 5th column as an index to the array a. Arrays in awk are associative arrays, but often you don't care about associating a value and just want to make a nice collection of things. This is a
pithy way to make a collection of all the values we've seen in 5th column of the
first file. The next statement, which follows, says to immediately get the next
available record without looking at any anymore statements in the awk program.
Summarizing the above, this line says "If you're reading the first file (file2.txt),
save the value of column 5 in the array called a and move on to the record without
continuing with the rest of the awk program."
NR == FNR { a[$5]; next }
Hopefully it's clear from the above that the only way we can past that first line of
the awk program is if we are reading the second file (file1.txt in our case).
$5 in a evaluates to true if the value of the 5th column occurs as an index in
the a array. In other words, it is true for every record in file1.txt whose 5th
column we saw as a value in the 5th column of file2.txt.
In awk, when the pattern portion evaluates to true, the accompanying action is
invoked. When there's no action given, as below, the default action is triggered
instead, which is to simply print the current record. Thus, by just saying
$5 in a, we are telling awk to print all the records in file1.txt whose 5th
column also occurs in file2.txt, which of course was the given requirement.
$5 in a

Extract lines from File2 already found File1

Using linux commandline, i need to output the lines from text file2 that are already found in file1.
File1:
C
A
G
E
B
D
H
F
File2:
N
I
H
J
K
M
D
L
A
Output:
A
D
H
Thanks!
You are looking for the tools 'grep'
Check this out.
Lets say you have inputs in file1 & file2 files
grep -f file1 file2
will return you
H
D
A
A more flexible tool to use would be awk
awk 'NR==FNR{lines[$0]++; next} $1 in lines'
Example
$ awk 'NR==FNR{lines[$0]++; next} $1 in lines' file1 file2
H
D
A
What it does?
NR==FNR{lines[$0]++; next}
NR==FNR checks if the file number of records is equal to the overall number of records. This is true only for the first file, file1
lines[$0]++ Here we create an associative array with the line, $0 in file 1 as index.
$0 in lines This line works only for the second file because of the next in previous action. This checks if the line in file 2 is there in the saved array lines, if yes the default action of printing the entire line is taken
Awk is more flexible than the grep as you can columns in file1 with any column in file 2 and decides to print any column rather than printing the entire line
This is what the comm utility does, but you have to sort the files first: To get the lines in common between the 2 files:
comm -12 <(sort File1) <(sort File2)

Print last field in file and use it for name of another file

I have a tab delimited file with 3 rows and 7 columns. I want to use the number at the end of the file to rename another file.
Example of tab delimited file:
a b c d e f g
a b c d e f g
a b c d e f 1235
So, I want to extract the number from tab delimited file and then rename "file1" to the number extracted (mv file1 1235)
I can print the column, but I cannot seem to extract just the number from the file. Even if I can extract the number I can't seem to figure out how to store that number to use as the new file name.
You can use this awk
name=$(awk 'END {print $NF}' file)
mv file $name
something along these lines perhaps?
num=$(tail -1 file1 | rev | awk '{print $1}' | rev)
mv file1 $num
Using a perl one-liner
perl -ne 'BEGIN{($f) = #ARGV} ($n) = /(\d+)$/; END{rename($f, $n)}' file1

Resources