Compare 2 files and remove duplicate lines only once - linux

I need to remove the duplicate values from file 1 comparing with file 2 . When i was trying to do so , i am facing issue like since the value in file 2(c,g) also comes under [b] in file 1 , those are also getting deleted. but my requirement is to delete only those under [a]. Thanks
$ less file 1
[a]
c
g
d
[b]
c
g
h
and
$ less file 2
[a]
c
g
d

You can use this awk command:
awk '/^\[.*?\]/{s=$0} FNR==NR{seen[s,$0]++; next} !seen[s,$0]' file2 file1
[b]
c
g
h
This awk is using an associative array seen with a composite key of value inside [...] and all the subsequent records i.e. s,$0
While going through file2 it saves those value in array and while traversing through file1 it will print only those that aren't available in seen thus avoiding duplicates.

Related

Removing Only Sequential Duplicates from Text FIle? [duplicate]

This sounds simple on its face but is actually somewhat more complex. I would like to use a unix utility to delete consecutive duplicates, leaving the original. But, I would also like to preserve other duplicates that do not occur immediately after the original. For example, if we have the lines:
O B
O B
C D
T V
O B
I want the output to be:
O B
C D
T V
O B
Although the first and last lines are the same, they are not consecutive and therefore I want to keep them as unique entries.
You can do:
cat file1 | uniq > file2
or more succinctly:
uniq file1 file2
assuming file1 contains
O B
O B
C D
T V
O B
For more details, see man uniq. In particular, note that the uniq command accepts two arguments with the following syntax: uniq [OPTION]... [INPUT [OUTPUT]].
Finally if you'd want to remove all duplicates (and sort the file along the way), you could do:
sort -u file1 > file2

Fast extraction of lines based on line numbers

I am looking for a fast way to extract lines of a file based on a list of line numbers read from a different file in bash.
Define three files:
position_file: Containing a single column of integers
full_data_file: Containing a single column of data
extracted_data_file: Containing those lines in full_data_file whose line numbers match the integers in position_file
My current way of doing this is
while read position; do
awk -v pos="$position" 'NR==pos {print; exit}' < full_data_file >> extracted_data_file
done < position_file
The problem is that this is painfully slow and I'm trying to do this for a large number of rather large files. I was hoping someone might be able to suggest a faster way.
Thank you for your help.
The right way with awk command:
Input files:
$ head pos.txt data.txt
==> pos.txt <==
2
4
6
8
10
==> data.txt <==
a
b
c
d
e
f
g
h
i
j
awk 'NR==FNR{ a[$1]; next }FNR in a' pos.txt data.txt > result.txt
$ cat result.txt
b
d
f
h
j

AWK compare two columns in two seperate files

I would like to compare two files and do something like this: if the 5th column in the first file is equal to the 5th column in the second file, I would like to print the whole line from the first file. Is that possible? I searched for the issue but was unable to find a solution :(
The files are separated by tabulators and I tried something like this:
zcat file1.txt.gz file2.txt.gz | awk -F'\t' 'NR==FNR{a[$5];next}$5 in a {print $0}'
Did anybody tried to do a similar thing? :)
Thanks in advance for help!
Your script is fine, but you need to provide each file individually to awk and in reverse order.
$ cat file1.txt
a b c d 100
x y z w 200
p q r s 300
1 2 3 4 400
$ cat file2.txt
. . . . 200
. . . . 400
$ awk 'NR==FNR{a[$5];next} $5 in a {print $0}' file2.txt file1.txt
x y z w 200
1 2 3 4 400
EDIT:
As pointed out in the comments, the generic solution above can be improved and tailored to OP's situation of starting with compressed tab-separated files:
$ awk -F'\t' 'NR==FNR{a[$5];next} $5 in a' <(zcat file2.txt) <(zcat file1.txt)
x y z w 200
1 2 3 4 400
Explanation:
NR is the number of the current record being processed and FNR is the number
of the current record within its file . Thus NR == FNR is only
true when awk is processing the first file given to it (which in our case is file2.txt).
a[$5] adds the value of the 5th column as an index to the array a. Arrays in awk are associative arrays, but often you don't care about associating a value and just want to make a nice collection of things. This is a
pithy way to make a collection of all the values we've seen in 5th column of the
first file. The next statement, which follows, says to immediately get the next
available record without looking at any anymore statements in the awk program.
Summarizing the above, this line says "If you're reading the first file (file2.txt),
save the value of column 5 in the array called a and move on to the record without
continuing with the rest of the awk program."
NR == FNR { a[$5]; next }
Hopefully it's clear from the above that the only way we can past that first line of
the awk program is if we are reading the second file (file1.txt in our case).
$5 in a evaluates to true if the value of the 5th column occurs as an index in
the a array. In other words, it is true for every record in file1.txt whose 5th
column we saw as a value in the 5th column of file2.txt.
In awk, when the pattern portion evaluates to true, the accompanying action is
invoked. When there's no action given, as below, the default action is triggered
instead, which is to simply print the current record. Thus, by just saying
$5 in a, we are telling awk to print all the records in file1.txt whose 5th
column also occurs in file2.txt, which of course was the given requirement.
$5 in a

How do I grep a number and all numbers above that number?

So I have a huge list of items.
I need to grep every lines containing the number: 1300 and above.
How can I do this? Will grep do this? Thanks
While grep technically can it's probably not the best tool for the job. If the list is in a fixed format you might be better off using something like awk.
Sample input:
a b c 1100 d e f
g h i 1200 j k l
m n o 1300 p q r
s t u 1400 v w x
Sample code:
awk -F' ' '($4 >= 1300) { print $0 }' input_file
Sample output:
m n o 1300 p q r
s t u 1400 v w x
awk goes through every line, splitting it into tokens, delimited by a space (as dictated by the parameter -F' ', by default it already uses space but explicitly showing it here lets you change it to however your file is formatted). The logic then says for all values in field 4 that are greater or equal to 1300, print the line (print $0).
Yes you can do this with grep, something along the lines of:
$ grep -E '(1[3-9][0-9]{2}|[2-9][0-9]{3}|[1-9][0-9]{4,})'

in Linux: merge two very big files

I would like to merge two files (one is space delimited and the other tab delimited) keeping only the records that are matching between the two files:
File 1: space delimited
A B C D E F G H
s e id_234 4 t 5 7 9
r d id_45 6 h 3 9 10
f w id_56 2 y 7 3 0
s f id_67 2 y 10 3 0
File 2: tab delimited
I L M N O P
s e 4 u id_67 88
d a 5 d id_33 67
g r 1 o id_45 89
I would like to match File 1 field 3 ("C") with file 2 field 5 ("O"), and merge the files like this:
File 3: tab delimited
I L M N O P A B D E F G H
s e 4 u id_67 88 s f 2 y 10 3 0
g r 1 o id_45 89 r d 6 h 3 9 10
There are entries in file 1 that don't appear in file 2, and vice versa, but I only want to keep the intersection (the common ids).
I don't really care about the order.
I would prefer not to use join because these are really big unsorted files and join requires to sort by common field before, which takes a very long time and much memory.
I have tried with awk but unsuccessfully
awk > file3 'NR == FNR {
f2[$3] = $2; next
}
$5 in f2 {
print $0, f2[$2]
}' file2 file1
Can someone please help me?
Thank you very much
Hmm.. you'll ideally be looking to avoid an n^2 solution which is what the awk based approach seems to require. For each record in file1 you have to scan file2 to see if occurs. That's where the time is going.
I'd suggest writing a python (or similar) script for this and building a map id->file position for one of the files and then querying that whilst scanning the other file. That'd get you an nlogn runtime which, to me at least, looks to be the best you could do here (using a hash for the index leaves you with the expensive problem of seeking to the file pos).
In fact, here's the Python script to do that:
f1 = file("file1.txt")
f1_index = {}
# Generate index for file1
fpos = f1.tell()
line = f1.readline()
while line:
id = line.split()[2]
f1_index[id] = fpos
fpos = f1.tell()
line = f1.readline()
# Now scan file2 and output matches
f2 = file("file2.txt")
line = f2.readline()
while line:
id = line.split()[4]
if id in f1_index:
# Found a matching line, seek to file1 pos and read
# the line back in
f1.seek(f1_index[id], 0)
line2 = f1.readline().split()
del line2[2] # <- Remove the redundant id_XX
new_line = "\t".join(line.strip().split() + line2)
print new_line
line = f2.readline()
If sorting the two files (on the columns you want to match on) is a possibility (and wouldn't break the content somehow), join is probably a better approach than trying to accomplish this with bash or awk. Since you state you don't really care about the order, then this would probably be an appropriate method.
It would look something like this:
join -1 3 -2 5 -o '2.1,2.2,2.3,2.4,2.5,2.6,1.1,1.2,1.4,1.5,1.6,1.7,1.8' <(sort -k3,3 file1) <(sort -k5,5 file2)
I wish there was a better way to tell it which columns to output, because that's a lot of typing, but that's the way it works. You could probably also leave off the -o ... stuff, and then just post-process the output with awk or something to get it into the order you want...

Resources