Compare one field, Remove duplicate if value of another field is greater - linux

Trying to do this at linux command line. Wanting to combine two files, compare values based on ID, but only keeping the ID that has the newer/greater value for Date (edit: equal to or greater than). Because the ID 456604 is in both files, wanting to only keep the one from File 2 with the newer date: "20111015 456604 tgf"
File 1
Date ID Note
20101009 456604 abc
20101009 444444 abc
20101009 555555 abc
20101009 666666 xyz
File 2
Date ID Note
20111015 111111 abc
20111015 222222 abc
20111015 333333 xyz
20111015 456604 tgf
And then the output to have both files combined, but only keeping the second ID value, with the newer date. The order of the rows are in does not matter, just example of the output for concept.
Output
Date ID Note
20101009 444444 abc
20101009 555555 abc
20101009 666666 xyz
20111015 111111 abc
20111015 222222 abc
20111015 333333 xyz
20111015 456604 tgf

$ cat file1.txt file2.txt | sort -ru | awk '!($2 in seen) { print; seen[$2] }'
Date ID Note
20111015 456604 tgf
20111015 333333 xyz
20111015 222222 abc
20111015 111111 abc
20101009 666666 xyz
20101009 555555 abc
20101009 444444 abc
Sort the combined files by descending date and only print a line the first time you see an ID.
EDIT
More compact edition, thanks to Steve:
cat file1.txt file2.txt | sort -ru | awk '!seen[$2]++'

You didn't specify how you'd like to handle the case were the dates are also duplicated, or even if this case could exist. Therefore, I have assumed that by 'greater', you really mean 'greater or equal to' (it also makes handling the header a tiny bit easier). If that's not the case, please edit your question.
awk code:
awk 'FNR==NR {
a[$2]=$1
b[$2]=$0
next
}
a[$2] >= $1 {
print b[$2]
delete b[$2]
next
}
1
END {
for (i in b) {
print b[i]
}
}' file2 file1
Explanation:
Basically, we use an associative array, called a, to store the 'ID' and 'Date' as key and value, respectively. We also store the contents of file2 in memory using another associative array called b. When file1 is read, we test if column two exists in our array, a, and that the key's value is greater or equal to column one. If it is, we print the corresponding line from array b, then delete it from the array, and next onto the next line/record of input. The 1 on it's lonesome will return true, thereby enabling printing where the previous (two) conditions are not met. This has the effect of printing any unmatched records from file1. Finally, we print what's left in array b.
Results:
Date ID Note
20111015 456604 tgf
20101009 444444 abc
20101009 555555 abc
20101009 666666 xyz
20111015 222222 abc
20111015 111111 abc
20111015 333333 xyz

Another awk way
awk 'NR==1;FNR>1{a[$2]=(a[$2]<$1&&b[$2]=$3)?$1:a[$2]}
END{for(i in a)print a[i],i,b[i]}' file file2
Compares value in an array to previously stored value to determine which is higher, also stores the third field if current record is higher.
Then prints out the stored date,key(field 2) and the value stored for field 3.
Or shorter
awk 'NR==1;FNR>1{(a[$2]<$1&&b[$2]=$0)&&a[$2]=$1}END{for(i in b)print b[i]}' file file2

Related

How to Print All line between matching first occurrence of word?

input.txt
ABC
CDE
EFG
XYZ
ABC
PQR
EFG
From above file i want to print lines between 'ABC' and first occurrence of 'EFG'.
Expected output :
ABC
CDE
EFG
ABC
PQR
EFG
How can i print lines from one word to first occurrence of second word?
EDIT: In case you want to print all occurrences of lines coming between ABC to DEF and leave others then try following.
awk '/ABC/{found=1} found;/EFG/{found=""}' Input_file
Could you please try following.
awk '/ABC/{flag=1} flag && !count;/EFG/{count++}' Input_file
$ awk '/ABC/,/EFG/' file
Output:
ABC
CDE
EFG
ABC
PQR
EFG
This might work for you (GNU sed):
sed -n '/ABC/{:a;N;/EFG/!ba;p}' file
Turn off implicit printing by using the -n option.
Gather up lines between ABC and EFG and then print them. Repeat.
If you want to only print between the first occurrence of ABC to EFG, use:
sed -n '/ABC/{:a;N;/EFG/!ba;p;q}' file
To print the second through fourth occurrences, use:
sed -En '/ABC/{:a;N;/EFG/!ba;x;s/^/x/;/^x{2,4}$/{x;p;x};x;}' file

linux/unix convert delimited file to fixed width

I have a requirement to convert a delimited file to fixed-width file, details as follows.
Input file sample:
AAA|BBB|C|1234|56
AA1|BB2|DD|12345|890
Output file sample:
AAA BBB C 1234 56
AA1 BB2 DD 12345 890
Details of field positions
Field 1 Start at position 1 and length should be 5
Field 2 start at position 6 and length should be 6
Field 3 Start at position 12 and length should be 4
Field 4 Start at position 16 and length should be 6
Field 5 Start at position 22 and length should be 3
Another awk solution:
echo -e "AAA|BBB|C|1234|56\nAA1|BB2|DD|12345|890" |
awk -F '|' '{printf "%-5s%-6s%-4s%-6s%-3s\n",$1,$2,$3,$4,$5}'
Note the - before the %-3s in the printf statement, which will left-align the fields, as required in the question. Output:
AAA BBB C 1234 56
AA1 BB2 DD 12345 890
With the following awk command you can achive your goal:
awk 'BEGIN { RS=" "; FS="|" } { printf "%5s%6s%4s%6s%3s\n",$1,$2,$3,$4,$5 }' your_input_file
Your record separator (RS) is a space and your field separator (FS) is a pipe (|) character. In order to parse your data correctly we set them in the BEGIN statement (before any data is read). Then using printf and the desired format characters we output the data in the desired format.
Output:
AAA BBB C 1234 56
AA1 BB2 DD 12345890
Update:
I just saw your edits on the input file format (previously they seemed different). If your input data records are separated with a new line then simply remove the RS=" "; part from the above one-liner and apply the - modifiers for the format characters to left align your fields:
awk 'BEGIN { FS="|" } { printf "%-5s%-6s%-4s%-6s%-3s\n",$1,$2,$3,$4,$5 }' your_input_file

Comparing two files using Awk

I have two text files one with a list of ids and another one with some id and corresponding values.
File 1
abc
abcd
def
cab
kac
File 2
abcd 100
def 200
cab 500
kan 400
So, I want to compare both the files and fetch the value of matching columns and also keep all the id from File 1 and assign "NA" to the ids that don't have a value in File2
Desired output
abc NA
abcd 100
def 200
cab 500
kac NA
PS: Only Awk script/One-liners
The code I'm using to print matching columns:
awk 'FNR==NR{a[$1]++;next}a[$1]{print $1,"\t",$2}'
$ awk 'NR==FNR{a[$1]=$2;next} {print $1, ($1 in a? a[$1]: "NA") }' file2 file1
abc NA
abcd 100
def 200
cab 500
kac NA
Using join and sort (hopefully portable):
export LC_ALL=C
sort -k1 file1 > /tmp/sorted1
sort -k1 file2 > /tmp/sorted2
join -a 1 -e NA -o 0,2.2 /tmp/sorted1 /tmp/sorted2
In bash you can use here-files in a single line:
LC_ALL=C join -a 1 -e NA -o 0,2.2 <(LC_ALL=C sort -k1 file1) <(LC_ALL=C sort -k1 file2)
Note 1, this gives output sorted by 1st column:
abc NA
abcd 100
cab 500
def 200
kac NA
Note 2, the commands may work even without LC_ALL=C. Important is that all sort and join commands are using the same locale.

In AWK, how to split consecutive rows that have the same string as a "record"?

Let's say I have below text.
aaaaaaa
aaaaaaa
bbb
bbb
bbb
ccccccccccccc
ddddd
ddddd
Is there a way to modify the text as the following.
1 aaaaaaa
1 aaaaaaa
2 bbb
2 bbb
2 bbb
3 ccccccccccccc
4 ddddd
4 ddddd
You could use something like this in awk:
$ awk '{print ($0!=p?++i:i),$0;p=$0}' file
1 aaaaaaa
1 aaaaaaa
2 bbb
2 bbb
2 bbb
3 ccccccccccccc
4 ddddd
4 ddddd
i is incremented whenever the current line differs from the previous line. p holds the value of the previous line, $0.
Alternatively, as suggested by JID:
awk '$0!=p{p=$0;i++}{print i,$0}' file
When the current line differs from p, replace p and increment i. See the comments for discussion of the pros and cons of either approach :)
A further contribution (and even shorter!) by NeronLeVelu
$ awk '{print i+=($0!=p),p=$0}' file
This version performs the addition assignment and basic assignment within the print statement. This works because the return value of each assignment is the value that has been assigned.
As pointed out in the comments, if the first line of the file is empty, the behaviour changes slightly. Assuming that the first line should always begin with a 1, the following block can be added to the start of any of the one-liners:
NR==1{p=$0;i=1}
i.e. on the first line, initialise p to the contents of the line (empty or not) and i to 1. Thanks to Wintermute for this suggestion.

How to delete the matching pattern from given occurrence

I'm trying to delete matching patterns, starting from the second occurrence, using sed or awk. The input file contains the information below:
abc
def
abc
ghi
jkl
abc
xyz
abc
I want to the delete the pattern abc from the second instance. The output should be as below:
abc
def
ghi
jkl
xyz
Neat sed solution:
sed '/abc/{2,$d}' test.txt
abc
def
ghi
jkl
xyz
$ awk '$0=="abc"{c[$0]++} c[$0]<2; ' file
abc
def
ghi
jkl
xyz
Just change the "2" to "3" or whatever number you want to keep the first N occurrences instead of just the first 1.
One way using awk:
$ awk 'f&&$0==p{next}$0==p{f=1}1' p="abc" file
abc
def
ghi
jkl
xyz
Just set p to pattern that you only want the first instance of printing:
Taken from : unix.com
Using awk '!x[$0]++' will remove duplicate lines. x is a array and it's initialized to 0.the index of x is $0,if $0 is first time meet,then plus 1 to the value of x[$0],x[$0] now is 1.As ++ here is "suffix ++",0 is returned and then be added.So !x[$0] is true,the $0 is printed by default.if $0 appears more than once,! x[$0] will be false so won't print $0.

Resources