Linux: Dedupe based on specific fields

Linux: Dedupe based on specific fields - linux

I have a CSV file where I need to dedupe entries where the FIRST field matches, even if the other fields don't match. In addition, the line that is left should be the one where one of the other fields with the highest date.
This what my data looks like:
"47917244","000","OTC","20180718","7","2018","20180719","47917244","20180719"
"47917244","000","OTC","20180718","7","2018","20180731","47917244","20180731"
"47917244","000","OTC","20180718","7","2018","20180830","47917244","20180830"
All 3 lines have the same value in the first field. The 9th field is a date field, and I want dedupe it in such a way that the third line, which has the highest date value, is kept, but the other two lines are deleted.

After checking another stackoverflow post (Is there a way to 'uniq' by column?), I got it working by using a mix of sort and awk:
sort -t, -u -k1,1 -k9,9 <file> |
awk -F',' '{ x[$1]=$0 } END { for (i in x) print x[i] }'

Related

Awk: filter files based on maximum in column group

I'm trying to filter (or move) folders based on a maximum value in each group in Linux...
For Example: List of filenames in a folder and I'd like to move the one with the greatest Value by Code after sorting by SN,Code,Date
SN-Code-Date-Value
01-2L-20200417-153542
01-2L-20200417-155640 --> move to folder
01-43-20200511-192316
01-43-20200521-165949
01-43-20200521-185815 --> move to folder
Thanks!

sort by keys and timestamp and pick the first record by awk, which will be the first since timestamp sorting in reverse order.
$ sort -t- -k1,2 -k3,4r file | awk -F- '!a[$1,$2]++'
note that this is comparing dates as well, if you want date to be part of the key, add $3 to lookup key as well in the script.

AWK - Show lines where column contains a specific string

I have a document (.txt) composed like that.
info1: info2: info3: info4
And I want to show some information by column.
For example, I have some different information in "info3" shield, I want to see only the lines who are composed by "test" in "info3" column.
I think I have to use sort but I'm not sure.
Any idea ?

The previous answers are assuming that the third column is exactly equal to test. It looks like you were looking for columns where the value included test. We need to use awk's match function
awk -F: 'match($3, "test")' file

You can use awk for this. Assuming your columns are de-limited by : and column 3 has entries having test, below command lists only those lines having that value.
awk -F':' '$3=="test"' input-file

Assuming that the spacing is consistent, and you're looking for only test in the third column, use
grep ".*:.*: test:.*" file.txt
Or to take care of any spacing that might occur
grep ".*:.*: *test *:.*" file.txt

Differences between Unix commands for Sorting CSV

What's the difference between:
!tail -n +2 hits.csv | sort -k 1n -o output.csv
and
!tail -n +2 hits.csv | sort -t "," -k1 -n -k2 > output.csv
?
I'm trying to sort a csv file by first column first, then by the second column, so that lines with the same first column are still together.
It seems like the first one already does that correctly, by first sorting by the field before the first comma, then by the field following the first comma. (breaking ties, that is.)
Or does it not actually do that?
And what does the second command do/mean? (And what's the difference between the two?) There is a significant difference between the two output.csv files when I run the two.
And, finally, which one should I use? (Or are they both wrong?)

See also the answer by #morido for some other pointers, but here's a description of exactly what those two sort invocations do:
sort -k 1n -o output.csv
This assumes that the "fields" in your file are delimited by a transition from non-whitespace to whitespace (i.e. leading whitespace is included in each field, not stripped, as many might expect/assume), and tells sort to order things by a key that starts with the first field and extends to the end of the line, and assumes that the key is formatted as a numeric value. The output is sent explicitly to a specific file.
sort -t "," -k1 -n -k2
This defines the field separator as a comma, and then defines two keys to sort on. The first key again starts at the first field and extends to the end of the line and is lexicographic (dictionary order), not numeric, and the second key, which will be used when values of the first key are identical, starts with the second field and extends to the end of the line, and because of the intervening -n, will be assumed to be numeric data as well. However, because your first key entails the entire line, essentially, the second key is not likely to ever be needed (if the first key of two separate lines is identical, the second key most likely will be too).
Since you didn't provide sample data, it's unknown whether the data in the first two fields is numeric or not, but I suspect you want something like what was suggested in the answer by #morido:
sort -t, -k1,1 -k2,2
or
sort -t, -k1,1n -k2,2n (alternatively sort -t, -n -k1,1 -k2,2)
if the data is numeric.

First off: you want to remove the leading ! from these two commands. In Bash (and probably others since this comes from csh) you are otherwise referencing the last command that contained tail in your history which does not make sense here.
The main difference between your two versions is that in the first case you are not taking the second column into account.
This is how I would do it:
tail -n +2 hits.csv | sort -t "," -n --key=1,1 --key=2,2 > output.csv
-t specifies the field separator
-n turns on numerical sorting order
--key specifies the fields that should be used for sorting (in order of precedence)

use uniq -d on a particular column?

Have a text file like this.
john,3
albert,4
tom,3
junior,5
max,6
tony,5
I'm trying to fetch records where column2 value is same. My desired output.
john,3
tom,3
junior,5
tony,5
I'm checking if we can use uniq -d on second column?

Here's one way using awk. It reads the input file twice, but avoids the need to sort:
awk -F, 'FNR==NR { a[$2]++; next } a[$2] > 1' file file
Results:
john,3
tom,3
junior,5
tony,5
Brief explanation:
FNR==NR is a common AWK idiom that is true for the first file in the arguments list. Here, column two is added to an array and incremented. On the second read of the file, we simply check if the value of column two is greater than one (the next keyword skips processing the rest of the code).

You can use uniq on fields (columns), but not easily in your case.
Uniq's -f and -s options filter by fields and characters respectively. However neither of these quite do what want.
-f divides fields by whitespace and you separate them with commas.
-s skips a fixed number of characters and your names are of variable length.
Overall though, uniq is used to compress input by consolidating duplicates into unique lines. You are actually wishing to retain duplicates and eliminate singletons, which is the opposite of what uniq is used to do. It would appear you need a different approach.

Awk item from column one, then awk again using the result in column two?

I have a CSV that I need to sort through using a key provided that'd be in the first column of said CSV, and then I need to awk again and search via column 2 and return all matching data.
So: I'd awk with the first key, and it'd return just the result of the second column [so just cell]. Then I'd awk using the cell contents and have it return all matching rows.
I have almost no bash/awk scripting experience so please bear with me. :)
Input:
KEY1,TRACKINGKEY1,TRACKINGNUMBER1-1,PACKAGENUM1-1
,TRACKINGKEY1,TRACKINGNUMBER1-2,PACKAGENUM1-2
,TRACKINGKEY1,TRACKINGNUMBER1-3,PACKAGENUM1-3
,TRACKINGKEY1,TRACKINGNUMBER1-4,PACKAGENUM1-4
,TRACKINGKEY1,TRACKINGNUMBER1-5,PACKAGENUM1-5
KEY2,TRACKINGKEY2,TRACKINGNUMBER2-1,PACKAGENUM2-1
KEY3,TRACKINGKEY3,TRACKINGNUMBER3-1,PACKAGENUM3-1
,TRACKINGKEY3,TRACKINGNUMBER3-2,PACKAGENUM3-2
Command:
awk -v key=KEY1 -F' *,' '$1==key{f=1} $1 && $1!=key{f=0} f{print $3}' file
Output:
TRACKINGNUMBER1-1
TRACKINGNUMBER1-2
TRACKINGNUMBER1-3
TRACKINGNUMBER1-4
TRACKINGNUMBER1-5
That's what I've tried. I'd like to awk so if I search for key1 that trackingkey1 is returned, then awk with trackingkey one and output each full matching row.
Sorry, I should have been more clear. For example - if I searched for KEY3 I'd like the output to be:
KEY3,TRACKINGKEY3,TRACKINGNUMBER3-1,PACKAGENUM3-1
,TRACKINGKEY3,TRACKINGNUMBER3-2,PACKAGENUM3-2
So what I want is I'd search for KEY3 initially, and it would return TRACKINGKEY3. I'd then search for TRACKINGKEY3 and it would return each full row with said TRACKINGKEY3 in it.

Does this do what you want?
awk -v key=KEY3 -F ',' '{if($1==key)tkey=$2;if($2==tkey)print}' file
It only makes a single pass through the file, not the multiple passes you described, but the output matches what you requested. When it finds the specified key in the first column is grabs the tracking key from the second column. It prints every line that matches this tracking key.
A shorter way to achieve the same thing is by using awk's implicit printing:
awk -v key=KEY3 -F ',' '$1==key{tkey=$2}$2==tkey' file

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Linux: Dedupe based on specific fields - linux

After checking another stackoverflow post (Is there a way to 'uniq' by column?), I got it working by using a mix of sort and awk: sort -t, -u -k1,1 -k9,9 <file> | awk -F',' '{ x[$1]=$0 } END { for (i in x) print x[i] }'

Related

Awk: filter files based on maximum in column group

AWK - Show lines where column contains a specific string

Differences between Unix commands for Sorting CSV

use uniq -d on a particular column?

Awk item from column one, then awk again using the result in column two?

Categories

Resources