Bash CSV sorting and unique-ing - linux

a Linux question: I have the CSV file data.csv with the following fields and values
KEY,LEVEL,DATA
2.456,2,aaa
2.456,1,zzz
0.867,2,bbb
9.775,4,ddd
0.867,1,ccc
2.456,0,ttt
...
The field KEY is a float value, while LEVEL is an integer. I know that the first field can have repeated values, as well as the second one, but if you take them together you have a unique couple.
What I would like to do is to sort the file according to the column KEY and then for each unique value under KEY keep only the row having the higher value under LEVEL.
Sorting is not a problem:
$> sort -t, -k1,2 data.csv # fields: KEY,LEVEL,DATA
0.867,1,ccc
0.867,2,bbb
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd
...
but then how can I filter the rows so that I get what I want, which is:
0.867,2,bbb
2.456,2,aaa
9.775,4,ddd
...
Is there a way to do it using command line tools like sort, uniq, awk and so on? Thanks in advance

try this line:
your sort...|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
output:
kent$ echo "0.867,1,bbb
0.867,2,ccc
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd"|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
0.867,2,ccc
2.456,2,aaa
9.775,4,ddd
The idea is, because your file is already sorted, just go through the file/input from top, if the first column (KEY) changed, print the last line, which is the highest value of LEVEL of last KEY
try with your real data, it should work.
also the whole logic (with your sort) could be done by awk in single process.

Use:
$> sort -r data.csv | uniq -w 5 | sort
given your floats are formatted "0.000"-"9.999"

Perl solution:
perl -aF, -ne '$h{$F[0]} = [#F[1,2]] if $F[1] > $h{$F[0]}[0]
}{
print join ",", $_, #{$h{$_}} for sort {$a<=>$b} keys %h' data.csv
Note that the result is different from the one you requested, the first line contains bbb, not ccc.

Related

insert column with same row content to csv in cli

I am having a csv to which I need to add a new column at the end and add a certain string to all rows of the csv in the newly added column.
Example csv:
os,num1,alpha1
Unix,10,A
Linux,30,B
Solaris,40,C
Fedora,20,D
Ubuntu,50,E
I tried using awk command and did not get expected result. I am not sure whether indexing or counting column number is right.
awk -F'[[:null:]]' '$2 && !$1{ $4="NA" }1'
Expected result is:
os,num1,alpha1,code
Unix,10,A,NA
Linux,30,B,NA
Solaris,40,C,NA
Fedora,20,D,NA
Ubuntu,50,E,NA
You can use sed:
sed 's/$/,NA/' db1.csv > db2.csv
then edit the first line containing the column titles.
I'm not quite sure how you came up w/ that awk statement of yours, why you'd think that your file has NUL-terminated lines or that [[:null:]] has become a valid character class ...
The following, however, will do your bidding:
awk 'NR==1{print $0",code"}; NR>1{print $0",NA"}' example.csv
os,num1,alpha1,code
Unix,10,A,NA
Linux,30,B,NA
Solaris,40,C,NA
Fedora,20,D,NA
Ubuntu,50,E,NA

Sort data file in bash

I am using sort in a bash script to order a file generated. An example of an input file is :
2,0,2165,5
2,-10,2122,5
2,10,2830,6
2,-11,2121,5
2,11,2903,6
2,-1,2151,5
2,1,2171,5
2,-12,2114,5
2,-13,2118,5
2,-14,2121,5
2,-2,2144,5
2,2,2199,5
I need sorting on the first number and then the second, I tried the following:
sort -k1,1n -k2,2n data
The positive numbers are ordered as required, but the negative ones are dictionary ordered:
2,-10,2122,5
2,-11,2121,5
2,-1,2151,5
2,-12,2114,5
2,-13,2118,5
2,-14,2121,5
2,-2,2144,5
2,0,2165,5
2,1,2171,5
2,2,2199,5
2,10,2830,6
2,11,2903,6
Could anyone help with this one?
sort -t, -k1,1n -k2,2n nums
2,-14,2121,5
2,-13,2118,5
2,-12,2114,5
2,-11,2121,5
2,-10,2122,5
2,-2,2144,5
2,-1,2151,5
2,0,2165,5
2,1,2171,5
2,2,2199,5
2,10,2830,6
2,11,2903,6
You need to tell sort the delimiter, and it works here.

How can i sort the directory content by owner and group rights in terminal?

Owner rights have a priority over group rights during sorting.
i thought something like ls -la | sort -n
The first letter though,that shows the type of the file gets in the way and gets counted as well.
How can i start sorting based on the 2nd column where owner rights start?(not 2nd field,terminal column)
If this is not possible,if there s any other solution for my problem?
Pipe the output of ls through cut?
ls -lA| cut -d ' ' -f2- | sort -n
cut will keep all columns starting with column2 and the output is then piped to sort.
You can use this awk (works only in gawk) program to do it:
NR==2,true {
arr[substr($0, 2)] = $0
}
END {
asorti(arr, sorted)
for (i in sorted)
print arr[sorted[i]]
}
here is how to run it: ls -l | awk -f prog.awk
What it does:
NR==2,true {
takes every line between the second one and the last one (to omit column header)
arr[substr($0, 2)] = $0
take the whole line and save it in associative array called arr under index that is the same as the whole line except it does not have the first letter
END {
after reading all lines
asorti(arr, sorted)
sort the array by the indices (gnu extension) and store the sorted array in sorted. The sorted array is indited by the element's position and the values are the original indices ( the lines without first letter)
for (i in sorted)
print arr[sorted[i]]
iterate the sorted indices and retrieve the original lines from the initial array arr.
The advantage of this method is that it saves all the information about the entries (including whether it's a link or directory or something else).

get 1 line with the same field in a file using shell script

I have a file, with contents like:
onelab2.warsaw.rd.tp.pl 5
onelab3.warsaw.rd.tp.pl 5
lefthand.eecs.harvard.edu 7
righthand.eecs.harvard.edu 7
planetlab2.netlab.uky.edu 8
planet1.scs.cs.nyu.edu 9
planetx.scs.cs.nyu.edu 9
so for each line, there is a number I want the 1st line for each number so for
the content above, I want to get:
onelab2.warsaw.rd.tp.pl 5
lefthand.eecs.harvard.edu 7
planetlab2.netlab.uky.edu 8
planet1.scs.cs.nyu.edu 9
How can I achieve this? I hope for shell scripts, with awk, sed, etc.
This might work for you (GNU sort):
sort -nsuk2 file
Sort the -k2 second field -n numerically keeping the -s original order and -u remove duplicates.
Use the awk command for that:
awk '{if(!a[$2]){a[$2]=1; print}}' file.dat
Explanation:
{
# 'a' is a lookup table (array) which will contain all numbers
# that have been printed so far. It will be initialized as an empty
# array on its first usage by awk. So you don't have to care about.
# $2 is the second 'column' in the line -> the number
if(!a[$2])
{
# set index in the lookup table. This way the if statement will
# fail for the next line with the same number at the end
a[$2]=1;
# print the whole current line
print
}
}
With sort and uniq:
sort -n -k2 input | uniq -f1
perl -ane 'print unless $a{$F[1]}++' file

Comparing two files using awk and printing contains which are matching from other files

I have two files:
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
file2.txt
919167000000
919594000000
Output
919167000000,hutch,mumbai
919594000000,idea,mumbai
How can I achieve this using AWK? I've got a huge file of phone numbers which needs to be compared like this. I believe Awk can handle it; if not please let me know how can I do this.
Extra definitions
Is the common part always a 6-digit number? Yes always 6.
Are the two files already sorted? file1 is not sorted. file2 can be sorted.
Are the trailing digits in file 2 always zeros? No, these are phone numbers this can vary, purpose of this is to get series information of the phone number.
Is there any danger of file 1 containing three records for a given number while file 2 contains 2 records, or is it one-to-one? It's one-to-one.
Can there be records in file 1 with no match in file 2, or vice versa?_ Yes.
If so, do you want to see the unmatched records? Yes I want both records.
Extended data
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
918888,airtel,karnataka
file2.txt
919167838888
919594998484
919212334323
Output Expected:
919167838888,hutch,mumbai
919594998484,idea,mumbai
919212334323,nomatch,nomatch
As I noted in a comment, there's a lot of unstated information needed to give a definitive answer. However, we can make some plausible guesses:
The common number is the first 6 digits of file 2 (we don't care about the trailing digits, but will simply copy them to the output).
The files are sorted in order.
If there are unmatched records in either file, those records will be ignored.
The tools of choice are probably sed and join:
sed 's/^\([0-9]\{6\}\)/\1,\1/' file2.txt |
join -t, -o 1.2,2.2,2.3 - file1.txt
This edits file2.txt to create a comma-separated first field with the 6-digit phone number followed by all the rest of the line. The input is fed to the join command, which joins on the first column, and outputs the 'rest of the line' (column 2) from file2.txt and columns 2 and 3 from file1.txt.
If the phone numbers are variable length, then the matching operation is horribly complex. For that, I'd drop into Perl (or Python) to do the work. If the data is unsorted, it can be sorted before being fed into the commands. If you want unmatched records, you can specify how to handle those in the options to join.
The extra information needed is now available. The key information is the 6-digits is fixed — phew! Since you're on Linux, I'm assuming bash is available with 'process substitution':
sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -a 2 -e 'no-match' - <(sort file1.txt)
If process substitution is not available, simply sort file1.txt in situ:
sort -o file1.txt file1.txt
Then use file1.txt in place of <(sort file1.txt).
I think the comment might be asking for inputs such as:
file1.txt
919167,hutch,mumbai
919594,idea,mumbai
902130,airtel,karnataka
file2.txt
919167000000
919594000000
919342313242
Output
no-match,airtel,karnataka
919167000000,hutch,mumbai
919342313242,no-match,no-match
919594000000,idea,mumbai
If that's not what the comment is about, please clarify by editing the question to add the extra data and output in a more readable format than comments allow.
Working with the extended data, this mildly modified command:
sort file2.txt |
sed 's/^\([0-9]\{6\}\)/\1,\1/' |
join -t, -o 1.2,2.2,2.3 -a 1 -e 'no-match' - <(sort file1.txt)
produces the output:
919167838888,hutch,mumbai
919212334323,no-match,no-match
919594998484,idea,mumbai
which looks rather like a sorted version of the desired output. The -a n options control whether the unmatched records from file 1 or file 2 (or both) are printed; the -e option controls the value printed for the unmatched fields. All of this is readily available from the man pages for join, of course.
Here's one way using GNU awk. Run like:
awk -f script.awk file2.txt file1.txt
Contents of script.awk:
BEGIN {
FS=OFS=","
}
FNR==NR {
sub(/[ \t]+$/, "")
line = substr($0, 0, 6)
array[line]=$0
next
}
{
printf ($1 in array) ? $0"\n" : "FILE1 no match --> "$0"\n"
dup[$1]++
}
END {
for (i in array) {
if (!(i in dup)) {
printf "FILE2 no match --> %s\n", array[i]
}
}
}
Alternatively, here's the one-liner:
awk 'BEGIN { FS=OFS="," } FNR==NR { sub(/[ \t]+$/, ""); line = substr($0, 0, 6); array[line]=$0; next } { printf ($1 in array) ? $0"\n" : "FILE1 no match --> "$0"\n"; dup[$1]++} END { for (i in array) if (!(i in dup)) printf "FILE2 no match --> %s\n", array[i] }' file2.txt file1.txt
awk -F, 'FNR==NR{a[$1]=$2","$3;next}{for(i in a){if($1~/i/) print $1","a[i]}}' your_file

Resources