Sorting csv file based on two columns in unix - linux

I'm a beginner in unix shell scripting. I'm trying to sort a csv file based on two columns.
My file looks like below:
sh-4.4$ cat test.csv
603,02,0123456,1111,201806131115
603,20,0123456,1111,201806131115
603,02,9876542,2222,201806131215
603,20,9876542,2222,201806131215
603,02,0123456,1111,201806131117
603,20,0123456,1111,201806131117
I want to group by the 3rd column and the 2nd column should also be ordered as shown below:
603,20,0123456,1111,201806131115
603,02,0123456,1111,201806131115
603,20,0123456,1111,201806131117
603,02,0123456,1111,201806131117
603,20,9876542,2222,201806131215
603,02,9876542,2222,201806131215
I tried doing sort -t',' -k3 -k2 test.csv. This does groups the column 3, but it does not sort the column 2. Its output looks like below.
603,02,0123456,1111,201806131115
603,20,0123456,1111,201806131115
603,02,0123456,1111,201806131117
603,20,0123456,1111,201806131117
603,02,9876542,2222,201806131215
603,20,9876542,2222,201806131215
I also tried sort -t',' -k3 -rk2 test.csv. This however sorts the column 2 as I desired but the column 3 is not sorted as I expected. Its output looks like below.
603,20,9876542,2222,201806131215
603,02,9876542,2222,201806131215
603,20,0123456,1111,201806131117
603,02,0123456,1111,201806131117
603,20,0123456,1111,201806131115
603,02,0123456,1111,201806131115
Any help on this is much appreciated. Suggestions to sort using awk is also welcome.

restrict the sorting fields
$ sort -t, -k3,3 -k2,2 file
should do.
Note however that the output you want doesn't match the spec you describe. You'll get
603,02,0123456,1111,201806131115
603,02,0123456,1111,201806131117
603,20,0123456,1111,201806131115
603,20,0123456,1111,201806131117
603,02,9876542,2222,201806131215
603,20,9876542,2222,201806131215
grouped by third field only and sorted by second field.
Perhaps this is what you wanted?
$ sort -t, -k3 -k2,2r file
603,20,0123456,1111,201806131115
603,02,0123456,1111,201806131115
603,20,0123456,1111,201806131117
603,02,0123456,1111,201806131117
603,20,9876542,2222,201806131215
603,02,9876542,2222,201806131215
note that -k3 means starting from 3rd field to the end, which seems what you want based on the order of the last fields. Also, you want to reorder the rows based on 2nd field in reverse order.
NB. If your numerical fields are not zero padded you may want to add -n option indicate numerical ordering instead of lexical ordering. Here it doesn't make a difference.

Sort will work sorting data on csv & txt file , it will print the output on console
-t says columns are delimited by '|' , -k1 -k2 says that-- it will sort te data by column 1 & then by 2
$ sort -t '|' -k1 -k2 <INPUT_FILE>
For storing the result in output file use following command
$ sort -t '|' -k1 -k2 <INPUT_FILE> -o <OUTPUTFILE>
If you wann do it with ignoring header line then use following command
(head -n1 INPUT_FILE && sort <(tail -n+2 INPUT_FILE)) > OUTPUT_FILE
head -n1 INPUT_FILE which will print only the first line of your file i.e. header
&
This special tail syntax gets your file from second line up to EOF.

Related

Loop through each column in a CSV file and exporting distinct values to a file

I have a CSV file with columns A-O. 500k rows. In Bash I would like to loop through each column, get distinct values and output them to a file:
sort -k1 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f1 -d , | uniq > EMPLOYEEID.csv
sort -k2 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f2 -d , | uniq > SORTNAME.csv
This works, but to me is very manual and not really scalable if there were like 100 columns.
The code sorts the column in-place and then the column specified is passed to uniq to get distinct values and is then outputted.
NB: The first row has the header information.
The above code works, but I'm looking to streamline it somewhat.
Assuming headers can be used as file names for each column:
head -1 test.csv | \
tr "," "\n" | \
sed "s/ /_/g" | \
nl -ba -s$'\t' | \
while IFS=$'\t' read field name; do
cut -f$field -d',' test.csv | \
tail -n +2 | sort -u > "${name}.csv" ;
done
Explanation:
head - reads the first line
tr- replaces the , with new line
sed - replaces white space with _ for cleaner file names (tr would work also, and you can combine with previous one then, but if you need more complex transforms use sed)
nl - adds the field number
-ba - number all lines
-s$'\t' - set the separator to tab (not necessary, as it default, but for clarity sake)
while- reads trough field number/names
cut - selects the field
tail - removes the heading, not all tails have this option, you can replace with sed
sort -u - sorts and removes duplicates
>"$name.csv" - saves in the appropriate file name
note: this assumes that there are no , int the fields, otherwise you will need to use a csv parser
Doing all the columns in a single pass is much more efficient than rescanning the entire input file for each column.
awk -F , 'NR==1 { ncols = split($0, cols, /,/); next }
{ for(i=1; i<=ncols; ++i)
if (!seen[i ":" $i])
print $i >>cols[i] ".csv"}' CROWN.csv
If this is going to be part of a bigger task, maybe split the input file into several temporary files with fewer columns than the number of open file handles permitted on your system, rather than fix this script to handle an arbitrary number of columns.
You can inspect this system constant with ulimit -n; on some systems, you can increase it either by tweaking the system configuration or, in the worst case, by recompiling the kernel. (Your question doesn't identify your platform, but this should be easy enough to google.)
Addendum: I created a quick and dirty timing comparison of these answers at https://ideone.com/dnFj41; I encourage you to fork it and experiment with different shapes of input data. With an input file of 100 columns and (probably) no duplication in the columns -- but only a few hundred rows -- I got the following results:
0.001s Baseline test -- simply copy input file to an identical output file
0.242s tripleee -- this single-pass AWK script
0.561s Sorin -- multiple passes using simple shell script
2.154s Mihir -- multiple passes using AWK
Unfortunately, Carmen's answer could not be tested, because I did not have permissions to install Text::CSV_XS on Ideone.
An earlier version of this answer contained a Python attempt, but I was too lazy to finish debugging it. It's still there in the edit history if you are curious.

linux sort command with delimiter in data

I need to sort a big csv file .So,using
sort
command will be quite good.
But, I am facing an issue that delimiter ',' is also present in the data .
So, sorting on fields with ',' works unexpectedly .
The file contains data like
Ahmedabad ,"7,Olive residency ", 380058
Gandhinagar,"85,Kabir villa",38048
Surat ,Binory Bunglows,589635
And I am using sort command like
sort --field-separator=',' -s -k 3,3 bigfile.csv
Which does not give desired output.
Can any one help me with this ?
sort -k3 -t',' -nr bigfile.csv

How to filter multiple files and eliminate duplicate entries to select a single entry while using linux shell

I have a folder that contains several files. These files consist of identical columns.
Let us say file1 and file2 have contents as follows.(Here it can be more than two files)
$cat file1.txt
9999999999|1200
8888888888|1400
7777777777|1255
6666666666|1788
7777777777|1289
9999999999|1300
$cat file2.txt
9999999999|2500
8888888888|2450
6666666666|2788
9999999999|3000
2222222222|3001
In my file 1st column is mobile number and 2nd is count. Same mobile can be there in multiple files. Now I want to get the records into a file with unique mobile numbers which has the highest count.
The output should be as follows:
$cat output.txt
7777777777|1289
8888888888|2450
6666666666|2788
9999999999|3000
2222222222|3001
Any help would be appreciated.
That's probably not very efficient but it does the job:
put this into phones.sh and run sh phones.sh
#!/bin/bash
files="
file1.txt
file2.txt
"
phones=$(cat $files | cut -d'|' -f1 | sort -u)
for phone in $phones; do grep -h $phone $files | sort -t'|' -k 2 -nr | head -n1; done | sort -t'|' -k 2
What it does is basically, extract all the phone numbers in the files, iterate over them and grep them in all files, select the one with the highest count. Then I also sorted the final result by count, which is what your expected result suggests. sort -t'|' -k 2 -nr means sort the second column given the delimiter |, by decreasing numerical order. head -n1 selects the first line. You can add other files into the files variable.
Another way of doing this is to use the power of sort and awk:
cat file1.txt file2.txt | sort -t '|' -k1,1 -k2,2nr | awk -F"|" '!_[$1]++' | sort -t '|' -k2,2n
I think the one-liner is pretty self-explanatory, except for the awk. What that part does is that it does a uniq by the first column. The last sort is just to get the final order that you wanted.

Find common lines between two files

File 1:
6
9219045
71608707
105853666
106000373
106000464
106000814
106001204
106001483
106002054
File 2:
6,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQufMrUK+8A4e0iJV4ktLQgXVxAH4ABQAAAEBNoyuUZLYRLaBqLvsvzHxxv63pO+4UPsRqpp/oHURcBdT6NES2G5H6+Kc3yjZOXDIIhHN1efAxyM/iWD0qDev9dAAwY29tLmFtYXpvbi5wb2ludHMuZW5jcnlwdGlvbi5rZXkuYWNjb3VudHNzZXJ2aWNlc3IADmphdmEubGFuZy5Mb25nO4vkkMyPI98CAAFKAAV2YWx1ZXhyABBqYXZhLmxhbmcuTnVtYmVyhqyVHQuU4IsCAAB4cAAAAAAAAAAB,jp-points
55555,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQ5C9LG75v8+ENmmteRa/bBHVxAH4ABQAAAFBgXjgKk6KvTg4FiPfWF/7Ittzk/MpmlBecYkc9Bc+3mAV7R58rcl1hGkFdk3MagFXjUsunbE0qcV+Gy+DwhUWpBYDpA3p9q9oO8zwDJfFqCHQAMGNvbS5hbWF6b24ucG9pbnRzLmVuY3J5cHRpb24ua2V5LmFjY291bnRzc2VydmljZXNyAA5qYXZhLmxhbmcuTG9uZzuL5JDMjyPfAgABSgAFdmFsdWV4cgAQamF2YS5sYW5nLk51bWJlcoaslR0LlOCLAgAAeHAAAAAAAAAAAQ==,jp-points
74292,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQPxjL0KWZoaYxWY7clP57tnVxAH4ABQAAAFB6WiMY05SU2WiYqaC7CzwMP2kQ51ec9mkIPh7R4fz2LPwfT8VNpAwH0QLM3I497D2JLfK13S6S90dxpU1ny2VBwaU4imxVchwo7YrcvwvEZXQAMGNvbS5hbWF6b24ucG9pbnRzLmVuY3J5cHRpb24ua2V5LmFjY291bnRzc2VydmljZXNyAA5qYXZhLmxhbmcuTG9uZzuL5JDMjyPfAgABSgAFdmFsdWV4cgAQamF2YS5sYW5nLk51bWJlcoaslR0LlOCLAgAAeHAAAAAAAAAAAQ==,jp-points
File 1 has only one column and I am sorting the file with the command sort -n file1
File 2 has three columns and I am sorting the file with command sort -t "," -k 1n,1 file2 which is sorting on the basis of ist column.
Now, I want to find the rows in file2 that are starting from lines in file1
Commands that I have tried:
grep -w -f file1 file2
join -t "," -1 1 -2 1 -o 2.2 file1 file2
But, I am not getting desired results. Please provide me with alternate approach. File 1 has rows 7124458 and File 2 has row 42987432.
Use awk:
awk -F, 'FNR == NR { ++a[$0]; next } $1 in a' file1 file2
Output:
6,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQufMrUK+8A4e0iJV4ktLQgXVxAH4ABQAAAEBNoyuUZLYRLaBqLvsvzHxxv63pO+4UPsRqpp/oHURcBdT6NES2G5H6+Kc3yjZOXDIIhHN1efAxyM/iWD0qDev9dAAwY29tLmFtYXpvbi5wb2ludHMuZW5jcnlwdGlvbi5rZXkuYWNjb3VudHNzZXJ2aWNlc3IADmphdmEubGFuZy5Mb25nO4vkkMyPI98CAAFKAAV2YWx1ZXhyABBqYXZhLmxhbmcuTnVtYmVyhqyVHQuU4IsCAAB4cAAAAAAAAAAB,jp-point
join(1) assumes both files are sorted alphabetically on the join fields. Try sorting the inputs without -n.
(To be more precise, it depends on the LC_COLLATE setting. If you are sorting for the benefit of two programs talking to each other, it is probably more reliable to set LC_ALL=C for both join and sort to avoid any glitches due to locale settings.)

Issue with unix sort

This is more of a doubt than a question.
So I have an input file like this:
$ cat test
class||sw sw-explr bot|results|id,23,0a522b36-556f-4116-b485-adcf132b6cad,20130325,/html/body/div/div[3]/div[2]/div[2]/div[3]/div/div/div/div/div/div[2]/div/div/ul/li[4]/div/img
class||sw sw-explr bot|results|id,40,30cefa2c-6ebf-485e-b49c-3a612fe3fd73,20130323,/html/body/div/div[3]/div[2]/div[3]/div[3]/div/div/div/div/div[3]/div/div/ul/li[8]/div/img
class||sw sw-explr bot|results|id,3,72805487-72c3-4173-947f-e5abed6ea1e4,20130324,/html/body/div/div[3]/div[2]/div[2]/div[2]/div/div/div/div/div/div[3]/div/div/div[2]/ul/li[20]/div/img
Kind of defining the element in an html page.
The comma separated 5 columns can be considered.
I want to sort this file with respect to the second column, i.e. columns having 23,40,3.
I am not sure why unix sort isn't working.
These are the queries I tried, surprisingly none gave me desired result.
cat test | sort -nt',' -k2
cat test | sort -n -t, -k2
cat test | sort -n -t$',' -k2
cat test | sort -t"," -k2
cat test | sort -n -k2
Is there something about sort that I don't know?
This didn't cause me a problem as I separated the columns, sorted, then joined again. But why did not sort work??
NB:- If I remove $3 of this file and then sort, it works fine!
this line should work for you:
sort -t, -n -k2,2 test
you don't need cat test|sort, just sort file
the default END POS of -k is the end of line. so if you sort -k2 it means sort from the 2nd field till the end of line. In fact you need sort by exact the 2nd field. And this also explains why your sort worked if you removed 3rd col.
if test with your example:
kent$ sort -t, -n -k2,2 file
class||sw sw-explr bot|results|id,3,72805487-72c3-4173-947f-e5abed6ea1e4,20130324,/html/body/div/div[3]/div[2]/div[2]/div[2]/div/div/div/div/div/div[3]/div/div/div[2]/ul/li[20]/div/img
class||sw sw-explr bot|results|id,23,0a522b36-556f-4116-b485-adcf132b6cad,20130325,/html/body/div/div[3]/div[2]/div[2]/div[3]/div/div/div/div/div/div[2]/div/div/ul/li[4]/div/img
class||sw sw-explr bot|results|id,40,30cefa2c-6ebf-485e-b49c-3a612fe3fd73,20130323,/html/body/div/div[3]/div[2]/div[3]/div[3]/div/div/div/div/div[3]/div/div/ul/li[8]/div/img
Here comes a working solution:
cat test.file | sort -t, -k2n,2
Explanation:
-t, # Set field separator to ','
-k2n,2 # sort by the second column, numerical

Resources