linux sort command with delimiter in data - linux

I need to sort a big csv file .So,using
sort
command will be quite good.
But, I am facing an issue that delimiter ',' is also present in the data .
So, sorting on fields with ',' works unexpectedly .
The file contains data like
Ahmedabad ,"7,Olive residency ", 380058
Gandhinagar,"85,Kabir villa",38048
Surat ,Binory Bunglows,589635
And I am using sort command like
sort --field-separator=',' -s -k 3,3 bigfile.csv
Which does not give desired output.
Can any one help me with this ?

sort -k3 -t',' -nr bigfile.csv

Related

Loop through each column in a CSV file and exporting distinct values to a file

I have a CSV file with columns A-O. 500k rows. In Bash I would like to loop through each column, get distinct values and output them to a file:
sort -k1 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f1 -d , | uniq > EMPLOYEEID.csv
sort -k2 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f2 -d , | uniq > SORTNAME.csv
This works, but to me is very manual and not really scalable if there were like 100 columns.
The code sorts the column in-place and then the column specified is passed to uniq to get distinct values and is then outputted.
NB: The first row has the header information.
The above code works, but I'm looking to streamline it somewhat.
Assuming headers can be used as file names for each column:
head -1 test.csv | \
tr "," "\n" | \
sed "s/ /_/g" | \
nl -ba -s$'\t' | \
while IFS=$'\t' read field name; do
cut -f$field -d',' test.csv | \
tail -n +2 | sort -u > "${name}.csv" ;
done
Explanation:
head - reads the first line
tr- replaces the , with new line
sed - replaces white space with _ for cleaner file names (tr would work also, and you can combine with previous one then, but if you need more complex transforms use sed)
nl - adds the field number
-ba - number all lines
-s$'\t' - set the separator to tab (not necessary, as it default, but for clarity sake)
while- reads trough field number/names
cut - selects the field
tail - removes the heading, not all tails have this option, you can replace with sed
sort -u - sorts and removes duplicates
>"$name.csv" - saves in the appropriate file name
note: this assumes that there are no , int the fields, otherwise you will need to use a csv parser
Doing all the columns in a single pass is much more efficient than rescanning the entire input file for each column.
awk -F , 'NR==1 { ncols = split($0, cols, /,/); next }
{ for(i=1; i<=ncols; ++i)
if (!seen[i ":" $i])
print $i >>cols[i] ".csv"}' CROWN.csv
If this is going to be part of a bigger task, maybe split the input file into several temporary files with fewer columns than the number of open file handles permitted on your system, rather than fix this script to handle an arbitrary number of columns.
You can inspect this system constant with ulimit -n; on some systems, you can increase it either by tweaking the system configuration or, in the worst case, by recompiling the kernel. (Your question doesn't identify your platform, but this should be easy enough to google.)
Addendum: I created a quick and dirty timing comparison of these answers at https://ideone.com/dnFj41; I encourage you to fork it and experiment with different shapes of input data. With an input file of 100 columns and (probably) no duplication in the columns -- but only a few hundred rows -- I got the following results:
0.001s Baseline test -- simply copy input file to an identical output file
0.242s tripleee -- this single-pass AWK script
0.561s Sorin -- multiple passes using simple shell script
2.154s Mihir -- multiple passes using AWK
Unfortunately, Carmen's answer could not be tested, because I did not have permissions to install Text::CSV_XS on Ideone.
An earlier version of this answer contained a Python attempt, but I was too lazy to finish debugging it. It's still there in the edit history if you are curious.

Sorting csv file based on two columns in unix

I'm a beginner in unix shell scripting. I'm trying to sort a csv file based on two columns.
My file looks like below:
sh-4.4$ cat test.csv
603,02,0123456,1111,201806131115
603,20,0123456,1111,201806131115
603,02,9876542,2222,201806131215
603,20,9876542,2222,201806131215
603,02,0123456,1111,201806131117
603,20,0123456,1111,201806131117
I want to group by the 3rd column and the 2nd column should also be ordered as shown below:
603,20,0123456,1111,201806131115
603,02,0123456,1111,201806131115
603,20,0123456,1111,201806131117
603,02,0123456,1111,201806131117
603,20,9876542,2222,201806131215
603,02,9876542,2222,201806131215
I tried doing sort -t',' -k3 -k2 test.csv. This does groups the column 3, but it does not sort the column 2. Its output looks like below.
603,02,0123456,1111,201806131115
603,20,0123456,1111,201806131115
603,02,0123456,1111,201806131117
603,20,0123456,1111,201806131117
603,02,9876542,2222,201806131215
603,20,9876542,2222,201806131215
I also tried sort -t',' -k3 -rk2 test.csv. This however sorts the column 2 as I desired but the column 3 is not sorted as I expected. Its output looks like below.
603,20,9876542,2222,201806131215
603,02,9876542,2222,201806131215
603,20,0123456,1111,201806131117
603,02,0123456,1111,201806131117
603,20,0123456,1111,201806131115
603,02,0123456,1111,201806131115
Any help on this is much appreciated. Suggestions to sort using awk is also welcome.
restrict the sorting fields
$ sort -t, -k3,3 -k2,2 file
should do.
Note however that the output you want doesn't match the spec you describe. You'll get
603,02,0123456,1111,201806131115
603,02,0123456,1111,201806131117
603,20,0123456,1111,201806131115
603,20,0123456,1111,201806131117
603,02,9876542,2222,201806131215
603,20,9876542,2222,201806131215
grouped by third field only and sorted by second field.
Perhaps this is what you wanted?
$ sort -t, -k3 -k2,2r file
603,20,0123456,1111,201806131115
603,02,0123456,1111,201806131115
603,20,0123456,1111,201806131117
603,02,0123456,1111,201806131117
603,20,9876542,2222,201806131215
603,02,9876542,2222,201806131215
note that -k3 means starting from 3rd field to the end, which seems what you want based on the order of the last fields. Also, you want to reorder the rows based on 2nd field in reverse order.
NB. If your numerical fields are not zero padded you may want to add -n option indicate numerical ordering instead of lexical ordering. Here it doesn't make a difference.
Sort will work sorting data on csv & txt file , it will print the output on console
-t says columns are delimited by '|' , -k1 -k2 says that-- it will sort te data by column 1 & then by 2
$ sort -t '|' -k1 -k2 <INPUT_FILE>
For storing the result in output file use following command
$ sort -t '|' -k1 -k2 <INPUT_FILE> -o <OUTPUTFILE>
If you wann do it with ignoring header line then use following command
(head -n1 INPUT_FILE && sort <(tail -n+2 INPUT_FILE)) > OUTPUT_FILE
head -n1 INPUT_FILE which will print only the first line of your file i.e. header
&
This special tail syntax gets your file from second line up to EOF.

Sort numbers values - separated by a dot or any other separator character - Sort versions values in RHEL5

Linux RHEL5 machine
How can I sort the following input to get 1.0.0.1019 in latest variable? Tried -t, -k and -n but it didn't help or may be I'm missing something.
$ echo '1.0.0
1.0.0.1018
1.0.0.1019
1.0.0.1019
1.0.0.7' | sort -u
Could you please try following and let me know if this helps(tested with GNU sort):
echo "1.0.0
1.0.0.1018
1.0.0.1019
1.0.0.1019
1.0.0.7" | sort --version-sort --field-separator=. --key=4 -r
Above will give 1019 in first place(latest one) in case you want it to last place then remove -r in above code please.
sort -n -t. -k1,4
Sort the input numerically.
Fields are separated by '.'
Only use the first four fields, in that order.

Merge two CSVs while resolving duplicates

I have a couple of csv's that i need to merge. I want to consider entries which have the same first and second columns as duplicates.
I know the command for this is something like
sort -t"," -u -k 1,1 -k 2,2 file1 file2
I additionally want to resolve the duplicates in such a way that the entry from the second file is chosen everytime. What is the way to do that?
If the suggestion to reverse the order of files to the sort command doesn't work (see other answer), another way to do this would be to concatenate the files, file2 first, and then sort them with the -s switch.
cat file2 file1 | sort -t"," -u -k 1,1 -k 2,2 -s
-s forces a stable sort, meaning that identical lines will appear in the same relative order. Since the input to sort has all of the lines from file2 before file1, all of the duplicates in the output should come from file2.
The sort man page doesn't explicitly state that input files will be read in the order that they're supplied on the command line, so I guess it's possible that an implementation could read the files in reverse order, or alternating lines, or whatever. But if you concatenate the files first then there's no ambiguity.
Change the order of the two files and add -s(#Jim Mischel give the hit) would solve your problem.
sort -t"," -u -k 1,1 -k 2,2 -s file2 file1
man sort
-u, --unique
with -c, check for strict ordering; without -c, output only the
first of an equal run
-s, --stable
stabilize sort by disabling last-resort comparison
Short answer
awk -F"," '{out[$1$2]=$0} END {for(i in out) {print out[i]}}' file1 file2
A bit long answer:
awk 'BEGIN {
FS=OFS=","; # set ',' as field separator
}
{
out[$1$2]=$0; # save the value to dict, new value would replace old value.
}
END {
for (i in out) { # in the end, print all value of the dict
print out[i];
}
}' file1 file2

Recursively grep unique pattern in different files

Sorry title is not very clear.
So let's say I'm grepping recursively for urls like this:
grep -ERo '(http|https)://[^/"]+' /folder
and in folder there are several files containing the same url. My goal is to output only once this url. I tried to pipe the grep to | uniq or sort -u but that doesn't help
example result:
/www/tmpl/button.tpl.php:http://www.w3.org
/www/tmpl/header.tpl.php:http://www.w3.org
/www/tmpl/main.tpl.php:http://www.w3.org
/www/tmpl/master.tpl.php:http://www.w3.org
/www/tmpl/progress.tpl.php:http://www.w3.org
If you only want the address and never the file where it was found in, there is a grep option -h to suppress file output; the list can then be piped to sort -u to make sure every address appears only once:
$ grep -hERo 'https?://[^/"]+' folder/ | sort -u
http://www.w3.org
If you don't want the https?:// part, you can use Perl regular expressions (-P instead of -E) with variable length look-behind (\K):
$ grep -hPRo 'https?://\K[^/"]+' folder/ | sort -u
www.w3.org
If the structure of the output is always:
/some/path/to/file.php:http://www.someurl.org
you can use the command cut :
cut -d ':' -f 2- should work. Basically, it cuts each line into fields separated by a delimiter (here ":") and you select the 2nd and following fields (-f 2-)
After that, you can use uniq to filter.
Pipe to Awk:
grep -ERo 'https?://[^/"]+' /folder |
awk -F: '!a[substr($0,length($1))]++'
The basic Awk idiom !a[key]++ is true the first time we see key, and forever false after that. Extracting the URL (or a reasonable approximation) into the key requires a bit of additional trickery.
This prints the whole input line if the key is one we have not seen before, i.e. it will print the file name and the URL for the first occurrence of each URL from the grep output.
Doing the whole thing in Awk should not be too hard, either.

Resources