Sorting large files by [M]M/[D]D/YYYY - linux

I have these large tab-delimited text files that I want to sort by the date field (the 17th field). The issue is that the dates come in the format [M]M/[D]D/YYYY meaning that there are no leading zeros so dates can be:
3/3/2013,
4/17/2014,
12/4/2013
Is it possible to use the sort command to do this? I haven't been able to find an example that takes into account no leading zeros.
As a note, I've tried recalculating the date field to be days from a certain date and then sorting on that. This works but the read/write necessary to do this extra step takes a long long time.

If the date is at the start of the line:
sort -n -t/ -k3,3 -k1,1 -k2,2
Use the --debug option to sort if available to help

The following prefixes each lines with YYYYMMDD before passing it to sort, then removes the added characters.
<file.in perl -pe'
$_ = (
m{^(?:[^\t]*\t){16}(\d+)/(\d+)/(\d+)\t}
? sprintf("%04d%02d%02d", $3, $1, $2)
: " " x 8
) . $_;
' | sort | cut -b 9- >file.out

Related

How can I sort a file with a specific string as the sorting key in bash?

So I have a events.csv file with:
Date as the key and format : DDMMYYHHMM
Name of the event
Additional (optional info)
And I want to sort them according to current date.
The next step for me was:
cur_date=$(date +%d%m%y | sed s/0/\ /)
cat "events.csv" | sort -rk $cur_date
To get rid of the leading 0s.
The output is falsely sorted.
Should I sort first the year, then the month, then the day?
Any ideas?
When you say sort by current date, I am assuming you mean you want to extract all entries for today and sort those into chronological order. If that is true, the following should work for you
grep $(date +%d%m%Y) _file_ | sort
It will extract all entries from file that have today's date. Since all of these entries have the same date, the sort then will take care of sorting by the time portion automatically.
Hope this helps.

Differences between Unix commands for Sorting CSV

What's the difference between:
!tail -n +2 hits.csv | sort -k 1n -o output.csv
and
!tail -n +2 hits.csv | sort -t "," -k1 -n -k2 > output.csv
?
I'm trying to sort a csv file by first column first, then by the second column, so that lines with the same first column are still together.
It seems like the first one already does that correctly, by first sorting by the field before the first comma, then by the field following the first comma. (breaking ties, that is.)
Or does it not actually do that?
And what does the second command do/mean? (And what's the difference between the two?) There is a significant difference between the two output.csv files when I run the two.
And, finally, which one should I use? (Or are they both wrong?)
See also the answer by #morido for some other pointers, but here's a description of exactly what those two sort invocations do:
sort -k 1n -o output.csv
This assumes that the "fields" in your file are delimited by a transition from non-whitespace to whitespace (i.e. leading whitespace is included in each field, not stripped, as many might expect/assume), and tells sort to order things by a key that starts with the first field and extends to the end of the line, and assumes that the key is formatted as a numeric value. The output is sent explicitly to a specific file.
sort -t "," -k1 -n -k2
This defines the field separator as a comma, and then defines two keys to sort on. The first key again starts at the first field and extends to the end of the line and is lexicographic (dictionary order), not numeric, and the second key, which will be used when values of the first key are identical, starts with the second field and extends to the end of the line, and because of the intervening -n, will be assumed to be numeric data as well. However, because your first key entails the entire line, essentially, the second key is not likely to ever be needed (if the first key of two separate lines is identical, the second key most likely will be too).
Since you didn't provide sample data, it's unknown whether the data in the first two fields is numeric or not, but I suspect you want something like what was suggested in the answer by #morido:
sort -t, -k1,1 -k2,2
or
sort -t, -k1,1n -k2,2n (alternatively sort -t, -n -k1,1 -k2,2)
if the data is numeric.
First off: you want to remove the leading ! from these two commands. In Bash (and probably others since this comes from csh) you are otherwise referencing the last command that contained tail in your history which does not make sense here.
The main difference between your two versions is that in the first case you are not taking the second column into account.
This is how I would do it:
tail -n +2 hits.csv | sort -t "," -n --key=1,1 --key=2,2 > output.csv
-t specifies the field separator
-n turns on numerical sorting order
--key specifies the fields that should be used for sorting (in order of precedence)

How do I sort input with a variable number of fields by the second-to-last field?

Editor's note: The original title of the question mentioned tabs as the field separators.
In a text such as
500 east 23rd avenue Toronto 2 890 400000 1
900 west yellovillage blvd Mississauga 3 800 600090 3
how would you sort in ascending order of the second to last column?
Editor's note: The OP later provided another sample input line, 500 Jackson Blvd Toronto 3 700 40000 2, which contains only 8 whitespace-separated input fields (compared to the 9 above), revealing the need to deal with a variable number of fields in the input.
Note: There are several, potentially separate questions:
Update: Question C was the relevant one.
Question A: As implied by the question's title only: how can you use the tab character (\t) as the field separator?
Question B: How can you sort input by the second-to-last field, without knowing that field's specific index up front, given a fixed number of fields?
Question C: How can you sort input by the second-to-last field, without knowing that field's respective index up front, given a variable number of fields?
Answer to question A:
sort's -t option allows you to specify a field separator.
By default, sort uses any run of line-interior whitespace as the separator.
Assuming Bash, Ksh, or Zsh, you can use an ANSI C-quoted string ($'...') to specify a single tab as the field separator ($'\t'):
sort -t $'\t' -n -k8,8 file # -n sorts numerically; omit for lexical sorting
Answer to question B:
Note: This assumes that all input lines have the same number of fields, and that input comes from file file:
# Determine the index of the next-to-last column, based on the first
# line, using Awk:
nextToLastColNdx=$(head -n 1 file | awk -F '\t' '{ print NF - 1 }')
# Sort numerically by the next-to-last column (omit -n to sort lexically):
sort -t $'\t' -n -k$nextToLastColNdx,$nextToLastColNdx file
Note: To sort by a single field, always specify it as the end field too (e.g., -k8,8), as above, because sort, given only a start field index (e.g., -k8), sorts from the specified field through the remainder of the line.
Answer to question C:
Note: This assumes that input lines may have a variable number of fields, and that on each line it is that line's second-to-last field that should act as the sort field; input comes from file file:
awk '{ printf "%s\t%s\n", $(NF-1), $0 }' file |
sort -n -k1,1 | # omit -n to perform lexical sorting
cut -f2-
The awk command extracts each line's second-to-last field and prepends it to the input line on output, separated by a tab.
The result is sorted by the first field (i.e., each input line's second-to-last field).
Finally, the artificially prepended sort field is removed again, using cut.
I suggest looking at "man sort".
You will see how to specify a field separator and how to specify the field index that should be used as a key for sorting.
You can use sort -k 2
For example :
echo -e '000 west \n500 east\n500 east\n900 west' | sort -k 2
The result is :
500 east
500 east
900 west
000 west
You can find more informations in the man page of sort. Take a look a the end of the man page. Just before author you have some interesting informations :)
Bye

Sort command in not working properly in unix for sorting a csv file

I have a csv file which I need to order based on the timestamp. It is the third column in the csv and I am using below commands to sort:
awk 'NR<2{print $_;next}{ print $_ | "sort -t, -k3.8,3.11nr -k3.1,3.3rM -k3.4rd" }'
This command is sorting properly when year is single, but for large data where the multiple years are present, it is putting the older ones first or in between somewhere of the csv. a sample is below:
data2,Send for Translation To CTM,Dec 30 2013 02:22
data1,Send for Translation To CTM,Dec 30 2013 02:20
data1,Send for Translation To CTM,Sep 30 2014 03:22
data2,Send for Translation To CTM,Oct 30 2014 03:21
I need to arrange the data with latest timestamp and year should go down in this order: 2014, 2013, 2012 and so on...
How can I acheive this?
The below should work
awk 'NR<2{print $_;next}{ print $_ | "sort -t, -k3.8,3.11rn -k3.1,3.3rM -k3.5,3.6rn -k3.12rd" }'
The 'awk' snippet passes all lines except header to the sort command.
The order of the keys is important here :
k3.8,3.11rn extracts the year part of the column and reverse sorts
k3.1,3.3rM extracts the first 3 characters in column three to be reverse monthly sorted and the rest we do a reverse dictionary sort
k3.5,3.6rn extracts the day and reverse sort and finally we do the same for time
Try this:
sort -rft',' -k3 merged.csv
I will try to sort by date and then by time
awk -F"," '{print $3,$1,$2}' file.csv | sort -d' ' -k 1d -k 2d
BTW, it would be great if you just share a small section of your file. :)

How to get CSV dimensions from terminal

Suppose I'm in a folder where ls returns Test.csv. What command do I enter to get the number of rows and columns of Test.csv (a standard comma separated file)?
Try using awk. It's best suited for well formatted csv file manipulations.
awk -F, 'END {printf "Number of Rows : %s\nNumber of Columns = %s\n", NR, NF}' Test.csv
-F, specifies , as a field separator in csv file.
At the end of file traversal, NR and NF have values of number of rows and columns respectively
Another quick and dirty approach would be like
# Number of Rows
cat Test.csv | wc -l
# Number of Columns
head -1 Test.csv | sed 's/,/\t/g' | wc -w
Although not a native solution using GNU coreutils, it is worth mentioning (since this is one of the top google results for such question) that xsv puts at your disposal a command to list the headers of a csv file, whose count returns obviously the number of columns.
# count rows
xsv count <filename>
# count columns
xsv headers <filename> | wc -l
For big files this is orders of magnitude faster than native solutions with awk and sed.

Resources