Linux sort command: keys with the same start but different length are not sorted in the same order - linux

I am trying to sort two files in order to join them. Some of the keys I am sorting by are very similar and this seems to be causing issues. For example I have two keys which are a1ke and a1k3-b3. I am using the command:
sort -nk1 file.txt -o file.txt
In one file they appear in this order and in the other they appear in reverse. This is causing issues when I try to join the files.
How can I sort these files so they are in the same order?
Thanks

Do not use the "-n" option, which compares according to the string numerical value.
-n
Compare according to arithmetic value an initial numeric string consisting of optional white
space, an optional - sign, and zero or more digits, optionally followed by a decimal point and
zero or more digits.
Your keys are strings, not numbers.
Instead, you should just do:
sort -k1 file.txt -o file.txt
Additional info:
You can see that sort considers your keys identical when -n is used by doing a unique sort:
sort -un file
You will see that a1k3-b3 and a1ke are considered equal (and therefore only one is emitted). If instead you do:
sort -u file
The result will contain both a1k3-b3 and a1ke, which is what you want.

Related

Sorting numerically if the number is not at the start of a line

I used grep -Eo '[0-9]{1,}kg' *.dat which filters the ones with *kg. Now I'm trying to sort them in increasing order. My output from grep is:
blue_whale.dat:240kg
crocodile.dat:5kg
elephant.dat:6kg
giraffe.dat:15kg
hippopotamus.dat:4kg
humpback_whale.dat:5kg
ostrich.dat:1kg
sea_turtle.dat:10kg
I've tried to used sort -n. But the sorting doesn't work.
edit:
I have bunch of files with how much each animals weight and their length. I filtered the weights of each animal. This part was easy. And then I want to order them in increasing order which I thought was just sort -n.
edit:
In my directory, I have many dat files.
And they contain values like 110000kg 24m
And I need to order them in weight increasing order
You need to use the command in this manner:
grep -Eo '[0-9]{1,}kg' *.dat | sort -t: -n -k2
Use the "-t" option to specify the colon as field separator.
You can use -r option for decreasing or reverse order.

How to sort a CSV with quoted fields (that may contain the separator)

In a shell script I'm trying to sort a CSV file. Some fields may contain the separator and are quoted to handle this correctly. Let's say I have a file with:
"2",D,Clair
1,R,Alice
"3","F","Dennis"
2,"P,F",Bob
I want to sort this on the first colum, then the third. The result should be:
1,R,Alice
2,"P,F",Bob
"2",D,Clair
"3","F","Dennis"
There may also be escaped double quotes in the fields. In general, the CSV will conform to RFC 4180.
I tried to do this with a sort -t , -k 1,1 -k 3,3 but that doesn't work, because sort isn't aware of the special meaning of quotes in CSV. I couldn't find a way to make sort behave this way. Perhaps I should use another command, but I can't find any.
How to sort my CSV?
I'd use the excellent xsv for the job:
$ xsv sort --no-headers --select 1,2 input.csv
1,R,Alice
2,D,Clair
2,"P,F",Bob
3,F,Dennis
csvkit can also do it:
$ csvsort --no-header-row --columns 1,2 input.csv
a,b,c
1,R,Alice
2,D,Clair
2,"P,F",Bob
3,F,Dennis

Retrieve substring with grep

I've got a question concerning grep.
I have some address data in an asc file as simple text. The first 30 characters are for the name. If the name is shorter than the 30 characters whitespaces fill it up to ensure its length is 30. At position 31 is a whitespace to separate the name from the next data which is the address. After the address is also a whitespace and some other data. My plan is to retrieve the address, which starts at index 32 and continues to index 50. I mostly got only nothing or the data beginning at the start of the line. I tried several methods such as
grep -iE '^.{30}' '.{8}$' myfile.asc
or
grep –o -P '^.{31,34}' myfile.asc
I can't search for a certain pattern since every set of data is different except the whitespaces which separate the data. Is it possible to retrieve my substring like that without relying on other methods through a pipe? I prefer to use grep since performance is an issue.
Why don't you use cut instead of grep if you're dealing with fixed positions?
cut -c 32-50 myfile.asc

Grep filtering of the dictionary

I'm having a hard time getting a grasp of using grep for a class i am in was hoping someone could help guide me in this assignment. The Assignment is as follows.
Using grep print all 5 letter lower case words from the linux dictionary that have a single letter duplicated one time (aabbe or ababe not valid because both a and b are in the word twice). Next to that print the duplicated letter followed buy the non-duplicated letters in alphabetically ascending order.
The Teacher noted that we will need to use several (6) grep statements (piping the results to the next grep) and a sed statement (String Editor) to reformat the final set of words, then pipe them into a read loop where you tear apart the three non-dup letters and sort them.
Sample Output:
aback a bck
abaft a bft
abase a bes
abash a bhs
abask a bks
abate a bet
I haven't figured out how to do more then printing 5 character words,
grep "^.....$" /usr/share/dict/words |
Didn't check it thoroughly, but this might work
tr '[:upper:]' '[:lower:]' | egrep -x '[a-z]{5}' | sed -r 's/^(.*)(.)(.*)\2(.*)$/\2 \1\3\4/' | grep " " | egrep -v "(.).*\1"
But do your way because someone might see it here.
All in one sed
sed -n '
# filter 5 letter word
/[a-zA-Z]\{5\}/ {
# lower letters
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxya/
# filter non single double letter
/\(.\).*\1/ !b
/\(.\).*\(.\).*\1.*\1/ b
/\(.\).*\(.\).*\1.*\2/ b
/\(.\).*\(.\).*\2.*\1/ b
# extract peer and single
s/\(.\)*\(.\)\(.*\)\2\(.*\)/a & \2:\1\3\4/
# sort singles
:sort
s/:\([^a]*\)a\(.*\)$/:\1\2a/
y/abcdefghijklmnopqrstuvwxyz/zabcdefghijklmnopqrstuvwxy/
/^a/ !b sort
# clean and print
s/..//
s/:/ /p
}' YourFile
posix sed so --posix on GNU sed
The first bit, obviously, is to use grep to get it down to just the words that have a single duplication in. I will give you some clues on how to do that.
The key is to use backreferences, which allow you to specify that something that matched a previous expression should appear again. So if you write
grep -E "^(.)...\1...\1$"
then you'll get all the words that have the starting letter reappearing in fifth and ninth positions. The point of the brackets is to allow you to refer later to whatever matched the thing in brackets; you do that with a \1 (to match the thing in the first lot of brackets).
You want to say that there should be a duplicate anywhere in the word, which is slightly more complicated, but not much. You want a character in brackets, then any number of characters, then the repeated character (with no ^ or $ specified).
That will also include ones where there are two or more duplicates, so the next stage is to filter them out. You can do that by a grep -v invocation. Once you've got your list of 5-character words that have at least one duplicate, pipe them through a grep -v call that strips out anything with two (or more) duplicates in. That'll have a (.), and another (.), and a \1, and a \2, and these might appear in several different orders.
You'll also need to strip out anything that has a (.) and a \1 and another \1, since that will have a letter with three occurrences.
That should be enough to get you started, at any rate.
Your next step should be to find the 5-letter words containing a duplicate letter. To do that, you will need to use back-references. Example:
grep "[a-z]*\([a-z]\)[a-z]*\$1[a-z]*"
The $1 picks up the contents of the first parenthesized group and expects to match that group again. In this case, it matches a single letter. See: http://www.thegeekstuff.com/2011/01/advanced-regular-expressions-in-grep-command-with-10-examples--part-ii/ for more description of this capability.
You will next need to filter out those cases that have either a letter repeated 3 times or a word with 2 letters repeated. You will need to use the same sort of back-reference trick, but you can use grep -v to filter the results.
sed can be used for the final display. Grep will merely allow you to construct the correct lines to consider.
Note that the dictionary contains capital letters and also non-letter characters, plus that strange characters used in Southern Europe. say "è".
If you want to distinguish "A" and "a", it's automatic, on the other hand if "A" and "a" are the same letter, in ALL grep invocations you must use the -i option, to instruct grep to ignore case.
Next, you always want to pass the -E option, to avoid the so called backslashitis gravis in the regexp that you want to pass to grep.
Further, if you want to exclude the lines matching a regexp from the output, the correct option is -v.
Eventually, if you want to specify many different regexes to a single grep invocation, this is the way (just an example btw)
grep -E -i -v -e 'regexp_1' -e 'regexp_2' ... -e 'regexp_n'
The preliminaries are after us, let's look forward, use the answer from chiastic-security as a reference to understand the procedings
There are only these possibilities to find a duplicate in a 5 character string
(.)\1
(.).\1
(.)..\1
(.)...\1
grep -E -i -e 'regexp_1' ...
Now you have all the doubles, but this doesn't exclude triples etc that are identified by the following patterns (Edit added a cople of additional matching triples patterns)
(.)\1\1
(.).\1\1
(.)\1.\1
(.)..\1\1
(.).\1.\1
(.)\1\1\1
(.).\1\1\1
(.)\1\1\1\1\
you want to exclude these patterns, so grep -E -i -v -e 'regexp_1' ...
at his point, you have a list of words with at least a couple of the same character, and no triples, etc and you want to drop double doubles, these are the regexes that match double doubles
(.)(.)\1\2
(.)(.)\2\1
(.).(.)\1\2
(.).(.)\2\1
(.)(.).\1\2
(.)(.).\2\1
(.)(.)\1.\2
(.)(.)\2.\1
and you want to exclude the lines with these patterns, so its grep -E -i -v ...
A final hint, to play with my answer copy a few hundred lines of the dictionary in your working directory, head -n 3000 /usr/share/dict/words | tail -n 300 > ./300words so that you can really understand what you're doing, avoiding to be overwhelmed by the volume of the output.
And yes, this is not a complete answer, but it is maybe too much, isn't it?

Sorting on the last field of a line

What is the simplest way to sort a list of lines, sorting on the last field of each line? Each line may have a variable number of fields.
Something like
sort -k -1
is what I want, but sort(1) does not take negative numbers to select fields from the end instead of the start.
I'd also like to be able to choose the field delimiter too.
Edit: To add some specificity to the question: The list I want to sort is a list of pathnames. The pathnames may be of arbitrary depth hence the variable number of fields. I want to sort on the filename component.
This additional information may change how one manipulates the line to extract the last field (basename(1) may be used), but does not change sorting requirements.
e.g.
/a/b/c/10-foo
/a/b/c/20-bar
/a/b/c/50-baz
/a/d/30-bob
/a/e/f/g/h/01-do-this-first
/a/e/f/g/h/99-local
I want this list sorted on the filenames, which all start with numbers indicating the order the files should be read.
I've added my answer below which is how I am currently doing it. I had hoped there was a simpler way - maybe a different sort utility - perhaps without needing to manipulate the data.
awk '{print $NF,$0}' file | sort | cut -f2- -d' '
Basically, this command does:
Repeat the last field at the beginning, separated with a whitespace (default OFS)
Sort, resolve the duplicated filenames using the full path ($0) for sorting
Cut the repeated first field, f2- means from the second field to the last
Here's a Perl command line (note that your shell may require you to escape the $s):
perl -e "print sort {(split '/', $a)[-1] <=> (split '/', $b)[-1]} <>"
Just pipe the list into it or, if the list is in a file, put the filename at the end of the command line.
Note that this script does not actually change the data, so you don't have to be careful about what delimeter you use.
Here's sample output:
>perl -e "print sort {(split '/', $a)[-1] <=> (split '/', $b)[-1]} " files.txt
/a/e/f/g/h/01-do-this-first
/a/b/c/10-foo
/a/b/c/20-bar
/a/d/30-bob
/a/b/c/50-baz
/a/e/f/g/h/99-local
something like this
awk '{print $NF"|"$0}' file | sort -t"|" -k1 | awk -F"|" '{print $NF }'
A one-liner in perl for reversing the order of the fields in a line:
perl -lne 'print join " ", reverse split / /'
You could use it once, pipe the output to sort, then pipe it back and you'd achieve what you want. You can change / / to / +/ so it squeezes spaces. And you're of course free to use whatever regular expression you want to split the lines.
I think the only solution would be to use awk:
Put the last field to the front using awk.
Sort lines.
Put the first field to the end again.
Replace the last delimiter on the line with another delimiter that does not otherwise appear in the list, sort on the second field using that other delimiter as the sort(1) delimiter, and then revert the delimiter change.
delim=/
new_delim=" "
cat $list \
| sed "s|\(.*\)$delim|\1$new_delim|" \
| sort -t"$new_delim" -k 2,2 \
| sed "s|$new_delim|$delim|"
The problem is knowing what delimiter to use that does not appear in the list. You can make multiple passes over the list and then grep for a succession of potential delimiters, but it's all rather nasty - particularly when the concept of "sort on the last field of a line" is so simply expressed, yet the solution is not.
Edit: One safe delimiter to use for $new_delim is NUL since that cannot appear in filenames, but I don't know how to put a NUL character into a bourne/POSIX shell script (not bash) and whether sort and sed will properly handle it.
#!/usr/bin/ruby
f = ARGF.read
lines = f.lines
broken = lines.map {|l| l.split(/:/) }
sorted = broken.sort {|a, b|
a[-1] <=> b[-1]
}
fixed = sorted.map {|s| s.join(":") }
puts fixed
If all the answers involve perl or awk, might as well solve the whole thing in the scripting language. (Incidentally, I tried in Perl first and quickly remembered that I dislike Perl's lists-of-lists. I'd love to see a Perl guru's version.)
I want this list sorted on the filenames, which all start with numbers
indicating the order the files should be read.
find . | sed 's#.*/##' | sort
the sed replaces all parts of the list of results that ends in slashes. the filenames are whats left, and you sort on that.
Here is a python oneliner version, note that it assumes the field is integer, you can change that as needed.
echo file.txt | python3 -c 'import sys; list(map(sys.stdout.write, sorted(sys.stdin, key=lambda x: int(x.rsplit(" ", 1)[-1]))))'
| sed "s#(.*)/#\1"\\$'\x7F'\# \
| sort -t\\$'\x7F' -k2,2 \
| sed s\#\\$'\x7F'"#/#"
Still way worse than simple negative field indexes for sort(1) but using the DEL character as delimiter shouldn’t cause any problem in this case.
I also like how symmetrical it is.
sort allows you to specify the delimiter with the -t option, if I remember it well. To compute the last field, you can do something like counting the number of delimiters in a line and sum one. For instance something like this (assuming the ":" delimiter):
d=`head -1 FILE | tr -cd : | wc -c`
d=`expr $d + 1`
($d now contains the last field index).

Resources