Sorting on the last field of a line - linux

What is the simplest way to sort a list of lines, sorting on the last field of each line? Each line may have a variable number of fields.
Something like
sort -k -1
is what I want, but sort(1) does not take negative numbers to select fields from the end instead of the start.
I'd also like to be able to choose the field delimiter too.
Edit: To add some specificity to the question: The list I want to sort is a list of pathnames. The pathnames may be of arbitrary depth hence the variable number of fields. I want to sort on the filename component.
This additional information may change how one manipulates the line to extract the last field (basename(1) may be used), but does not change sorting requirements.
e.g.
/a/b/c/10-foo
/a/b/c/20-bar
/a/b/c/50-baz
/a/d/30-bob
/a/e/f/g/h/01-do-this-first
/a/e/f/g/h/99-local
I want this list sorted on the filenames, which all start with numbers indicating the order the files should be read.
I've added my answer below which is how I am currently doing it. I had hoped there was a simpler way - maybe a different sort utility - perhaps without needing to manipulate the data.

awk '{print $NF,$0}' file | sort | cut -f2- -d' '
Basically, this command does:
Repeat the last field at the beginning, separated with a whitespace (default OFS)
Sort, resolve the duplicated filenames using the full path ($0) for sorting
Cut the repeated first field, f2- means from the second field to the last

Here's a Perl command line (note that your shell may require you to escape the $s):
perl -e "print sort {(split '/', $a)[-1] <=> (split '/', $b)[-1]} <>"
Just pipe the list into it or, if the list is in a file, put the filename at the end of the command line.
Note that this script does not actually change the data, so you don't have to be careful about what delimeter you use.
Here's sample output:
>perl -e "print sort {(split '/', $a)[-1] <=> (split '/', $b)[-1]} " files.txt
/a/e/f/g/h/01-do-this-first
/a/b/c/10-foo
/a/b/c/20-bar
/a/d/30-bob
/a/b/c/50-baz
/a/e/f/g/h/99-local

something like this
awk '{print $NF"|"$0}' file | sort -t"|" -k1 | awk -F"|" '{print $NF }'

A one-liner in perl for reversing the order of the fields in a line:
perl -lne 'print join " ", reverse split / /'
You could use it once, pipe the output to sort, then pipe it back and you'd achieve what you want. You can change / / to / +/ so it squeezes spaces. And you're of course free to use whatever regular expression you want to split the lines.

I think the only solution would be to use awk:
Put the last field to the front using awk.
Sort lines.
Put the first field to the end again.

Replace the last delimiter on the line with another delimiter that does not otherwise appear in the list, sort on the second field using that other delimiter as the sort(1) delimiter, and then revert the delimiter change.
delim=/
new_delim=" "
cat $list \
| sed "s|\(.*\)$delim|\1$new_delim|" \
| sort -t"$new_delim" -k 2,2 \
| sed "s|$new_delim|$delim|"
The problem is knowing what delimiter to use that does not appear in the list. You can make multiple passes over the list and then grep for a succession of potential delimiters, but it's all rather nasty - particularly when the concept of "sort on the last field of a line" is so simply expressed, yet the solution is not.
Edit: One safe delimiter to use for $new_delim is NUL since that cannot appear in filenames, but I don't know how to put a NUL character into a bourne/POSIX shell script (not bash) and whether sort and sed will properly handle it.

#!/usr/bin/ruby
f = ARGF.read
lines = f.lines
broken = lines.map {|l| l.split(/:/) }
sorted = broken.sort {|a, b|
a[-1] <=> b[-1]
}
fixed = sorted.map {|s| s.join(":") }
puts fixed
If all the answers involve perl or awk, might as well solve the whole thing in the scripting language. (Incidentally, I tried in Perl first and quickly remembered that I dislike Perl's lists-of-lists. I'd love to see a Perl guru's version.)

I want this list sorted on the filenames, which all start with numbers
indicating the order the files should be read.
find . | sed 's#.*/##' | sort
the sed replaces all parts of the list of results that ends in slashes. the filenames are whats left, and you sort on that.

Here is a python oneliner version, note that it assumes the field is integer, you can change that as needed.
echo file.txt | python3 -c 'import sys; list(map(sys.stdout.write, sorted(sys.stdin, key=lambda x: int(x.rsplit(" ", 1)[-1]))))'

| sed "s#(.*)/#\1"\\$'\x7F'\# \
| sort -t\\$'\x7F' -k2,2 \
| sed s\#\\$'\x7F'"#/#"
Still way worse than simple negative field indexes for sort(1) but using the DEL character as delimiter shouldn’t cause any problem in this case.
I also like how symmetrical it is.

sort allows you to specify the delimiter with the -t option, if I remember it well. To compute the last field, you can do something like counting the number of delimiters in a line and sum one. For instance something like this (assuming the ":" delimiter):
d=`head -1 FILE | tr -cd : | wc -c`
d=`expr $d + 1`
($d now contains the last field index).

Related

Linux remove whitespace first line

i have the file virt.txt contains:
0302 000000 23071SOCIETY 117
0602 000000000000000001 PAYMENT BANK
I want to remove 3 whitespaces from 6th to 8th column to the first line only.
I do:
sed '1s/[[:blank:]]+[[:blank:]]+[[:blank:]]//6' virt.txt
it'KO
please help
Your regex would consume all the available blanks from a sequence of three or more (in a quite inefficient way) and replace the sixth occurrence of that. Because your first input line does not contain six or more separate stretches of three or more whitespace characters, it actually did nothing. But you can in fact use sed to do exactly what you say you want:
sed '1s/^\(.....\) /\1/' virt.txt
(or for convenience, if you have sed -E or the variant sed -r which works on some platforms, but neither of these is standard):
sed -E '1s/^(.{5}) {3}/\1/' virt.txt # -E is not portable
The parentheses capture the first five characters into a back reference, and we then use the first back reference \1 as the replacement string, effectively replacing only the text which matched outside the parentheses.
If your sed supports the -i option, you can use that to modify the file directly; but this is also not standard, so the most portable solution is to write the result to a new file, then move it back on top of the original file if you want to replace it.
sed is convenient if you are familiar with it, but as you are clearly not, perhaps a better approach would be to use a different language, ideally one which is not write-only for many users, like sed.
If you know the three characters will always be spaces, just do a static replacement.
awk 'NR==1 { $0 = substr($0, 1, 5) substr($0, 9) } 1' virt.txt
On the first line (NR is the current input line number) replace the input line $0 with a catenation of the substrings on both sides of the part you want to cut.
For a simple replacement like that, you can also use basic Unix text manipulation utilities, though it's rather inefficient and inelegant:
head -n 1 virt.txt | cut -c1-5,9- >newfile.txt
tail -n +2 virt.txt >>newfile.txt
If you need to check that the three characters are spaces, the Awk script only needs a minor tweak.
awk 'NR==1 && /^.{5} {3}/ { $0 = substr($0, 1, 5) substr($0, 9) } 1' virt.txt
You should vaguely recognize the regex from above. Awk is less succinct, but as a consequence also quite a lot more readable, than sed.

remove almost-duplicates containing substring of next line

I need to know a way to remove duplicate strings in line, but let me explain, cause I have already used uniq. In a file, I get these two lines:
ANASI:A=4-63261950;
ANASI:A=4-63261950,ES=541;
The string 4-63261950 is duplicated in both lines, but the line itself is different, only that string is equal in both lines. I just need a way to process the entire file and remove the first line and leave only the one with the ANASI:A=4-63261950,ES=541;. The file will contain several lines with this exact same scenario. Is there a way to do this with sed or something?
awk to the rescue...
assuming your delimiters and structure stays the same
sort file | awk -F"[;,]" '!a[$1]++'
will pick the first one based on lexical order (, < ;)
If file is huge (and memory a problem or issue)
sort YourFile | awk -F '[;,]' 'Last != $1{print}{Last = $1}'
This might work for you (GNU sed):
sed -r 'N;/^(.*);\n\1,/!P;D' file
This uses a moving window to compare successive pairs of lines to print the required match.

How to process nested delimiters in a same file in unix

I have a semicolon-separated file.
And one of the fields in the file is pipe-separated.
Example.
field-1;field-2;value-1|value-2|value-3;field-4;field-5
Here field-3 is nothing but pipe-separated values.
Now my requirement is that if there are multiple values in field-3,
I only have to keep the first value. Discard anything after the first pipe in that field.
i.e., I want to keep only first value in field-3 as shown below
field-1;field-2;value-1;field-4;field-5
Can you please tell me how can I achieve this using sed/awk?
sed 's/[|][^;]*//g' YourFile
it remove (replace by nothing) any value after (including) a | ( the [|] pattern) until it is not a ; ( the [^;] pattern) and for each occurence ( the g option)
Assuming | is only a separator and not part of a filed value (like in string)
One way:
$ x='field-1;field-2;value-1|value-2|value-3;field-4;field-5'
$ echo $x | awk -F";" '{gsub(/\|.*/,"",$3);}1' OFS=";"
field-1;field-2;value-1;field-4;field-5
$
This might work for you (GNU sed):
sed -r 's/^(([^;]*;){2}[^|;]*)[^;]*/\1/' file
This matches the first three fields and replaces it by the first two fields and the first value in the third field.

Grep filtering of the dictionary

I'm having a hard time getting a grasp of using grep for a class i am in was hoping someone could help guide me in this assignment. The Assignment is as follows.
Using grep print all 5 letter lower case words from the linux dictionary that have a single letter duplicated one time (aabbe or ababe not valid because both a and b are in the word twice). Next to that print the duplicated letter followed buy the non-duplicated letters in alphabetically ascending order.
The Teacher noted that we will need to use several (6) grep statements (piping the results to the next grep) and a sed statement (String Editor) to reformat the final set of words, then pipe them into a read loop where you tear apart the three non-dup letters and sort them.
Sample Output:
aback a bck
abaft a bft
abase a bes
abash a bhs
abask a bks
abate a bet
I haven't figured out how to do more then printing 5 character words,
grep "^.....$" /usr/share/dict/words |
Didn't check it thoroughly, but this might work
tr '[:upper:]' '[:lower:]' | egrep -x '[a-z]{5}' | sed -r 's/^(.*)(.)(.*)\2(.*)$/\2 \1\3\4/' | grep " " | egrep -v "(.).*\1"
But do your way because someone might see it here.
All in one sed
sed -n '
# filter 5 letter word
/[a-zA-Z]\{5\}/ {
# lower letters
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxya/
# filter non single double letter
/\(.\).*\1/ !b
/\(.\).*\(.\).*\1.*\1/ b
/\(.\).*\(.\).*\1.*\2/ b
/\(.\).*\(.\).*\2.*\1/ b
# extract peer and single
s/\(.\)*\(.\)\(.*\)\2\(.*\)/a & \2:\1\3\4/
# sort singles
:sort
s/:\([^a]*\)a\(.*\)$/:\1\2a/
y/abcdefghijklmnopqrstuvwxyz/zabcdefghijklmnopqrstuvwxy/
/^a/ !b sort
# clean and print
s/..//
s/:/ /p
}' YourFile
posix sed so --posix on GNU sed
The first bit, obviously, is to use grep to get it down to just the words that have a single duplication in. I will give you some clues on how to do that.
The key is to use backreferences, which allow you to specify that something that matched a previous expression should appear again. So if you write
grep -E "^(.)...\1...\1$"
then you'll get all the words that have the starting letter reappearing in fifth and ninth positions. The point of the brackets is to allow you to refer later to whatever matched the thing in brackets; you do that with a \1 (to match the thing in the first lot of brackets).
You want to say that there should be a duplicate anywhere in the word, which is slightly more complicated, but not much. You want a character in brackets, then any number of characters, then the repeated character (with no ^ or $ specified).
That will also include ones where there are two or more duplicates, so the next stage is to filter them out. You can do that by a grep -v invocation. Once you've got your list of 5-character words that have at least one duplicate, pipe them through a grep -v call that strips out anything with two (or more) duplicates in. That'll have a (.), and another (.), and a \1, and a \2, and these might appear in several different orders.
You'll also need to strip out anything that has a (.) and a \1 and another \1, since that will have a letter with three occurrences.
That should be enough to get you started, at any rate.
Your next step should be to find the 5-letter words containing a duplicate letter. To do that, you will need to use back-references. Example:
grep "[a-z]*\([a-z]\)[a-z]*\$1[a-z]*"
The $1 picks up the contents of the first parenthesized group and expects to match that group again. In this case, it matches a single letter. See: http://www.thegeekstuff.com/2011/01/advanced-regular-expressions-in-grep-command-with-10-examples--part-ii/ for more description of this capability.
You will next need to filter out those cases that have either a letter repeated 3 times or a word with 2 letters repeated. You will need to use the same sort of back-reference trick, but you can use grep -v to filter the results.
sed can be used for the final display. Grep will merely allow you to construct the correct lines to consider.
Note that the dictionary contains capital letters and also non-letter characters, plus that strange characters used in Southern Europe. say "è".
If you want to distinguish "A" and "a", it's automatic, on the other hand if "A" and "a" are the same letter, in ALL grep invocations you must use the -i option, to instruct grep to ignore case.
Next, you always want to pass the -E option, to avoid the so called backslashitis gravis in the regexp that you want to pass to grep.
Further, if you want to exclude the lines matching a regexp from the output, the correct option is -v.
Eventually, if you want to specify many different regexes to a single grep invocation, this is the way (just an example btw)
grep -E -i -v -e 'regexp_1' -e 'regexp_2' ... -e 'regexp_n'
The preliminaries are after us, let's look forward, use the answer from chiastic-security as a reference to understand the procedings
There are only these possibilities to find a duplicate in a 5 character string
(.)\1
(.).\1
(.)..\1
(.)...\1
grep -E -i -e 'regexp_1' ...
Now you have all the doubles, but this doesn't exclude triples etc that are identified by the following patterns (Edit added a cople of additional matching triples patterns)
(.)\1\1
(.).\1\1
(.)\1.\1
(.)..\1\1
(.).\1.\1
(.)\1\1\1
(.).\1\1\1
(.)\1\1\1\1\
you want to exclude these patterns, so grep -E -i -v -e 'regexp_1' ...
at his point, you have a list of words with at least a couple of the same character, and no triples, etc and you want to drop double doubles, these are the regexes that match double doubles
(.)(.)\1\2
(.)(.)\2\1
(.).(.)\1\2
(.).(.)\2\1
(.)(.).\1\2
(.)(.).\2\1
(.)(.)\1.\2
(.)(.)\2.\1
and you want to exclude the lines with these patterns, so its grep -E -i -v ...
A final hint, to play with my answer copy a few hundred lines of the dictionary in your working directory, head -n 3000 /usr/share/dict/words | tail -n 300 > ./300words so that you can really understand what you're doing, avoiding to be overwhelmed by the volume of the output.
And yes, this is not a complete answer, but it is maybe too much, isn't it?

truncate output in BASH

How do I truncate output in BASH?
For example, if I "du file.name" how do I just get the numeric value and nothing more?
later addition:
all solutions work perfectly. I chose to accept the most enlightning "cut" answer because I prefer the simplest approach in bash files others are supposed to be able to read.
If you know what the delimiters are then cut is your friend
du | cut -f1
Cut defaults to tab delimiters so in this case you are selecting the first field.
You can change delimiters: cut -d ' ' would use a space as a delimiter. (from Tomalak)
You can also select individual character positions or ranges:
ls | cut -c1-2
I'd recommend cut, as others have said. But another alternative that is sometimes useful because it allows any whitespace as separators, is to use awk:
du file.name | awk '{print $1}'
du | cut -f 1
If you just want the number of bytes of a single file, use the -s operator.
SIZE=-s file.name
That gives you a different number than du, but I'm not sure how exactly you're using this.
This has the advantage of not having to run du, and having bash get the size of the file directly.
It's hard to answer questions like this in a vacuum, because we don't know how you're going to use the data. Knowing that might suggest an entirely different answer.

Resources