Natural sorting in reverse? - linux

I'm trying to sort a text file in this manner:
6 aaa
4 bbb
2 ccc
2 ddd
That is, each line sorted first in numeric descending order (the number indicates the number of occurrences of the word on the right), and if multiple words are repeated the same number of times, I'd like to have those words sorted alphabetically.
What I have:
6 aaa
4 bbb
2 ddd
2 ccc
When I try sort -nr | sort -V it kind of does what I want but in ascending order.
2 ccc
2 ddd
4 bbb
6 aaa
What's a clean way to accomplish this?

I think you just need to specify that the numeric reverse sort only applies to the first field:
$ sort -k1,1nr file
6 aaa
4 bbb
2 ccc
2 ddd
-k1,1[OPTS] means that OPTS only apply between the 1st and 1st field. The rest of the line is sorted according to global ordering options. In this case, since no other options were passed, this means the default lexicographic sort.

Maybe using tac? (not a shell expert here, just remembering uni days...
sort -nr | sort -V | tac

Related

Excel parent list chain

I have a large table of two columns in Excel 2010. Column A is the user, column B is the person who invited the user. Usernames are alphanumeric, including some which are just numeric. The earliest users don't have an invitee.
User | Parent
-------------
AAA |
BBB |
CCC | AAA
DDD | BBB
EEE | DDD
FFF | DDD
GGG | FFF
HHH |
III | GGG
What I would like to do is have a formula which allows me to go to grandparent (and great-grand-parent, and beyond), so I'm trying to find a formula-based solution which uses mixed relative and absolute columns where appropriate.
The above chain would go to a maximum of four, but I have reason to believe my data set goes to no more than 20 levels deep at maximum. I would like to find a formula or combination of formulas that get me to this (and, as I said, beyond):
USER | PARENT | P2 | P3 | P4 | ...
AAA | |
BBB | |
CCC | AAA |
DDD | BBB |
EEE | DDD | BBB |
FFF | DDD | BBB |
GGG | FFF | DDD | BBB
HHH |
III | GGG | FFF | DDD | BBB
...
I've tried various methods combining VLOOKUP, MATCH, and INDEX commands, with and without a key row of user ID numbers (since some of those solutions without a numeric column broke down when faced with the fact that "0" was a valid username, which makes error trapping more difficult). I can get to P2 pretty reliably, but I can't ever seem to get to P3 without it breaking down. Incidentally, the formulas I've tried are very CPU-intensive, given the data goes to nearly 400,000 rows, but calculation time doesn't concern me much. My brute-force methods aren't working. There are several somewhat similar questions on stackoverflow, but they're asking for slightly different things, and I haven't been able to adapt any of them.
If this can be done via standard functions, that would be preferable to VBA (which I am not familiar with), even if the calculation time is longer, as it would increase my ability to maintain it when I need to revisit this issue next year.
Try this formula:=IFERROR(VLOOKUP(C5,UserParent,2,FALSE) & "",""), replacing UserParent with your absolutely referenced column pair (e.g. $B$5:$C$30) or an appropriate named range. Copy it down and across your grandparent columns.
I'm betting this is the approach that you tried before, but you end up with a bunch of zeroes in the output. The juicy bit in this formula is the & "". This forces the empty cells in your parent column to be treated as empty strings rather that zero-valued cells when VLOOKUP does its work. This removes all those zeroes that dork up the output.
I was able to make it work with a bunch of random alphanumerics, but without sample data, this is the best I could do.
As you have noted, the existence of 0 as a valid username, is a real problem - since this also gets returned as a value by VLOOKUP() (and equivalently by INDEX(,MATCH())) for names with no parents.
An alternative strategy is to use some dummy value which does not appear in either the User or Parent column, such as -99999, to signify the absent parent and to add this in place of any empty cell in the Parent column of the User/Parent table. Also add a row to this table with this same dummy value in both the User and Parent columns. Now you will only get a zero returned by VLOOKUP if 0 is genuinely the parent of the cell whose parent you are attempting to find. You will detect when there are no more "levels" when all the values in the column are equal to the dummy value.

Counting Combination Pairs in Excel

Alright, I have a spreadsheet that looks like this in the "B" column:
AAA
BBB
BBB
AAA
BBB
CCC
DDD
BBB
AAA
BBB
CCC
What I need to do is to count how many times "BBB" directly follows "AAA" (In this example 3 times)
Ive tried multiple things with =SumProduct like =SUMPRODUCT(COUNTIFS(K6:K10,{"AAA","BBB"})) but that returns the product of all "AAA's" and "BBB's" instead of just the number of AAA & BBB pairs. The best I could do is to use =countifs() but I cant see a way to do a forward lookup like =countifs("B:B","AAA","B+1:B+1","BBB")
Also, I should mention, I was hoping to use this somewhere in the formula "Table13[[#All],[Radio State]]." That way the formula would grow and shrink depending on the size of the table.
Does anyone happen to know if its possible to do this?
Thanks Guys,
You can 'offset' the range by a bit like this:
=COUNTIFS(A1:A10, "AAA", A2:A11, "BBB")
You can change the range accordingly.
With a table:
=COUNTIFS(Table13[[#All],[Radio State]],"AAA",OFFSET(Table13[[#All],[Radio State]],1,0),"BBB")

For every string in file1.txt check if it exists in file2.txt then do something

I got two txt file, file1.txt and file2.txt.
Both of them have one single string for each line. Strings in file1.txt are uniqe (no duplication), as well as strings in file2.txt.
The files have different numbers of strings.
file1.txt file2.txt
FFF AAA
GGG BBB
ZZZ CCC
ZZZ
I'd like to compare those files, so that for every string in file1.txt, if it exists in file2.txt than it's ok. If not, than write that string in another file (file3.txt)
In this example, file3.txt would be:
file3.txt
FFF
GGG
I'd like to use the command shell, doing something like:
cat file1.txt | while read a; do something on file2.txt ...
but that is not compulsory.
See the man page for grep, specifically the -f option.
grep -vf file2.txt file1.txt
Your best bet would be to read in the input from file 2, put it in a sorted list (or even better, a balanced search tree) and then as you read in each line from file1, go through the tree or do a binary search of the list to find if the string exists.
The idea is that you want to do processing once to make the list of allowed values as easy to check as possible. Putting them in a binary search tree would mean that you first compare it against the word in the middle (alphabetically) of list 2, if it is before it, you take the left branch (which contains words that come before the word you just compared to, or if it comes after, you only have to look at the right branch.
Similarly, if using a list, you look at the word in the middle of the list and then can remove half of the remaining list from consideration each iteration. This means you only have to do log n steps to check each of the words in List1 against the n words in list2.

How to delete double lines in bash

Given a long text file like this one (that we will call file.txt):
EDITED
1 AA
2 ab
3 azd
4 ab
5 AA
6 aslmdkfj
7 AA
How to delete the lines that appear at least twice in the same file in bash? What I mean is that I want to have this result:
1 AA
2 ab
3 azd
6 aslmdkfj
I do not want to have the same lines in double, given a specific text file. Could you show me the command please?
Assuming whitespace is significant, the typical solution is:
awk '!x[$0]++' file.txt
(eg, The line "ab " is not considered the same as "ab". It is probably simplest to pre-process the data if you want to treat whitespace differently.)
--EDIT--
Given the modified question, which I'll interpret as only wanting to check uniqueness after a given column, try something like:
awk '!x[ substr( $0, 2 )]++' file.txt
This will only compare columns 2 through the end of the line, ignoring the first column. This is a typical awk idiom: we are simply building an array named x (one letter variable names are a terrible idea in a script, but are reasonable for a one-liner on the command line) which holds the number of times a given string is seen. The first time it is seen, it is printed. In the first case, we are using the entire input line contained in $0. In the second case we are only using the substring consisting of everything including and after the 2nd character.
Try this simple script:
cat file.txt | sort | uniq
cat will output the contents of the file,
sort will put duplicate entries adjacent to each other
uniq will remove adjcacent duplicate entries.
Hope this helps!
The uniq command will do what you want.
But make sure the file is sorted first, it only checks for consecutive lines.
Like this:
sort file.txt | uniq

Using uniq to compare 2 dictionaries

So I have two dictionaries to compare (american english vs british english).
How do I use the uniq command to count (-c) how many words are in the american english or british english but not both?
Also how do I count the number of word occurrences of one dictionary that appears in a different dictionary?
Just trying to understand how uniq works on a more complicated level. Any help is appreciated!
Instead of uniq, use the comm command for this. It finds lines that are in common between two files, or are unique to one or the other.
This counts all the words that are in one dictionary but not both
comm -3 american british | wc -l
This counts the words that are in both dictionaries:
comm -12 american british | wc -l
By default, comm shows the lines that are only in the first file in column 1, the lines that are only in the second file in column 2, and the lines in both files in column 3. You can then use the -[123] options to tell it to leave out the specified columns. So -3 only shows columns 1 and 2 (the unique words in each file), while -12 only shows column 3 (the common words).
It requires that the files be sorted, which I assume your dictionary files are.
You can also do it with unique. It has options -u to show only lines that appear once, and -d to show only lines that are repeated.
sort american british | uniq -u | wc -l # words in just one language
sort american british | uniq -d | wc -l # words in both languages

Resources