Using uniq to compare 2 dictionaries - linux

So I have two dictionaries to compare (american english vs british english).
How do I use the uniq command to count (-c) how many words are in the american english or british english but not both?
Also how do I count the number of word occurrences of one dictionary that appears in a different dictionary?
Just trying to understand how uniq works on a more complicated level. Any help is appreciated!

Instead of uniq, use the comm command for this. It finds lines that are in common between two files, or are unique to one or the other.
This counts all the words that are in one dictionary but not both
comm -3 american british | wc -l
This counts the words that are in both dictionaries:
comm -12 american british | wc -l
By default, comm shows the lines that are only in the first file in column 1, the lines that are only in the second file in column 2, and the lines in both files in column 3. You can then use the -[123] options to tell it to leave out the specified columns. So -3 only shows columns 1 and 2 (the unique words in each file), while -12 only shows column 3 (the common words).
It requires that the files be sorted, which I assume your dictionary files are.
You can also do it with unique. It has options -u to show only lines that appear once, and -d to show only lines that are repeated.
sort american british | uniq -u | wc -l # words in just one language
sort american british | uniq -d | wc -l # words in both languages

Related

Check if any file in a directory contains words in a constant file

i have a directory with n textfiles. Now, i want to check, if any of these files contains one (or more) words of a constant file.
These files are all dictionarys with a different amount of words. The constant file is a password list, where i want to check these words. The amount of correct hits should be saved in a variable. Also the word should be save (i think as an array) in a variable.
For example: file1 contains This is my dictionary, file2 contains And another one, my password list contains this is a test for the dictionary and we have no other one.
The hits from file1 are This is dictionary (n1=3 words) and from file2 and one (n2=2 words).
My present code is
#!/bin/bash
# program_call passwordlist.txt *.txt
passwordlist="$1"
dictionarys="$*"
for comparison in $dictionarys; do
cat $passwordlist $comparison| sort | uniq -d >${comparison}.compare
done
One of my biggest problems her is, that i've got a different amount of dictionarys. Maybe 2, maybe 200. Nevermind, all of these has to be checked against the passwordlist and the result (the amound of correct words and the correct word itselfs) has to be saved in his OWN variables. So i think two variable for each dictionary.
another way
$ for f in file{1,2};
do echo -n $f": ";
grep -iow -f <(tr ' ' '\n' <cons) $f |
wc -l;
done
file1: 3
file2: 2
convert the constants file one word per line, check the dictionary files for word match ignore case and count the matched occurrences.
My solution:
#!/bin/bash
# program_call_is /dictionarys/*.txt passwordlist.txt
dictionarys="$1"
shift
passwordlist="$*"
for comparison in $dictionarys; do
fgrep -x -f $passwordlist $comparison >${comparison}.compare
done

How do I sort input with a variable number of fields by the second-to-last field?

Editor's note: The original title of the question mentioned tabs as the field separators.
In a text such as
500 east 23rd avenue Toronto 2 890 400000 1
900 west yellovillage blvd Mississauga 3 800 600090 3
how would you sort in ascending order of the second to last column?
Editor's note: The OP later provided another sample input line, 500 Jackson Blvd Toronto 3 700 40000 2, which contains only 8 whitespace-separated input fields (compared to the 9 above), revealing the need to deal with a variable number of fields in the input.
Note: There are several, potentially separate questions:
Update: Question C was the relevant one.
Question A: As implied by the question's title only: how can you use the tab character (\t) as the field separator?
Question B: How can you sort input by the second-to-last field, without knowing that field's specific index up front, given a fixed number of fields?
Question C: How can you sort input by the second-to-last field, without knowing that field's respective index up front, given a variable number of fields?
Answer to question A:
sort's -t option allows you to specify a field separator.
By default, sort uses any run of line-interior whitespace as the separator.
Assuming Bash, Ksh, or Zsh, you can use an ANSI C-quoted string ($'...') to specify a single tab as the field separator ($'\t'):
sort -t $'\t' -n -k8,8 file # -n sorts numerically; omit for lexical sorting
Answer to question B:
Note: This assumes that all input lines have the same number of fields, and that input comes from file file:
# Determine the index of the next-to-last column, based on the first
# line, using Awk:
nextToLastColNdx=$(head -n 1 file | awk -F '\t' '{ print NF - 1 }')
# Sort numerically by the next-to-last column (omit -n to sort lexically):
sort -t $'\t' -n -k$nextToLastColNdx,$nextToLastColNdx file
Note: To sort by a single field, always specify it as the end field too (e.g., -k8,8), as above, because sort, given only a start field index (e.g., -k8), sorts from the specified field through the remainder of the line.
Answer to question C:
Note: This assumes that input lines may have a variable number of fields, and that on each line it is that line's second-to-last field that should act as the sort field; input comes from file file:
awk '{ printf "%s\t%s\n", $(NF-1), $0 }' file |
sort -n -k1,1 | # omit -n to perform lexical sorting
cut -f2-
The awk command extracts each line's second-to-last field and prepends it to the input line on output, separated by a tab.
The result is sorted by the first field (i.e., each input line's second-to-last field).
Finally, the artificially prepended sort field is removed again, using cut.
I suggest looking at "man sort".
You will see how to specify a field separator and how to specify the field index that should be used as a key for sorting.
You can use sort -k 2
For example :
echo -e '000 west \n500 east\n500 east\n900 west' | sort -k 2
The result is :
500 east
500 east
900 west
000 west
You can find more informations in the man page of sort. Take a look a the end of the man page. Just before author you have some interesting informations :)
Bye

How to find unique words from file linux

i have a big file, teh lines look like this
Text numbers etc. [Man-(some numers)] is lot of this Man-somenumbers is repeat in few lines, i want to count only unique Mans -words. I cant use unique file , because text before Man words is always different in each line.
How can i count only unique Man-somenumbers words in file ?
If I understand what you want to do correctly, then
grep -oE 'Man-[0-9]+' filename | sort | uniq -c
should do the trick. It works as follows: First
grep -oE 'Man-[0-9]+' filename
isolates all words from the file that match the Man-[0-9]+ regular expression. That list is then piped through sort to get the sorted list that uniq requires, and then that sorted list is piped through uniq -c to count how often each unique Man- word appears.

How to get CSV dimensions from terminal

Suppose I'm in a folder where ls returns Test.csv. What command do I enter to get the number of rows and columns of Test.csv (a standard comma separated file)?
Try using awk. It's best suited for well formatted csv file manipulations.
awk -F, 'END {printf "Number of Rows : %s\nNumber of Columns = %s\n", NR, NF}' Test.csv
-F, specifies , as a field separator in csv file.
At the end of file traversal, NR and NF have values of number of rows and columns respectively
Another quick and dirty approach would be like
# Number of Rows
cat Test.csv | wc -l
# Number of Columns
head -1 Test.csv | sed 's/,/\t/g' | wc -w
Although not a native solution using GNU coreutils, it is worth mentioning (since this is one of the top google results for such question) that xsv puts at your disposal a command to list the headers of a csv file, whose count returns obviously the number of columns.
# count rows
xsv count <filename>
# count columns
xsv headers <filename> | wc -l
For big files this is orders of magnitude faster than native solutions with awk and sed.

How to delete double lines in bash

Given a long text file like this one (that we will call file.txt):
EDITED
1 AA
2 ab
3 azd
4 ab
5 AA
6 aslmdkfj
7 AA
How to delete the lines that appear at least twice in the same file in bash? What I mean is that I want to have this result:
1 AA
2 ab
3 azd
6 aslmdkfj
I do not want to have the same lines in double, given a specific text file. Could you show me the command please?
Assuming whitespace is significant, the typical solution is:
awk '!x[$0]++' file.txt
(eg, The line "ab " is not considered the same as "ab". It is probably simplest to pre-process the data if you want to treat whitespace differently.)
--EDIT--
Given the modified question, which I'll interpret as only wanting to check uniqueness after a given column, try something like:
awk '!x[ substr( $0, 2 )]++' file.txt
This will only compare columns 2 through the end of the line, ignoring the first column. This is a typical awk idiom: we are simply building an array named x (one letter variable names are a terrible idea in a script, but are reasonable for a one-liner on the command line) which holds the number of times a given string is seen. The first time it is seen, it is printed. In the first case, we are using the entire input line contained in $0. In the second case we are only using the substring consisting of everything including and after the 2nd character.
Try this simple script:
cat file.txt | sort | uniq
cat will output the contents of the file,
sort will put duplicate entries adjacent to each other
uniq will remove adjcacent duplicate entries.
Hope this helps!
The uniq command will do what you want.
But make sure the file is sorted first, it only checks for consecutive lines.
Like this:
sort file.txt | uniq

Resources