i have a big file, teh lines look like this
Text numbers etc. [Man-(some numers)] is lot of this Man-somenumbers is repeat in few lines, i want to count only unique Mans -words. I cant use unique file , because text before Man words is always different in each line.
How can i count only unique Man-somenumbers words in file ?
If I understand what you want to do correctly, then
grep -oE 'Man-[0-9]+' filename | sort | uniq -c
should do the trick. It works as follows: First
grep -oE 'Man-[0-9]+' filename
isolates all words from the file that match the Man-[0-9]+ regular expression. That list is then piped through sort to get the sorted list that uniq requires, and then that sorted list is piped through uniq -c to count how often each unique Man- word appears.
Related
I have a .csv file filled with names of people, their group, the city they live in, and the day they are able to work, these 4 informations are separated with this ":".
For e.g
Dennis:GR1:Thursday:Paris
Charles:GR3:Monday:Levallois
Hugues:GR2:Friday:Ivry
Michel:GR2:Tuesday:Paris
Yann:GR1:Monday:Pantin
I'd like to cut the 2nd and the 3rd columns, and prints all the lines containing names ending with "s", but without cutting the 2nd column remaining.
For e.g, I would like to have something like that :
Dennis:Paris
Charles:Levallois
Hugues:Ivry
I tried to this with grep and cut, and but using cut ends with having just the 1st remaining.
I hope that I've been able to make myself understood !
It sounds like all you need is:
$ awk 'BEGIN{FS=OFS=":"} $1~/s$/{print $1, $4}' file
Dennis:Paris
Charles:Levallois
Hugues:Ivry
To address your comment requesting a grep+cut solution:
$ grep -E '^[^:]+s:' file | cut -d':' -f1,4
Dennis:Paris
Charles:Levallois
Hugues:Ivry
but awk is the right way to do this.
i have a directory with n textfiles. Now, i want to check, if any of these files contains one (or more) words of a constant file.
These files are all dictionarys with a different amount of words. The constant file is a password list, where i want to check these words. The amount of correct hits should be saved in a variable. Also the word should be save (i think as an array) in a variable.
For example: file1 contains This is my dictionary, file2 contains And another one, my password list contains this is a test for the dictionary and we have no other one.
The hits from file1 are This is dictionary (n1=3 words) and from file2 and one (n2=2 words).
My present code is
#!/bin/bash
# program_call passwordlist.txt *.txt
passwordlist="$1"
dictionarys="$*"
for comparison in $dictionarys; do
cat $passwordlist $comparison| sort | uniq -d >${comparison}.compare
done
One of my biggest problems her is, that i've got a different amount of dictionarys. Maybe 2, maybe 200. Nevermind, all of these has to be checked against the passwordlist and the result (the amound of correct words and the correct word itselfs) has to be saved in his OWN variables. So i think two variable for each dictionary.
another way
$ for f in file{1,2};
do echo -n $f": ";
grep -iow -f <(tr ' ' '\n' <cons) $f |
wc -l;
done
file1: 3
file2: 2
convert the constants file one word per line, check the dictionary files for word match ignore case and count the matched occurrences.
My solution:
#!/bin/bash
# program_call_is /dictionarys/*.txt passwordlist.txt
dictionarys="$1"
shift
passwordlist="$*"
for comparison in $dictionarys; do
fgrep -x -f $passwordlist $comparison >${comparison}.compare
done
Have a text file like this.
john,3
albert,4
tom,3
junior,5
max,6
tony,5
I'm trying to fetch records where column2 value is same. My desired output.
john,3
tom,3
junior,5
tony,5
I'm checking if we can use uniq -d on second column?
Here's one way using awk. It reads the input file twice, but avoids the need to sort:
awk -F, 'FNR==NR { a[$2]++; next } a[$2] > 1' file file
Results:
john,3
tom,3
junior,5
tony,5
Brief explanation:
FNR==NR is a common AWK idiom that is true for the first file in the arguments list. Here, column two is added to an array and incremented. On the second read of the file, we simply check if the value of column two is greater than one (the next keyword skips processing the rest of the code).
You can use uniq on fields (columns), but not easily in your case.
Uniq's -f and -s options filter by fields and characters respectively. However neither of these quite do what want.
-f divides fields by whitespace and you separate them with commas.
-s skips a fixed number of characters and your names are of variable length.
Overall though, uniq is used to compress input by consolidating duplicates into unique lines. You are actually wishing to retain duplicates and eliminate singletons, which is the opposite of what uniq is used to do. It would appear you need a different approach.
I have a dictionary defined in Vim. I need to do a search in a text file and match all ocurrences of the words in the dictionary.
For example I could do a search /[[:alpha:]] and match all letters in my file, I was thinking of something like /[[:dictionary:]] that matches all the words in the previously defined dictionary. Is there a way to do this?
If all you want is to count words from a dictionary, assuming the words consist only of ASCII letters, and the dictionary has exactly one word per line:
tr -cs A-Za-z '\n' <file.txt | fgrep -xof dictionary.txt | sort | uniq -c
So I have two dictionaries to compare (american english vs british english).
How do I use the uniq command to count (-c) how many words are in the american english or british english but not both?
Also how do I count the number of word occurrences of one dictionary that appears in a different dictionary?
Just trying to understand how uniq works on a more complicated level. Any help is appreciated!
Instead of uniq, use the comm command for this. It finds lines that are in common between two files, or are unique to one or the other.
This counts all the words that are in one dictionary but not both
comm -3 american british | wc -l
This counts the words that are in both dictionaries:
comm -12 american british | wc -l
By default, comm shows the lines that are only in the first file in column 1, the lines that are only in the second file in column 2, and the lines in both files in column 3. You can then use the -[123] options to tell it to leave out the specified columns. So -3 only shows columns 1 and 2 (the unique words in each file), while -12 only shows column 3 (the common words).
It requires that the files be sorted, which I assume your dictionary files are.
You can also do it with unique. It has options -u to show only lines that appear once, and -d to show only lines that are repeated.
sort american british | uniq -u | wc -l # words in just one language
sort american british | uniq -d | wc -l # words in both languages