I have a dictionary defined in Vim. I need to do a search in a text file and match all ocurrences of the words in the dictionary.
For example I could do a search /[[:alpha:]] and match all letters in my file, I was thinking of something like /[[:dictionary:]] that matches all the words in the previously defined dictionary. Is there a way to do this?
If all you want is to count words from a dictionary, assuming the words consist only of ASCII letters, and the dictionary has exactly one word per line:
tr -cs A-Za-z '\n' <file.txt | fgrep -xof dictionary.txt | sort | uniq -c
Related
i have a directory with n textfiles. Now, i want to check, if any of these files contains one (or more) words of a constant file.
These files are all dictionarys with a different amount of words. The constant file is a password list, where i want to check these words. The amount of correct hits should be saved in a variable. Also the word should be save (i think as an array) in a variable.
For example: file1 contains This is my dictionary, file2 contains And another one, my password list contains this is a test for the dictionary and we have no other one.
The hits from file1 are This is dictionary (n1=3 words) and from file2 and one (n2=2 words).
My present code is
#!/bin/bash
# program_call passwordlist.txt *.txt
passwordlist="$1"
dictionarys="$*"
for comparison in $dictionarys; do
cat $passwordlist $comparison| sort | uniq -d >${comparison}.compare
done
One of my biggest problems her is, that i've got a different amount of dictionarys. Maybe 2, maybe 200. Nevermind, all of these has to be checked against the passwordlist and the result (the amound of correct words and the correct word itselfs) has to be saved in his OWN variables. So i think two variable for each dictionary.
another way
$ for f in file{1,2};
do echo -n $f": ";
grep -iow -f <(tr ' ' '\n' <cons) $f |
wc -l;
done
file1: 3
file2: 2
convert the constants file one word per line, check the dictionary files for word match ignore case and count the matched occurrences.
My solution:
#!/bin/bash
# program_call_is /dictionarys/*.txt passwordlist.txt
dictionarys="$1"
shift
passwordlist="$*"
for comparison in $dictionarys; do
fgrep -x -f $passwordlist $comparison >${comparison}.compare
done
I have multiple text files I need to extract a viable amount of characters between two specific words, "".
Can someone give me an example grep pattern that will find all characters of any kind, including spaces, between these two words so I can then replace with a blank space? Thank you.
I don't have any example code I can put in my question, I am using a text editing program and I would like to find all the text between two unique words in the file and delete it, the text editing program allows the use of grep patterns.
You can use below grep line to search pattern like word1 something and other things and then word2
grep -o -E "word1(\b).*word2(\b)" file.txt
As you may notice this command's output also includes word1 and word2.
i have a big file, teh lines look like this
Text numbers etc. [Man-(some numers)] is lot of this Man-somenumbers is repeat in few lines, i want to count only unique Mans -words. I cant use unique file , because text before Man words is always different in each line.
How can i count only unique Man-somenumbers words in file ?
If I understand what you want to do correctly, then
grep -oE 'Man-[0-9]+' filename | sort | uniq -c
should do the trick. It works as follows: First
grep -oE 'Man-[0-9]+' filename
isolates all words from the file that match the Man-[0-9]+ regular expression. That list is then piped through sort to get the sorted list that uniq requires, and then that sorted list is piped through uniq -c to count how often each unique Man- word appears.
I have a dictionary with words separated by line breaks.
You can just do:
egrep -x '.{1,3}' myfile
This will also skip blank lines, which are technically not words. Unfortunately, the above reg-ex will count apostrophes in contractions as letters as well as hyphens in hyphenated compound words. Hyphenated compound words are not a problem at such a low letter count, but I am not sure whether or not you want to count apostrophes in contractions, which are possible (e.g., I'm). You can try to use a reg-ex such as:
egrep -x '\w{1,3}' myfile
..., but this will only match upper/lower case letters and not match contractions or hyphenated compound words at all.
Like this:
grep -v "^...." my_file
Try this regular expression:
grep -E '^.{1,3}$' your_dictionary
So I have two dictionaries to compare (american english vs british english).
How do I use the uniq command to count (-c) how many words are in the american english or british english but not both?
Also how do I count the number of word occurrences of one dictionary that appears in a different dictionary?
Just trying to understand how uniq works on a more complicated level. Any help is appreciated!
Instead of uniq, use the comm command for this. It finds lines that are in common between two files, or are unique to one or the other.
This counts all the words that are in one dictionary but not both
comm -3 american british | wc -l
This counts the words that are in both dictionaries:
comm -12 american british | wc -l
By default, comm shows the lines that are only in the first file in column 1, the lines that are only in the second file in column 2, and the lines in both files in column 3. You can then use the -[123] options to tell it to leave out the specified columns. So -3 only shows columns 1 and 2 (the unique words in each file), while -12 only shows column 3 (the common words).
It requires that the files be sorted, which I assume your dictionary files are.
You can also do it with unique. It has options -u to show only lines that appear once, and -d to show only lines that are repeated.
sort american british | uniq -u | wc -l # words in just one language
sort american british | uniq -d | wc -l # words in both languages