Grep (a.txt - En word list, b.txt - one string in each line) Q: string from b.txt built only from words or not? - string

I have a list with English words (1 in each line, around 100.000)-> a.txt and a b.txt contains strings (around 50.000 line, one string in each line, can contain pure words, word+something, garbage). I would like to know which strings from b.txt contains English words only (without any additional chars).
Can I do this with grep?
Example:
a.txt:
apple
pie
b.txt:
applepie
applebs
bspie
bsabcbs
Output:
c.txt:
applepie

Since your question is underspecified, maybe this answer can help as a shot in the dark to clarify your question:
c='cat b.txt'
while IFS='' read -e line
do
c="$c | grep '$line'"
done < a.txt
eval "$c" > c.txt
But this would also match a line like this is my apply on a pie. I don't know if that's what you want.
This is another try:
re=''
while IFS='' read -e line
do
re="$re${re:+|}$line"
done < a.txt
grep -E "^($re)*$" b.txt > c.txt
This will let pass only the lines which have nothing but a concatenation of these words. But it will also let pass things like 'appleapplepieapplepiepieapple'. Again, I don't know if this is what you want.
Given your latest explanation in the question I would propose another approach (because building such a list out of 100000+ words is not going to work).
A working approach for this amount of words could be to remove all recognized words from the text and see which lines get emptied in the process. This can easily be done iteratively without exploding the memory usage or other resources. It will take time, though.
cp b.txt inprogress.txt
while IFS='' read -e line
do
sed -i "s/$line//g" inprogress.txt
done < a.txt
for lineNumber in $(grep -n '^$' inprogress.txt | sed 's/://')
do
sed -n "${lineNumber}p" b.txt
done
rm inprogress.txt
But this still would not really solve your issue; consider if you have the words to and potato in your list, and removing the to would occur first, then this would leave a word pota in your text file, and pota is not a word which would then be removed.
You could address that issue by sorting your word file by word length (longest words first) but that still would be problematic in some cases of compound words, e. g. redart (being red + art) but dart would be removed first, so re would remain. If that is not in your word list, you would not recognize this word.
Actually, your problem is one of logical programming and natural language processing and probably does not fit to SO. You should have a look at the language Prolog which is designed around such problems as yours.

I will post this as an answer as well since I feel this is the correct answer to your specific question.
Your requirement is to find non-English words in a file (b.txt) based on a word list ( a.txt) which contains a list of English words. Based on the example in your question said word list does not contain compound words (e.g. applepie) but you would still like to match the file against compound words based on words in your word list (e.g. apple and pie).
There are two problem you are facing:
Not every permutation of words in a.txt will be a valid English compound word so just based on this your problem is already impossible to solve.
If you, nonetheless, were to attempt building a list of compound words yourself by compiling a list of all possible permutations you cannot easily do this because of the size of your wordlist (and resulting memory problems). You would most probably have to store your words in a more complex data structure, e.g. a tree, and build permutations on the fly by traversing the tree which is not doable in shell scripting.
Because of these points and your actual question being "can this be done with grep?" the answer is no, this is not possible.

Related

How would I use grep to find all the words in a file that start with a-f

Here is the code to look for all the words in a file starting with a. What about words that start with a-f?
grep -E '\ba' data.txt
Since the question may be quite ambiguous, I will try to add all the possible solutions I can find for it. Let's start with a sample file, in my case containing the following words so I can explain myself better:
ascii
affair
a-f-f-i-l-i-a-t-e
barcode
break
charset
character
delete
draft
execute
example
force
failure
grip
group
halt
held
inode
instance
joke
jekyll
That said, if you want to grep all the words starting with the letters that go from a to f (this is, every word starting by a, b, c, d, e and f) you would use:
grep -E "^[a-f]" data.txt
And the output you would see is this one:
ascii
affair
barcode
break
charset
character
delete
draft
execute
example
force
failure
Now, if you want to filter the words starting literally by the string a-f for some reason then you would use this:
grep -E "^a-f" data.txt
And you would see that the output would look like this:
a-f-f-i-l-i-a-t-e
In this case the key is not that much grep knowledge but regex stuff.
Hope you find it useful.

Prefix search names to output in bash

I have a simple egrep command searching for multiple strings in a text file which outputs either null or a value. Below is the command and the output.
cat Output.txt|egrep -i "abc|def|efg"|cut -d ':' -f 2
Output is:-
xxx
(null)
yyy
Now, i am trying to prefix my search texts to the output like below.
abc:xxx
def:
efg:yyy
Any help on the code to achieve this or where to start would be appreciated.
-Abhi
Since I do not know exactly your input file content (not specified properly in the question), I will put some hypothesis in order to answer your question.
Case 1: the patterns you are looking for are always located in the same column
If it is the case, the answer is quite straightforward:
$ cat grep_file.in
abc:xxx:uvw
def:::
efg:yyy:toto
xyz:lol:hey
$ egrep -i "abc|def|efg" grep_file.in | cut -d':' -f1,2
abc:xxx
def:
efg:yyy
After the grep just use the cut with the 2 columns that you are looking for (here it is 1 and 2)
REMARK:
Do not cat the file, pipe it and then grep it, since this is doing the work twice!!! Your grep command will already read the file so do not read it twice, it might not be that important on small files but you will feel the difference on 10GB files for example!
Case 2: the patterns you are looking for are NOT located in the same column
In this case it is a bit more tricky, but not impossible. There are many ways of doing, here I will detail the awk way:
$ cat grep_file2.in
abc:xxx:uvw
::def:
efg:yyy:toto
xyz:lol:hey
If your input file is in this format; with your pattern that could be located anywhere:
$ awk 'BEGIN{FS=":";ORS=FS}{tmp=0;for(i=1;i<=NF;i++){tmp=match($i,/abc|def|efg/);if(tmp){print $i;break}}if(tmp){printf "%s\n", $2}}' grep_file
2.in
abc:xxx
def:
efg:yyy
Explanations:
FS=":";ORS=FS define your input/output field separator at : Then on each line you define a test variable that will become true when you reach your pattern, you loop on all the fields of the line until you reach it if it is the case you print it, break the loop and print the second field + an EOL char.
If you do not meet your pattern you do nothing.
If you prefer the sed way, you can use the following command:
$ sed -n '/abc\|def\|efg/{h;s/.*\(abc\|def\|efg\).*/\1:/;x;s/^[^:]*:\([^:]*\):.*/\1/;H;x;s/\n//p}' grep_file2.in
abc:xxx
def:
efg:yyy
Explanations:
/abc\|def\|efg/{} is used to filter the lines that contain only one of the patterns provided, then you execute the instructions in the block. h;s/.*\(abc\|def\|efg\).*/\1:/; save the line in the hold space and replace the line with one of the 3 patterns, x;s/^[^:]*:\([^:]*\):.*/\1/; is used to exchange the pattern and hold space and extract the 2nd column element. Last but not least, H;x;s/\n//p is used to regroup both extracted elements on 1 line and print it.
try this
$ egrep -io "(abc|def|efg):[^:]*" file
will print the match and the next token after delimiter.
If we can assume that there are only two fields, that abc etc will always match in the first field, and that getting the last match on a line which contains multiple matches is acceptable, a very simple sed script could work.
sed -n 's/^[^:]*\(abc\|def\|efg\)[^:]*:\([^:]*\)/\1:\2/p' file
If other but similar conditions apply (e.g. there are three fields or more but we don't care about matches in the first two) the required modifications are trivial. If not, you really need to clarify your question.

How to make a Palindrome with a sed command?

I'm trying to find the code that searches all palindromes in a dictionary file
this is what I got atm which is wrong :
sed -rn '/^([a-z])-([a-z])\2\1$/p' /usr/share/dict/words
Can somebody explain the code as well.
Found the right answer.
sed -n '/^\([a-z]\)\([a-z]\)\2\1$/p' /usr/share/dict/words
I have no idea why I used -
I also don't have an explenation for the \ ater each group
You can use the grep command as explained here
grep -w '^\(.\)\(.\).\2\1'
explanation The grep command searches for the first any three letters by using (.)(.). after that we are searching the same 2nd character and 1st character is occuring or not.
The above grep command will find out only 5 letters palindrome words.
extended version is proposed as well on that page; and works correctly for the first line but then crashes... there is surely some good to keep and maybe to adapt...
Guglielmo Bondioni proposed a single RE that finds all palindromes up to 19 characters long using 9 subexpressions and 9 back-references:
grep -E -e '^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1' file
You can extend this further as much as you want :)
Perl to the rescue:
perl -lne 'print if $_ eq reverse' /usr/share/dict/words
Hate to say it, but while regex may be able to cook your breakfast, I don't think it can find a palindrome. According to the all-knowing Wikipedia:
In the automata theory, a set of all palindromes in a given alphabet is a typical example of a language that is context-free, but not regular. This means that it is impossible for a computer with a finite amount of memory to reliably test for palindromes. (For practical purposes with modern computers, this limitation would apply only to incredibly long letter-sequences.)
In addition, the set of palindromes may not be reliably tested by a deterministic pushdown automaton which also means that they are not LR(k)-parsable or LL(k)-parsable. When reading a palindrome from left-to-right, it is, in essence, impossible to locate the "middle" until the entire word has been read completely.
So a regular expression won't be able to solve the problem based on the problem's nature, but a computer program (or sed examples like #NeronLeVelu or #potong) will work.
explanation of your code
sed -rn '/^([a-z])-([a-z])\2\1$/p' /usr/share/dict/words
select and print line that correspond to :
A first (starting the line) small alphabetic character followed by - followed by another small alaphabetic character (could be the same as the first) followed by the last letter of the previous group followed by the first letter Letter1-Letter2Letter2Letter1 and the no other element (end of line)
sample:
a-bba
a is first letter
b second letter
b is \2
a is \1
But it's a bit strange for any work unless it came from a very specific dictionnary (limited to combination by example)
This might work for you (GNU sed):
sed -r 'h;s/[^[:alpha:]]//g;H;x;s/\n/&&/;ta;:a;s/\n(.*)\n(.)/\n\2\1\n/;ta;G;/\n(.*)\n\n\1$/IP;d' file
This copies the original string(s) to the hold space (HS), then removes everything but alpha characters from the string(s) and appends this to the HS. The second copy is then reversed and the current string(s) and the reversed copy compared. If the two strings are equal then the original string(s) is printed out otherwise the line is deleted.

What are the differences among grep, awk & sed? [duplicate]

This question already has answers here:
What are the differences between Perl, Python, AWK and sed? [closed]
(5 answers)
What is the difference between sed and awk? [closed]
(3 answers)
Closed last month.
I am confused about the differences between grep, awk and sed in terms of their role in Unix/Linux system administration and text processing.
Short definition:
grep: search for specific terms in a file
#usage
$ grep This file.txt
Every line containing "This"
Every line containing "This"
Every line containing "This"
Every line containing "This"
$ cat file.txt
Every line containing "This"
Every line containing "This"
Every line containing "That"
Every line containing "This"
Every line containing "This"
Now awk and sed are completly different than grep.
awk and sed are text processors. Not only do they have the ability to find what you are looking for in text, they have the ability to remove, add and modify the text as well (and much more).
awk is mostly used for data extraction and reporting. sed is a stream editor
Each one of them has its own functionality and specialties.
Example
Sed
$ sed -i 's/cat/dog/' file.txt
# this will replace any occurrence of the characters 'cat' by 'dog'
Awk
$ awk '{print $2}' file.txt
# this will print the second column of file.txt
Basic awk usage:
Compute sum/average/max/min/etc. what ever you may need.
$ cat file.txt
A 10
B 20
C 60
$ awk 'BEGIN {sum=0; count=0; OFS="\t"} {sum+=$2; count++} END {print "Average:", sum/count}' file.txt
Average: 30
I recommend that you read this book: Sed & Awk: 2nd Ed.
It will help you become a proficient sed/awk user on any unix-like environment.
Grep is useful if you want to quickly search for lines that match in a file. It can also return some other simple information like matching line numbers, match count, and file name lists.
Awk is an entire programming language built around reading CSV-style files, processing the records, and optionally printing out a result data set. It can do many things but it is not the easiest tool to use for simple tasks.
Sed is useful when you want to make changes to a file based on regular expressions. It allows you to easily match parts of lines, make modifications, and print out results. It's less expressive than awk but that lends it to somewhat easier use for simple tasks. It has many more complicated operators you can use (I think it's even turing complete), but in general you won't use those features.
I just want to mention a thing, there are many tools can do text processing, e.g.
sort, cut, split, join, paste, comm, uniq, column, rev, tac, tr, nl, pr, head, tail.....
they are very handy but you have to learn their options etc.
A lazy way (not the best way) to learn text processing might be: only learn grep , sed and awk. with this three tools, you can solve almost 99% of text processing problems and don't need to memorize above different cmds and options. :)
AND, if you 've learned and used the three, you knew the difference. Actually, the difference here means which tool is good at solving what kind of problem.
a more lazy way might be learning a script language (python, perl or ruby) and do every text processing with it.

Determining Word Frequency of Specific Terms

I'm a non-computer science student doing a history thesis that involves determining the frequency of specific terms in a number of texts and then plotting these frequencies over time to determine changes and trends. While I have figured out how to determine word frequencies for a given text file, I am dealing with a (relatively, for me) large number of files (>100) and for consistencies sake would like to limit the words included in the frequency count to a specific set of terms (sort of like the opposite of a "stop list")
This should be kept very simple. At the end all I need to have is the frequencies for the specific words for each text file I process, preferably in spreadsheet format (tab delineated file) so that I can then create graphs and visualizations using that data.
I use Linux day-to-day, am comfortable using the command line, and would love an open-source solution (or something I could run with WINE). That is not a requirement however:
I see two ways to solve this problem:
Find a way strip-out all the words in a text file EXCEPT for the pre-defined list and then do the frequency count from there, or:
Find a way to do a frequency count using just the terms from the pre-defined list.
Any ideas?
I would go with the second idea. Here is a simple Perl program that will read a list of words from the first file provided and print a count of each word in the list from the second file provided in tab-separated format. The list of words in the first file should be provided one per line.
#!/usr/bin/perl
use strict;
use warnings;
my $word_list_file = shift;
my $process_file = shift;
my %word_counts;
# Open the word list file, read a line at a time, remove the newline,
# add it to the hash of words to track, initialize the count to zero
open(WORDS, $word_list_file) or die "Failed to open list file: $!\n";
while (<WORDS>) {
chomp;
# Store words in lowercase for case-insensitive match
$word_counts{lc($_)} = 0;
}
close(WORDS);
# Read the text file one line at a time, break the text up into words
# based on word boundaries (\b), iterate through each word incrementing
# the word count in the word hash if the word is in the hash
open(FILE, $process_file) or die "Failed to open process file: $!\n";
while (<FILE>) {
chomp;
while ( /-$/ ) {
# If the line ends in a hyphen, remove the hyphen and
# continue reading lines until we find one that doesn't
chop;
my $next_line = <FILE>;
defined($next_line) ? $_ .= $next_line : last;
}
my #words = split /\b/, lc; # Split the lower-cased version of the string
foreach my $word (#words) {
$word_counts{$word}++ if exists $word_counts{$word};
}
}
close(FILE);
# Print each word in the hash in alphabetical order along with the
# number of time encountered, delimited by tabs (\t)
foreach my $word (sort keys %word_counts)
{
print "$word\t$word_counts{$word}\n"
}
If the file words.txt contains:
linux
frequencies
science
words
And the file text.txt contains the text of your post, the following command:
perl analyze.pl words.txt text.txt
will print:
frequencies 3
linux 1
science 1
words 3
Note that breaking on word boundaries using \b may not work the way you want in all cases, for example, if your text files contain words that are hyphenated across lines you will need to do something a little more intelligent to match these. In this case you could check to see if the last character in a line is a hyphen and, if it is, just remove the hyphen and read another line before splitting the line into words.
Edit: Updated version that handles words case-insensitively and handles hyphenated words across lines.
Note that if there are hyphenated words, some of which are broken across lines and some that are not, this won't find them all because it only removed hyphens at the end of a line. In this case you may want to just remove all hyphens and match words after the hyphens are removed. You can do this by simply adding the following line right before the split function:
s/-//g;
I do this sort of thing with a script like following (in bash syntax):
for file in *.txt
do
sed -r 's/([^ ]+) +/\1\n/g' "$file" \
| grep -F -f 'go-words' \
| sort | uniq -c > "${file}.frq"
done
You can tweak the regex you use to delimit individual words; in the example I just treat whitespace as the delimiter. The -f argument to grep is a file that contains your words of interest, one per line.
First familiarize yourself with lexical analysis and how to write a scanner generator specification. Read the introductions to using tools like YACC, Lex, Bison, or my personal favorite, JFlex. Here you define what constitutes a token. This is where you learn about how to create a tokenizer.
Next you have what is called a seed list. The opposite of the stop list is usually referred to as the start list or limited lexicon. Lexicon would also be a good thing to learn about. Part of the app needs to load the start list into memory so it can be quickly queried. The typical way to store is a file with one word per line, then read this in at the start of the app, once, into something like a map. You might want to learn about the concept of hashing.
From here you want to think about the basic algorithm and the data structures necessary to store the result. A distribution is easily represented as a two dimensional sparse array. Learn the basics of a sparse matrix. You don't need 6 months of linear algebra to understand what it does.
Because you are working with larger files, I would advocate a stream-based approach. Don't read in the whole file into memory. Read it as a stream into the tokenizer that produces a stream of tokens.
In the next part of the algorithm think about how to transform the token list into a list containing only the words you want. If you think about it, the list is in memory and can be very large, so it is better to filter out non-start-words at the start. So at the critical point where you get a new token from the tokenizer and before adding it to the token list, do a lookup in the in-memory start-words-list to see if the word is a start word. If so, keep it in the output token list. Otherwise ignore it and move to the next token until the whole file is read.
Now you have a list of tokens only of interest. The thing is, you are not looking at other indexing metrics like position and case and context. Therefore, you really don't need a list of all tokens. You really just want a sparse matrix of distinct tokens with associated counts.
So,first create an empty sparse matrix. Then think about the insertion of the newly found token during parsing. When it occurs, increment its count if its in the list or otherwise insert a new token with a count of 1. This time, at the end of parsing the file, you have a list of distinct tokens, each with a frequency of at least 1.
That list is now in-mem and you can do whatever you want. Dumping it to a CSV file would be a trivial process of iterating over the entries and writing each entry per line with its count.
For that matter, take a look at the non-commercial product called "GATE" or a commercial product like TextAnalyst or products listed at http://textanalysis.info
I'm guessing that new files get introduced over time, and that's how things change?
I reckon your best bet would be to go with something like your option 2. There's not much point pre-processing the files, if all you want to do is count occurrences of keywords. I'd just go through each file once, counting each time a word in your list appears. Personally I'd do it in Ruby, but a language like perl or python would also make this task pretty straightforward. E.g., you could use an associative array with the keywords as the keys, and a count of occurrences as the values. (But this might be too simplistic if you need to store more information about the occurrences).
I'm not sure if you want to store information per file, or about the whole dataset? I guess that wouldn't be too hard to incorporate.
I'm not sure about what to do with the data once you've got it -- exporting it to a spreadsheet would be fine, if that gives you what you need. Or you might find it easier in the long-run just to write a bit of extra code that displays the data nicely for you. Depends on what you want to do with the data (e.g. if you want to produce just a few charts at the end of the exercise and put them into a report, then exporting to CSV would probably make most sense, whereas if you want to generate a new set of data every day for a year then building a tool to do that automatically is almost certainly the best idea.
Edit: I just figured out that since you're studying history, the chances are your documents are not changing over time, but rather reflect a set of changes that happened already. Sorry for misunderstanding that. Anyway, I think pretty much everything I said above still applies, but I guess you'll lean towards going with exporting to CSV or what have you rather than an automated display.
Sounds like a fun project -- good luck!
Ben
I'd do a "grep" on the files to find all the lines that contain your key words. (Grep -f can be used to specify an input file of words to search for (pipe the output of grep to a file). That will give you a list of lines which contain instances of your words. Then do a "sed" to replace your word separators (most likely spaces) with newlines, to give you a file of separate words (one word per line). Now run through grep again, with your same word list, except this time specify -c (to get a count of the lines with the specified words; i.e. a count of the occurrences of the word in the original file).
The two-pass method simply makes life easier for "sed"; the first grep should eliminate a lot of lines.
You can do this all in basic linux command-line commands. Once you're comfortable with the process, you can put it all into shell script pretty easily.
Another Perl attempt:
#!/usr/bin/perl -w
use strict;
use File::Slurp;
use Tie::File;
# Usage:
#
# $ perl WordCount.pl <Files>
#
# Example:
#
# $ perl WordCount.pl *.text
#
# Counts words in all files given as arguments.
# The words are taken from the file "WordList".
# The output is appended to the file "WordCount.out" in the format implied in the
# following example:
#
# File,Word1,Word2,Word3,...
# File1,0,5,3,...
# File2,6,3,4,...
# .
# .
# .
#
### Configuration
my $CaseSensitive = 1; # 0 or 1
my $OutputSeparator = ","; # another option might be "\t" (TAB)
my $RemoveHyphenation = 0; # 0 or 1. Careful, may be too greedy.
###
my #WordList = read_file("WordList");
chomp #WordList;
tie (my #Output, 'Tie::File', "WordCount.out");
push (#Output, join ($OutputSeparator, "File", #WordList));
for my $InFile (#ARGV)
{ my $Text = read_file($InFile);
if ($RemoveHyphenation) { $Text =~ s/-\n//g; };
my %Count;
for my $Word (#WordList)
{ if ($CaseSensitive)
{ $Count{$Word} = ($Text =~ s/(\b$Word\b)/$1/g); }
else
{ $Count{$Word} = ($Text =~ s/(\b$Word\b)/$1/gi); }; };
my $OutputLine = "$InFile";
for my $Word (#WordList)
{ if ($Count{$Word})
{ $OutputLine .= $OutputSeparator . $Count{$Word}; }
else
{ $OutputLine .= $OutputSeparator . "0"; }; };
push (#Output, $OutputLine); };
untie #Output;
When I put your question in the file wc-test and Robert Gamble's answer into wc-ans-test, the Output file looks like this:
File,linux,frequencies,science,words
wc-ans-test,2,2,2,12
wc-test,1,3,1,3
This is a comma separated value (csv) file (but you can change the separator in the script). It should be readable for any spreadsheet application. For plotting graphs, I would recommend gnuplot, which is fully scriptable, so you can tweak your output independently of the input data.
To hell with big scripts. If you're willing to grab all words, try this shell fu:
cat *.txt | tr A-Z a-z | tr -cs a-z '\n' | sort | uniq -c | sort -rn |
sed '/[0-9] /&, /'
That (tested) will give you a list of all words sorted by frequency in CSV format, easily imported by your favorite spreadsheet. If you must have the stop words then try inserting grep -w -F -f stopwords.txt into the pipeline (not tested).

Resources