Parsing program output line by line on-the-fly - linux

I have a program that does some heavy processing (ML algorithm) and writes lots (read GBs of plain text) of data to standard output. In some particular scenarios, I only require a tiny portion of the output, however right now I am saving a (huge) text file and then parsing the lines in there to get my data.
While totally effective my approach is awfully efficient. Is there a way to avoid the generation of such big files (since most of the data will be removed anyway), and do the parsing on-the-fly line by line.
Execute:
./myProgram model test > myOutput
myOutput content (millions of lines):
0, blah blah blah thousand of more blahs -> [ I care data inside brackets ]
0, blah blah blah thousand of more blahs -> [ I care data inside brackets ]
....
What I think could be the best option would be to use the unix pipeline to chain results but I do not know how to send the data line by line lets say to a python or java app to parse it.
./myProgram model test | <now what>

To read and write data in the script you want to use to filter the data just read and write from/to standard input/output.
./myProgram model test | ./filter.py > myOutput
filter.py:
import sys
for line in sys.stdin:
if some_condition:
sys.stdout.write(line)
If the condition is just to have some pattern in the data you don't need a script, you can simply use grep to filter the lines:
./myProgram model test | grep 'interesting_pattern' > myOutput

That pipeline does exactly that. It sends the data (possibly buffered) to the program on the RHS of the pipeline.
That program can then operate on that data in any way it wants.
Programs like grep, sed and awk operate on that data in line-oriented fashion.
Other programs can do other things as they want/need to.

./myProgram model test | now what
If I understand correctly that you want [ I care data inside brackets ] (just the data between the brackets), then one proper approach is to pipe the output to sed and then using a backreference to replace the line of text with just what is inside the brackets. So now what is sed -e 's/^.*[[]\(.*\)[]].*$/\1/'. Or put together:
./myProgram model test | sed -e 's/^.*[[]\(.*\)[]].*$/\1/' > myOutput
If your program provides the output provided, an example is:
$ echo "0, blah blah blah thousand of more blahs -> [ I care data inside brackets ]" |
sed -e 's/^.*[[]\(.*\)[]].*$/\1/'
I care data inside brackets
A brief explanation of the regular expression is relatively easy if you look at it in pieces:
's/^.*[[]\(.*\)[]].*$/\1/'
Is a simple substitution expression of the form s/this/that/. Looking at the first (or this) part you have:
^.* # from the beginning of the line, match all characters
[[] # until you find the first open bracket [
\( # begin saving the pattern that follows
.* # all characters in this case
\) # stop collecting the pattern
[]] # before you encounter the close bracket ]
.*$ # and then all remaining characters in the line.
Next the second (or that) part of the s/this/that/ expression is a backreference that says:
\1 # substitute the (1st) pattern you collected. All between \(...\) for the line.
Which when put together simply says replace the line with just what is between the brackets. (and of course if I didn't understand what you needed, it's a long explanation down the tubes.)

If indeed the output is line oriented and you want to extract or process some of it, pipe the output to some awk command, i.e.
./myProgram model test | awk ...
of course, replace the ... with appropriate arguments for awk. Learn more about GNU awk (a.k.a. gawk) it is designed for such tasks:
If you are like many computer users, you would frequently like to make changes in various text files wherever certain patterns appear, or extract data from parts of certain lines while discarding the rest. To write a program to do this in a language such as C or Pascal is a time-consuming inconvenience that may take many lines of code. The job is easy with awk, especially the GNU implementation: gawk.
Alternatively, you could modify your original ./myProgram to let it e.g. fill some database, either with sqlite (an easy to use library) or with something more serious like PostGreSQL or MongoDb

Related

concatenate two strings and one variable using bash

I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards
By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv
You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).
Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...

Bash - process backspace control character when redirecting output to file

I have to run a third-party program in background and capture its output to file. I'm doing this simply using the_program > output.txt. However, the coders of said program decided to be flashy and show processed lines in real-time, using \b characters to erase the previous value. So, one of the lines in output.txt ends up like Lines: 1(b)2(b)3(b)4(b)5, (b) being an unprintable character with ASCII code 08. I want that line to end up as Lines: 5.
I'm aware that I can write it as-is and post-process the file using AWK, but I wonder if it's possible to somehow process the control characters in-place, by using some kind of shell option or by piping some commands together, so that line would become Lines: 5 without having to run any additional commands after the program is done?
Edit:
Just a clarification: what I wrote here is a simplified version, actual line count processed by the program is a hundred thousands, so that string ends up quite long.
Thanks for your comments! I ended up piping the output of that program to AWK Script I linked in the question. I get a well-formed file in the end.
the_program | ./awk_crush.sh > output.txt
The only downside is that I get the output only once the program itself is finished, even though the initial output exceeds 5M and should be passed in the lesser chunks. I don't know the exact reason, perhaps AWK script waits for EOF on stdin. Either way, on more modern system I would use
stdbuf -oL the_program | ./awk_crush.sh > output.txt
to process the output line-by-line. I'm stuck on RHEL4 with expired support though, so I'm unable to use neither stdbuf nor unbuffer. I'll leave it as-is, it's fine too.
The contents of awk_crush.sh are based on this answer, except with ^H sequences (which are supposed to be ASCII 08 characters entered via VIM commands) replaced with escape sequence \b:
#!/usr/bin/awk -f
function crushify(data) {
while (data ~ /[^\b]\b/) {
gsub(/[^\b]\b/, "", data)
}
print data
}
crushify($0)
Basically, it replaces character before \b and \b itself with empty string, and repeats it while there are \b in the string - just what I needed. It doesn't care for other escape sequences though, but if it's necessary, there's a more complete SED solution by Thomas Dickey.
Pipe it to col -b, from util-linux:
the_program | col -b
Or, if the input is a file, not a program:
col -b < input > output
Mentioned in Unix & Linux: Evaluate large file with ^H and ^M characters.

Print previous line if condition is met

I would like to grep a word and then find the second column in the line and check if it is bigger than a value. Is yes, I want to print the previous line.
Ex:
Input file
AAAAAAAAAAAAA
BB 2
CCCCCCCCCCCCC
BB 0.1
Output
AAAAAAAAAAAAA
Now, I want to search for BB and if the second column (2 or 0.1) in that line is bigger than 1, I want to print the previous line.
Can somebody help me with grep and awk? Thanks. Any other suggestions are also welcome. Thanks.
This can be a way:
$ awk '$1=="BB" && $2>1 {print f} {f=$1}' file
AAAAAAAAAAAAA
Explanation
$1=="BB" && $2>1 {print f} if the 1st field is exactly BB and 2nd field is bigger than 1, then print f, a stored value.
{f=$1} store the current line in f, so that it is accessible when reading the next line.
Another option: reverse the file and print the next line if the condition matches:
tac file | awk '$1 == "BB" && $2 > 1 {getline; print}' | tac
Concerning generality
I think it needs to be mentioned that the most general solution to this class of problem involves two passes:
the first pass to add a decimal row number ($REC) to the front of each line, effectively grouping lines into records by $REC
the second pass to trigger on the first instance of each new value of $REC as a record boundary (resetting $CURREC), thereafter rolling along in the native AWK idiom concerning the records to follow matching $CURREC.
In the intermediate file, some sequence of decimal digits followed by a separator (for human reasons, typically an added tab or space) is parsed (aka conceptually snipped off) as out-of-band with respect to the baseline file.
Command line paste monster
Even confined to the command line, it's an easy matter to ensure that the intermediate file never hits disk. You just need to use an advanced shell such as ZSH (my own favourite) which supports process substitution:
paste <( <input.txt awk "BEGIN { R=0; N=0; } /Header pattern/ { N=1; } { R=R+N; N=0; print R; }" ) input.txt | awk -f yourscript.awk
Let's render that one-liner more suitable for exposition:
P="/Header pattern/"
X="BEGIN { R=0; N=0; } $P { N=1; } { R=R+N; N=0; print R; }"
paste <( <input.txt awk $X ) input.txt | awk -f yourscript.awk
This starts three processes: the trivial inline AWK script, paste, and the AWK script you really wanted to run in the first place.
Behind the scenes, the <() command line construct creates a named pipe and passes the pipe name to paste as the name of its first input file. For paste's second input file, we give it the name of our original input file (this file is thus read sequentially, in parallel, by two different processes, which will consume between them at most one read from disk, if the input file is cold).
The magic named pipe in the middle is an in-memory FIFO that ancient Unix probably managed at about 16 kB of average size (intermittently pausing the paste process if the yourscript.awk process is sluggish in draining this FIFO back down).
Perhaps modern Unix throws a bigger buffer in there because it can, but it's certainly not a scarce resource you should be concerned about, until you write your first truly advanced command line with process redirection involving these by the hundreds or thousands :-)
Additional performance considerations
On modern CPUs, all three of these processes could easily find themselves running on separate cores.
The first two of these processes border on the truly trivial: an AWK script with a single pattern match and some minor bookkeeping, paste called with two arguments. yourscript.awk will be hard pressed to run faster than these.
What, your development machine has no lightly loaded cores to render this master shell-master solution pattern almost free in the execution domain?
Ring, ring.
Hello?
Hey, it's for you. 2018 just called, and wants its problem back.
2020 is officially the reprieve of MTV: That's the way we like it, magic pipes for nothing and cores for free. Not to name out loud any particular TLA chip vendor who is rocking the space these days.
As a final performance consideration, if you don't want the overhead of parsing actual record numbers:
X="BEGIN { N=0; } $P { N=1; } { print N; N=0; }"
Now your in-FIFO intermediate file is annotated with just an additional two characters prepended to each line ('0' or '1' and the default separator character added by paste), with '1' demarking first line in record.
Named FIFOs
Under the hood, these are no different than the magic FIFOs instantiated by Unix when you write any normal pipe command:
cat file | proc1 | proc2 | proc2
Three unnamed pipes (and a whole process devoted to cat you didn't even need).
It's almost unfortunate that the truly exceptional convenience of the default stdin/stdout streams as premanaged by the shell obscures the reality that paste $magictemppipe1 $magictemppipe2 bears no additional performance considerations worth thinking about, in 99% of all cases.
"Use the <() Y-joint, Luke."
Your instinctive reflex toward natural semantic decomposition in the problem domain will herewith benefit immensely.
If anyone had had the wits to name the shell construct <() as the YODA operator in the first place, I suspect it would have been pressed into universal service at least a solid decade ago.
Combining sed & awk you get this:
sed 'N;s/\n/ /' < file |awk '$3>1{print $1}'
sed 'N;s/\n/ / : Combine 1st and 2nd line and replace next line char with space
awk '$3>1{print $1}': print $1(1st column) if $3(3rd column's value is > 1)

How do I grep for entire, possibly wrapped, lines of code?

When searching code for strings, I constantly run into the problem that I get meaningless, context-less results. For example, if a function call is split across 3 lines, and I search for the name of a parameter, I get the parameter on a line by itself and not the name of the function.
For example, in a file containing
...
someFunctionCall ("test",
MY_CONSTANT,
(some *really) - long / expression);
grepping for MY_CONSTANT would return a line that looked like this:
MY_CONSTANT,
Likewise, in a comment block:
/////////////////////////////////////////
// FIXMESOON, do..while is the wrong choice here, because
// it makes the wrong thing happen
/////////////////////////////////////////
Grepping for FIXMESOON gives the very frustrating answer:
// FIXMESOON, do..while is the wrong choice here, because
When there are thousands of hits, single line results are a little meaningless. What I would like to do is have grep be aware of the start and stop points of source code lines, something as simple as having it consider ";" as the line separator would be a good start.
Bonus points if you can make it return the entire comment block if the hit is inside a comment.
I know you can't do this with grep alone. I also am aware of the option to have grep return a certain number of lines of context. Any suggestions on how to accomplish under Linux? FYI my preferred languages are C and Perl.
I'm sure I could write something, but I know that somebody must have already done this.
Thanks!
You can use pcregrep with the -M option (multiline matching; pcregrep is grep with Perl-compatible regular expressions). Something like:
pcregrep -M ";*\R*.*thingtosearchfor*\R*.*;.*"
Here's an example using awk.
$ cat file
blah1
blah2
function1 ("test",
MY_CONSTANT,
(some *really) - long / expression);
function2( one , two )
blah3
blah4
$ awk -vRS=")" '/function1/{gsub(".*function1","function1");print $0RT}' file
function1 ("test",
MY_CONSTANT,
(some *really)
the concept behind: RS is record separator. by setting it to ")", then every record in your file is separated by ")" instead of newline. This make it easy to find your "function1" since you can then "grep" for it. If you don't use awk, the same concept can be applied using "splitting" on ")".
You can write a command line using grep with the options that give you the line number and the filename, then xarg these results into awk to parse these columns and then use a little script from you to display the N lines surrounding that line? :)
If this isn't an academic endeavour you could just use cscope (for C code only though). If you are willing to drop the requirement to search in comments ctags should be enough (and it also supports Perl).
I had a situation in which I had an xml file full of the names of zip files in an xml style format, that is, with carrots bracketing the names of the files, say example.zip<\stuff>
I used awk to change all carrots into newlines then used grep :)

Determining Word Frequency of Specific Terms

I'm a non-computer science student doing a history thesis that involves determining the frequency of specific terms in a number of texts and then plotting these frequencies over time to determine changes and trends. While I have figured out how to determine word frequencies for a given text file, I am dealing with a (relatively, for me) large number of files (>100) and for consistencies sake would like to limit the words included in the frequency count to a specific set of terms (sort of like the opposite of a "stop list")
This should be kept very simple. At the end all I need to have is the frequencies for the specific words for each text file I process, preferably in spreadsheet format (tab delineated file) so that I can then create graphs and visualizations using that data.
I use Linux day-to-day, am comfortable using the command line, and would love an open-source solution (or something I could run with WINE). That is not a requirement however:
I see two ways to solve this problem:
Find a way strip-out all the words in a text file EXCEPT for the pre-defined list and then do the frequency count from there, or:
Find a way to do a frequency count using just the terms from the pre-defined list.
Any ideas?
I would go with the second idea. Here is a simple Perl program that will read a list of words from the first file provided and print a count of each word in the list from the second file provided in tab-separated format. The list of words in the first file should be provided one per line.
#!/usr/bin/perl
use strict;
use warnings;
my $word_list_file = shift;
my $process_file = shift;
my %word_counts;
# Open the word list file, read a line at a time, remove the newline,
# add it to the hash of words to track, initialize the count to zero
open(WORDS, $word_list_file) or die "Failed to open list file: $!\n";
while (<WORDS>) {
chomp;
# Store words in lowercase for case-insensitive match
$word_counts{lc($_)} = 0;
}
close(WORDS);
# Read the text file one line at a time, break the text up into words
# based on word boundaries (\b), iterate through each word incrementing
# the word count in the word hash if the word is in the hash
open(FILE, $process_file) or die "Failed to open process file: $!\n";
while (<FILE>) {
chomp;
while ( /-$/ ) {
# If the line ends in a hyphen, remove the hyphen and
# continue reading lines until we find one that doesn't
chop;
my $next_line = <FILE>;
defined($next_line) ? $_ .= $next_line : last;
}
my #words = split /\b/, lc; # Split the lower-cased version of the string
foreach my $word (#words) {
$word_counts{$word}++ if exists $word_counts{$word};
}
}
close(FILE);
# Print each word in the hash in alphabetical order along with the
# number of time encountered, delimited by tabs (\t)
foreach my $word (sort keys %word_counts)
{
print "$word\t$word_counts{$word}\n"
}
If the file words.txt contains:
linux
frequencies
science
words
And the file text.txt contains the text of your post, the following command:
perl analyze.pl words.txt text.txt
will print:
frequencies 3
linux 1
science 1
words 3
Note that breaking on word boundaries using \b may not work the way you want in all cases, for example, if your text files contain words that are hyphenated across lines you will need to do something a little more intelligent to match these. In this case you could check to see if the last character in a line is a hyphen and, if it is, just remove the hyphen and read another line before splitting the line into words.
Edit: Updated version that handles words case-insensitively and handles hyphenated words across lines.
Note that if there are hyphenated words, some of which are broken across lines and some that are not, this won't find them all because it only removed hyphens at the end of a line. In this case you may want to just remove all hyphens and match words after the hyphens are removed. You can do this by simply adding the following line right before the split function:
s/-//g;
I do this sort of thing with a script like following (in bash syntax):
for file in *.txt
do
sed -r 's/([^ ]+) +/\1\n/g' "$file" \
| grep -F -f 'go-words' \
| sort | uniq -c > "${file}.frq"
done
You can tweak the regex you use to delimit individual words; in the example I just treat whitespace as the delimiter. The -f argument to grep is a file that contains your words of interest, one per line.
First familiarize yourself with lexical analysis and how to write a scanner generator specification. Read the introductions to using tools like YACC, Lex, Bison, or my personal favorite, JFlex. Here you define what constitutes a token. This is where you learn about how to create a tokenizer.
Next you have what is called a seed list. The opposite of the stop list is usually referred to as the start list or limited lexicon. Lexicon would also be a good thing to learn about. Part of the app needs to load the start list into memory so it can be quickly queried. The typical way to store is a file with one word per line, then read this in at the start of the app, once, into something like a map. You might want to learn about the concept of hashing.
From here you want to think about the basic algorithm and the data structures necessary to store the result. A distribution is easily represented as a two dimensional sparse array. Learn the basics of a sparse matrix. You don't need 6 months of linear algebra to understand what it does.
Because you are working with larger files, I would advocate a stream-based approach. Don't read in the whole file into memory. Read it as a stream into the tokenizer that produces a stream of tokens.
In the next part of the algorithm think about how to transform the token list into a list containing only the words you want. If you think about it, the list is in memory and can be very large, so it is better to filter out non-start-words at the start. So at the critical point where you get a new token from the tokenizer and before adding it to the token list, do a lookup in the in-memory start-words-list to see if the word is a start word. If so, keep it in the output token list. Otherwise ignore it and move to the next token until the whole file is read.
Now you have a list of tokens only of interest. The thing is, you are not looking at other indexing metrics like position and case and context. Therefore, you really don't need a list of all tokens. You really just want a sparse matrix of distinct tokens with associated counts.
So,first create an empty sparse matrix. Then think about the insertion of the newly found token during parsing. When it occurs, increment its count if its in the list or otherwise insert a new token with a count of 1. This time, at the end of parsing the file, you have a list of distinct tokens, each with a frequency of at least 1.
That list is now in-mem and you can do whatever you want. Dumping it to a CSV file would be a trivial process of iterating over the entries and writing each entry per line with its count.
For that matter, take a look at the non-commercial product called "GATE" or a commercial product like TextAnalyst or products listed at http://textanalysis.info
I'm guessing that new files get introduced over time, and that's how things change?
I reckon your best bet would be to go with something like your option 2. There's not much point pre-processing the files, if all you want to do is count occurrences of keywords. I'd just go through each file once, counting each time a word in your list appears. Personally I'd do it in Ruby, but a language like perl or python would also make this task pretty straightforward. E.g., you could use an associative array with the keywords as the keys, and a count of occurrences as the values. (But this might be too simplistic if you need to store more information about the occurrences).
I'm not sure if you want to store information per file, or about the whole dataset? I guess that wouldn't be too hard to incorporate.
I'm not sure about what to do with the data once you've got it -- exporting it to a spreadsheet would be fine, if that gives you what you need. Or you might find it easier in the long-run just to write a bit of extra code that displays the data nicely for you. Depends on what you want to do with the data (e.g. if you want to produce just a few charts at the end of the exercise and put them into a report, then exporting to CSV would probably make most sense, whereas if you want to generate a new set of data every day for a year then building a tool to do that automatically is almost certainly the best idea.
Edit: I just figured out that since you're studying history, the chances are your documents are not changing over time, but rather reflect a set of changes that happened already. Sorry for misunderstanding that. Anyway, I think pretty much everything I said above still applies, but I guess you'll lean towards going with exporting to CSV or what have you rather than an automated display.
Sounds like a fun project -- good luck!
Ben
I'd do a "grep" on the files to find all the lines that contain your key words. (Grep -f can be used to specify an input file of words to search for (pipe the output of grep to a file). That will give you a list of lines which contain instances of your words. Then do a "sed" to replace your word separators (most likely spaces) with newlines, to give you a file of separate words (one word per line). Now run through grep again, with your same word list, except this time specify -c (to get a count of the lines with the specified words; i.e. a count of the occurrences of the word in the original file).
The two-pass method simply makes life easier for "sed"; the first grep should eliminate a lot of lines.
You can do this all in basic linux command-line commands. Once you're comfortable with the process, you can put it all into shell script pretty easily.
Another Perl attempt:
#!/usr/bin/perl -w
use strict;
use File::Slurp;
use Tie::File;
# Usage:
#
# $ perl WordCount.pl <Files>
#
# Example:
#
# $ perl WordCount.pl *.text
#
# Counts words in all files given as arguments.
# The words are taken from the file "WordList".
# The output is appended to the file "WordCount.out" in the format implied in the
# following example:
#
# File,Word1,Word2,Word3,...
# File1,0,5,3,...
# File2,6,3,4,...
# .
# .
# .
#
### Configuration
my $CaseSensitive = 1; # 0 or 1
my $OutputSeparator = ","; # another option might be "\t" (TAB)
my $RemoveHyphenation = 0; # 0 or 1. Careful, may be too greedy.
###
my #WordList = read_file("WordList");
chomp #WordList;
tie (my #Output, 'Tie::File', "WordCount.out");
push (#Output, join ($OutputSeparator, "File", #WordList));
for my $InFile (#ARGV)
{ my $Text = read_file($InFile);
if ($RemoveHyphenation) { $Text =~ s/-\n//g; };
my %Count;
for my $Word (#WordList)
{ if ($CaseSensitive)
{ $Count{$Word} = ($Text =~ s/(\b$Word\b)/$1/g); }
else
{ $Count{$Word} = ($Text =~ s/(\b$Word\b)/$1/gi); }; };
my $OutputLine = "$InFile";
for my $Word (#WordList)
{ if ($Count{$Word})
{ $OutputLine .= $OutputSeparator . $Count{$Word}; }
else
{ $OutputLine .= $OutputSeparator . "0"; }; };
push (#Output, $OutputLine); };
untie #Output;
When I put your question in the file wc-test and Robert Gamble's answer into wc-ans-test, the Output file looks like this:
File,linux,frequencies,science,words
wc-ans-test,2,2,2,12
wc-test,1,3,1,3
This is a comma separated value (csv) file (but you can change the separator in the script). It should be readable for any spreadsheet application. For plotting graphs, I would recommend gnuplot, which is fully scriptable, so you can tweak your output independently of the input data.
To hell with big scripts. If you're willing to grab all words, try this shell fu:
cat *.txt | tr A-Z a-z | tr -cs a-z '\n' | sort | uniq -c | sort -rn |
sed '/[0-9] /&, /'
That (tested) will give you a list of all words sorted by frequency in CSV format, easily imported by your favorite spreadsheet. If you must have the stop words then try inserting grep -w -F -f stopwords.txt into the pipeline (not tested).

Resources