Find unique sequences within dna strings - linux

I have a file which contains bunch of sequences. The strings have a prefix of AAGCTT and a suffix of GCGGCCGC.
Between these two pattern lies unique sequences. I want to find these sequences and count their occurrence.
Example below
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
String CTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCG is present 1000 times.

I'd divide the problem into these subproblems:
Extract all sequences between AAGCTT and GCGGCCGC:
grep -Po 'AAGCTT\K.*?(?=GCGGCCGC)'.
-P is a GNU extension. If your implementation of grep does not support it use pcregrep.
Assumption: The sequences to be extracted never contain AAGCTT/GCGGCCGC except at the beginning/end.
Count the found sequences:
sort | uniq -c
Putting everything together, we end up with:
grep -Po 'AAGCTT\K.*?(?=GCGGCCGC)' yourInputFile | sort | uniq -c

It's hard (impossible?) to assess whether this will work for you, given the sample size. My one-liner assumes one sequence per line, lines defined by unix line endings.
echo "AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC" | awk '{a[gensub( /AAGCTT(.*)GCGGCCGC/,"\\1",1,$0)]++}END{for(i in a){print i" is present "a[i]" times"}}'
CTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCG is present 1 times

I believe this will do what you want:
awk '/^AAGCTT/ && /GCGGCCGC$/ {arr[$0]++} END {for (i in arr) {print i "\t" arr[i]}}' file
Explanation: find lines beginning with the first adapter and ending with the last adapter, then load these into an array and print the unique lines followed by the count for each line
With this test data:
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGCAACT
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCccccccccc
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCccccccccc
The output is:
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC 5
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC 4
If you just want the count, you can use print arr[i] instead of print i "\t" arr[i], or if you want the count before the read you can use print arr[i] "\t" i

Assuming you have some file dna.txt you could simply:
Separate your original continuous DNA string into multiple lines, using your PREFIX as a line delimiter, then simply remove all the suffixes and all irrelevant DNA following them
Then use sort -u to iterate through all lines in your new file with no repeats (All the unique patterns).
Then simply use grep -o and wc -l to count the occurrences!
PREFIX='AAGCTT'
SUFFIX='GCGGCCGC'
find_traits() {
# Step 1
sed "s/${PREFIX}/\n/g" dna.txt > /tmp/dna_lines.txt
sed -i "s/${SUFFIX}.*//" /tmp/dna_lines.txt
# Step 2
for pattern in $(sort -u /tmp/dna_lines.txt)
do
# Step 3
printf "
PATTERN [$(grep -o $pattern dna.txt | wc -l)] : |${PREFIX}|${pattern}|${SUFFIX}|
"
done
}

Related

Extract lines containing two patterns

I have a file which contains several lines as follows:
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>
I want to extract all lines containing both and including the header lines.
I have tried using grep, but it only extracts the sequence lines but not the header lines.
grep <pattern_1> | grep <pattern_2> input.fasta > output.fasta
How to extract lines containing both the patterns and the headers in Linux? The patterns can be present anywhere in the lines. Not limited to start or end of the lines.
Expected output:
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
$ grep -A 1 header[12] file
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
man grep:
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
grep -B 1 pattern_[12]could work also, but you have several pattern_1s in the sample data so... not this time.
You can easily do that with awk like this:
awk '/^>/{h=$0;next}
/<pattern_1>/&&/<pattern_2>/{print h;print}' input.fasta > output.fasta
And here is a sed solution which yields the desired output as well:
sed -n '/^>/{N;/<pattern_1>/{/<pattern_2>/p}}' input.fasta > output.fasta
If it is likely that multiline records exist, you can use this:
awk -v pat1='<pattern_1>' -v pat2='<pattern_2>' '
/^>/ {r=$0;p=0;next}
!p {r=r ORS $0;if(chk()){print r;p=1};next}
p
function chk( tmp){
tmp=gensub(/\n/,"","g",r)
return (tmp~pat1&&tmp~pat2)
}' input.fasta > output.fasta
You might be interested in BioAwk, it is an adapted version of awk which is tuned to process fasta files
bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
'($seq ~ seq1) && ($seq ~ seq2) { print ">"$name; print $seq }' file.fasta
If you want seq1 at the beginning and seq2 at the end, you can change it into:
bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
'($seq ~ "^"seq1) && ($seq ~ seq2"$") { print ">"$name; print $seq }' file.fasta
This is really practical for processing fasta files, as often the sequence is spread over multiple lines. The above code handles this very easily as the variable $seq contains the full sequence.
If you do not want to install BioAwk, you can use the following method to process your FASTA file. It will allow multi-line sequences and does the following:
read a single record at a time (this assumes no > in the header, except the first character)
extract the header from the record and store it in name (not really needed)
merge the full sequence in a single string of characters, removing all newlines and spaces. This ensures that searching for pattern1 or pattern2 will not fail if the pattern is split over multiple lines.
if a match is found, print the record.
The following awk does the requested:
awk -v seq1="pattern1" -v seq2="pattern2" \
'BEGIN{RS=">"; ORS=""; FS="\n"}
{ seq="";for(i=2;i<=NF;++i) seq=seq""$i; gsub(/[^a-zA-Z0-9]/,"",seq) }
(seq ~ seq1 && seq ~ seq2){print ">" $0}' file.fasta
If the record header contains other > characters which are not at the beginning of the line, you have to take a slightly different approach (unless you use GNU awk)
awk -v seq1="pattern1" -v seq2="pattern2" \
'/^>/ && (seq ~ seq1 && seq ~ seq2) {
print name
for(i=0;i<n;i++) print aseq[i]
}
/^>/ { seq=""; delete aseq; n=0; name=$0; next }
{ aseq[n++] = $0; seq=seq""$0; sub(/[^a-zA-Z0-9]*$/,"",seq) }
END { if (seq ~ seq1 && seq ~ seq2) {
print name
for(i=0;i<n;i++) print aseq[i]
}
}' file.fasta
note: we make use of sub here in case unexpected characters are introduced in the fasta file (eg. spaces/tabs or CR (\r))
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
If you want grep to print lines around the match, use the -B flag for lines before, the -A for lines after, and -C for both before and after the match.
In your case, grep -B 1 seems like it would do the job.
If your input file is exactly as described in your post then you can use:
grep -B1 '^<pattern_1>.*<pattern_2>$' input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
Where -B1 will display on top of the matching lines the line before it. The regex used is based on the hypothesis that your 2 patterns are located in the exact order at the beginning and at the end of the line. If this is not the case: use '.*<pattern_1>.*<pattern_2>.*'. Last but not least, if the order of the 2 patterns are not always respected then you can use: '^.*<pattern_1>.*<pattern_2>.*$\|^.*<pattern_2>.*<pattern_1>.*$'
On the following input file:
cat input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header2b
<pattern_2>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_1>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>
output:
grep -B1 '^.*<pattern_1>.*<pattern_2>.*$\|^.*<pattern_2>.*<pattern_1>.*$' input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header2b
<pattern_2>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_1>

awk Print Line Issue

I'm experiencing some issues with a awk command right now. The original script was developed using awk on MacOS and was then ported to Linux. There awk shows a different behavior.
What I want to do is to count the occurrences of single strings provided via /tmp/test.uniq.txt in the file /tmp/test.txt.
awk '{print $1, system("cat /tmp/test.txt | grep -o -c " $1)}' /tmp/test.uniq.txt
Mac delivers an expected output like:
test1 2
test2 1
The output is in one line, the sting and the number of occurrences, separated by a whitespace.
Linux delivers an output like:
2
test1 1
test2
The output is not in one line an the output of the system command is printed first.
Sample input:
test.txt looks like:
test1 test test
test1 test test
test2 test test
test.uniq.txt looks like:
test1
test2
As comments suggested that using grep and cat etc using system function is not recommended as awk is complete language that can perform most of these tasks.
You can use following awk command to replace your cat | grep functionality:
awk 'FNR == NR {a[$1]=0; next} {for (i=1; i<=NF; i++) if ($i in a) a[$i]++}
END { for (i in a) print i, a[i] }' uniq.txt test.txt
test1 2
test2 1
Note that this output doesn't match with the count 5 as your question states as your sample data is probably different.
References:
Effective AWK Programming
Awk Tutorial
It looks to me as if you're trying to count the number of line containing each unique string in the uniq file. But the way you're doing it is .. awkward, and as you've demonstrated, inconsistent between versions of awk.
The following might work a little better:
$ awk '
NR==FNR {
a[$1]
next
}
{
for (i in a) {
if ($1~i) {
a[i]++
}
}
}
END {
for (i in a)
printf "%6d\t%s\n",a[i],i
}
' test.uniq.txt test.txt
2 test1
1 test2
This loads your uniq file into an array, then for every line in your text file, steps through the array to count the matches.
Note that these are being compared as regular expressions, without word boundaries, so test1 will also be counted as part of test12.
Another way might be to use grep+sort+uniq:
grep -o -w -F -f uniq.txt test.txt | sort | uniq -c
It's a pipeline but a short one
From man grep:
-F, --fixed-strings, --fixed-regexp Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
(-F is specified by POSIX, --fixed-regexp is an obsoleted alias, please do not use it in new scripts.)
-f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns and therefore matches nothing. (-f is specified by POSIX.)
-o, --only-matching Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.

how to count occurrence of specific word in group of file by bash/shellscript

i have two text files 'simple' and 'simple1' with following data in them
simple.txt--
hello
hi hi hello
this
is it
simple1.txt--
hello hi
how are you
[]$ tr ' ' '\n' < simple.txt | grep -i -c '\bh\w*'
4
[]$ tr ' ' '\n' < simple1.txt | grep -i -c '\bh\w*'
3
this commands show the number of words that start with "h" for each file but i want to display the total count to be 7 i.e. total of both file. Can i do this in single command/shell script?
P.S.: I had to write two commands as tr does not take two file names.
Try this, the straightforward way :
cat simple.txt simple1.txt | tr ' ' '\n' | grep -i -c '\bh\w*'
This alternative requires no pipelines:
$ awk -v RS='[[:space:]]+' '/^h/{i++} END{print i+0}' simple.txt simple1.txt
7
How it works
-v RS='[[:space:]]+'
This tells awk to treat each word as a record.
/^h/{i++}
For any record (word) that starts with h, we increment variable i by 1.
END{print i+0}
After we have finished reading all the files, we print out the value of i.
It is not the case, that tr accepts only one filename, it does not accept any filename (and always reads from stdin). That's why even in your solution, you didn't provide a filename for tr, but used input redirection.
In your case, I think you can replace tr by fmt, which does accept filenames:
fmt -1 simple.txt simple1.txt | grep -i -c -w 'h.*'
(I also changed the grep a bit, because I personally find it better readable this way, but this is a matter of taste).
Note that both solutions (mine and your original ones) would count a string consisting of letters and one or more non-space characters - for instance the string haaaa.hbbbbbb.hccccc - as a "single block", i.e. it would only add 1 to the count of "h"-words, not 3. Whether or not this is the desired behaviour, it's up to you to decide.

Getting n-th line of text output

I have a script that generates two lines as output each time. I'm really just interested in the second line. Moreover I'm only interested in the text that appears between a pair of #'s on the second line. Additionally, between the hashes, another delimiter is used: ^A. It would be great if I can also break apart each part of text that is ^A-delimited (Note that ^A is SOH special character and can be typed by using Ctrl-A)
output | sed -n '1p' #prints the 1st line of output
output | sed -n '1,3p' #prints the 1st, 2nd and 3rd line of output
your.program | tail +2 | cut -d# -f2
should get you 2/3 of the way.
Improving Grumdrig's answer:
your.program | head -n 2| tail -1 | cut -d# -f2
I'd probably use awk for that.
your_script | awk -F# 'NR == 2 && NF == 3 {
num_tokens=split($2, tokens, "^A")
for (i = 1; i <= num_tokens; ++i) {
print tokens[i]
}
}'
This says
1. Set the field separator to #
2. On lines that are the 2nd line, and also have 3 fields (text#text#text)
3. Split the middle (2nd) field using "^A" as the delimiter into the array named tokens
4. Print each token
Obviously this makes a lot of assumptions. You might need to tweak it if, for example, # or ^A can appear legitimately in the data, without being separators. But something like that should get you started. You might need to use nawk or gawk or something, I'm not entirely sure if plain awk can handle splitting on a control character.
bash:
read
read line
result="${line#*#}"
result="${result%#*}"
IFS=$'\001' read result -a <<< "$result"
$result is now an array that contains the elements you're interested in. Just pipe the output of the script to this one.
here's a possible awk solution
awk -F"#" 'NR==2{
for(i=2;i<=NF;i+=2){
split($i,a,"\001") # split on SOH
for(o in a ) print o # print the splitted hash
}
}' file

How to remove duplicate words from a plain text file using linux command

I have a plain text file with words, which are separated by comma, for example:
word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3
i want to delete the duplicates and to become:
word1, word2, word3, word4, word5, word6, word7
Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....
Assuming that the words are one per line, and the file is already sorted:
uniq filename
If the file's not sorted:
sort filename | uniq
If they're not one per line, and you don't mind them being one per line:
tr -s [:space:] \\n < filename | sort | uniq
That doesn't remove punctuation, though, so maybe you want:
tr -s [:space:][:punct:] \\n < filename | sort | uniq
But that removes the hyphen from hyphenated words. "man tr" for more options.
ruby -pi.bak -e '$_.split(",").uniq.join(",")' filename ?
I'll admit the two kinds of quotations are ugly.
Creating a unique list is pretty easy thanks to uniq, although most Unix commands like one entry per line instead of a comma-separated list, so we have to start by converting it to that:
$ sed 's/, /\n/g' filename | sort | uniq
word1
word2
word3
word4
word5
word6
word7
The harder part is putting this on one line again with commas as separators and not terminators. I used a perl one-liner to do this, but if someone has something more idiomatic, please edit me. :)
$ sed 's/, /\n/g' filename | sort | uniq | perl -e '#a = <>; chomp #a; print((join ", ", #a), "\n")'
word1, word2, word3, word4, word5, word6, word7
Here's an awk script that will leave each line in tact, only removing the duplicate words:
BEGIN {
FS=", "
}
{
for (i=1; i <= NF; i++)
used[$i] = 1
for (x in used)
printf "%s, ",x
printf "\n"
split("", used)
}
i had the very same problem today.. a word list with 238,000 words but about 40, 000 of those were duplicates. I already had them in individual lines by doing
cat filename | tr " " "\n" | sort
to remove the duplicates I simply did
cat filename | uniq > newfilename .
Worked perfectly no errors and now my file is down from 1.45MB to 1.01MB
I'd think you'll want to replace the spaces with newlines, use the uniq command to find unique lines, then replace the newlines with spaces again.
I presumed you wanted the words to be unique on a single line, rather than throughout the file. If this is the case, then the Perl script below will do the trick.
while (<DATA>)
{
chomp;
my %seen = ();
my #words = split(m!,\s*!);
#words = grep { $seen{$_} ? 0 : ($seen{$_} = 1) } #words;
print join(", ", #words), "\n";
}
__DATA__
word1, word2, word3, word2, word4, word5, word3, word6, word7, word3
If you want uniqueness over the whole file, you can just move the %seen hash outside the while (){} loop.
Came across this thread while trying to solve much the same problem. I had concatenated several files containing passwords, so naturally there were a lot of doubles. Also, many non-standard characters. I didn't really need them sorted, but it seemed that was gonna be necessary for uniq.
I tried:
sort /Users/me/Documents/file.txt | uniq -u
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `t\203tonnement' and `t\203tonner'
Tried:
sort -u /Users/me/Documents/file.txt >> /Users/me/Documents/file2.txt
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `t\203tonnement' and `t\203tonner'.
And even tried passing it through cat first, just so I could see if we were getting a proper input.
cat /Users/me/Documents/file.txt | sort | uniq -u > /Users/me/Documents/file2.txt
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `zon\351s' and `zoologie'.
I'm not sure what's happening. The strings "t\203tonnement" and "t\203tonner" aren't found in the file, though "t/203" and "tonnement" are found, but on separate, non-adjoining lines. Same with "zon\351s".
What finally worked for me was:
awk '!x[$0]++' /Users/me/Documents/file.txt > /Users/me/Documents/file2.txt
It also preserved words whose only difference was case, which is what I wanted. I didn't need the list sorted, so it was fine that it wasn't.
And don't forget the -c option for the uniq utility if you're interested in getting a count of the words as well.
open file with vim (vim filename) and run sort command with unique flag (:sort u).

Resources