How to count each letters from a file?

How to count each letters from a file? - linux

I have a cord.txt file as shown below,
188H,190D,245H
187D,481E,482T
187H,194E,196D
386D,388E,389N,579H
44E,60D
I need to count each letters and have to make a summary as shown below (expected output),
H,4
D,5
E,4
T,1
I know how to count each letters by using grep "<letter>" cord.txt | wc. But I have a huge file which contains more number of letters, therefore please help me to do the same.
Thanks in advance.

You're missing the N :-)
grep -o '[[:alpha:]]' cord.txt | sort | uniq -c
grep -o only outputs the matching part. With the POSIX class [[:alpha:]], it outputs all the letters contained in the input.
sort groups the same letters together
uniq -c reports unique lines with their counts. It needs sorted input, as it only compares the current line to the previous one.

The following command
Removes any character that is not an ASCII letter;
Places every character on its own line;
Sorts the characters;
Counts the number of same consecutive lines.
sed 's/[^a-zA-Z]//g' < input.txt | fold -w 1 -s | sort | uniq -c > output.txt
# ^ ^ ^ ^
# 1. 2. 3. 4.
Input:
188H,190D,245H
187D,481E,482T
187H,194E,196D
386D,388E,389N,579H
44E,60D
output:
5 D
4 E
4 H
1 N
1 T

You might use python's collections.Counter as follows, let cord.txt content be
188H,190D,245H
187D,481E,482T
187H,194E,196D
386D,388E,389N,579H
44E,60D
and counting.py be
import collections
counter = collections.Counter()
with open("cord.txt", "r") as f:
for line in f:
counter.update(i for i in line if i.isalpha())
for char, cnt in counter.items():
print("{},{}".format(char,cnt))
then python counting.py output
H,4
D,5
E,4
T,1
N,1
Note that I used for line in f where f is file-handle to avoid loading whole file into memory. Disclaimer: I used python version 3.7, older should work but might give other order in output, as collections.Counter is subclass of dict and these do not keep order in older python versions.

Shortly:
tr '[0-9],' \\n <input | sort | uniq -c
43
5 D
4 E
4 H
1 N
1 T
Ok, there are 43 other characters... You could drop and match your request by adding sed:
tr '[0-9],' \\n </tmp/so/input | sort | uniq -c |
sed -ne 's/^ *\([0-9]\+\) \(.\)/\2,\1/p'
D,5
E,4
H,4
N,1
T,1

Related

Find unique sequences within dna strings

I have a file which contains bunch of sequences. The strings have a prefix of AAGCTT and a suffix of GCGGCCGC.
Between these two pattern lies unique sequences. I want to find these sequences and count their occurrence.
Example below
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
String CTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCG is present 1000 times.

I'd divide the problem into these subproblems:
Extract all sequences between AAGCTT and GCGGCCGC:
grep -Po 'AAGCTT\K.*?(?=GCGGCCGC)'.
-P is a GNU extension. If your implementation of grep does not support it use pcregrep.
Assumption: The sequences to be extracted never contain AAGCTT/GCGGCCGC except at the beginning/end.
Count the found sequences:
sort | uniq -c
Putting everything together, we end up with:
grep -Po 'AAGCTT\K.*?(?=GCGGCCGC)' yourInputFile | sort | uniq -c

It's hard (impossible?) to assess whether this will work for you, given the sample size. My one-liner assumes one sequence per line, lines defined by unix line endings.
echo "AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC" | awk '{a[gensub( /AAGCTT(.*)GCGGCCGC/,"\\1",1,$0)]++}END{for(i in a){print i" is present "a[i]" times"}}'
CTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCG is present 1 times

I believe this will do what you want:
awk '/^AAGCTT/ && /GCGGCCGC$/ {arr[$0]++} END {for (i in arr) {print i "\t" arr[i]}}' file
Explanation: find lines beginning with the first adapter and ending with the last adapter, then load these into an array and print the unique lines followed by the count for each line
With this test data:
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGCAACT
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCccccccccc
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCccccccccc
The output is:
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC 5
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC 4
If you just want the count, you can use print arr[i] instead of print i "\t" arr[i], or if you want the count before the read you can use print arr[i] "\t" i

Assuming you have some file dna.txt you could simply:
Separate your original continuous DNA string into multiple lines, using your PREFIX as a line delimiter, then simply remove all the suffixes and all irrelevant DNA following them
Then use sort -u to iterate through all lines in your new file with no repeats (All the unique patterns).
Then simply use grep -o and wc -l to count the occurrences!
PREFIX='AAGCTT'
SUFFIX='GCGGCCGC'
find_traits() {
# Step 1
sed "s/${PREFIX}/\n/g" dna.txt > /tmp/dna_lines.txt
sed -i "s/${SUFFIX}.*//" /tmp/dna_lines.txt
# Step 2
for pattern in $(sort -u /tmp/dna_lines.txt)
do
# Step 3
printf "
PATTERN [$(grep -o $pattern dna.txt | wc -l)] : |${PREFIX}|${pattern}|${SUFFIX}|
"
done
}

How can I fix my bash script to find a random word from a dictionary?

I'm studying bash scripting and I'm stuck fixing an exercise of this site: https://ryanstutorials.net/bash-scripting-tutorial/bash-variables.php#activities
The task is to write a bash script to output a random word from a dictionary whose length is equal to the number supplied as the first command line argument.
My idea was to create a sub-dictionary, assign each word a number line, select a random number from those lines and filter the output, which worked for a similar simpler script, but not for this.
This is the code I used:
6 DIC='/usr/share/dict/words'
7 SUBDIC=$( egrep '^.{'$1'}$' $DIC )
8
9 MAX=$( $SUBDIC | wc -l )
10 RANDRANGE=$((1 + RANDOM % $MAX))
11
12 RWORD=$(nl "$SUBDIC" | grep "\b$RANDRANGE\b" | awk '{print $2}')
13
14 echo "Random generated word from $DIC which is $1 characters long:"
15 echo $RWORD
and this is the error I get using as input "21":
bash script.sh 21
script.sh: line 9: counterintelligence's: command not found
script.sh: line 10: 1 + RANDOM % 0: division by 0 (error token is "0")
nl: 'counterintelligence'\''s'$'\n''electroencephalograms'$'\n''electroencephalograph': No such file or directory
Random generated word from /usr/share/dict/words which is 21 characters long:
I tried in bash to split the code in smaller pieces obtaining no error (input=21):
egrep '^.{'21'}$' /usr/share/dict/words | wc -l
3
but once in the script line 9 and 10 give error.
Where do you think is the error?

problems
SUBDIC=$( egrep '^.{'$1'}$' $DIC ) will store all words of the given length in the SUBDIC variable, so it's content is now something like foo bar baz.
MAX=$( $SUBDIC | ... ) will try to run the command foo bar baz which is obviously bogus; it should be more like MAX=$(echo $SUBDIC | ... )
MAX=$( ... | wc -l ) will count the lines; when using the above mentioned echo $SUBDIC you will have multiple words, but all in one line...
RWORD=$(nl "$SUBDIC" | ...) same problem as above: there's only one line (also note #armali's answer that nl requires a file or stdin)
RWORD=$(... | grep "\b$RANDRANGE\b" | ...) might match the dictionary entry catch 22
likely RWORD=$(... | awk '{print $2}') won't handle lines containing spaces
a simple solution
doing a "random sort" over the all the possible words and taking the first line, should be sufficient:
egrep "^.{$1}$" "${DIC}" | sort -R | head -1

MAX=$( $SUBDIC | wc -l ) - A pipe is used for connecting a command's output, while $SUBDIC isn't a command; an appropriate syntax is MAX=$( <<<$SUBDIC wc -l ).
nl "$SUBDIC" - The argument to nl has to be a filename, which "$SUBDIC" isn't; an appropriate syntax is nl <<<"$SUBDIC".

This code will do it. My test dictionary of words is in file file. It's a good idea to get all words of a given length first but put them in an array not in var. And then get a random index and echo it.
dic=( $(sed -n "/^.\{$1\}$/p" file) )
ind=$((0 + RANDOM % ${#dic[#]}))
echo ${dic[$ind]}

I am also doing this activity and I create one simple solution.
I create the script.
#!/bin/bash
awk "NR==$1 {print}" /usr/share/dict/words
Here if you want a random string then you have to run the script as per the below command from the terminal.
./script.sh $RANDOM
If you want the print any specific number string then you can run as per the below command from the terminal.
./script.sh 465

cat /usr/share/dict/american-english | head -n $RANDOM | tail -n 1
$RANDOM - Returns a different random number each time is it referred to.
this simple line outputs random word from the mentioned dictionary.
Otherwise as umläute mentined you can do:
cat /usr/share/dict/american-english | sort -R | head -1

Counting total occurrences of each 'version' across multiple files

I have a number of files in a directory on Linux, each of which contains a version line in the format: #version x (where x is the version number).
I'm trying to find a way to count the number of times each different version appears across all the files, and output something like:
#version 1: 12
#version 2: 36
#version 3: 2
I don't know all the potential versions that might exist, so I'm really trying to match lines that contain #version.
I've tried using things like grep -c - however that only gives the total of all lines containing #version - I can't find a nice way to split on the different version numbers.

A possibility piping multiple commands:
strings * | grep '#version \w' | sort | uniq --count | awk '{printf("%s: %s\n", substr($0, index($0, $2)), $1)}''
Operations breakdown:
strings *: Extract text strings from * all files in current directory.
| grep '#version \w': Pipe the strings into the grep command, to find all occurrences of #version word.
sort: Pipe the version strings to the sort command.
| uniq --count: Pipe the occurrences of #version lines into the uniq command, to output count of each #version... string.
awk '{printf("%s: %s\n", substr($0, index($0, $2)), $1)}': Pipe the unique counts into the awk command, to re-format the output as: #version ...: count.
Testing the process:
cd /tmp
mkdir testing 2>/dev/null || true
cd testing
# Create 10 testfile#.txt with random #version 1 to 4
for i in {1..10}; do
echo "#version $(($RANDOM%4+1))" >"testfile${i}.txt"
done
# Now get the counts per version
strings * \
| grep '#version \w' \
| sort \
| uniq --count \
| awk '{printf("%s: %s\n", substr($0, index($0, $2)), $1)}'
Example of test output:
#version 1: 4
#version 2: 2
#version 3: 1
#version 4: 3

Something like this may do the trick:
grep -h '#version' * | sort | uniq -c | awk '{print $2,$3": found "$1}'
example files:
filename:filecontent
file1:#version 1
file1.1:#version 1
file111:#version 1
file2:#version 2
file3:#version 3
file4:#version 4
file44:#version 4
Output:
#version 1: found 3
#version 2: found 1
#version 3: found 1
#version 4: found 2
grep version * gets all files with version.sort sorts the results for uniq -c which counts the number of duplicates then awk rearranges the output for desired formatting.
Note: grep might have a slightly different separator than : on your OS.

how to count occurrence of specific word in group of file by bash/shellscript

i have two text files 'simple' and 'simple1' with following data in them
simple.txt--
hello
hi hi hello
this
is it
simple1.txt--
hello hi
how are you
[]$ tr ' ' '\n' < simple.txt | grep -i -c '\bh\w*'
4
[]$ tr ' ' '\n' < simple1.txt | grep -i -c '\bh\w*'
3
this commands show the number of words that start with "h" for each file but i want to display the total count to be 7 i.e. total of both file. Can i do this in single command/shell script?
P.S.: I had to write two commands as tr does not take two file names.

Try this, the straightforward way :
cat simple.txt simple1.txt | tr ' ' '\n' | grep -i -c '\bh\w*'

This alternative requires no pipelines:
$ awk -v RS='[[:space:]]+' '/^h/{i++} END{print i+0}' simple.txt simple1.txt
7
How it works
-v RS='[[:space:]]+'
This tells awk to treat each word as a record.
/^h/{i++}
For any record (word) that starts with h, we increment variable i by 1.
END{print i+0}
After we have finished reading all the files, we print out the value of i.

It is not the case, that tr accepts only one filename, it does not accept any filename (and always reads from stdin). That's why even in your solution, you didn't provide a filename for tr, but used input redirection.
In your case, I think you can replace tr by fmt, which does accept filenames:
fmt -1 simple.txt simple1.txt | grep -i -c -w 'h.*'
(I also changed the grep a bit, because I personally find it better readable this way, but this is a matter of taste).
Note that both solutions (mine and your original ones) would count a string consisting of letters and one or more non-space characters - for instance the string haaaa.hbbbbbb.hccccc - as a "single block", i.e. it would only add 1 to the count of "h"-words, not 3. Whether or not this is the desired behaviour, it's up to you to decide.

grep: show lines surrounding each match

How do I grep and show the preceding and following 5 lines surrounding each matched line?

For BSD or GNU grep you can use -B num to set how many lines before the match and -A num for the number of lines after the match.
grep -B 3 -A 2 foo README.txt
If you want the same number of lines before and after you can use -C num.
grep -C 3 foo README.txt
This will show 3 lines before and 3 lines after.

-A and -B will work, as will -C n (for n lines of context), or just -n (for n lines of context... as long as n is 1 to 9).

ack works with similar arguments as grep, and accepts -C. But it's usually better for searching through code.

grep astring myfile -A 5 -B 5
That will grep "myfile" for "astring", and show 5 lines before and after each match

ripgrep
If you care about the performance, use ripgrep which has similar syntax to grep, e.g.
rg -C5 "pattern" .
-C, --context NUM - Show NUM lines before and after each match.
There are also parameters such as -A/--after-context and -B/--before-context.
The tool is built on top of Rust's regex engine which makes it very efficient on the large data.

I normally use
grep searchstring file -C n # n for number of lines of context up and down
Many of the tools like grep also have really great man files too. I find myself referring to grep's man page a lot because there is so much you can do with it.
man grep
Many GNU tools also have an info page that may have more useful information in addition to the man page.
info grep

Use grep
$ grep --help | grep -i context
Context control:
-B, --before-context=NUM print NUM lines of leading context
-A, --after-context=NUM print NUM lines of trailing context
-C, --context=NUM print NUM lines of output context
-NUM same as --context=NUM

If you search code often, AG the silver searcher is much more efficient (ie faster) than grep.
You show context lines by using the -C option.
Eg:
ag -C 3 "foo" myFile
line 1
line 2
line 3
line that has "foo"
line 5
line 6
line 7

Search for "17655" in /some/file.txt showing 10 lines context before and after (using Awk), output preceded with line number followed by a colon. Use this on Solaris when grep does not support the -[ACB] options.
awk '
/17655/ {
for (i = (b + 1) % 10; i != b; i = (i + 1) % 10) {
print before[i]
}
print (NR ":" ($0))
a = 10
}
a-- > 0 {
print (NR ":" ($0))
}
{
before[b] = (NR ":" ($0))
b = (b + 1) % 10
}' /some/file.txt;

Let's understand using an example.
We can use grep with options:
-A 5 # this will give you 5 lines after searched string.
-B 5 # this will give you 5 lines before searched string.
-C 5 # this will give you 5 lines before & after searched string
Example.
File.txt contains 6 lines and following are the operations.
[abc#xyz]~/% cat file.txt # print all file data
this is first line
this is 2nd line
this is 3rd line
this is 4th line
this is 5th line
this is 6th line
[abc#xyz]~% grep "3rd" file.txt # we are searching for keyword '3rd' in the file
this is 3rd line
[abc#xyz]~% grep -A 2 "3rd" file.txt # print 2 lines after finding the searched string
this is 3rd line
this is 4th line
this is 5th line
[abc#xyz]~% grep -B 2 "3rd" file.txt # Print 2 lines before the search string.
this is first line
this is 2nd line
this is 3rd line
[abc#xyz]~% grep -C 2 "3rd" file.txt # print 2 line before and 2 line after the searched string
this is first line
this is 2nd line
this is 3rd line
this is 4th line
this is 5th line
Trick to remember options:
-A  → A means "after"
-B  → B means "before"
-C  → C means "in between"

I do it the compact way:
grep -5 string file
That is the equivalent of:
grep -A 5 -B 5 string file

Here is the #Ygor solution in awk
awk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=3 a=3 s="pattern" myfile
Note: Replace a and b variables with number of lines before and after.
It's especially useful for system which doesn't support grep's -A, -B and -C parameters.

Grep has an option called Context Line Control, you can use the --context in that, simply,
| grep -C 5
or
| grep -5
Should do the trick

$ grep thestring thefile -5
-5 gets you 5 lines above and below the match 'thestring' is equivalent to -C 5 or -A 5 -B 5.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to count each letters from a file? - linux

Shortly: tr '[0-9],' \\n <input | sort | uniq -c 43 5 D 4 E 4 H 1 N 1 T Ok, there are 43 other characters... You could drop and match your request by adding sed: tr '[0-9],' \\n </tmp/so/input | sort | uniq -c | sed -ne 's/^ *\([0-9]\+\) \(.\)/\2,\1/p' D,5 E,4 H,4 N,1 T,1

Related

Find unique sequences within dna strings

How can I fix my bash script to find a random word from a dictionary?

Counting total occurrences of each 'version' across multiple files

how to count occurrence of specific word in group of file by bash/shellscript

grep: show lines surrounding each match

Categories

Resources