Counting total occurrences of each 'version' across multiple files - linux

I have a number of files in a directory on Linux, each of which contains a version line in the format: #version x (where x is the version number).
I'm trying to find a way to count the number of times each different version appears across all the files, and output something like:
#version 1: 12
#version 2: 36
#version 3: 2
I don't know all the potential versions that might exist, so I'm really trying to match lines that contain #version.
I've tried using things like grep -c - however that only gives the total of all lines containing #version - I can't find a nice way to split on the different version numbers.

A possibility piping multiple commands:
strings * | grep '#version \w' | sort | uniq --count | awk '{printf("%s: %s\n", substr($0, index($0, $2)), $1)}''
Operations breakdown:
strings *: Extract text strings from * all files in current directory.
| grep '#version \w': Pipe the strings into the grep command, to find all occurrences of #version word.
sort: Pipe the version strings to the sort command.
| uniq --count: Pipe the occurrences of #version lines into the uniq command, to output count of each #version... string.
awk '{printf("%s: %s\n", substr($0, index($0, $2)), $1)}': Pipe the unique counts into the awk command, to re-format the output as: #version ...: count.
Testing the process:
cd /tmp
mkdir testing 2>/dev/null || true
cd testing
# Create 10 testfile#.txt with random #version 1 to 4
for i in {1..10}; do
echo "#version $(($RANDOM%4+1))" >"testfile${i}.txt"
done
# Now get the counts per version
strings * \
| grep '#version \w' \
| sort \
| uniq --count \
| awk '{printf("%s: %s\n", substr($0, index($0, $2)), $1)}'
Example of test output:
#version 1: 4
#version 2: 2
#version 3: 1
#version 4: 3

Something like this may do the trick:
grep -h '#version' * | sort | uniq -c | awk '{print $2,$3": found "$1}'
example files:
filename:filecontent
file1:#version 1
file1.1:#version 1
file111:#version 1
file2:#version 2
file3:#version 3
file4:#version 4
file44:#version 4
Output:
#version 1: found 3
#version 2: found 1
#version 3: found 1
#version 4: found 2
grep version * gets all files with version.sort sorts the results for uniq -c which counts the number of duplicates then awk rearranges the output for desired formatting.
Note: grep might have a slightly different separator than : on your OS.

Related

How to count each letters from a file?

I have a cord.txt file as shown below,
188H,190D,245H
187D,481E,482T
187H,194E,196D
386D,388E,389N,579H
44E,60D
I need to count each letters and have to make a summary as shown below (expected output),
H,4
D,5
E,4
T,1
I know how to count each letters by using grep "<letter>" cord.txt | wc. But I have a huge file which contains more number of letters, therefore please help me to do the same.
Thanks in advance.
You're missing the N :-)
grep -o '[[:alpha:]]' cord.txt | sort | uniq -c
grep -o only outputs the matching part. With the POSIX class [[:alpha:]], it outputs all the letters contained in the input.
sort groups the same letters together
uniq -c reports unique lines with their counts. It needs sorted input, as it only compares the current line to the previous one.
The following command
Removes any character that is not an ASCII letter;
Places every character on its own line;
Sorts the characters;
Counts the number of same consecutive lines.
sed 's/[^a-zA-Z]//g' < input.txt | fold -w 1 -s | sort | uniq -c > output.txt
# ^ ^ ^ ^
# 1. 2. 3. 4.
Input:
188H,190D,245H
187D,481E,482T
187H,194E,196D
386D,388E,389N,579H
44E,60D
output:
5 D
4 E
4 H
1 N
1 T
You might use python's collections.Counter as follows, let cord.txt content be
188H,190D,245H
187D,481E,482T
187H,194E,196D
386D,388E,389N,579H
44E,60D
and counting.py be
import collections
counter = collections.Counter()
with open("cord.txt", "r") as f:
for line in f:
counter.update(i for i in line if i.isalpha())
for char, cnt in counter.items():
print("{},{}".format(char,cnt))
then python counting.py output
H,4
D,5
E,4
T,1
N,1
Note that I used for line in f where f is file-handle to avoid loading whole file into memory. Disclaimer: I used python version 3.7, older should work but might give other order in output, as collections.Counter is subclass of dict and these do not keep order in older python versions.
Shortly:
tr '[0-9],' \\n <input | sort | uniq -c
43
5 D
4 E
4 H
1 N
1 T
Ok, there are 43 other characters... You could drop and match your request by adding sed:
tr '[0-9],' \\n </tmp/so/input | sort | uniq -c |
sed -ne 's/^ *\([0-9]\+\) \(.\)/\2,\1/p'
D,5
E,4
H,4
N,1
T,1

Extracting a set of characters for a column in a txt file

I have a bed file (what is a txt file formed by columns separated by tabs). The fourth column has a name followed by numbers. Using the command line (Linux), I would like to get these names without repetition. A provided an example below.
This is my file:
$head example.bed
1 2160195 2161184 SKI_1.2160205.2161174
1 2234406 2234552 SKI_1.2234416.2234542
1 2234713 2234849 SKI_1.2234723.2234839
1 2235268 2235551 SKI_1.2235278.2235541
1 2235721 2236034 SKI_1.2235731.2236024
1 2237448 2237699 SKI_1.2237458.2237689
1 2238005 2238214 SKI_1.2238015.2238204
1 9770503 9770664 PIK3CD_1.9770513.9770654
1 9775588 9775837 PIK3CD_1.9775598.9775827
1 9775896 9776146 PIK3CD_1.9775906.9776136
...
My list should look like this:
SKI_1
PIK3CD_1
...
Could please help me with the code I need to use?
I found the solution years ago with grep but I have lost the document in which I used to save all useful codes.
Given so.txt:
1 2160195 2161184 SKI_1.2160205.2161174
1 2234406 2234552 SKI_1.2234416.2234542
1 2234713 2234849 SKI_1.2234723.2234839
1 2235268 2235551 SKI_1.2235278.2235541
1 2235721 2236034 SKI_1.2235731.2236024
1 2237448 2237699 SKI_1.2237458.2237689
1 2238005 2238214 SKI_1.2238015.2238204
1 9770503 9770664 PIK3CD_1.9770513.9770654
1 9775588 9775837 PIK3CD_1.9775598.9775827
1 9775896 9776146 PIK3CD_1.9775906.9776136
Then the following command should do the trick:
cat so.txt | awk '{split($4,f,".");print f[1];}' | sort -u
$4 is the 4th column
We split the 4th column on the . character. The result is put into the f array
Finally we filter out the duplicates with sort -u
With data in the file bed and using awk:
awk 'NR>1 { split($4,arr,".");bed[arr[1]]="" } END { for (i in bed) { print i } }' bed
Ignore the first line and then split the 4th space delimited field into the array arr based on "." put the first index of this array into another bed array. In the end block, loop through the bed array and print the indexes.
Using grep:
cat example.bed|awk '{print $4}'|grep -oP '\w+_\d+'|sort -u
produces:
PIK3CD_1
SKI_1
and without the cat command (as there are always UUOC advocates here.. ):
awk '{print $4}' < example.bed|grep -oP '\w+_\d+'|sort -u
I have found this:
head WRGL2_hg19_v1.bed | cut -f4 | cut -d "." -f1
Output:
PIK3CD_1
SKI_1

How can I fix my bash script to find a random word from a dictionary?

I'm studying bash scripting and I'm stuck fixing an exercise of this site: https://ryanstutorials.net/bash-scripting-tutorial/bash-variables.php#activities
The task is to write a bash script to output a random word from a dictionary whose length is equal to the number supplied as the first command line argument.
My idea was to create a sub-dictionary, assign each word a number line, select a random number from those lines and filter the output, which worked for a similar simpler script, but not for this.
This is the code I used:
6 DIC='/usr/share/dict/words'
7 SUBDIC=$( egrep '^.{'$1'}$' $DIC )
8
9 MAX=$( $SUBDIC | wc -l )
10 RANDRANGE=$((1 + RANDOM % $MAX))
11
12 RWORD=$(nl "$SUBDIC" | grep "\b$RANDRANGE\b" | awk '{print $2}')
13
14 echo "Random generated word from $DIC which is $1 characters long:"
15 echo $RWORD
and this is the error I get using as input "21":
bash script.sh 21
script.sh: line 9: counterintelligence's: command not found
script.sh: line 10: 1 + RANDOM % 0: division by 0 (error token is "0")
nl: 'counterintelligence'\''s'$'\n''electroencephalograms'$'\n''electroencephalograph': No such file or directory
Random generated word from /usr/share/dict/words which is 21 characters long:
I tried in bash to split the code in smaller pieces obtaining no error (input=21):
egrep '^.{'21'}$' /usr/share/dict/words | wc -l
3
but once in the script line 9 and 10 give error.
Where do you think is the error?
problems
SUBDIC=$( egrep '^.{'$1'}$' $DIC ) will store all words of the given length in the SUBDIC variable, so it's content is now something like foo bar baz.
MAX=$( $SUBDIC | ... ) will try to run the command foo bar baz which is obviously bogus; it should be more like MAX=$(echo $SUBDIC | ... )
MAX=$( ... | wc -l ) will count the lines; when using the above mentioned echo $SUBDIC you will have multiple words, but all in one line...
RWORD=$(nl "$SUBDIC" | ...) same problem as above: there's only one line (also note #armali's answer that nl requires a file or stdin)
RWORD=$(... | grep "\b$RANDRANGE\b" | ...) might match the dictionary entry catch 22
likely RWORD=$(... | awk '{print $2}') won't handle lines containing spaces
a simple solution
doing a "random sort" over the all the possible words and taking the first line, should be sufficient:
egrep "^.{$1}$" "${DIC}" | sort -R | head -1
MAX=$( $SUBDIC | wc -l ) - A pipe is used for connecting a command's output, while $SUBDIC isn't a command; an appropriate syntax is MAX=$( <<<$SUBDIC wc -l ).
nl "$SUBDIC" - The argument to nl has to be a filename, which "$SUBDIC" isn't; an appropriate syntax is nl <<<"$SUBDIC".
This code will do it. My test dictionary of words is in file file. It's a good idea to get all words of a given length first but put them in an array not in var. And then get a random index and echo it.
dic=( $(sed -n "/^.\{$1\}$/p" file) )
ind=$((0 + RANDOM % ${#dic[#]}))
echo ${dic[$ind]}
I am also doing this activity and I create one simple solution.
I create the script.
#!/bin/bash
awk "NR==$1 {print}" /usr/share/dict/words
Here if you want a random string then you have to run the script as per the below command from the terminal.
./script.sh $RANDOM
If you want the print any specific number string then you can run as per the below command from the terminal.
./script.sh 465
cat /usr/share/dict/american-english | head -n $RANDOM | tail -n 1
$RANDOM - Returns a different random number each time is it referred to.
this simple line outputs random word from the mentioned dictionary.
Otherwise as umläute mentined you can do:
cat /usr/share/dict/american-english | sort -R | head -1

Print a row of 16 lines evenly side by side (column)

I have a file with unknown number of lines(but even number of lines). I want to print them side by side based on total number of lines in that file. For example, I have a file with 16 lines like below:
asdljsdbfajhsdbflakjsdff235
asjhbasdjbfajskdfasdbajsdx3
asjhbasdjbfajs23kdfb235ajds
asjhbasdjbfajskdfbaj456fd3v
asjhbasdjb6589fajskdfbaj235
asjhbasdjbfajs54kdfbaj2f879
asjhbasdjbfajskdfbajxdfgsdh
asjhbasdf3709ddjbfajskdfbaj
100
100
150
125
trh77rnv9vnd9dfnmdcnksosdmn
220
225
sdkjNSDfasd89asdg12asdf6asdf
So now i want to print them side by side. as they have 16 lines in total, I am trying to get the results 8:8 like below
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
paste command did not work for me exactly, (paste - - - - - - - -< file1) nor the awk command that I used awk '{printf "%s" (NR%2==0?RS:FS),$1}'
Note: The number of lines in a file dynamic. The only known thing in my scenario is, they are even number all the time.
If you have the memory to hash the whole file ("max" below):
$ awk '{
a[NR]=$0 # hash all the records
}
END { # after hashing
mid=int(NR/2) # compute the midpoint, int in case NR is uneven
for(i=1;i<=mid;i++) # iterate from start to midpoint
print a[i],a[mid+i] # output
}' file
If you have the memory to hash half of the file ("mid"):
$ awk '
NR==FNR { # on 1st pass hash second half of records
if(FNR>1) { # we dont need the 1st record ever
a[FNR]=$0 # hash record
if(FNR%2) # if odd record
delete a[int(FNR/2)+1] # remove one from the past
}
next
}
FNR==1 { # on the start of 2nd pass
if(NR%2==0) # if record count is uneven
exit # exit as there is always even count of them
offset=int((NR-1)/2) # compute offset to the beginning of hash
}
FNR<=offset { # only process the 1st half of records
print $0,a[offset+FNR] # output one from file, one from hash
next
}
{ # once 1st half of 2nd pass is finished
exit # just exit
}' file file # notice filename twice
And finally if you have awk compiled into a worms brain (ie. not so much memory, "min"):
$ awk '
NR==FNR { # just get the NR of 1st pass
next
}
FNR==1 {
mid=(NR-1)/2 # get the midpoint
file=FILENAME # filename for getline
while(++i<=mid && (getline line < file)>0); # jump getline to mid
}
{
if((getline line < file)>0) # getline read from mid+FNR
print $0,line # output
}' file file # notice filename twice
Standard disclaimer on getline and no real error control implemented.
Performance:
I seq 1 100000000 > file and tested how the above solutions performed. Output was > /dev/null but writing it to a file lasted around 2 s longer. max performance is so-so as the mem print was 88 % of my 16 GB so it might have swapped. Well, I killed all the browsers and shaved off 7 seconds for the real time of max.
+------------------+-----------+-----------+
| which | | |
| min | mid | max |
+------------------+-----------+-----------+
| time | | |
| real 1m7.027s | 1m30.146s | 0m48.405s |
| user 1m6.387s | 1m27.314 | 0m43.801s |
| sys 0m0.641s | 0m2.820s | 0m4.505s |
+------------------+-----------+-----------+
| mem | | |
| 3 MB | 6.8 GB | 13.5 GB |
+------------------+-----------+-----------+
Update:
I tested #DavidC.Rankin's and #EdMorton's solutions and they ran, respectively:
real 0m41.455s
user 0m39.086s
sys 0m2.369s
and
real 0m39.577s
user 0m37.037s
sys 0m2.541s
Mem print was about the same as my mid had. It pays to use the wc, it seems.
$ pr -2t file
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
if you want just one space between columns, change to
$ pr -2ts' ' file
You can also do it with awk simply by storing the first-half of the lines in an array and then concatenating the second half to the end, e.g.
awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file
Example Use/Output
With your data in file, then
$ awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
Extract the first half of the file and the last half of the file and merge the lines:
paste <(head -n $(($(wc -l <file.txt)/2)) file.txt) <(tail -n $(($(wc -l <file.txt)/2)) file.txt)
You can use columns utility from autogen:
columns -c2 --by-columns file.txt
You can use column, but the count of columns is calculated in a strange way from the count of columns of your terminal. So assuming your lines have 28 characters, you also can:
column -c $((28*2+8)) file.txt
I do not want to solve this, but if I were you:
wc -l file.txt
gives number of lines
echo $(($(wc -l < file.txt)/2))
gives a half
head -n $(($(wc -l < file.txt)/2)) file.txt > first.txt
tail -n $(($(wc -l < file.txt)/2)) file.txt > last.txt
create file with first half and last half of the original file. Now you can merge those files together side by side as it was described here .
Here is my take on it using the bash shell wc(1) and ed(1)
#!/usr/bin/env bash
array=()
file=$1
total=$(wc -l < "$file")
half=$(( total / 2 ))
plus1=$(( half + 1 ))
for ((m=1;m<=half;m++)); do
array+=("${plus1}m$m" "${m}"'s/$/ /' "${m}"',+1j')
done
After all of that if just want to print the output to stdout. Add the line below to the script.
printf '%s\n' "${array[#]}" ,p Q | ed -s "$file"
If you want to write the changes directly to the file itself, Use this code instead below the script.
printf '%s\n' "${array[#]}" w | ed -s "$file"
Here is an example.
printf '%s\n' {1..10} > file.txt
Now running the script against that file.
./myscript file.txt
Output
1 6
2 7
3 8
4 9
5 10
Or using bash4+ feature mapfile aka readarray
Save the file in an array named array.
mapfile -t array < file.txt
Separate the files.
left=("${array[#]::((${#array[#]} / 2))}") right=("${array[#]:((${#array[#]} / 2 ))}")
loop and print side-by-side
for i in "${!left[#]}"; do
printf '%s %s\n' "${left[i]}" "${right[i]}"
done
What you said The only known thing in my scenario is, they are even number all the time. That solution should work.

How can I count most occuring sequence of 3 letters within a word with a bash script

I have a sample file like
XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant
Here I need to grep most occurring sequence of 3 letters within a word
Output should be
acc = 5
aco = 3
Is that possible in Bash?
I got absolutely no idea how I can accomplish it with either awk, sed, grep.
Any clue how it's possible...
PS: no output because I got no idea how to do that, I dont wanna wrote unnecessary awk -F, xyz abc... that not gonna help anywhere...
Here's how to get started with what I THINK you're trying to do:
$ cat tst.awk
BEGIN { stringLgth = 3 }
{
for (fldNr=1; fldNr<=NF; fldNr++) {
field = $fldNr
fieldLgth = length(field)
if ( fieldLgth >= stringLgth ) {
maxBegPos = fieldLgth - (stringLgth - 1)
for (begPos=1; begPos<=maxBegPos; begPos++) {
string = tolower(substr(field,begPos,stringLgth))
cnt[string]++
}
}
}
}
END {
for (string in cnt) {
print string, cnt[string]
}
}
.
$ awk -f tst.awk file | sort -k2,2nr
acc 5
cou 5
cco 4
ing 4
nti 4
oun 4
tin 4
unt 4
aco 3
abc 1
ant 1
any 1
bca 1
cac 1
cal 1
com 1
con 1
fir 1
ica 1
irm 1
lta 1
mpa 1
nsu 1
omp 1
ons 1
ous 1
pan 1
sti 1
sul 1
tan 1
tic 1
ult 1
ust 1
xyz 1
yza 1
zac 1
This is an alternative method to the solution of Ed Morton. It is less looping, but needs a bit more memory. The idea is not to care about spaces or any non-alphabetic character. We filter them out in the end.
awk -v n=3 '{ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
END {for(s in a) if (s !~ /[^a-z]/) print s,a[s] }' file
When you use GNU awk, you can do this a bit differently and optimized by setting each record to be a word. This way the end selection does not need to happen:
awk -v n=3 -v RS='[[:space:]]' '
(length>=n){ for(i=length-n+1;i>0;--i) a[tolower(substr($0,i,n))]++ }
END {for(s in a) print s,a[s] }' file
This might work for you (GNU sed, sort and uniq):
sed -E 's/.(..)/\L&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -c |
sort -s -k1,1rn |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p'
Use the first sed invocation to output 3 letter lower case words.
Sort the words.
Count the duplicates.
Sort the counts in reverse numerical order maintaining the alphabetical order.
Use the second sed invocation to manipulate the results into the desired format.
If you only want lines with duplicates and in alphabetical order and case wise, use:
sed -E 's/.(..)/&\n\1/;/^\S{3}/P;D' file |
sort |
uniq -cd |
sed -En 's/^\s*(\S+)\s*(\S+)/\2 = \1/;H;$!b;x;s/\n/ /g;s/.//p

Resources