I am trying to count how many times each ASCII printable character is present in a file. I thought a good way to do this might be to list the printable characters in a { } enclosed list, and use grep on each item within the braces. An example code is below. I would like to expand the char list to include all 64 ASCII printable characters. I cannot figure out how to get the code to read and use each characters between the braces separately. I would really like to output a file in the format "character\tcharacterCount". Any suggestions?
char={" ",!,\",#,"\$"}
cat PHRED_scores.txt | grep -e "$char" | wc -m
Below command will display the special characters present in the file and their total count.
grep -oP '[ !\\$#]' file | sort | uniq -c
Explanation:
o - print the match only.
P - grep with Perl-regexp option.
[ !\\&#] - Special characters are included in the character class. You have to escape \ so that it means a literal \
sort Output would be sorted.
uniq -c All the duplicates are counted and then it will be combined into one.
There is a way to avoid listing all 64 characters individually to match the ASCII character set. Bash provides character classes and allows ranges to represent numerous characters without listing each individual character. Some examples are:
[a-z] match all lowercase characters
[A-Z] match all uppercase characters
[0-9] match all digits
[[:print:]] all printable characters
So with very little effort, you can match all upper and lowercase characters and all digits with:
[a-zA-Z0-9]
You can then add the additional printable characters, but you must take care to escape or avoid those with special meaning to regular expressions themselves. An example (not intended to be all-inclusive is)
[a-zA-Z0-0:;~!##$%&*()_-+=]
or you can use the predefined class:
[:print:]
You can add as required. To solve your problem, as Avinash provided sort | uniq -c can provide the individual count. Adding an additional call to wc -m will provide the total. With that, it is not difficult to develop a script that will take the filename as an argument and give the total and individual character counts you require. Something similar to the following will work:
#!/bin/bash
echo -n "Total character count: "
grep $cclass "$1" | wc -m # obtain the total character count
echo -e " Individual frequency:"
grep -o [[:print:]] "$1" | sort | uniq -c # obtain the individual frequency
exit 0
Sample output:
Total character count: 455
Individual frequency:
6 =
10 _
7 -
4 ,
12 ;
1 /
4 .
6 "
9 (
9 )
2 {
2 }
2 *
5 \
2 #
4 %
4 0
3 a
17 b
11 c
1 C
24 d
4 D
28 e
1 E
...
Related
My text file should be of two columns separated by a tab-space (represented by \t) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s).
A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5
My objective is to create a table as follows:
A\t1
B\t2
C\t3
D\t4
E\t5
i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3 I can discard the x that is present after space and store the columns as C\t3.
I have tried a couple of things but with no luck.
I tried to cut the cols based on \t into independent columns and then cut the first column based on \s and join them again. However, it did not work.
Here is the snippet:
col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[#]};idx++));do
echo "#{col1[$idx]} #{col2[$idx]}"
# I will append to myArr here
done
The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.
Any advice is very much appreciated.
Thank you. :)
And another sed solution:
Search and replace any literal space followed by any number of non-TAB-characters with nothing.
sed -E 's/ [^\t]+//' file
A 1
B 2
C 3
D 4
E 5
If there could be more than one actual space in there just make it 's/ +[^\t]+//' ...
Assuming that when you say a space you mean a blank character then using any awk:
awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file
Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)
$ cat ls
A 1
B 2
C x 3
D 4
E y 5
$ cat ls |perl -pe 's/^(\S+).*\t(\S+)/$1 $2/g'
A 1
B 2
C 3
D 4
E 5
This code gets all non-empty characters from the front and all non-empty characters from after \t
Try
sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file
The ANSI-C Quoting ($'...') feature of Bash is used to make tab characters visible as \t.
take advantage of FS and OFS and let them do all the hard work for you
{m,g}awk NF=NF FS='[ \t].*[ \t]' OFS='\t'
A 1
B 2
C 3
D 4
E 5
if there's a chance of leading edge or trailing edge spaces and tabs, then perhaps
mawk 'NF=gsub("^[ \t]+|[ \t]+$",_)^_+!_' OFS='\t' RS='[\r]?\n'
Is it possible using sed to replace the first occurrence of a character or substring in line of file only if it is the first 2 characters in the line?
For example we have this text file:
15 hello
15 h15llo
1 hello
1 h15loo
Using the following command: sed -i 's/15/0/' file.txt
Will give this output
0 hello
0 h15llo
1 hello
1 h0loo
What I am trying to avoid is it considering the characters past the first 2.
Is this possible?
Desired output:
0 hello
0 h15llo
1 hello
1 h15loo
You can use
sed -i 's/^15 /0 /' file.txt
sed -i 's/^15\([[:space:]]\)/0\1/' file.txt
sed -i 's/^15\(\s\)/0\1/' file.txt
Here, the ^ matches the start of string position, 15 matches the 15 substring and then a space matches a space.
The second and third solutions are the same, instead of a literal space, they capture a whitespace char into Group 1 and the group value is put back into the result using the \1 placeholder.
Linux's sys filesystem represents sets of CPU ids with the syntax:
0,2,8: Set of CPUs containing 0, 2 and 8.
4-6: Set of CPUs containing 4, 5 and 6.
Both syntaxes can be mixed and matched, for example: 0,2,4-6,8
For example, running cat /sys/devices/system/cpu/online prints 0-3 on my machine which means CPUs 0, 1, 2 and 3 are online.
The problem is the above syntax is difficult to iterate over using a for loop in a shell script. How can the above syntax be converted to one more conventional such as 0 2 4 5 6 8?
Try:
$ echo 0,2,4-6,8 | awk '/-/{for (i=$1; i<=$2; i++)printf "%s%s",i,ORS;next} 1' ORS=' ' RS=, FS=-
0 2 4 5 6 8
This can be used in a loop as follows:
for n in $(echo 0,2,4-6,8 | awk '/-/{for (i=$1; i<=$2; i++)printf "%s%s",i,ORS;next} 1' RS=, FS=-)
do
echo cpu="$n"
done
Which produces the output:
cpu=0
cpu=2
cpu=4
cpu=5
cpu=6
cpu=8
Or like:
printf "%s" 0,2,4-6,8 | awk '/-/{for (i=$1; i<=$2; i++)printf "%s%s",i,ORS;next} 1' RS=, FS=- | while read n
do
echo cpu="$n"
done
Which also produces:
cpu=0
cpu=2
cpu=4
cpu=5
cpu=6
cpu=8
How it works
The awk command works as follows:
RS=,
This tells awk to use , as the record separator.
If, for example, the input is 0,2,4-6,8, then awk will see four records: 0 and 2 and 4-6 and 8.
FS=-
This tells awk to use - as the field separator.
With FS set this way and if, for example, the input record consists of 2-4, then awk will see 2 as the first field and 4 as the second field.
/-/{for (i=$1; i<=$2; i++)printf "%s%s",i,ORS;next}
For any record that contains -, we print out each number starting with the value of the first field, $1, and ending with the value of the second field, $2. Each such number is followed by the Output Record Separator, ORS. By default, ORS is a newline character. For some of the examples above, we set ORS to a blank.
After we have printed these numbers, we skip the rest of the commands and jump to the next record.
1
If we get here, then the record did not contain - and we print it out as is. 1 is awk's shorthand for print-the-line.
A Perl one:
echo "0,2,4-6,8" | perl -lpe 's/(\d+)-(\d+)/{$1..$2}/g; $_="echo {$_}"' | bash
Just convert the original string into echo {0,2,{4..6},8} and let bash 'brace expansion' to interpolate it.
eval echo $(cat /sys/devices/system/cpu/online | sed 's/\([[:digit:]]\+\)-\([[:digit:]]\+\)/$(seq \1 \2)/g' | tr , ' ')
Explanation:
cat /sys/devices/system/cpu/online reads the file from sysfs. This can be changed to any other file such as offline.
The output is piped through the substitution s/\([[:digit:]]\+\)-\([[:digit:]]\+\)/$(seq \1 \2)/g. This matches something like 4-6 and replaces it with $(seq 4 6).
tr , ' ' replaces all commas with spaces.
At this point, the input 0,2,4-6,8 is transformed to 0 2 $(seq 4 6) 8. The final step is to eval this sequence to get 0 2 4 5 6 8.
The example echo's the output. Alternatively, it can be written to a variable or used in a for loop.
I have a file that has the following user names in random places in the file:
albert#ghhdh
albert#jdfjgjjg
john#jfkfeie
mike#fjfkjf
bill#fjfj
bill#fkfkfk
Usernames are the names to the left of the # symbol.
I want to use unix commands to grep the file for usernames, then make a count of unique usernames.
Therefore using the example above, the output should state that there are 4 unique users (I just need the count as the output, no words)
Can someone help me determine the correct count?
You could extract the words before #, sort them and count them :
cat test.txt | cut -d '#' -f 1 | sort | uniq -c
With test.txt :
albert#ghhdh
john#jfkfeie
bill#fjfj
mike#fjfkjf
bill#fkfkfk
albert#jdfjgjjg
It outputs :
2 albert
2 bill
1 john
1 mike
Note that the duplicate usernames don't have to be grouped in the input list.
If you're just interested in the count of uniq users :
cat test.txt | cut -d '#' -f 1 | sort -u | wc -l
# => 4
Or shorter :
cut -d '#' -f 1 test.txt | sort -u | wc -l
Here is the solution that finds the usernames anywhere on the line (not just at the beginning), even if there are multiple usernames on a single line, and finds their unique count:
grep -oE '\b[[:alpha:]_][[:alnum:]_.]*#' file | cut -f1 -d# | sort -u | wc -l
-o only fetches the matched portion
-E processes extended regex
\b[[:alpha:]_][[:alnum:]]*# matches usernames (a string following a word boundary \b that starts with an alpha or underscore followed by zero or more alphanumeric and other permitted characters, ending with a #
cut -f1 -d# extracts the username portion which is then sorted and counted for unique names
Faster with one awk command, if awk is allowed:
awk -F"#" '!seen[$1]++{c++}END{print "Unique users =" c}'
Small Explanation:
using # as delimiter (-F) you look for field 1 = $1 for awk.
For every field 1 that is not seen again we increase a counter c.
In the same time we increase the particular field1 so if found again the test "not seen" will not be valid.
At the end we just print the counter of unique "seen".
As a plus, this solution does not require pre-sorting. Duplicates would be found even if file is not sorted.
I have a text file with lines like this:
Sequences (1:4) Aligned. Score: 4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...
I want to be able to extract the values into a new tab delimited text file:
1 4 4
100 3011 77
12 345 100
(like this but with tabs instead of spaces)
Can anyone suggest anything? Some combination of sed or cut maybe?
You can use Perl:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'
Or, to save to file:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt
Little explanation:
Regex here is in the form:
s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/
How to match = .*?(\d+):(\d+).*?(\d+)
How to replace = $1\t$2\t$3
In our case, we used the following tokens to declare how we want to match the string:
.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit
Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:
.*?(\d+):(\d+).*?(\d+)
$1 $2 $3
Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):
$1\t$2\t$3
You could use sed:
sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile
Here's a BSD sed compatible version:
sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile
The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.
If you know there will always be 3 numbers per line, you could go with grep:
<infile grep -o '[0-9]\+' | paste - - -
Output in all cases:
1 4 4
100 3011 77
12 345 100
My solution using sed:
sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1 \2 \3/g' file.txt