Using sed to change only consecutive repeated letters - text

Using sed, how to change the letter 'a' to 'A' but only if it appears repeated as two or more consecutive letters. Example, from:
galaxy
ear
aardvak
Haaaaaaaaa
into
galaxy
ear
AArdvak
HAAAAAAAAA

You can do it using groups. If you have this file:
$ cat a
galaxy
ear
aardvak
Haaaaaaaaa
Ulaanbaatar
You can use this sed command:
$ sed 's/\(.\)\1\{1,\}/\U&/g' a
galaxy
ear
AArdvak
HAAAAAAAAA
UlAAnbAAtar
What does happen here? If we have a char, "packed" in a group (\(.\)), and this group (\1) repeats itself one or more times (\1\{1,\}), then replace the matched part (&) by its uppercased version (\U&).

EDIT
You can do this with:
sed 's/a\(a\+\)/A\U\1/;s/b\(b\+\)/B\U\1/;s/c\(c\+\)/C\U\1/;s/d\(d\+\)/D\U\1/;s/e\(e\+\)/E\U\1/;s/f\(f\+\)/F\U\1/;s/g\(g\+\)/G\U\1/;s/h\(h\+\)/H\U\1/;s/i\(i\+\)/I\U\1/;s/j\(j\+\)/J\U\1/;s/k\(k\+\)/K\U\1/;s/l\(l\+\)/L\U\1/;s/m\(m\+\)/M\U\1/;s/n\(n\+\)/N\U\1/;s/o\(o\+\)/O\U\1/;s/p\(p\+\)/P\U\1/;s/q\(q\+\)/Q\U\1/;s/r\(r\+\)/R\U\1/;s/s\(s\+\)/S\U\1/;s/t\(t\+\)/T\U\1/;s/u\(u\+\)/U\U\1/;s/v\(v\+\)/V\U\1/;s/w\(w\+\)/W\U\1/;s/x\(x\+\)/X\U\1/;s/y\(y\+\)/Y\U\1/;s/z\(z\+\)/Z\U\1/'
(Thanks to shelter)
Or with a pipe of sed:
function capitalize_consecutives () {
sed 's/a\(a\+\)/A\U\1/' |
sed 's/b\(b\+\)/B\U\1/' |
sed 's/c\(c\+\)/C\U\1/' |
sed 's/d\(d\+\)/D\U\1/' |
sed 's/e\(e\+\)/E\U\1/' |
sed 's/f\(f\+\)/F\U\1/' |
sed 's/g\(g\+\)/G\U\1/' |
sed 's/h\(h\+\)/H\U\1/' |
sed 's/i\(i\+\)/I\U\1/' |
sed 's/j\(j\+\)/J\U\1/' |
sed 's/k\(k\+\)/K\U\1/' |
sed 's/l\(l\+\)/L\U\1/' |
sed 's/m\(m\+\)/M\U\1/' |
sed 's/n\(n\+\)/N\U\1/' |
sed 's/o\(o\+\)/O\U\1/' |
sed 's/p\(p\+\)/P\U\1/' |
sed 's/q\(q\+\)/Q\U\1/' |
sed 's/r\(r\+\)/R\U\1/' |
sed 's/s\(s\+\)/S\U\1/' |
sed 's/t\(t\+\)/T\U\1/' |
sed 's/u\(u\+\)/U\U\1/' |
sed 's/v\(v\+\)/V\U\1/' |
sed 's/w\(w\+\)/W\U\1/' |
sed 's/x\(x\+\)/X\U\1/' |
sed 's/y\(y\+\)/Y\U\1/' |
sed 's/z\(z\+\)/Z\U\1/'
}
Then let it parses your file:
capitalize_consecutives < myfile
\U is to UPPERCASE the occurence. I guess this is only for GNU sed.

Related

Linux command to retrieve unique words and count along with punctuation marks

tr -c '[:alnum:]' '[\n*]' < 4300-0.txt | sort | uniq -c | sort -nr | head
The following command retrieves unique words along with the count. I'd like to retrieve punctuation marks along with the unique word counts.
What is the way to achieve this?
You could split your input with tee and extract punctuations and alnum separately.
echo "Helo, world!" |
{
tee >(tr -c '[:alnum:]' '\n' >&3) |
tr -c '[:punct:]' '\n'
} 3>&1 |
sed '/^$/d' |
sort | uniq -c | sort -nr | head
should output:
1 world
1 Helo
1 !
1 ,
A short sed script also seems to work:
echo "Helo, world!
OK!" |
sed '
s/\([[:alnum:]]\+\)\([^[:alnum:]]\)/\1\n\2/g
s/\([[:punct:]]\+\)\([^[:punct:]]\)/\1\n\2/g
s/[^[:punct:][:alnum:]]/\n/g
' |
sed '/^$/d' |
sort | uniq -c | sort -nr | head
should output:
2 !
1 world
1 OK
1 Helo
1 ,
You can use [:punct:] to retrieve the punctuation marks
And you can run:
tr -c '[:alnum:][:punct:]' '[\n*]' < 4300-0.txt | sort | uniq -c | sort -nr | head
it will print out the punctuation marks as well.
For example:
if you have in your txt file
aaa,
aaa
the output will be:
1 aaa
1 aaa,

Using Sed multiple search pattern printing specific lines

SED command usage multiple pattern
I am using the sed command to search for multiple patterns.
The command works and print the lines when it find matches
However I need to do 2 things ( here is the command I use)
sed -r '/pattern1|pattern2/!d' filename
A - Print the line containing the first pattern
then print not only the line matching the second pattern
but print the number of lines below it. I like to specify
the number of lines below second pattern search .
B - I need to print first pattern and then only a certain number of lines below
the 2nd pattern but omit the line containing the search pattern
In short, I need to control specify the number of lines below
my second serach pattern and omit the line containing the serach patetrn as well if
I decide to do so
Hostname1
section1
a
section2
a
c
d
Hostname2
section1
a
section2
x
y
d
desired Output
hostname1
section2
a
c
hostname2
section2
x
y
# Create test file
(
cat << EOF
Hostname1
section1
a
section2
a
c
d
Hostname2
section1
a
section2
x
y
d
EOF
) > filename
# transformation
cat filename | grep -v "^ *$" | sed -e "s/\(Hostname\)/==\1/g" | sed -e "s/\(section\)/=\1/g" | tr '\n' '|' | tr '=' '\n' | sed -r '/Hostname1|Hostname2|section2/!d' | cut -d"|" -f-3 | tr '|' '\n' | grep -v "^ *$" | sed -e "s/\(Hostname\)/\n\1/g"
explications
# etape 1 : transforme each section to on ligne, with a dilimiter "|" :
cat filename | grep -v "^ *$" | sed -e "s/\(Hostname\)/==\1/g" | sed -e "s/\(section\)/=\1/g" | tr '\n' '|' | tr '=' '\n'
#Hostname1|
#section1|a|
#section2|a|c|d|
#
#Hostname2|
#section1|a|
#section2|x|y|d|
# etape 2 : cut n+1 fild ( cut -d"|" -f-3 ) :
cat filename | cat filename | grep -v "^ *$" | sed -e "s/\(Hostname\)/==\1/g" | sed -e "s/\(section\)/=\1/g" | tr '\n' '|' | tr '=' '\n' | sed -r '/Hostname1|Hostname2|section2/!d' | cut -d"|" -f-3
#Hostname1|
#section2|a|c
#Hostname2|
#section2|x|y
#etape 3 : transfomation to wanted format :
cat filename | cat filename | grep -v "^ *$" | sed -e "s/\(Hostname\)/==\1/g" | sed -e "s/\(section\)/=\1/g" | tr '\n' '|' | tr '=' '\n' | sed -r '/Hostname1|Hostname2|section2/!d' | cut -d"|" -f-3 | tr '|' '\n' | grep -v "^ *$" | sed -e "s/\(Hostname\)/\n\1/g"
#Hostname1
#section2
#a
#c
#
#Hostname2
#section2
#x
#y

How to change some symbols (e.g. "space") to other symbol in Bash/shell

I have some output from
ps -ef | grep apache
I need to change all spaces in that output to '#' symbol
Is it possible to use some bash script for this?
Thanks
Use tr:
ps -ef | grep apache | tr ' ' #
use tr:
$ echo 'foo bar baz' | tr ' ' '#'
foo#bar#baz
(documentation)
Basic sed command:
ps -ef | grep apache | sed 's/ /#/g'
sed 's/text/new text/g' looks for "text" and replaces it with "new text".
In case you want to replace more characters, for example replace all spaces and _ with #: (thanks Adrian Frühwirth):
ps -ef | grep apache | sed 's/[_ ]/#/g'
You can can skip the extra grep if you use awk:
ps -ef | awk '/apache/{gsub(/ /,"#");print}'
If you want multiple space characters to be replaced with only one # symbol, you can use -s flag with tr:
ps -ef | grep apache | tr -s ' ' '#'
or this sed solution:
ps -ef | grep apache | sed -r 's/ +/#/g'

Need to remove the count from the output when using "uniq -c" command

I am trying to read a file and sort it by number of occurrences of a particular field. Suppose i want to find out the most repeated date from a log file then i use uniq -c option and sort it in descending order. something like this
uniq -c | sort -nr
This will produce some output like this -
809 23/Dec/2008:19:20
the first field which is actually the count is the problem for me .... i want to get ony the date from the above output but m not able to get this. I tried to use cut command and did this
uniq -c | sort -nr | cut -d' ' -f2
but this just prints blank space ... please can someone help me on getting the date only and chop off the count. I want only
23/Dec/2008:19:20
Thanks
The count from uniq is preceded by spaces unless there are more than 7 digits in the count, so you need to do something like:
uniq -c | sort -nr | cut -c 9-
to get columns (character positions) 9 upwards. Or you can use sed:
uniq -c | sort -nr | sed 's/^.\{8\}//'
or:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
This second option is robust in the face of a repeat count of 10,000,000 or more; if you think that might be a problem, it is probably better than the cut alternative. And there are undoubtedly other options available too.
Caveat: the counts were determined by experimentation on Mac OS X 10.7.3 but using GNU uniq from coreutils 8.3. The BSD uniq -c produced 3 leading spaces before a single digit count. The POSIX spec says the output from uniq -c shall be formatted as if with:
printf("%d %s", repeat_count, line);
which would not have any leading blanks. Given this possible variance in output formats, the sed script with the [0-9] regex is the most reliable way of dealing with the variability in observed and theoretical output from uniq -c:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
Instead of cut -d' ' -f2, try
awk '{$1="";print}'
Maybe you need to remove one more blank in the beginning:
awk '{$1="";print}' | sed 's/^.//'
or completly with sed, preserving original whitspace:
sed -r 's/^[^0-9]*[0-9]+//'
Following awk may help you here.
awk '{a[$0]++} END{for(i in a){print a[i],i | "sort -k2"}}' Input_file
Solution 2nd: In case you want order of output to be same as input but not as sort.
awk '!a[$0]++{b[++count]=$0} {c[$0]++} END{for(i=1;i<=count;i++){print c[b[i]],b[i]}}' Input_file
an alternative solution is this:
uniq -c | sort -nr | awk '{print $1, $2}'
also you may easily print a single field.
use(since you use -f2 in the cut in your question)
cat file |sort |uniq -c | awk '{ print $2; }'
If you want to work with the count field downstream, following command will reformat it to a 'pipe friendly' tab delimited format without the left padding:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/'
For the original task it is a bit of an overkill, but after reformatting, cut can be used to remove the field, as OP intended:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/' | cut -d $'\t' -f2-
Add tr -s to the pipe chain to "squeeze" multiple spaces into one space delimiter:
uniq -c | tr -s ' ' | cut -d ' ' -f3
tr is very useful in some obscure places. Unfortunately it doesn't get rid of the first leading space, hence the -f3
You could make use of sed to strip both the leading spaces and the numbers printed by uniq -c
sort file | uniq -c | sed 's/^ *[0-9]* //'
I would illustrate this with an example. Consider a file
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
The command
sort file | uniq -c | sed 's/^ *[0-9]* //'
would return
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
first solution
just using sort when input repetition has not been taken into consideration. sort has unique option -u
sort -u file
sort -u < file
Ex.:
$ cat > file
a
b
c
a
a
g
d
d
$ sort -u file
a
b
c
d
g
second solution
if sorting based on repetition is important
sort txt | uniq -c | sort -k1 -nr | sed 's/^ \+[0-9]\+ //g'
sort txt | uniq -c | sort -k1 -nr | perl -lpe 's/^ +[\d]+ +//g'
which has this output:
a
d
g
c
b

Bash script to find the frequency of every letter in a file

I am trying to find out the frequency of appearance of every letter in the english alphabet in an input file. How can I do this in a bash script?
My solution using grep, sort and uniq.
grep -o . file | sort | uniq -c
Ignore case:
grep -o . file | sort -f | uniq -ic
Just one awk command
awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' file
if you want case insensitive, add tolower()
awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' file
and if you want only characters,
awk -vFS="" '{for(i=1;i<=NF;i++){ if($i~/[a-zA-Z]/) { w[tolower($i)]++} } }END{for(i in w) print i,w[i]}' file
and if you want only digits, change /[a-zA-Z]/ to /[0-9]/
if you do not want to show unicode, do export LC_ALL=C
A solution with sed, sort and uniq:
sed 's/\(.\)/\1\n/g' file | sort | uniq -c
This counts all characters, not only letters. You can filter out with:
sed 's/\(.\)/\1\n/g' file | grep '[A-Za-z]' | sort | uniq -c
If you want to consider uppercase and lowercase as same, just add a translation:
sed 's/\(.\)/\1\n/g' file | tr '[:upper:]' '[:lower:]' | grep '[a-z]' | sort | uniq -c
Here is a suggestion:
while read -n 1 c
do
echo "$c"
done < "$INPUT_FILE" | grep '[[:alpha:]]' | sort | uniq -c | sort -nr
Similar to mouviciel's answer above, but more generic for Bourne and Korn shells used on BSD systems, when you don't have GNU sed, which supports \n in a replacement, you can backslash escape a newline:
sed -e's/./&\
/g' file | sort | uniq -c | sort -nr
or to avoid the visual split on the screen, insert a literal newline by type CTRL+V CTRL+J
sed -e's/./&\^J/g' file | sort | uniq -c | sort -nr

Resources